Upload fine-tuned EU regulation embeddings model

Browse files

Files changed (13) hide show

.gitattributes +1 -0
1_Pooling/config.json +10 -0
README.md +849 -0
config.json +39 -0
config_sentence_transformers.json +12 -0
configuration_hf_alibaba_nlp_gte.py +145 -0
eval/Information-Retrieval_evaluation_results.csv +5 -0
model.safetensors +3 -0
modules.json +20 -0
sentence_bert_config.json +4 -0
special_tokens_map.json +51 -0
tokenizer.json +3 -0
tokenizer_config.json +62 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text

1_Pooling/config.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+  "word_embedding_dimension": 768,
+  "pooling_mode_cls_token": true,
+  "pooling_mode_mean_tokens": false,
+  "pooling_mode_max_tokens": false,
+  "pooling_mode_mean_sqrt_len_tokens": false,
+  "pooling_mode_weightedmean_tokens": false,
+  "pooling_mode_lasttoken": false,
+  "include_prompt": true
+}

README.md ADDED Viewed

	@@ -0,0 +1,849 @@

+---
+tags:
+- sentence-transformers
+- sentence-similarity
+- feature-extraction
+- generated_from_trainer
+- dataset_size:46338
+- loss:MatryoshkaLoss
+- loss:MultipleNegativesRankingLoss
+base_model: Snowflake/snowflake-arctic-embed-m-v2.0
+widget:
+- source_sentence: What are the anticipated financial effects that could arise from
+    material risks associated with resource use and circular economy, and how might
+    these risks impact the financial position, performance, and cash flows of an undertaking
+    over different time frames?
+  sentences:
+  - '(a)
+    anticipated financial effects due to material risks arising from material resource
+    use and circular economy -related impacts and dependencies and how these risks
+    have or could reasonably be expected to have) a material influence on the undertaking’s
+    financial position, financial performance performance, and cash flows over the
+    short-, medium- and long-term; and
+    (b)
+    anticipated financial effects due to material opportunities related to resource
+    use and circular economy.
+    The disclosure shall include:
+    (a)'
+  - combination of hydrocarbons obtained as a raffinate from a sulphuric acid treating
+    process. It consists of hydrocarbons having carbon numbers predominantly in the
+    range of C7 through C12 and boiling in the range of approximately 90 °C to 230
+    °C.) 649-351-00-7 265-115-2 64742-15-0 P Naphtha (petroleum), chemically neutralised
+    heavy; Low boiling point naphtha — unspecified (A complex combination of hydrocarbons
+    produced by a treating process to remove acidic materials. It consists of hydrocarbons
+    having carbon numbers predominantly in the range of C6 through C12 and boiling
+    in the range of approximately 65 °C to 230 °C.) 649-352-00-2 265-122-0 64742-22-9
+    P Naphtha (petroleum), chemically neutralised light; Low boiling point naphtha
+    —
+  - '2. Member States shall require any investment firm wishing to establish a branch
+    within the territory of another Member State or to use tied agents established
+    in another Member State in which it has not established a branch, first to notify
+    the competent authority of its home Member State and to provide it with the following
+    information:
+    (a) the Member States within the territory of which it plans to establish a branch
+    or the Member States in which it has not established a branch but plans to use
+    tied agents established there;
+    (b) a programme of operations setting out, inter alia, the investment services
+    and/or activities as well as the ancillary services to be offered;
+    (c) where established, the organisational structure of the branch and indicating
+    whether the branch intends to use tied agents and the identity of those tied agents;
+    (d) where tied agents are to be used in a Member State in which an investment
+    firm has not established a branch, a description of the intended use of the tied
+    agent(s) and an organisational structure, including reporting lines, indicating
+    how the agent(s) fit into the corporate structure of the investment firm;
+    (e) the address in the host Member State from which documents may be obtained;
+    (f) the names of those responsible for the management of the branch or of the
+    tied agent.
+    Where an investment firm uses a tied agent established in a Member State outside
+    its home Member State, such tied agent shall be assimilated to the branch, where
+    one is established, and shall in any event be subject to the provisions of this
+    Directive relating to branches.'
+- source_sentence: What steps must the single point of contact take if the project
+    promoter submits an incomplete application for a Strategic Project, and how does
+    this affect the permit-granting process timeline?
+  sentences:
+  - '(1)
+    ‘cooling’ means the extraction of heat from an enclosed or indoor space (comfort
+    application) or from a process in order to reduce the space or process temperature
+    to, or maintain it at, a specified temperature (set point); for cooling systems,
+    the extracted heat is rejected into and absorbed by the ambient air, ambient water
+    or the ground, where the environment (air, ground, and water) provides a sink
+    for the heat extracted and thus functions as a cold source;
+    (2)'
+  - '1. Suppliers shall provide the manufacturer with all the information and documentation
+    necessary for the manufacturer to demonstrate the conformity of the packaging
+    and the packaging materials with this Regulation, including the technical documentation
+    referred to in Annex VII and required under or pursuant to Articles 5 to 11, in
+    one or more languages which can be easily understood by the manufacturer. That
+    information and documentation shall be provided in either paper or electronic
+    form.
+    2. Where appropriate, the documentation and information required under Union legal
+    acts applicable to contact-sensitive packaging shall be part of the information
+    and documentation to be provided to the manufacturer pursuant to paragraph 1.'
+  - '6.
+    No later than 45 days following the receipt of a permit-granting application related
+    to a Strategic Project, the single point of contact concerned shall acknowledge
+    that the application is complete or, if the project promoter has not sent all
+    the information required to process an application, request the project promoter
+    to submit a complete application without undue delay, specifying which information
+    is missing. Where the application submitted is deemed to be incomplete a second
+    time, the single point of contact concerned shall not request information in areas
+    not covered in the first request for additional information and shall be entitled
+    only to request further evidence to complete the identified missing information.
+    The date of the acknowledgement referred to in the first subparagraph shall serve
+    as the start of the permit-granting process.
+    7.
+    No later than one month from the date of acknowledgement referred to in paragraph
+    6 of this Article, the single point of contact concerned shall draw up, in close
+    cooperation with the project promoter and other competent authorities concerned,
+    a detailed schedule for the permit-granting process. The schedule shall be published
+    by the project promoter on the website referred to in Article 8(5). The single
+    point of contact concerned shall update the schedule in the event that there are
+    significant changes that potentially affect the timing of the comprehensive decision.
+    8.
+    The single point of contact concerned shall notify the project promoter when the
+    environmental impact assessment report referred in Article 5(1) of Directive 2011/92/EU
+    is due, taking into account the organisation of the permit-granting process in
+    the Member State concerned and the need to allow sufficient time to assess the
+    report. The period between the deadline for the submission of the environmental
+    impact assessment report and the actual submission of that report shall not be
+    counted towards the duration of the permit-granting process referred to in paragraphs
+    1 and 2 of this Article.
+    9.'
+- source_sentence: What are the requirements for energy audits to be considered compliant
+    with the specified paragraph, and what role do voluntary agreements play in this
+    process?
+  sentences:
+  - '8. Member States shall develop programmes to encourage enterprises that are not
+    SMEs and that are not subject to paragraph 1 or 2 to undergo energy audits and
+    to subsequently implement the recommendations arising from those audits.
+    9. Energy audits shall be considered to comply with paragraph 2 where they are:
+    (a) carried out in an independent manner, on the basis of the minimum criteria
+    set out in Annex VI; (b) implemented under voluntary agreements concluded between
+    organisations of stakeholders and a body appointed and supervised by the Member
+    State concerned, by another body to which the competent authorities have delegated
+    the responsibility concerned or by the Commission. --- ---'
+  - '3.1.1. The evaluation of all available information shall comprise:
+    the hazard identification based on all available information,
+    the establishment of the quantitative dose (concentration)-response (effect) relationship.
+    3.1.2. When it is not possible to establish the quantitative dose (concentration)-response
+    (effect) relationship, then this should be justified and a semi-quantitative or
+    qualitative analysis shall be included.
+    3.1.3. All information used to assess the effects on a specific environmental
+    sphere shall be briefly presented, if possible in the form of a table or tables.
+    The relevant test results (e.g. LC50 or NOEC) and test conditions (e.g. test duration,
+    route of administration) and other relevant information shall be presented, in
+    internationally recognised units of measurement for that effect.
+    3.1.4. All information used to assess the environmental fate of the substance
+    shall be briefly presented, if possible in the form of a table or tables. The
+    relevant test results and test conditions and other relevant information shall
+    be presented, in internationally recognised units of measurement for that effect.
+    3.1.5. If one study is available then a robust study summary should be prepared
+    for that study. Where there is more than one study addressing the same effect,
+    then the study or studies giving rise to the highest concern shall be used to
+    draw a conclusion and a robust study summary shall be prepared for that study
+    or studies and included as part of the technical dossier. Robust summaries will
+    be required of all key data used in the hazard assessment. If the study or studies
+    giving rise to the highest concern are not used, then this shall be fully justified
+    and included as part of the technical dossier, not only for the study being used
+    but also for all studies reaching a higher concern than the study being used.
+    For substances where all available studies indicate no hazards an overall assessment
+    of the validity of all studies should be performed.
+    3.2.  Step 2 : Classification and Labelling
+    ▼M51'
+  - impact of single-use packaging, in particular plastic carrier bags; --- --- (f)
+    the composting properties and appropriate waste management options for compostable
+    packaging in accordance with Article 9(2) of this Regulation; consumers shall
+    be informed that compostable packaging is not suitable for home composting and
+    that compostable packaging is not to be discarded in nature. --- ---
+- source_sentence: In what scenario should information on toxic effects be listed
+    only once for a mixture?
+  sentences:
+  - 'In determining the energy savings from taxation-related policy measures introduced
+    under Article 10, the following principles shall apply: (a) credit shall be given
+    only for energy savings from taxation measures exceeding the minimum levels of
+    taxation applicable to fuels as required in Council Directive 2003/96/EC (2) or
+    2006/112/EC (3); (b) short-run price elasticities for the calculation of the impact
+    of the energy taxation measures shall represent the responsiveness of energy demand
+    to price changes, and shall be estimated on the basis of recent and representative
+    official data sources, which are applicable for the Member State, and, where applicable,
+    on the basis of accompanying studies from an independent institute. If a different'
+  - 'Article 13
+    Project development assistance
+    1.
+    The Commission shall, after consulting the Member States in accordance with Article
+    21(2), point (c), determine the maximum amount of Innovation Fund support available
+    for project development assistance.
+    2.
+    The Commission may award project development assistance in the form of technical
+    assistance to any project that falls within the scope of the Innovation Fund,
+    as set out in Article 10a(8), first and sixth subparagraphs of Directive 2003/87/EC.
+    3.
+    The following activities may be funded by way of project development assistance:
+    (a)
+    improvement and development of project documentation or of components of the project
+    design with a view to ensuring the sufficient maturity of the project;
+    (b)
+    assessment of the feasibility of the project, including technical and economic
+    studies;
+    (c)
+    advice on the financial and legal structure of the project;
+    (d)
+    capacity building of the project proponent.
+    4.
+    If project development assistance is implemented under indirect management, the
+    implementing entity shall carry out the selection procedure and take the decision
+    to award the project development assistance after having consulted the Commission.
+    The award criteria shall take into account the degree of innovation compared to
+    the state of the art, the potential to significantly reduce climate impacts and
+    to support widespread application, the maturity as well as the geographical and
+    sectoral balance in relation to the portfolio of funded projects.'
+  - 'effects of the mixture. The information on toxic effects shall be presented for
+    each substance, except for the following cases: (a)  if the information is duplicated,
+    it shall be listed only once for the mixture overall, such as when two substances
+    both cause vomiting and diarrhoea; (b)  if it is unlikely that these effects will
+    occur at the concentrations present, such as when a mild irritant is diluted to
+    below a certain concentration in a non-irritant solution; (c)  where information
+    on interactions between substances in a mixture is not available, assumptions
+    shall not be made and instead the health effects of each substance shall be listed
+    separately. --- ---'
+- source_sentence: How does the text suggest addressing the social aspects related
+    to low- and middle-income transport users in the context of zero-emission vehicle
+    initiatives?
+  sentences:
+  - '(b)
+    measures intended to accelerate the uptake of zero-emission vehicles or to provide
+    financial support for the deployment of fully interoperable refuelling and recharging
+    infrastructure for zero-emission vehicles, or measures to encourage a shift to
+    public transport and improve multimodality, or to provide financial support in
+    order to address social aspects concerning low- and middle-income transport users;
+    (c)
+    to finance their Social Climate Plan in accordance with Article 15 of Regulation
+    (EU) 2023/955;
+    (d)'
+  - If the planned change is implemented notwithstanding the first and second subparagraphs,
+    or if an unplanned change has taken place pursuant to which the AIFM’s management
+    of the AIF no longer complies with this Directive or the AIFM otherwise no longer
+    complies with this Directive, the competent authorities of the Member State of
+    reference of the AIFM shall take all due measures in accordance with Article 46,
+    including, if necessary, the express prohibition of marketing of the AIF.
+  - '(d)
+    for gas discharge lamps, 80 % shall be recycled.
+    Part 2: Minimum targets applicable by category from 15 August 2015 until 14 August
+    2018 with reference to the categories listed in Annex I:
+    (a)
+    for WEEE falling within category 1 or 10 of Annex I,
+    85 % shall be recovered, and
+    80 % shall be prepared for re-use and recycled;
+    (b)
+    for WEEE falling within category 3 or 4 of Annex I,
+    80 % shall be recovered, and
+    70 % shall be prepared for re-use and recycled;
+    (c)
+    for WEEE falling within category 2, 5, 6, 7, 8 or 9 of Annex I,
+    75 % shall be recovered, and
+    55 % shall be prepared for re-use and recycled;
+    (d)
+    for gas discharge lamps, 80 % shall be recycled.'
+pipeline_tag: sentence-similarity
+library_name: sentence-transformers
+metrics:
+- cosine_accuracy@1
+- cosine_accuracy@3
+- cosine_accuracy@5
+- cosine_accuracy@10
+- cosine_precision@1
+- cosine_precision@3
+- cosine_precision@5
+- cosine_precision@10
+- cosine_recall@1
+- cosine_recall@3
+- cosine_recall@5
+- cosine_recall@10
+- cosine_ndcg@10
+- cosine_mrr@10
+- cosine_map@100
+model-index:
+- name: SentenceTransformer based on Snowflake/snowflake-arctic-embed-m-v2.0
+  results:
+  - task:
+      type: information-retrieval
+      name: Information Retrieval
+    dataset:
+      name: Unknown
+      type: unknown
+    metrics:
+    - type: cosine_accuracy@1
+      value: 0.7058518902123252
+      name: Cosine Accuracy@1
+    - type: cosine_accuracy@3
+      value: 0.9067840497151735
+      name: Cosine Accuracy@3
+    - type: cosine_accuracy@5
+      value: 0.9447609183497324
+      name: Cosine Accuracy@5
+    - type: cosine_accuracy@10
+      value: 0.9730709476954945
+      name: Cosine Accuracy@10
+    - type: cosine_precision@1
+      value: 0.7058518902123252
+      name: Cosine Precision@1
+    - type: cosine_precision@3
+      value: 0.3022613499050578
+      name: Cosine Precision@3
+    - type: cosine_precision@5
+      value: 0.18895218366994648
+      name: Cosine Precision@5
+    - type: cosine_precision@10
+      value: 0.09730709476954946
+      name: Cosine Precision@10
+    - type: cosine_recall@1
+      value: 0.7058518902123252
+      name: Cosine Recall@1
+    - type: cosine_recall@3
+      value: 0.9067840497151735
+      name: Cosine Recall@3
+    - type: cosine_recall@5
+      value: 0.9447609183497324
+      name: Cosine Recall@5
+    - type: cosine_recall@10
+      value: 0.9730709476954945
+      name: Cosine Recall@10
+    - type: cosine_ndcg@10
+      value: 0.851314896054128
+      name: Cosine Ndcg@10
+    - type: cosine_mrr@10
+      value: 0.8109469830857718
+      name: Cosine Mrr@10
+    - type: cosine_map@100
+      value: 0.8122768308333804
+      name: Cosine Map@100
+---
+# SentenceTransformer based on Snowflake/snowflake-arctic-embed-m-v2.0
+This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [Snowflake/snowflake-arctic-embed-m-v2.0](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v2.0). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
+## Model Details
+### Model Description
+- **Model Type:** Sentence Transformer
+- **Base model:** [Snowflake/snowflake-arctic-embed-m-v2.0](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v2.0) <!-- at revision 95c2741480856aa9666782eb4afe11959938017f -->
+- **Maximum Sequence Length:** 8192 tokens
+- **Output Dimensionality:** 768 dimensions
+- **Similarity Function:** Cosine Similarity
+<!-- - **Training Dataset:** Unknown -->
+<!-- - **Language:** Unknown -->
+<!-- - **License:** Unknown -->
+### Model Sources
+- **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
+- **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
+- **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
+### Full Model Architecture
+```
+SentenceTransformer(
+  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: GteModel
+  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
+  (2): Normalize()
+)
+```
+## Usage
+### Direct Usage (Sentence Transformers)
+First install the Sentence Transformers library:
+```bash
+pip install -U sentence-transformers
+```
+Then you can load this model and run inference.
+```python
+from sentence_transformers import SentenceTransformer
+# Download from the 🤗 Hub
+model = SentenceTransformer("sentence_transformers_model_id")
+# Run inference
+sentences = [
+    'How does the text suggest addressing the social aspects related to low- and middle-income transport users in the context of zero-emission vehicle initiatives?',
+    '(b)\n\nmeasures intended to accelerate the uptake of zero-emission vehicles or to provide financial support for the deployment of fully interoperable refuelling and recharging infrastructure for zero-emission vehicles, or measures to encourage a shift to public transport and improve multimodality, or to provide financial support in order to address social aspects concerning low- and middle-income transport users;\n\n(c)\n\nto finance their Social Climate Plan in accordance with Article 15 of Regulation (EU) 2023/955;\n\n(d)',
+    'If the planned change is implemented notwithstanding the first and second subparagraphs, or if an unplanned change has taken place pursuant to which the AIFM’s management of the AIF no longer complies with this Directive or the AIFM otherwise no longer complies with this Directive, the competent authorities of the Member State of reference of the AIFM shall take all due measures in accordance with Article 46, including, if necessary, the express prohibition of marketing of the AIF.',
+]
+embeddings = model.encode(sentences)
+print(embeddings.shape)
+# [3, 768]
+# Get the similarity scores for the embeddings
+similarities = model.similarity(embeddings, embeddings)
+print(similarities.shape)
+# [3, 3]
+```
+<!--
+### Direct Usage (Transformers)
+<details><summary>Click to see the direct usage in Transformers</summary>
+</details>
+-->
+<!--
+### Downstream Usage (Sentence Transformers)
+You can finetune this model on your own dataset.
+<details><summary>Click to expand</summary>
+</details>
+-->
+<!--
+### Out-of-Scope Use
+*List how the model may foreseeably be misused and address what users ought not to do with the model.*
+-->
+## Evaluation
+### Metrics
+#### Information Retrieval
+* Evaluated with [<code>InformationRetrievalEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator)
+| Metric              | Value      |
+|:--------------------|:-----------|
+| cosine_accuracy@1   | 0.7059     |
+| cosine_accuracy@3   | 0.9068     |
+| cosine_accuracy@5   | 0.9448     |
+| cosine_accuracy@10  | 0.9731     |
+| cosine_precision@1  | 0.7059     |
+| cosine_precision@3  | 0.3023     |
+| cosine_precision@5  | 0.189      |
+| cosine_precision@10 | 0.0973     |
+| cosine_recall@1     | 0.7059     |
+| cosine_recall@3     | 0.9068     |
+| cosine_recall@5     | 0.9448     |
+| cosine_recall@10    | 0.9731     |
+| **cosine_ndcg@10**  | **0.8513** |
+| cosine_mrr@10       | 0.8109     |
+| cosine_map@100      | 0.8123     |
+<!--
+## Bias, Risks and Limitations
+*What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
+-->
+<!--
+### Recommendations
+*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
+-->
+## Training Details
+### Training Dataset
+#### Unnamed Dataset
+* Size: 46,338 training samples
+* Columns: <code>sentence_0</code> and <code>sentence_1</code>
+* Approximate statistics based on the first 1000 samples:
+  |         | sentence_0                                                                         | sentence_1                                                                           |
+  |:--------|:-----------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------|
+  | type    | string                                                                             | string                                                                               |
+  | details | <ul><li>min: 9 tokens</li><li>mean: 39.98 tokens</li><li>max: 286 tokens</li></ul> | <ul><li>min: 3 tokens</li><li>mean: 248.72 tokens</li><li>max: 1315 tokens</li></ul> |
+* Samples:
+  | sentence_0                                                                                                                                                                           | sentence_1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
+  |:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+  | <code>What is the maximum allowable reduction in excise duty for mixtures used as motor fuels containing biodiesel in Italy until 30 June 2004?</code>                               | <code>for waste oils which are reused as fuel, either directly after recovery or following a recycling process for waste oils, and where the reuse is subject to duty.<br><br>8. ITALY:<br><br>for differentiated rates of excise duty on mixtures used as motor fuels containing 5 % or 25 % of biodiesel until 30 June 2004. The reduction in excise duty may not be greater than the amount of excise duty payable on the volume of biofuels present in the products eligible for the reduction. The reduction in excise duty shall be adjusted to take account of changes in the price of raw materials to avoid overcompensating for the extra costs involved in the manufacture of biofuels;</code> |
+  | <code>What are the minimum indicative share percentages for the years 2023 to 2030, and how do these percentages relate to the interconnectivity levels of the Member States?</code> | <code>Such indicative shares may, in each year, amount to at least 5 % from 2023 to 2026 and at least 10 % from 2027 to 2030, or, where lower, to the level of interconnectivity of the Member State concerned in any given year.<br><br>In order to acquire further implementation experience, Member States may organise one or more pilot schemes where support is open to producers located in other Member States.<br><br>2.</code>                                                                                                                                                                                                                                                                  |
+  | <code>What is the significance of the one-month period mentioned in the context?</code>                                                                                              | <code>one month after its notification, in accordance with the arrangements provided for in Article 23.</code>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
+* Loss: [<code>MatryoshkaLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#matryoshkaloss) with these parameters:
+  ```json
+  {
+      "loss": "MultipleNegativesRankingLoss",
+      "matryoshka_dims": [
+          768,
+          512,
+          256,
+          128,
+          64
+      ],
+      "matryoshka_weights": [
+          1,
+          1,
+          1,
+          1,
+          1
+      ],
+      "n_dims_per_step": -1
+  }
+  ```
+### Training Hyperparameters
+#### Non-Default Hyperparameters
+- `eval_strategy`: steps
+- `num_train_epochs`: 4
+- `fp16`: True
+- `multi_dataset_batch_sampler`: round_robin
+#### All Hyperparameters
+<details><summary>Click to expand</summary>
+- `overwrite_output_dir`: False
+- `do_predict`: False
+- `eval_strategy`: steps
+- `prediction_loss_only`: True
+- `per_device_train_batch_size`: 8
+- `per_device_eval_batch_size`: 8
+- `per_gpu_train_batch_size`: None
+- `per_gpu_eval_batch_size`: None
+- `gradient_accumulation_steps`: 1
+- `eval_accumulation_steps`: None
+- `torch_empty_cache_steps`: None
+- `learning_rate`: 5e-05
+- `weight_decay`: 0.0
+- `adam_beta1`: 0.9
+- `adam_beta2`: 0.999
+- `adam_epsilon`: 1e-08
+- `max_grad_norm`: 1
+- `num_train_epochs`: 4
+- `max_steps`: -1
+- `lr_scheduler_type`: linear
+- `lr_scheduler_kwargs`: {}
+- `warmup_ratio`: 0.0
+- `warmup_steps`: 0
+- `log_level`: passive
+- `log_level_replica`: warning
+- `log_on_each_node`: True
+- `logging_nan_inf_filter`: True
+- `save_safetensors`: True
+- `save_on_each_node`: False
+- `save_only_model`: False
+- `restore_callback_states_from_checkpoint`: False
+- `no_cuda`: False
+- `use_cpu`: False
+- `use_mps_device`: False
+- `seed`: 42
+- `data_seed`: None
+- `jit_mode_eval`: False
+- `use_ipex`: False
+- `bf16`: False
+- `fp16`: True
+- `fp16_opt_level`: O1
+- `half_precision_backend`: auto
+- `bf16_full_eval`: False
+- `fp16_full_eval`: False
+- `tf32`: None
+- `local_rank`: 0
+- `ddp_backend`: None
+- `tpu_num_cores`: None
+- `tpu_metrics_debug`: False
+- `debug`: []
+- `dataloader_drop_last`: False
+- `dataloader_num_workers`: 0
+- `dataloader_prefetch_factor`: None
+- `past_index`: -1
+- `disable_tqdm`: False
+- `remove_unused_columns`: True
+- `label_names`: None
+- `load_best_model_at_end`: False
+- `ignore_data_skip`: False
+- `fsdp`: []
+- `fsdp_min_num_params`: 0
+- `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
+- `fsdp_transformer_layer_cls_to_wrap`: None
+- `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
+- `deepspeed`: None
+- `label_smoothing_factor`: 0.0
+- `optim`: adamw_torch
+- `optim_args`: None
+- `adafactor`: False
+- `group_by_length`: False
+- `length_column_name`: length
+- `ddp_find_unused_parameters`: None
+- `ddp_bucket_cap_mb`: None
+- `ddp_broadcast_buffers`: False
+- `dataloader_pin_memory`: True
+- `dataloader_persistent_workers`: False
+- `skip_memory_metrics`: True
+- `use_legacy_prediction_loop`: False
+- `push_to_hub`: False
+- `resume_from_checkpoint`: None
+- `hub_model_id`: None
+- `hub_strategy`: every_save
+- `hub_private_repo`: None
+- `hub_always_push`: False
+- `gradient_checkpointing`: False
+- `gradient_checkpointing_kwargs`: None
+- `include_inputs_for_metrics`: False
+- `include_for_metrics`: []
+- `eval_do_concat_batches`: True
+- `fp16_backend`: auto
+- `push_to_hub_model_id`: None
+- `push_to_hub_organization`: None
+- `mp_parameters`:
+- `auto_find_batch_size`: False
+- `full_determinism`: False
+- `torchdynamo`: None
+- `ray_scope`: last
+- `ddp_timeout`: 1800
+- `torch_compile`: False
+- `torch_compile_backend`: None
+- `torch_compile_mode`: None
+- `dispatch_batches`: None
+- `split_batches`: None
+- `include_tokens_per_second`: False
+- `include_num_input_tokens_seen`: False
+- `neftune_noise_alpha`: None
+- `optim_target_modules`: None
+- `batch_eval_metrics`: False
+- `eval_on_start`: False
+- `use_liger_kernel`: False
+- `eval_use_gather_object`: False
+- `average_tokens_across_devices`: False
+- `prompts`: None
+- `batch_sampler`: batch_sampler
+- `multi_dataset_batch_sampler`: round_robin
+</details>
+### Training Logs
+| Epoch  | Step  | Training Loss | cosine_ndcg@10 |
+|:------:|:-----:|:-------------:|:--------------:|
+| 0.0863 | 500   | 0.225         | -              |
+| 0.1726 | 1000  | 0.1337        | -              |
+| 0.2589 | 1500  | 0.1195        | -              |
+| 0.3452 | 2000  | 0.0803        | -              |
+| 0.4316 | 2500  | 0.0775        | -              |
+| 0.5179 | 3000  | 0.0714        | -              |
+| 0.6042 | 3500  | 0.0852        | -              |
+| 0.6905 | 4000  | 0.0718        | -              |
+| 0.7768 | 4500  | 0.0499        | -              |
+| 0.8631 | 5000  | 0.0665        | 0.8371         |
+| 0.9494 | 5500  | 0.0674        | -              |
+| 1.0    | 5793  | -             | 0.8416         |
+| 1.0357 | 6000  | 0.0538        | -              |
+| 1.1220 | 6500  | 0.0606        | -              |
+| 1.2084 | 7000  | 0.0294        | -              |
+| 1.2947 | 7500  | 0.0129        | -              |
+| 1.3810 | 8000  | 0.0101        | -              |
+| 1.4673 | 8500  | 0.0072        | -              |
+| 1.5536 | 9000  | 0.0211        | -              |
+| 1.6399 | 9500  | 0.0133        | -              |
+| 1.7262 | 10000 | 0.0063        | 0.8513         |
+### Framework Versions
+- Python: 3.10.15
+- Sentence Transformers: 4.0.2
+- Transformers: 4.49.0
+- PyTorch: 2.6.0+cu126
+- Accelerate: 0.26.0
+- Datasets: 3.5.0
+- Tokenizers: 0.21.1
+## Citation
+### BibTeX
+#### Sentence Transformers
+```bibtex
+@inproceedings{reimers-2019-sentence-bert,
+    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
+    author = "Reimers, Nils and Gurevych, Iryna",
+    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
+    month = "11",
+    year = "2019",
+    publisher = "Association for Computational Linguistics",
+    url = "https://arxiv.org/abs/1908.10084",
+}
+```
+#### MatryoshkaLoss
+```bibtex
+@misc{kusupati2024matryoshka,
+    title={Matryoshka Representation Learning},
+    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
+    year={2024},
+    eprint={2205.13147},
+    archivePrefix={arXiv},
+    primaryClass={cs.LG}
+}
+```
+#### MultipleNegativesRankingLoss
+```bibtex
+@misc{henderson2017efficient,
+    title={Efficient Natural Language Response Suggestion for Smart Reply},
+    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
+    year={2017},
+    eprint={1705.00652},
+    archivePrefix={arXiv},
+    primaryClass={cs.CL}
+}
+```
+<!--
+## Glossary
+*Clearly define terms in order to be accessible across audiences.*
+-->
+<!--
+## Model Card Authors
+*Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
+-->
+<!--
+## Model Card Contact
+*Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
+-->

config.json ADDED Viewed

	@@ -0,0 +1,39 @@

+{
+  "_name_or_path": "Snowflake/snowflake-arctic-embed-m-v2.0",
+  "architectures": [
+    "GteModel"
+  ],
+  "attention_probs_dropout_prob": 0.0,
+  "auto_map": {
+    "AutoConfig": "configuration_hf_alibaba_nlp_gte.GteConfig",
+    "AutoModel": "Snowflake/snowflake-arctic-embed-m-v2.0--modeling_hf_alibaba_nlp_gte.GteModel"
+  },
+  "classifier_dropout": 0.1,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "layer_norm_eps": 1e-12,
+  "layer_norm_type": "layer_norm",
+  "logn_attention_clip1": false,
+  "logn_attention_scale": false,
+  "matryoshka_dimensions": [
+    256
+  ],
+  "max_position_embeddings": 8192,
+  "model_type": "gte",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "pack_qkv": true,
+  "pad_token_id": 1,
+  "position_embedding_type": "rope",
+  "rope_scaling": null,
+  "rope_theta": 160000,
+  "torch_dtype": "float32",
+  "transformers_version": "4.49.0",
+  "type_vocab_size": 1,
+  "unpad_inputs": "true",
+  "use_memory_efficient_attention": "true",
+  "vocab_size": 250048
+}

config_sentence_transformers.json ADDED Viewed

	@@ -0,0 +1,12 @@

+{
+  "__version__": {
+    "sentence_transformers": "4.0.2",
+    "transformers": "4.49.0",
+    "pytorch": "2.6.0+cu126"
+  },
+  "prompts": {
+    "query": "query: "
+  },
+  "default_prompt_name": null,
+  "similarity_fn_name": "cosine"
+}

configuration_hf_alibaba_nlp_gte.py ADDED Viewed

	@@ -0,0 +1,145 @@

+# coding=utf-8
+# Copyright 2024 The GTE Team Authors and Alibaba Group.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" GTE model configuration"""
+from transformers.configuration_utils import PretrainedConfig
+from transformers.utils import logging
+logger = logging.get_logger(__name__)
+class GteConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`NewModel`] or a [`TFNewModel`]. It is used to
+    instantiate a NEW model according to the specified arguments, defining the model architecture. Instantiating a
+    configuration with the defaults will yield a similar configuration to that of the NEW
+    [izhx/new-base-en](https://huggingface.co/izhx/new-base-en) architecture.
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+    Args:
+        vocab_size (`int`, *optional*, defaults to 30522):
+            Vocabulary size of the NEW model. Defines the number of different tokens that can be represented by the
+            `inputs_ids` passed when calling [`NewModel`] or [`TFNewModel`].
+        hidden_size (`int`, *optional*, defaults to 768):
+            Dimensionality of the encoder layers and the pooler layer.
+        num_hidden_layers (`int`, *optional*, defaults to 12):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 12):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        intermediate_size (`int`, *optional*, defaults to 3072):
+            Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
+        hidden_act (`str` or `Callable`, *optional*, defaults to `"gelu"`):
+            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
+            `"relu"`, `"silu"` and `"gelu_new"` are supported.
+        hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
+        attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout ratio for the attention probabilities.
+        max_position_embeddings (`int`, *optional*, defaults to 512):
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
+            just in case (e.g., 512 or 1024 or 2048).
+        type_vocab_size (`int`, *optional*, defaults to 2):
+            The vocabulary size of the `token_type_ids` passed when calling [`NewModel`] or [`TFNewModel`].
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        layer_norm_eps (`float`, *optional*, defaults to 1e-12):
+            The epsilon used by the layer normalization layers.
+        position_embedding_type (`str`, *optional*, defaults to `"rope"`):
+            Type of position embedding. Choose one of `"absolute"`, `"rope"`.
+        rope_theta (`float`, *optional*, defaults to 10000.0):
+            The base period of the RoPE embeddings.
+        rope_scaling (`Dict`, *optional*):
+            Dictionary containing the scaling configuration for the RoPE embeddings. Currently supports two scaling
+            strategies: linear and dynamic. Their scaling factor must be a float greater than 1. The expected format is
+            `{"type": strategy name, "factor": scaling factor}`. When using this flag, don't update
+            `max_position_embeddings` to the expected new maximum. See the following thread for more information on how
+            these scaling strategies behave:
+            https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases/. This is an
+            experimental feature, subject to breaking API changes in future versions.
+        classifier_dropout (`float`, *optional*):
+            The dropout ratio for the classification head.
+    Examples:
+    ```python
+    >>> from transformers import NewConfig, NewModel
+    >>> # Initializing a NEW izhx/new-base-en style configuration
+    >>> configuration = NewConfig()
+    >>> # Initializing a model (with random weights) from the izhx/new-base-en style configuration
+    >>> model = NewModel(configuration)
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+    model_type = "gte"
+    def __init__(
+        self,
+        vocab_size=30528,
+        hidden_size=768,
+        num_hidden_layers=12,
+        num_attention_heads=12,
+        intermediate_size=3072,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.0,
+        max_position_embeddings=2048,
+        type_vocab_size=1,
+        initializer_range=0.02,
+        layer_norm_type='layer_norm',
+        layer_norm_eps=1e-12,
+        # pad_token_id=0,
+        position_embedding_type="rope",
+        rope_theta=10000.0,
+        rope_scaling=None,
+        classifier_dropout=None,
+        pack_qkv=True,
+        unpad_inputs=False,
+        use_memory_efficient_attention=False,
+        logn_attention_scale=False,
+        logn_attention_clip1=False,
+        **kwargs,
+    ):
+        super().__init__(**kwargs)
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.hidden_act = hidden_act
+        self.intermediate_size = intermediate_size
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.layer_norm_type = layer_norm_type
+        self.layer_norm_eps = layer_norm_eps
+        self.position_embedding_type = position_embedding_type
+        self.rope_theta = rope_theta
+        self.rope_scaling = rope_scaling
+        self.classifier_dropout = classifier_dropout
+        self.pack_qkv = pack_qkv
+        self.unpad_inputs = unpad_inputs
+        self.use_memory_efficient_attention = use_memory_efficient_attention
+        self.logn_attention_scale = logn_attention_scale
+        self.logn_attention_clip1 = logn_attention_clip1

eval/Information-Retrieval_evaluation_results.csv ADDED Viewed

	@@ -0,0 +1,5 @@

+epoch,steps,cosine-Accuracy@1,cosine-Accuracy@3,cosine-Accuracy@5,cosine-Accuracy@10,cosine-Precision@1,cosine-Recall@1,cosine-Precision@3,cosine-Recall@3,cosine-Precision@5,cosine-Recall@5,cosine-Precision@10,cosine-Recall@10,cosine-MRR@10,cosine-NDCG@10,cosine-MAP@100
+1.0,5793,0.6904885206283445,0.9026411185914034,0.9366476782323494,0.967892283790782,0.6904885206283445,0.6904885206283445,0.30088037286380115,0.9026411185914034,0.18732953564646984,0.9366476782323494,0.09678922837907819,0.967892283790782,0.7996936642198161,0.8416020003069878,0.8012900708593229
+2.0,11586,0.6941135853616434,0.8993612981184188,0.9375107888831348,0.9696185050923528,0.6941135853616434,0.6941135853616434,0.2997870993728063,0.8993612981184188,0.18750215777662693,0.9375107888831348,0.09696185050923527,0.9696185050923528,0.8014224200526644,0.8432305431101866,0.8029183926314174
+3.0,17379,0.693423096841015,0.8972898325565337,0.9387191437942344,0.9696185050923528,0.693423096841015,0.693423096841015,0.2990966108521779,0.8972898325565337,0.18774382875884688,0.9387191437942344,0.09696185050923527,0.9696185050923528,0.8008940730328616,0.8428217313706017,0.802428493797126
+4.0,23172,0.6903158984981874,0.8933195235629208,0.9368203003625065,0.9684101501812532,0.6903158984981874,0.6903158984981874,0.29777317452097357,0.8933195235629208,0.18736406007250128,0.9368203003625065,0.09684101501812532,0.9684101501812532,0.7984443458032272,0.8406723510269695,0.80003685068108

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e5180b42e713060bcf65a1ff5f11f8b27dca0230fc31b3f6512cfa7c99fd0726
+size 1221487872

modules.json ADDED Viewed

	@@ -0,0 +1,20 @@

+[
+  {
+    "idx": 0,
+    "name": "0",
+    "path": "",
+    "type": "sentence_transformers.models.Transformer"
+  },
+  {
+    "idx": 1,
+    "name": "1",
+    "path": "1_Pooling",
+    "type": "sentence_transformers.models.Pooling"
+  },
+  {
+    "idx": 2,
+    "name": "2",
+    "path": "2_Normalize",
+    "type": "sentence_transformers.models.Normalize"
+  }
+]

sentence_bert_config.json ADDED Viewed

	@@ -0,0 +1,4 @@

+{
+  "max_seq_length": 8192,
+  "do_lower_case": false
+}

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,51 @@

+{
+  "bos_token": {
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "cls_token": {
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "mask_token": {
+    "content": "<mask>",
+    "lstrip": true,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<pad>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "sep_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "<unk>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:aa7a6ad87a7ce8fe196787355f6af7d03aee94d19c54a5eb1392ed18c8ef451a
+size 17082988

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,62 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<pad>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "</s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "<unk>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "250001": {
+      "content": "<mask>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "bos_token": "<s>",
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "<s>",
+  "eos_token": "</s>",
+  "extra_special_tokens": {},
+  "mask_token": "<mask>",
+  "max_length": 512,
+  "model_max_length": 32768,
+  "pad_to_multiple_of": null,
+  "pad_token": "<pad>",
+  "pad_token_type_id": 0,
+  "padding_side": "right",
+  "sep_token": "</s>",
+  "stride": 0,
+  "tokenizer_class": "XLMRobertaTokenizerFast",
+  "truncation_side": "right",
+  "truncation_strategy": "longest_first",
+  "unk_token": "<unk>"
+}