stella_en_400M_v5_finetuned_embedding_scorer

This is a sentence-transformers model finetuned from NovaSearch/stella_en_400M_v5, with the goal of serving as a better embedding model for scoring text explanations of sparse autoencoders (SAE).

It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Type: Sentence Transformer
Base model: NovaSearch/stella_en_400M_v5
Maximum Sequence Length: 512 tokens
Output Dimensionality: 1024 dimensions
Similarity Function: Cosine Similarity
Training Dataset:
- json
Language: en
License: apache-2.0

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: NewModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Dense({'in_features': 1024, 'out_features': 1024, 'bias': True, 'activation_function': 'torch.nn.modules.linear.Identity'})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("cristiano-sartori/stella_finetuned1")
# Run inference
sentences = [
    'Describe the techniques that typical dynamically scheduled\n            processors use to achieve the same purpose of the following features\n            of Intel Itanium: (a) Predicated execution; (b) advanced\n            loads---that is, loads moved before a store and explicit check for\n            RAW hazards; (c) speculative loads---that is, loads moved before a\n            branch and explicit check for exceptions; (d) rotating register\n            file.',
    'Dynamically scheduled processors are designed to improve the efficiency of instruction execution by allowing the CPU to make decisions at runtime about the order of instruction execution. Let\'s break down each feature you mentioned from the Intel Itanium architecture and see how typical dynamically scheduled processors achieve similar goals.\n\n### (a) Predicated Execution\n\n**Intuition:**\nPredicated execution allows the processor to execute instructions based on certain conditions without using traditional branching (like `if` statements). This helps to avoid pipeline stalls that can occur when a branch is taken.\n\n**Example:**\nImagine you have the following pseudo-code:\n```c\nif (x > 0) {\n    y = z + 1;\n} else {\n    y = z - 1;\n}\n```\n\nIn a predicated execution model, instead of branching, the processor can execute both instructions but use a predicate (a boolean condition) to determine which result to keep:\n```assembly\np1 = (x > 0)\ny1 = z + 1; // Execute regardless\ny2 = z - 1; // Execute regardless\ny = p1 ? y1 : y2; // Keep the result based on p1\n```\n\n**Dynamically Scheduled Processors:**\nThese processors use techniques like "instruction scheduling" and "register renaming" to allow for instructions to be executed out of order while avoiding the pitfalls of branches. The hardware can evaluate conditions ahead of time and execute the necessary instructions while keeping track of which values are valid.\n\n### (b) Advanced Loads\n\n**Intuition:**\nAdvanced loads allow the processor to move load instructions (fetching data from memory) ahead of store instructions (writing data to memory), while also checking for Read After Write (RAW) hazards to ensure data correctness.\n\n**Example:**\nConsider the following pseudo-code:\n```c\na = b; // Store b into a\nc = a; // Load a into c\n```\n\nIf `a` is stored before it\'s loaded again, there can be a dependency. Advanced load techniques allow the processor to load `c = a` even before it knows if the value of `a` has been updated, as long as it checks that no store operation that modifies `a` has occurred.\n\n**Dynamically Scheduled Processors:**\nThese processors often use a technique called "out-of-order execution." They keep track of the status of loads and stores in a structure like a reorder buffer. When a load is scheduled, the processor checks if any previous instructions modify the value it needs (checking for RAW hazards) before executing the load.\n\n### (c) Speculative Loads\n\n**Intuition:**\nSpeculative loads allow the processor to execute load instructions before it knows the outcome of branches, which can keep the pipeline filled and reduce stalls.\n\n**Example:**\nConsider a branch that depends on a condition:\n```c\nif (x > 0) {\n    a = b + c;\n}\n```\n\nInstead of waiting for the branch to be resolved, a speculative load might execute `load a` regardless of the branch\'s result. If the branch is taken, the processor can discard this load if it\'s not necessary.\n\n**Dynamically Scheduled Processors:**\nThese processors utilize "speculative execution," where they predict the likely path of execution based on past behavior. They perform loads and even entire blocks of instructions speculatively and have mechanisms to roll back if they guessed wrong while checking for exceptions (like accessing invalid memory).\n\n### (d) Rotating Register File\n\n**Intuition:**\nA rotating register file allows the processor to efficiently manage registers, effectively giving it more registers to work with by cycling through them for different contexts or states.\n\n**Example:**\nIn a simple program, if you have a limited number of registers but multiple functions, rotating registers means that as soon as one function completes, its registers can be reused for the next function without needing to save them to memory.\n\n**Dynamically Scheduled Processors:**\nMany dynamically scheduled processors use a "register renaming" technique, which allows them to allocate physical registers dynamically. When an instruction is ready to execute, it checks which registers are free and assigns one, effectively "rotating" the use of registers without the programmer needing to manage this directly.\n\n### Summary\n\nIn summary, dynamically scheduled processors use advanced techniques like out-of-order execution, speculative execution, and register renaming to achieve efficiency and performance similar to the features found in Intel Itanium. These techniques help to minimize stalls and maximize instruction throughput by allowing more flexibility in how instructions are executed relative to their dependencies and branch outcomes.',
    "The question at hand explores whether it is possible to add new documents to a collection such that one document,  $d_{1}$ , is ranked higher than another document,  $d_{2}$ , based on a specific query, while also allowing for the possibility of ranking  $d_{2}$  higher than  $d_{1}$  simultaneously.\n\nTo analyze this problem, we begin by examining the two documents in question: \n\n- Document  $d_{1}$  contains three occurrences of 'a', one occurrence of 'b', and none of 'c' (represented as  $d_1 = \\text{aabc}$ ).\n- Document  $d_{2}$  has one occurrence each of 'a', 'b', and 'c' (represented as  $d_2 = \\text{abc}$ ).\n\nGiven the query  $q = \\text{ab}$ , our focus lies on the occurrences of 'a' and 'b' in both documents.\n\nNext, we calculate the term frequencies for the relevant terms in each document:\n\n- For  $d_{1}$ , the term frequencies are:\n  -  $f_{d_1}(a) = 3$ \n  -  $f_{d_1}(b) = 1$ \n  -  $f_{d_1}(c) = 0$ \n\n- For  $d_{2}$ , the term frequencies are:\n  -  $f_{d_2}(a) = 1$ \n  -  $f_{d_2}(b) = 1$ \n  -  $f_{d_2}(c) = 1$ \n\nThe total number of terms in each document is calculated as follows:\n\n- Total terms in  $d_{1} = 4$  (3 'a's + 1 'b' + 0 'c's).\n- Total terms in  $d_{2} = 3$  (1 'a' + 1 'b' + 1 'c').\n\nWe will apply the smoothed probabilistic retrieval model using the formula:\n\\[\nP(w | d) = \\frac{f_{d}(w) + \\lambda \\cdot P(w | C)}{N + \\lambda \\cdot |V|}\n\\]\nwhere  $N$  is the total number of terms in the document,  $∣ V ∣$  is the size of the vocabulary (which is 3 in this case), and  $P (w ∣ C)$  is the probability of the word in the overall collection.\n\nAssuming a uniform distribution for the collection, we calculate:\n-  $P(a | C) = \\frac{4}{10} = 0.4$ \n-  $P(b | C) = \\frac{2}{10} = 0.2$ \n-  $P(c | C) = \\frac{2}{10} = 0.2$ \n\nNow, we compute the probabilities for the query terms for each document.\n\nFor document  $d_{1}$ :\n- Probability of 'a':\n\\[\nP(a | d_1) = \\frac{3 + 0.5 \\cdot 0.4}{4 + 0.5 \\cdot 3} = \\frac{3 + 0.2}{4 + 1.5} = \\frac{3.2}{5.5} \\approx 0.5818\n\\]\n- Probability of 'b':\n\\[\nP(b | d_1) = \\frac{1 + 0.5 \\cdot 0.2}{4 + 0.5 \\cdot 3} = \\frac{1 + 0.1}{5.5} = \\frac{1.1}{5.5} \\approx 0.2\n\\]\n- Combined score for  $d_{1}$  for the query  $q = a b$ :\n\\[\nP(q | d_1) = P(a | d_1) \\cdot P(b | d_1) \\approx 0.5818 \\cdot 0.2 \\approx 0.1164\n\\]\n\nFor document  $d_{2}$ :\n- Probability of 'a':\n\\[\nP(a | d_2) = \\frac{1 + 0.5 \\cdot 0.4}{3 + 0.5 \\cdot 3} = \\frac{1 + 0.2}{4.5} = \\frac{1.2}{4.5} \\approx 0.2667\n\\]\n- Probability of 'b':\n\\[\nP(b | d_2) = \\frac{1 + 0.5 \\cdot 0.2}{3 + 0.5 \\cdot 3} = \\frac{1 + 0.1}{4.5} = \\frac{1.1}{4.5} \\approx 0.2444\n\\]\n- Combined score for  $d_{2}$  for the query  $q = a b$ :\n\\[\nP(q | d_2) = P(a | d_2) \\cdot P(b | d_2) \\approx 0.2667 \\cdot 0.2444 \\approx 0.0652\n\\]\n\nAt this stage, we find that  $P(q | d_1) \\approx 0.1164$  and  $P(q | d_2) \\approx 0.0652$ , indicating that  $d_{1}$  currently ranks higher than  $d_{2}$ .\n\nTo explore the possibility of achieving both  $d_{1} > d_{2}$  and  $d_{2} > d_{1}$ , we consider the addition of new documents. While it is theoretically possible to manipulate rankings by introducing documents that alter the frequency of terms, the fundamental nature of probabilistic scoring means that achieving both conditions simultaneously is implausible. Specifically, any document that increases the score of  $d_{1}$  will likely decrease the score of  $d_{2}$  and vice versa due to the competitive nature of the scoring based on term frequencies.\n\nIn conclusion, while document addition can influence individual rankings, the inherent constraints of probabilistic retrieval prevent the simultaneous fulfillment of both ranking conditions. Therefore, the answer is **no, it is not possible** to enforce both rankings as required.",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Downloads last month: 150

Safetensors

Model size

0.4B params

Tensor type

F32

Model tree for mathreader/stella_en_400M_v5_finetuned_embedding_scorer

Base model

NovaSearch/stella_en_400M_v5

Finetuned

(17)

this model