nickcdryan/bitter-retrieval-converted-infonce-bert
This is a retrieval model fine-tuned using Converted InfoNCE on MS MARCO dataset with additional validation on SQuAD.
Model Details
- Base Model: google-bert/bert-base-uncased
- Training Method: Converted InfoNCE
- Training Data: MS MARCO soft-labeled dataset
- Validation Data: SQuAD v2 + MS MARCO
- Framework: PyTorch + Transformers
Training Details
This model was trained using the bitter-retrieval framework with:
- Training Method: Converted InfoNCE
- Encoder: BERT-base-uncased
- Max Sequence Length: 512 tokens
- Batch Size: 32
- Epochs: 2
- Learning Rate: 2e-5
- Temperature: 0.02
Usage
from transformers import AutoModel, AutoTokenizer
import torch
import torch.nn.functional as F
# Load model and tokenizer
model = AutoModel.from_pretrained("nickcdryan/bitter-retrieval-converted-infonce-bert")
tokenizer = AutoTokenizer.from_pretrained("nickcdryan/bitter-retrieval-converted-infonce-bert")
def encode_text(text, prefix=""):
    '''Encode text with optional prefix'''
    full_text = f"{prefix}{text}" if prefix else text
    inputs = tokenizer(full_text, return_tensors="pt", padding=True, truncation=True, max_length=512)
    
    with torch.no_grad():
        outputs = model(**inputs)
        # Mean pooling
        attention_mask = inputs['attention_mask']
        token_embeddings = outputs.last_hidden_state
        masked_embeddings = token_embeddings * attention_mask.unsqueeze(-1)
        sum_embeddings = masked_embeddings.sum(dim=1)
        count_tokens = attention_mask.sum(dim=1, keepdim=True)
        embeddings = sum_embeddings / count_tokens
        # L2 normalize
        embeddings = F.normalize(embeddings, dim=-1)
    
    return embeddings
# Example usage
query = "What is machine learning?"
passage = "Machine learning is a subset of artificial intelligence..."
# Encode with prefixes (recommended)
query_emb = encode_text(query, "query: ")
passage_emb = encode_text(passage, "passage: ")
# Compute similarity
similarity = torch.cosine_similarity(query_emb, passage_emb)
print(f"Similarity: {similarity.item():.4f}")
Evaluation Metrics
The model was evaluated on both SQuAD and MS MARCO datasets with the following metrics:
- Retrieval Accuracy: How often the correct passage is retrieved
- F1 Score: Token-level F1 between generated and reference answers
- Exact Match: Exact match between generated and reference answers
- LLM Judge: Semantic similarity judged by Gemini-2.0-flash
Training Framework
This model was trained using the bitter-retrieval framework, which implements various contrastive learning methods for retrieval tasks.
Citation
If you use this model, please cite:
@misc{bitter-retrieval-converted infonce,
  title={Bitter Retrieval: Converted InfoNCE Fine-tuned BERT for Information Retrieval},
  author={Your Name},
  year={2024},
  howpublished={\url{https://huggingface.co/nickcdryan/bitter-retrieval-converted-infonce-bert}}
}
- Downloads last month
- 2
Model tree for nickcdryan/bitter-retrieval-converted-infonce-bert
Base model
google-bert/bert-base-uncased