nickcdryan/bitter-retrieval-converted-infonce-bert
This is a retrieval model fine-tuned using Converted InfoNCE on MS MARCO dataset with additional validation on SQuAD.
Model Details
- Base Model: google-bert/bert-base-uncased
- Training Method: Converted InfoNCE
- Training Data: MS MARCO soft-labeled dataset
- Validation Data: SQuAD v2 + MS MARCO
- Framework: PyTorch + Transformers
Training Details
This model was trained using the bitter-retrieval framework with:
- Training Method:
Converted InfoNCE - Encoder: BERT-base-uncased
- Max Sequence Length: 512 tokens
- Batch Size: 32
- Epochs: 2
- Learning Rate: 2e-5
- Temperature: 0.02
Usage
from transformers import AutoModel, AutoTokenizer
import torch
import torch.nn.functional as F
# Load model and tokenizer
model = AutoModel.from_pretrained("nickcdryan/bitter-retrieval-converted-infonce-bert")
tokenizer = AutoTokenizer.from_pretrained("nickcdryan/bitter-retrieval-converted-infonce-bert")
def encode_text(text, prefix=""):
'''Encode text with optional prefix'''
full_text = f"{prefix}{text}" if prefix else text
inputs = tokenizer(full_text, return_tensors="pt", padding=True, truncation=True, max_length=512)
with torch.no_grad():
outputs = model(**inputs)
# Mean pooling
attention_mask = inputs['attention_mask']
token_embeddings = outputs.last_hidden_state
masked_embeddings = token_embeddings * attention_mask.unsqueeze(-1)
sum_embeddings = masked_embeddings.sum(dim=1)
count_tokens = attention_mask.sum(dim=1, keepdim=True)
embeddings = sum_embeddings / count_tokens
# L2 normalize
embeddings = F.normalize(embeddings, dim=-1)
return embeddings
# Example usage
query = "What is machine learning?"
passage = "Machine learning is a subset of artificial intelligence..."
# Encode with prefixes (recommended)
query_emb = encode_text(query, "query: ")
passage_emb = encode_text(passage, "passage: ")
# Compute similarity
similarity = torch.cosine_similarity(query_emb, passage_emb)
print(f"Similarity: {similarity.item():.4f}")
Evaluation Metrics
The model was evaluated on both SQuAD and MS MARCO datasets with the following metrics:
- Retrieval Accuracy: How often the correct passage is retrieved
- F1 Score: Token-level F1 between generated and reference answers
- Exact Match: Exact match between generated and reference answers
- LLM Judge: Semantic similarity judged by Gemini-2.0-flash
Training Framework
This model was trained using the bitter-retrieval framework, which implements various contrastive learning methods for retrieval tasks.
Citation
If you use this model, please cite:
@misc{bitter-retrieval-converted infonce,
title={Bitter Retrieval: Converted InfoNCE Fine-tuned BERT for Information Retrieval},
author={Your Name},
year={2024},
howpublished={\url{https://huggingface.co/nickcdryan/bitter-retrieval-converted-infonce-bert}}
}
- Downloads last month
- 2
Model tree for nickcdryan/bitter-retrieval-converted-infonce-bert
Base model
google-bert/bert-base-uncased