nickcdryan/bitter-retrieval-converted-infonce-bert

This is a retrieval model fine-tuned using Converted InfoNCE on MS MARCO dataset with additional validation on SQuAD.

Model Details

  • Base Model: google-bert/bert-base-uncased
  • Training Method: Converted InfoNCE
  • Training Data: MS MARCO soft-labeled dataset
  • Validation Data: SQuAD v2 + MS MARCO
  • Framework: PyTorch + Transformers

Training Details

This model was trained using the bitter-retrieval framework with:

  • Training Method: Converted InfoNCE
  • Encoder: BERT-base-uncased
  • Max Sequence Length: 512 tokens
  • Batch Size: 32
  • Epochs: 2
  • Learning Rate: 2e-5
  • Temperature: 0.02

Usage

from transformers import AutoModel, AutoTokenizer
import torch
import torch.nn.functional as F

# Load model and tokenizer
model = AutoModel.from_pretrained("nickcdryan/bitter-retrieval-converted-infonce-bert")
tokenizer = AutoTokenizer.from_pretrained("nickcdryan/bitter-retrieval-converted-infonce-bert")

def encode_text(text, prefix=""):
    '''Encode text with optional prefix'''
    full_text = f"{prefix}{text}" if prefix else text
    inputs = tokenizer(full_text, return_tensors="pt", padding=True, truncation=True, max_length=512)
    
    with torch.no_grad():
        outputs = model(**inputs)
        # Mean pooling
        attention_mask = inputs['attention_mask']
        token_embeddings = outputs.last_hidden_state
        masked_embeddings = token_embeddings * attention_mask.unsqueeze(-1)
        sum_embeddings = masked_embeddings.sum(dim=1)
        count_tokens = attention_mask.sum(dim=1, keepdim=True)
        embeddings = sum_embeddings / count_tokens
        # L2 normalize
        embeddings = F.normalize(embeddings, dim=-1)
    
    return embeddings

# Example usage
query = "What is machine learning?"
passage = "Machine learning is a subset of artificial intelligence..."

# Encode with prefixes (recommended)
query_emb = encode_text(query, "query: ")
passage_emb = encode_text(passage, "passage: ")

# Compute similarity
similarity = torch.cosine_similarity(query_emb, passage_emb)
print(f"Similarity: {similarity.item():.4f}")

Evaluation Metrics

The model was evaluated on both SQuAD and MS MARCO datasets with the following metrics:

  • Retrieval Accuracy: How often the correct passage is retrieved
  • F1 Score: Token-level F1 between generated and reference answers
  • Exact Match: Exact match between generated and reference answers
  • LLM Judge: Semantic similarity judged by Gemini-2.0-flash

Training Framework

This model was trained using the bitter-retrieval framework, which implements various contrastive learning methods for retrieval tasks.

Citation

If you use this model, please cite:

@misc{bitter-retrieval-converted infonce,
  title={Bitter Retrieval: Converted InfoNCE Fine-tuned BERT for Information Retrieval},
  author={Your Name},
  year={2024},
  howpublished={\url{https://huggingface.co/nickcdryan/bitter-retrieval-converted-infonce-bert}}
}
Downloads last month
2
Safetensors
Model size
0.1B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for nickcdryan/bitter-retrieval-converted-infonce-bert

Finetuned
(5914)
this model