--- license: apache-2.0 base_model: answerdotai/ModernBERT-base tags: - sentence-transformers - feature-extraction - sentence-similarity - biomedical - embeddings - life-sciences - scientific-text - SODA-VEC - EMBO datasets: - EMBO/soda-vec-data-full_pmc_title_abstract_paired metrics: - cosine-similarity --- # Negative Sampling PMB Model ## Model Description SODA-VEC embedding model trained with negative sampling (MultipleNegativesRankingLoss). This is the PMB (PubMed) version, optimized for biomedical text similarity tasks using the standard sentence-transformers approach. This model is part of the **SODA-VEC** (Scientific Open Domain Adaptation for Vector Embeddings) project, which focuses on creating high-quality embedding models for biomedical and life sciences text. **Key Features:** - Trained on **26.5M biomedical title-abstract pairs** from PubMed Central - Based on **microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext** architecture - Optimized for **biomedical text similarity** and **semantic search** - Produces **768-dimensional embeddings** with mean pooling ## Training Details ### Training Data - **Dataset**: [`EMBO/soda-vec-data-full_pmc_title_abstract_paired`](https://huggingface.co/datasets/EMBO/soda-vec-data-full_pmc_title_abstract_paired) - **Size**: 26,473,900 training pairs - **Source**: Complete PubMed Central baseline (July 2024) - **Format**: Paired title-abstract examples optimized for contrastive learning ### Training Procedure **Loss Function**: MultipleNegativesRankingLoss: standard negative sampling approach used in sentence-transformers **Base Model**: `microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext` **Training Configuration:** - **GPUs**: 4 - **Batch Size per GPU**: 32 - **Gradient Accumulation**: 4 - **Effective Batch Size**: 512 - **Learning Rate**: 2e-05 - **Warmup Steps**: 100 - **Pooling Strategy**: mean - **Epochs**: 1 (full dataset pass) **Training Command:** ```bash python scripts/soda-vec-train.py --config negative_sampling --push_to_hub --hub_org EMBO --save_limit 5 ``` ### Model Architecture - **Base Architecture**: ModernBERT-base (12 layers, 768 hidden size) - **Pooling**: Mean pooling over token embeddings - **Output Dimension**: 768 - **Normalization**: L2-normalized embeddings (for VICReg-based models) ## Usage ### Using Sentence-Transformers ```python from sentence_transformers import SentenceTransformer # Load the model model = SentenceTransformer("EMBO/negative_sampling_pmb") # Encode sentences sentences = [ "CRISPR-Cas9 gene editing in human cells", "Genome editing using CRISPR technology" ] embeddings = model.encode(sentences) print(f"Embedding shape: {embeddings.shape}") # Compute similarity from sentence_transformers.util import cos_sim similarity = cos_sim(embeddings[0], embeddings[1]) print(f"Similarity: {similarity.item():.4f}") ``` ### Using Hugging Face Transformers ```python from transformers import AutoTokenizer, AutoModel import torch import torch.nn.functional as F # Load model and tokenizer tokenizer = AutoTokenizer.from_pretrained("EMBO/negative_sampling_pmb") model = AutoModel.from_pretrained("EMBO/negative_sampling_pmb") # Encode sentences sentences = [ "CRISPR-Cas9 gene editing in human cells", "Genome editing using CRISPR technology" ] inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt") with torch.no_grad(): outputs = model(**inputs) # Mean pooling embeddings = outputs.last_hidden_state.mean(dim=1) # Normalize (for VICReg models) embeddings = F.normalize(embeddings, p=2, dim=1) # Compute similarity similarity = F.cosine_similarity(embeddings[0:1], embeddings[1:2]) print(f"Similarity: {similarity.item():.4f}") ``` ## Evaluation The model has been evaluated on comprehensive biomedical benchmarks including: - **Journal-Category Classification**: Matching journals to BioRxiv subject categories - **Title-Abstract Similarity**: Discriminating between related and unrelated paper pairs - **Field-Specific Separability**: Distinguishing between different biological fields - **Semantic Search**: Retrieval quality on biomedical text corpora For detailed evaluation results, see the [SODA-VEC benchmark notebooks](https://github.com/source-data/soda-vec). ## Intended Use This model is designed for: - **Biomedical Semantic Search**: Finding relevant papers, abstracts, or text passages - **Scientific Text Similarity**: Computing similarity between biomedical texts ## Limitations - **Domain Specificity**: Optimized for biomedical and life sciences text; may not perform as well on general domain text - **Language**: English only - **Text Length**: Optimized for titles and abstracts; longer documents may require chunking - **Bias**: Inherits biases from the training data (PubMed Central corpus) ## Citation If you use this model, please cite: ```bibtex @software{soda_vec, title = {SODA-VEC: Scientific Open Domain Adaptation for Vector Embeddings}, author = {EMBO}, year = {2024}, url = {https://github.com/source-data/soda-vec} } ``` ## Model Card Contact For questions or issues, please open an issue on the [SODA-VEC GitHub repository](https://github.com/source-data/soda-vec). --- **Model Card Generated**: 2025-11-10