Overview

This model card provides detailed information about the BioLinkBERT-based version of BALI-BERT model pre-trained with a Graph Attention Network (GAT), a biomedical language representation model enhanced through knowledge graph and language model alignment. It is based on BERT architecture (specifically PubMedBERT and BioLinkBERT variants) and has been pre-trained using the BALI method to incorporate external knowledge from biomedical Knowledge Graphs (KG), such as UMLS.

The model demonstrates improved performance across several biomedical NLP tasks including Question Answering (QA), Entity Linking (EL), and Relation Extraction (RE). The BALI approach aligns textual representations with structured biomedical knowledge by simultaneously training a language model and a KG encoder.

Training Details

Text Corpus: Scientific abstracts from PubMed
Knowledge Graph: Unified Medical Language System (UMLS)
Training Size: 1.5M sentences from PubMed; 600K nodes from UMLS
Training Steps: ~65,000 steps
Batch Size: 256

📦 How to Load the Model

pip install transformers

from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained("andorei/BALI-BERT-BioLinkBERT-base-GNN")
model = BertModel.from_pretrained("andorei/BALI-BERT-BioLinkBERT-base-GNN")

📜 Citation

If you use this model in your work, please cite the original paper:

@inproceedings{Sakhovskiy2025BALI,
  author = {Sakhovskiy, Andrey and Tutubalina, Elena},
  title = {BALI: Enhancing Biomedical Language Representations through Knowledge Graph and Language Model Alignment},
  booktitle = {Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '25)},
  year = {2025}
}

Downloads last month: 3

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support