GDC Cohort LLM - GPT2 / User + 100K Synthetic data
GDC Cohort LLM is a language model which translates natural language descriptions of patient cohorts from the NCI Genomic Data Commons (GDC) into the structured JSON cohort filters used by GDC for search, retrieval, and analysis of cancer genomic data.
gdc-cohort-llm-gpt2-s100K
is a variant of the GDC Cohort LLM which trains a GPT2 model over user-derived and 100K synthetically sampled GDC cohort filters. This model is adapted from the pretrained weights of openai-community/gpt2
.
GDC Cohort Copilot is the corresponding web app to the model running on HuggingFace Spaces and specifically utilizes the gdc-cohort-llm-gpt2-s1M
version of GDC Cohort LLM. Full details of our model development provided in our paper and GitHub repo.
Model Variations
GDC Cohort LLM version | HuggingFace Link | Base Model | Training Data | Note |
---|---|---|---|---|
GPT2 / User data | uc-ctds/gdc-cohort-llm-gpt2-u | openai-community/gpt2 | User Data | |
GPT2 / User + 100K Synthetic data | uc-ctds/gdc-cohort-llm-gpt2-s100K | openai-community/gpt2 | User + 100K Synthetic Data | |
GPT2 / User + 1M Synthetic data | uc-ctds/gdc-cohort-llm-gpt2-s1M | openai-community/gpt2 | User + 1M Synthetic Data | Deployed with GDC Cohort Copilot |
BART / User data | uc-ctds/gdc-cohort-llm-bart-u | facebook/bart-base | User Data | |
Mistral LORA / User data | uc-ctds/gdc-cohort-llm-mistral-lora-u | mistralai/Mistral-7B-Instruct-v0.3 | User Data |
Getting Started with GDC Cohort LLM
While GDC Cohort LLM is trained over structured JSON outputs, generation is greatly improved by using a structured generation framework with a JSON schema defined by a pydantic
model. We provide a lightweight pydantic model for GDC cohort filter JSONs on our github repo. Using this schema and vLLM
for structured generation, this model can be used as follows:
from vllm import LLM, SamplingParams
from vllm.sampling_params import GuidedDecodingParams
from schema import GDCCohortSchema
JSON_SCHEMA = GDCCohortSchema.model_json_schema()
MODEL_NAME = "uc-ctds/gdc-cohort-llm-gpt2-s100K"
QUERY = "bam files from TCGA"
decoding_params = GuidedDecodingParams(json=JSON_SCHEMA)
sampling_params = SamplingParams(
n=1,
temperature=0,
seed=42,
max_tokens=1024,
guided_decoding=decoding_params,
)
llm = LLM(model=MODEL_NAME)
outputs = llm.generate(
prompts=[QUERY],
sampling_params=sampling_params,
)
cohort_filter = outputs[0].outputs[0].text
print(cohort_filter)
Performance
We demonstrate that our trained models can drastically outperform GPT-4o prompting, even when providing a full data dictionary to 4o. A detailed explanation of our evaluation metrics is provided in our paper.
GDC Cohort LLM version | TPR | IoU | Exact | BERT |
---|---|---|---|---|
BART / User data | 0.117 | 0.078 | 0.028 | 0.735 |
Mistral LORA / User data | 0.124 | 0.117 | 0.092 | 0.835 |
GPT2 / User data | 0.365 | 0.331 | 0.221 | 0.819 |
GPT2 / User + 100K Synthetic data | 0.783 | 0.748 | 0.607 | 0.902 |
GPT2 / User + 1M Synthetic data | 0.855 | 0.832 | 0.702 | 0.919 |
GPT-4o (prompting w/ data dict) | 0.720 | 0.698 | 0.558 | 0.894 |
Citation
@article{song2025gdc,
title={GDC Cohort Copilot: An AI Copilot for Curating Cohorts from the Genomic Data Commons},
author={Song, Steven and Subramanyam, Anirudh and Zhang, Zhenyu and Venkat, Aarti and Grossman, Robert L},
journal={arXiv preprint arXiv:2507.02221},
year={2025}
}
- Downloads last month
- -
Model tree for uc-ctds/gdc-cohort-llm-gpt2-s100K
Base model
openai-community/gpt2