You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

GDC Cohort LLM - GPT2 / User + 100K Synthetic data

GDC Cohort LLM is a language model which translates natural language descriptions of patient cohorts from the NCI Genomic Data Commons (GDC) into the structured JSON cohort filters used by GDC for search, retrieval, and analysis of cancer genomic data.

gdc-cohort-llm-gpt2-s100K is a variant of the GDC Cohort LLM which trains a GPT2 model over user-derived and 100K synthetically sampled GDC cohort filters. This model is adapted from the pretrained weights of openai-community/gpt2.

GDC Cohort Copilot is the corresponding web app to the model running on HuggingFace Spaces and specifically utilizes the gdc-cohort-llm-gpt2-s1M version of GDC Cohort LLM. Full details of our model development provided in our paper and GitHub repo.

Model Variations

GDC Cohort LLM version	HuggingFace Link	Base Model	Training Data	Note
GPT2 / User data	uc-ctds/gdc-cohort-llm-gpt2-u	openai-community/gpt2	User Data
GPT2 / User + 100K Synthetic data	uc-ctds/gdc-cohort-llm-gpt2-s100K	openai-community/gpt2	User + 100K Synthetic Data
GPT2 / User + 1M Synthetic data	uc-ctds/gdc-cohort-llm-gpt2-s1M	openai-community/gpt2	User + 1M Synthetic Data	Deployed with GDC Cohort Copilot
BART / User data	uc-ctds/gdc-cohort-llm-bart-u	facebook/bart-base	User Data
Mistral LORA / User data	uc-ctds/gdc-cohort-llm-mistral-lora-u	mistralai/Mistral-7B-Instruct-v0.3	User Data

Getting Started with GDC Cohort LLM

While GDC Cohort LLM is trained over structured JSON outputs, generation is greatly improved by using a structured generation framework with a JSON schema defined by a pydantic model. We provide a lightweight pydantic model for GDC cohort filter JSONs on our github repo. Using this schema and vLLM for structured generation, this model can be used as follows:

from vllm import LLM, SamplingParams
from vllm.sampling_params import GuidedDecodingParams

from schema import GDCCohortSchema

JSON_SCHEMA = GDCCohortSchema.model_json_schema()

MODEL_NAME = "uc-ctds/gdc-cohort-llm-gpt2-s100K"
QUERY = "bam files from TCGA"

decoding_params = GuidedDecodingParams(json=JSON_SCHEMA)
sampling_params = SamplingParams(
    n=1,
    temperature=0,
    seed=42,
    max_tokens=1024,
    guided_decoding=decoding_params,
)

llm = LLM(model=MODEL_NAME)

outputs = llm.generate(
    prompts=[QUERY],
    sampling_params=sampling_params,
)
cohort_filter = outputs[0].outputs[0].text
print(cohort_filter)

Performance

We demonstrate that our trained models can drastically outperform GPT-4o prompting, even when providing a full data dictionary to 4o. A detailed explanation of our evaluation metrics is provided in our paper.

GDC Cohort LLM version	TPR	IoU	Exact	BERT
BART / User data	0.117	0.078	0.028	0.735
Mistral LORA / User data	0.124	0.117	0.092	0.835
GPT2 / User data	0.365	0.331	0.221	0.819
GPT2 / User + 100K Synthetic data	0.783	0.748	0.607	0.902
GPT2 / User + 1M Synthetic data	0.855	0.832	0.702	0.919
GPT-4o (prompting w/ data dict)	0.720	0.698	0.558	0.894

Citation

@article{song2025gdc,
  title={GDC Cohort Copilot: An AI Copilot for Curating Cohorts from the Genomic Data Commons},
  author={Song, Steven and Subramanyam, Anirudh and Zhang, Zhenyu and Venkat, Aarti and Grossman, Robert L},
  journal={arXiv preprint arXiv:2507.02221},
  year={2025}
}

Downloads last month: -

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for uc-ctds/gdc-cohort-llm-gpt2-s100K

Base model

openai-community/gpt2

Finetuned

(2028)

this model

Collection including uc-ctds/gdc-cohort-llm-gpt2-s100K

GDC Cohort LLM

Collection

Variations of GDC Cohort LLM which power GDC Cohort Copilot. • 6 items • Updated Jul 15 • 2