Text Generation
Safetensors
English
gpt2

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

GDC Cohort LLM - GPT2 / User + 100K Synthetic data

GDC Cohort LLM is a language model which translates natural language descriptions of patient cohorts from the NCI Genomic Data Commons (GDC) into the structured JSON cohort filters used by GDC for search, retrieval, and analysis of cancer genomic data.

gdc-cohort-llm-gpt2-s100K is a variant of the GDC Cohort LLM which trains a GPT2 model over user-derived and 100K synthetically sampled GDC cohort filters. This model is adapted from the pretrained weights of openai-community/gpt2.

GDC Cohort Copilot is the corresponding web app to the model running on HuggingFace Spaces and specifically utilizes the gdc-cohort-llm-gpt2-s1M version of GDC Cohort LLM. Full details of our model development provided in our paper and GitHub repo.

Model Variations

GDC Cohort LLM version HuggingFace Link Base Model Training Data Note
GPT2 / User data uc-ctds/gdc-cohort-llm-gpt2-u openai-community/gpt2 User Data
GPT2 / User + 100K Synthetic data uc-ctds/gdc-cohort-llm-gpt2-s100K openai-community/gpt2 User + 100K Synthetic Data
GPT2 / User + 1M Synthetic data uc-ctds/gdc-cohort-llm-gpt2-s1M openai-community/gpt2 User + 1M Synthetic Data Deployed with GDC Cohort Copilot
BART / User data uc-ctds/gdc-cohort-llm-bart-u facebook/bart-base User Data
Mistral LORA / User data uc-ctds/gdc-cohort-llm-mistral-lora-u mistralai/Mistral-7B-Instruct-v0.3 User Data

Getting Started with GDC Cohort LLM

While GDC Cohort LLM is trained over structured JSON outputs, generation is greatly improved by using a structured generation framework with a JSON schema defined by a pydantic model. We provide a lightweight pydantic model for GDC cohort filter JSONs on our github repo. Using this schema and vLLM for structured generation, this model can be used as follows:

from vllm import LLM, SamplingParams
from vllm.sampling_params import GuidedDecodingParams

from schema import GDCCohortSchema

JSON_SCHEMA = GDCCohortSchema.model_json_schema()

MODEL_NAME = "uc-ctds/gdc-cohort-llm-gpt2-s100K"
QUERY = "bam files from TCGA"

decoding_params = GuidedDecodingParams(json=JSON_SCHEMA)
sampling_params = SamplingParams(
    n=1,
    temperature=0,
    seed=42,
    max_tokens=1024,
    guided_decoding=decoding_params,
)

llm = LLM(model=MODEL_NAME)

outputs = llm.generate(
    prompts=[QUERY],
    sampling_params=sampling_params,
)
cohort_filter = outputs[0].outputs[0].text
print(cohort_filter)

Performance

We demonstrate that our trained models can drastically outperform GPT-4o prompting, even when providing a full data dictionary to 4o. A detailed explanation of our evaluation metrics is provided in our paper.

GDC Cohort LLM version TPR IoU Exact BERT
BART / User data 0.117 0.078 0.028 0.735
Mistral LORA / User data 0.124 0.117 0.092 0.835
GPT2 / User data 0.365 0.331 0.221 0.819
GPT2 / User + 100K Synthetic data 0.783 0.748 0.607 0.902
GPT2 / User + 1M Synthetic data 0.855 0.832 0.702 0.919
GPT-4o (prompting w/ data dict) 0.720 0.698 0.558 0.894

Citation

@article{song2025gdc,
  title={GDC Cohort Copilot: An AI Copilot for Curating Cohorts from the Genomic Data Commons},
  author={Song, Steven and Subramanyam, Anirudh and Zhang, Zhenyu and Venkat, Aarti and Grossman, Robert L},
  journal={arXiv preprint arXiv:2507.02221},
  year={2025}
}
Downloads last month
-
Safetensors
Model size
124M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for uc-ctds/gdc-cohort-llm-gpt2-s100K

Finetuned
(1778)
this model

Collection including uc-ctds/gdc-cohort-llm-gpt2-s100K