File size: 7,046 Bytes
a16f4b8 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 |
---
license: cc-by-nc-4.0
language:
- en
library_name: transformers
pipeline_tag: text-classification
tags:
- modernbert
- oncology
- clinical-trials
- eligibility
- binary-classification
- text-matching
---
# TrialChecker-0825
**TrialChecker-0825** is a binary text classifier that estimates whether a given **clinical trial “space”** is a **reasonable consideration** for a patient, given the patient’s summary.
It is fine-tuned from **[`answerdotai/ModernBERT-large`]** for sequence classification on pairs of *(trial space, patient summary)*.
> **Important:** This is a research prototype for model development, **not** a medical device and **not** intended for clinical decision-making.
---
## What counts as a “trial space”?
A *trial space* is a concise description of the target population a trial aims to enroll, focusing on:
- Cancer type & histology
- Burden of disease (curative vs metastatic)
- Prior or excluded treatments
- Required / excluded biomarkers
(Boilerplate exclusion rules—e.g., heart failure, uncontrolled brain mets—are **not** part of the trial space itself. They can be screened separately by OncoReasoning-3B or BoilerplateChecker-0825 or other logic.)
---
## Training summary
The classifier was trained with a script that:
1. Loads three sources of annotated patient–trial pairs:
- Pairs originating from space-specific eligibility checks
- “Patient→top-cohorts” checks (rounds 1–3)
- “Trial-space→top patients” checks (rounds 1–3)
2. Deduplicates by `['patient_summary', 'this_space']`
3. Builds the final text input as:
```
text = this\_space + "\nNow here is the patient summary:" + patient\_summary
````
4. Uses `eligibility_result` as the **binary label** (0/1)
5. Model is **ModernBERT-large** (sequence classification, 2 labels) at max_length **2048**
### Key hyperparameters from training
- Base model: `answerdotai/ModernBERT-large`
- Max length: **2048**
- Optimizer settings: `learning_rate=2e-5`, `weight_decay=0.01`
- Batch size: `per_device_train_batch_size=4`
- Epochs: `2`
- Save strategy: `epoch`
- Tokenizer: `AutoTokenizer.from_pretrained("answerdotai/ModernBERT-large")`
- Data collator: `DataCollatorWithPadding`
---
## Intended use
- **Input:** a string describing the trial space and a patient summary string
- **Output:** probability that the trial is a **reasonable consideration** for that patient (not full eligibility)
Use cases:
- Ranking candidate trial spaces for a patient
- Early triage before detailed eligibility review (including boilerplate exclusions)
Out of scope:
- Confirming formal eligibility or safety
- Clinical decision support
---
## Inference (Transformers)
### Quick start (single example)
```python
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
device = "cuda" if torch.cuda.is_available() else "cpu"
MODEL_REPO = "ksg-dfci/TrialChecker-0825"
tok = AutoTokenizer.from_pretrained(MODEL_REPO)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_REPO).to(device)
model.eval()
this_space = (
"Cancer type allowed: non-small cell lung cancer. "
"Histology allowed: adenocarcinoma. "
"Cancer burden allowed: metastatic disease. "
"Prior treatment required: prior platinum-based chemo-immunotherapy allowed. "
"Biomarkers required: ALK fusion."
)
patient_summary = (
"Dx 2022 lung adenocarcinoma; metastatic to bone. Prior carbo/pem/pembro "
"with best PR; ALK fusion detected by NGS. ECOG 1."
)
text = this_space + "\nNow here is the patient summary:" + patient_summary
enc = tok(text, return_tensors="pt", truncation=True, max_length=2048).to(device)
with torch.no_grad():
logits = model(**enc).logits
probs = logits.softmax(-1).squeeze(0)
# Label mapping was set in training: {0: "NEGATIVE", 1: "POSITIVE"}
p_positive = float(probs[1])
print(f"Reasonable consideration probability: {p_positive:.3f}")
````
### Batched scoring
```python
from typing import List
import torch
def score_pairs(spaces: List[str], summaries: List[str], tokenizer, model, max_length=2048, batch_size=8):
assert len(spaces) == len(summaries)
device = next(model.parameters()).device
scores = []
for i in range(0, len(spaces), batch_size):
batch_spaces = spaces[i:i+batch_size]
batch_summaries = summaries[i:i+batch_size]
texts = [s + "\nNow here is the patient summary:" + p for s, p in zip(batch_spaces, batch_summaries)]
enc = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=max_length).to(device)
with torch.no_grad():
logits = model(**enc).logits
probs = logits.softmax(-1)[:, 1] # POSITIVE
scores.extend(probs.detach().cpu().tolist())
return scores
# Example
spaces = [this_space] * 3
summaries = [patient_summary, "Different summary 1...", "Different summary 2..."]
scores = score_pairs(spaces, summaries, tok, model)
print(scores)
```
### Thresholding & calibration
* Default decision: **0.5** on the POSITIVE probability.
* For better calibration/operating points, tune the threshold on a validation set (e.g., maximize F1, optimize Youden’s J, or set to a desired precision).
---
## How to prepare inputs
**Trial space**: a compact “target population” disease context description, including cancer type/histology, metastatic/curative, prior/forbidden treatments, required/excluded biomarkers.
**Patient summary**: a concise longitudinal summary of diagnosis, histology, current burden, biomarkers, and treatment history.
You can generate these inputs with your upstream LLM pipeline (e.g., `OncoReasoning-3B` for summarization and space extraction), but the classifier accepts any plain strings in the format shown above.
---
## Reproducibility (high-level)
Below is the minimal structure used by the training script to build the dataset before tokenization:
```python
# 1) Load and merge three labeled sources
# - space_specific_eligibility_checks.parquet
# - top_ten_cohorts_checked_round{1,2,3}.csv
# - top_twenty_patients_checked_round{1,2,3}.csv
# 2) Deduplicate by ['patient_summary','this_space'] and keep:
# - split, patient_summary, this_space, eligibility_result
# 3) Compose input text and label:
text = this_space + "\nNow here is the patient summary:" + patient_summary
label = int(eligibility_result) # 0 or 1
# 4) Tokenize with ModernBERT tokenizer (max_length=2048, truncation=True)
# 5) Train AutoModelForSequenceClassification (2 labels)
```
To reproduce exactly, consult and run the original training script.
---
## Limitations & ethical considerations
* Outputs reflect training data and may contain biases or errors.
* The model estimates *reasonableness for consideration*, not strict eligibility.
* Not validated for safety-critical use; do not use for diagnosis or treatment decisions.
---
## Citation
If you use this model or parts of the pipeline, please cite this model card and the training script (ModernBERT TrialChecker fine-tuning). 
```
|