File size: 7,046 Bytes
a16f4b8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
---
license: cc-by-nc-4.0
language:
- en
library_name: transformers
pipeline_tag: text-classification
tags:
- modernbert
- oncology
- clinical-trials
- eligibility
- binary-classification
- text-matching
---


# TrialChecker-0825

**TrialChecker-0825** is a binary text classifier that estimates whether a given **clinical trial “space”** is a **reasonable consideration** for a patient, given the patient’s summary.  
It is fine-tuned from **[`answerdotai/ModernBERT-large`]** for sequence classification on pairs of *(trial space, patient summary)*.

> **Important:** This is a research prototype for model development, **not** a medical device and **not** intended for clinical decision-making.

---

## What counts as a “trial space”?

A *trial space* is a concise description of the target population a trial aims to enroll, focusing on:
- Cancer type & histology
- Burden of disease (curative vs metastatic)
- Prior or excluded treatments
- Required / excluded biomarkers

(Boilerplate exclusion rules—e.g., heart failure, uncontrolled brain mets—are **not** part of the trial space itself. They can be screened separately by OncoReasoning-3B or BoilerplateChecker-0825 or other logic.)

---

## Training summary

The classifier was trained with a script that:
1. Loads three sources of annotated patient–trial pairs:
   - Pairs originating from space-specific eligibility checks
   - “Patient→top-cohorts” checks (rounds 1–3)
   - “Trial-space→top patients” checks (rounds 1–3)
2. Deduplicates by `['patient_summary', 'this_space']`
3. Builds the final text input as:
```

text = this\_space + "\nNow here is the patient summary:" + patient\_summary

````
4. Uses `eligibility_result` as the **binary label** (0/1)
5. Model is **ModernBERT-large** (sequence classification, 2 labels) at max_length **2048**  


### Key hyperparameters from training
- Base model: `answerdotai/ModernBERT-large`
- Max length: **2048**
- Optimizer settings: `learning_rate=2e-5`, `weight_decay=0.01`
- Batch size: `per_device_train_batch_size=4`
- Epochs: `2`
- Save strategy: `epoch`
- Tokenizer: `AutoTokenizer.from_pretrained("answerdotai/ModernBERT-large")`
- Data collator: `DataCollatorWithPadding`

---

## Intended use

- **Input:** a string describing the trial space and a patient summary string  
- **Output:** probability that the trial is a **reasonable consideration** for that patient (not full eligibility)

Use cases:
- Ranking candidate trial spaces for a patient
- Early triage before detailed eligibility review (including boilerplate exclusions)

Out of scope:
- Confirming formal eligibility or safety
- Clinical decision support

---

## Inference (Transformers)


### Quick start (single example)

```python
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

device = "cuda" if torch.cuda.is_available() else "cpu"
MODEL_REPO = "ksg-dfci/TrialChecker-0825" 

tok = AutoTokenizer.from_pretrained(MODEL_REPO)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_REPO).to(device)
model.eval()

this_space = (
 "Cancer type allowed: non-small cell lung cancer. "
 "Histology allowed: adenocarcinoma. "
 "Cancer burden allowed: metastatic disease. "
 "Prior treatment required: prior platinum-based chemo-immunotherapy allowed. "
 "Biomarkers required: ALK fusion."
)

patient_summary = (
 "Dx 2022 lung adenocarcinoma; metastatic to bone. Prior carbo/pem/pembro "
 "with best PR; ALK fusion detected by NGS. ECOG 1."
)

text = this_space + "\nNow here is the patient summary:" + patient_summary

enc = tok(text, return_tensors="pt", truncation=True, max_length=2048).to(device)
with torch.no_grad():
 logits = model(**enc).logits
probs = logits.softmax(-1).squeeze(0)

# Label mapping was set in training: {0: "NEGATIVE", 1: "POSITIVE"}
p_positive = float(probs[1])
print(f"Reasonable consideration probability: {p_positive:.3f}")
````

### Batched scoring

```python
from typing import List
import torch

def score_pairs(spaces: List[str], summaries: List[str], tokenizer, model, max_length=2048, batch_size=8):
    assert len(spaces) == len(summaries)
    device = next(model.parameters()).device
    scores = []

    for i in range(0, len(spaces), batch_size):
        batch_spaces = spaces[i:i+batch_size]
        batch_summaries = summaries[i:i+batch_size]
        texts = [s + "\nNow here is the patient summary:" + p for s, p in zip(batch_spaces, batch_summaries)]
        enc = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=max_length).to(device)
        with torch.no_grad():
            logits = model(**enc).logits
        probs = logits.softmax(-1)[:, 1]  # POSITIVE
        scores.extend(probs.detach().cpu().tolist())
    return scores

# Example
spaces = [this_space] * 3
summaries = [patient_summary, "Different summary 1...", "Different summary 2..."]
scores = score_pairs(spaces, summaries, tok, model)
print(scores)
```

### Thresholding & calibration

* Default decision: **0.5** on the POSITIVE probability.
* For better calibration/operating points, tune the threshold on a validation set (e.g., maximize F1, optimize Youden’s J, or set to a desired precision).

---

## How to prepare inputs

**Trial space**: a compact “target population” disease context description, including cancer type/histology, metastatic/curative, prior/forbidden treatments, required/excluded biomarkers.
**Patient summary**: a concise longitudinal summary of diagnosis, histology, current burden, biomarkers, and treatment history.

You can generate these inputs with your upstream LLM pipeline (e.g., `OncoReasoning-3B` for summarization and space extraction), but the classifier accepts any plain strings in the format shown above.

---

## Reproducibility (high-level)

Below is the minimal structure used by the training script to build the dataset before tokenization:

```python
# 1) Load and merge three labeled sources
#    - space_specific_eligibility_checks.parquet
#    - top_ten_cohorts_checked_round{1,2,3}.csv
#    - top_twenty_patients_checked_round{1,2,3}.csv

# 2) Deduplicate by ['patient_summary','this_space'] and keep:
#    - split, patient_summary, this_space, eligibility_result

# 3) Compose input text and label:
text  = this_space + "\nNow here is the patient summary:" + patient_summary
label = int(eligibility_result)  # 0 or 1

# 4) Tokenize with ModernBERT tokenizer (max_length=2048, truncation=True)
# 5) Train AutoModelForSequenceClassification (2 labels)
```

To reproduce exactly, consult and run the original training script.

---

## Limitations & ethical considerations

* Outputs reflect training data and may contain biases or errors.
* The model estimates *reasonableness for consideration*, not strict eligibility.
* Not validated for safety-critical use; do not use for diagnosis or treatment decisions.

---

## Citation

If you use this model or parts of the pipeline, please cite this model card and the training script (ModernBERT TrialChecker fine-tuning). 

```