File size: 7,046 Bytes

a16f4b8

---
license: cc-by-nc-4.0
language:
- en
library_name: transformers
pipeline_tag: text-classification
tags:
- modernbert
- oncology
- clinical-trials
- eligibility
- binary-classification
- text-matching
---


# TrialChecker-0825

**TrialChecker-0825** is a binary text classifier that estimates whether a given **clinical trial “space”** is a **reasonable consideration** for a patient, given the patient’s summary.  
It is fine-tuned from **[`answerdotai/ModernBERT-large`]** for sequence classification on pairs of *(trial space, patient summary)*.

> **Important:** This is a research prototype for model development, **not** a medical device and **not** intended for clinical decision-making.

---

## What counts as a “trial space”?

A *trial space* is a concise description of the target population a trial aims to enroll, focusing on:
- Cancer type & histology
- Burden of disease (curative vs metastatic)
- Prior or excluded treatments
- Required / excluded biomarkers

(Boilerplate exclusion rules—e.g., heart failure, uncontrolled brain mets—are **not** part of the trial space itself. They can be screened separately by OncoReasoning-3B or BoilerplateChecker-0825 or other logic.)

---

## Training summary

The classifier was trained with a script that:
1. Loads three sources of annotated patient–trial pairs:
   - Pairs originating from space-specific eligibility checks
   - “Patient→top-cohorts” checks (rounds 1–3)
   - “Trial-space→top patients” checks (rounds 1–3)
2. Deduplicates by `['patient_summary', 'this_space']`
3. Builds the final text input as:
```

text = this\_space + "\nNow here is the patient summary:" + patient\_summary

````
4. Uses `eligibility_result` as the **binary label** (0/1)
5. Model is **ModernBERT-large** (sequence classification, 2 labels) at max_length **2048**  


### Key hyperparameters from training
- Base model: `answerdotai/ModernBERT-large`
- Max length: **2048**
- Optimizer settings: `learning_rate=2e-5`, `weight_decay=0.01`
- Batch size: `per_device_train_batch_size=4`
- Epochs: `2`
- Save strategy: `epoch`
- Tokenizer: `AutoTokenizer.from_pretrained("answerdotai/ModernBERT-large")`
- Data collator: `DataCollatorWithPadding`

---

## Intended use

- **Input:** a string describing the trial space and a patient summary string  
- **Output:** probability that the trial is a **reasonable consideration** for that patient (not full eligibility)

Use cases:
- Ranking candidate trial spaces for a patient
- Early triage before detailed eligibility review (including boilerplate exclusions)

Out of scope:
- Confirming formal eligibility or safety
- Clinical decision support

---

## Inference (Transformers)


### Quick start (single example)

```python
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

device = "cuda" if torch.cuda.is_available() else "cpu"
MODEL_REPO = "ksg-dfci/TrialChecker-0825" 

tok = AutoTokenizer.from_pretrained(MODEL_REPO)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_REPO).to(device)
model.eval()

this_space = (
 "Cancer type allowed: non-small cell lung cancer. "
 "Histology allowed: adenocarcinoma. "
 "Cancer burden allowed: metastatic disease. "
 "Prior treatment required: prior platinum-based chemo-immunotherapy allowed. "
 "Biomarkers required: ALK fusion."
)

patient_summary = (
 "Dx 2022 lung adenocarcinoma; metastatic to bone. Prior carbo/pem/pembro "
 "with best PR; ALK fusion detected by NGS. ECOG 1."
)

text = this_space + "\nNow here is the patient summary:" + patient_summary

enc = tok(text, return_tensors="pt", truncation=True, max_length=2048).to(device)
with torch.no_grad():
 logits = model(**enc).logits
probs = logits.softmax(-1).squeeze(0)

# Label mapping was set in training: {0: "NEGATIVE", 1: "POSITIVE"}
p_positive = float(probs[1])
print(f"Reasonable consideration probability: {p_positive:.3f}")
````

### Batched scoring

```python
from typing import List
import torch

def score_pairs(spaces: List[str], summaries: List[str], tokenizer, model, max_length=2048, batch_size=8):
    assert len(spaces) == len(summaries)
    device = next(model.parameters()).device
    scores = []

    for i in range(0, len(spaces), batch_size):
        batch_spaces = spaces[i:i+batch_size]
        batch_summaries = summaries[i:i+batch_size]
        texts = [s + "\nNow here is the patient summary:" + p for s, p in zip(batch_spaces, batch_summaries)]
        enc = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=max_length).to(device)
        with torch.no_grad():
            logits = model(**enc).logits
        probs = logits.softmax(-1)[:, 1]  # POSITIVE
        scores.extend(probs.detach().cpu().tolist())
    return scores

# Example
spaces = [this_space] * 3
summaries = [patient_summary, "Different summary 1...", "Different summary 2..."]
scores = score_pairs(spaces, summaries, tok, model)
print(scores)
```

### Thresholding & calibration

* Default decision: **0.5** on the POSITIVE probability.
* For better calibration/operating points, tune the threshold on a validation set (e.g., maximize F1, optimize Youden’s J, or set to a desired precision).

---

## How to prepare inputs

**Trial space**: a compact “target population” disease context description, including cancer type/histology, metastatic/curative, prior/forbidden treatments, required/excluded biomarkers.
**Patient summary**: a concise longitudinal summary of diagnosis, histology, current burden, biomarkers, and treatment history.

You can generate these inputs with your upstream LLM pipeline (e.g., `OncoReasoning-3B` for summarization and space extraction), but the classifier accepts any plain strings in the format shown above.

---

## Reproducibility (high-level)

Below is the minimal structure used by the training script to build the dataset before tokenization:

```python
# 1) Load and merge three labeled sources
#    - space_specific_eligibility_checks.parquet
#    - top_ten_cohorts_checked_round{1,2,3}.csv
#    - top_twenty_patients_checked_round{1,2,3}.csv

# 2) Deduplicate by ['patient_summary','this_space'] and keep:
#    - split, patient_summary, this_space, eligibility_result

# 3) Compose input text and label:
text  = this_space + "\nNow here is the patient summary:" + patient_summary
label = int(eligibility_result)  # 0 or 1

# 4) Tokenize with ModernBERT tokenizer (max_length=2048, truncation=True)
# 5) Train AutoModelForSequenceClassification (2 labels)
```

To reproduce exactly, consult and run the original training script.

---

## Limitations & ethical considerations

* Outputs reflect training data and may contain biases or errors.
* The model estimates *reasonableness for consideration*, not strict eligibility.
* Not validated for safety-critical use; do not use for diagnosis or treatment decisions.

---

## Citation

If you use this model or parts of the pipeline, please cite this model card and the training script (ModernBERT TrialChecker fine-tuning).&#x20;

```