YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

T³: Test-Time Model Merging for Medical Vision-Language Models

Raza Imam, Hu Wang, Dwarikanath Mahapatra, Mohammad Yaqub
Mohamed bin Zayed University of Artificial Intelligence

This repository provides the official PyTorch implementation of our T³ Medical Model-Merging paper:

Figure 1: Dynamic test-time merging workflow of T³

Official implementation of T³: Test-Time Model Merging in Vision-Language Models for Zero-Shot Medical Imaging, a method for adaptive fusion of pretrained and fine-tuned vision-language models at test time using Jensen-Shannon divergence.

Key Features

🧠 Mutual Information Guidance: Uses JS divergence to measure model consensus.
⚡ Backpropagation-Free: No gradient updates required during inference.
🏥 Medical Modality Agnostic: Validated consistency on 4x medical imaging domains.
🚀 Batch-Wise Efficiency: Reduces compute cost by 32x vs sample-wise merging.
📈 SOTA Performance: Outperforms 8+ baselines in accuracy & robustness.

Installation
Method Overview
Folder Structure
Reproducing Results
Pretrained Weights
Datasets
Citation

Installation

Clone repository:

git clone https://github.com/Razaimam45/TCube.git
cd T3

Create conda environment:

conda create -n t3 python=3.9
conda activate t3
pip install -r requirements.txt

Method Overview

Adaptive Merging via Jensen-Shannon Divergence

The interpolation coefficient λ is computed dynamically for each sample using the following equation:

λ(x) = λ_{min} + (λ_{max}-λ_{min})σ(γ⋅JS(p_{pt}(x)‖p_{ft}(x)))

Where:

JS = Jensen-Shannon divergence between pretrained and fine-tuned model predictions.
σ = Sigmoid function for smooth scaling.
γ = Scaling factor (default=0.5).

Visual Explanation of the Method

Below justifies the method and its effectiveness:

Dynamic Weighting Based on Model Agreement

We propose using Jensen–Shannon (JS) divergence to measure mutual information between pretrained (p_pt) and fine-tuned (p_ft) model predictions, offering a more robust gauge of joint confidence than entropy-based methods like DaWin's entropy ratio:

R(x) = \frac{\mathcal{H}(p_{ft}(x))}{\mathcal{H}(p_{pt}(x)) + \mathcal{H}(p_{ft}(x))}

JS divergence explicitly captures agreement vs. disagreement by comparing full predictive distributions:

I(x) = \frac{1}{2} \Bigl(\mathrm{KL}(p_{pt}(x) \Vert \bar{p}(x)) + \mathrm{KL}(p_{ft}(x) \Vert \bar{p}(x))\Bigr)

where

\bar{p}(x) = 0.5 \cdot (p_{pt}(x) + p_{ft}(x))`.

This ensures:

(I(x) = 0) when models fully agree.
(I(x) > 0) when confident predictions disagree.

Empirically, (I(x)) correlates positively with (R(x)), but better distinguishes disagreements, validating its use for adaptive merging.

Mutual Information vs. Entropy

Figure 3: Relationship between mutual information and entropy for adaptive merging.
Performance Across Modalities

Figure 4: T³ achieves superior performance across multiple medical imaging modalities.

Folder Structure

Do check our HuggingFace page for Expert Models and Evaluation Datasets.

T3/
├── clip/              # CLIP model adaptations
├── data/              # Data Utilities
├── utils/             # Helper functions
├── models/            # Put your finetuned models HERE
├── dataset/           # Put your medimeta/medmnist-c eval data HERE
├── baselines.py       # Comparison methods
├── t_cube.py          # Core T³ implementation
├── BetaMixture.py     # Auxiliary models
└── README.md          # This document

Reproducing Results

To reproduce the results from the paper, you can run the t_cube.py script. This script handles the evaluation of T³ and its baselines across multiple datasets and severity levels. Additional baselines are available in baselines.py.

To understand the script better; in t_cube.py:

Refer to the compute_samplewise_tcube_weights and compute_samplewise_tcube_weights_MI functions for entropy (DaWiN baseline) and Our mutual information-based merging.
Check the evaluate_on_test_set function for how datasets and severities are processed.
Explore the evaluate_tcube function for the merging and evaluation logic.

Pretrained Weights

We provide pretrained weights for the following models:

Generalist CLIP: A pretrained model for general vision-language tasks.
Expert CLIPs: 4x Fine-tuned models for the following medical imaging domains:
- Breast Imaging
- Fundoscopy
- Cell Microscopy
- Retinal OCT

If you would like to access these weights, please find them at model card at https://huggingface.co/razaimam45/TCube_Merging under models/finetuned subfolder.

Datasets

We provided Breast Imaging evaluation sets on HuggingFace page. Please download from there.

If you need to run multiple modalities datasets, just pass --testset arg with 'bloodmnist/breastmnist/'. This will evaluate medmnist-c and medimeta from each modality, resulting in 4 datasets evaluation.

If you need all modality datasets, you can find them as follows:

MedMNIST datasets | In-Domain Fine-Tune Datasets
MediMeta datasets | OOD-B2N Eval Datasets
MedMNIST-C datasets | OOD-Corruptions Eval Datasets

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you find this work useful, please cite the arXiv version below:

@misc{imam2025t3testtimemodelmerging,
      title={T3: Test-Time Model Merging in VLMs for Zero-Shot Medical Imaging Analysis}, 
      author={Raza Imam and Hu Wang and Dwarikanath Mahapatra and Mohammad Yaqub},
      year={2025},
      eprint={2510.27265},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.27265}, 
}

Contact

For questions or collaborations, contact Raza Imam. Please feel free to raise an issue in facing error in reproducing the results.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

razaimam45
/

TCube_Merging