T³: Test-Time Model Merging for Medical Vision-Language Models
Raza Imam, Hu Wang, Dwarikanath Mahapatra, Mohammad Yaqub
Mohamed bin Zayed University of Artificial Intelligence
This repository provides the official PyTorch implementation of our T³ Medical Model-Merging paper:

Figure 1: Dynamic test-time merging workflow of T³
Official implementation of T³: Test-Time Model Merging in Vision-Language Models for Zero-Shot Medical Imaging, a method for adaptive fusion of pretrained and fine-tuned vision-language models at test time using Jensen-Shannon divergence.
Key Features
- 🧠 Mutual Information Guidance: Uses JS divergence to measure model consensus.
- ⚡ Backpropagation-Free: No gradient updates required during inference.
- 🏥 Medical Modality Agnostic: Validated consistency on 4x medical imaging domains.
- 🚀 Batch-Wise Efficiency: Reduces compute cost by 32x vs sample-wise merging.
- 📈 SOTA Performance: Outperforms 8+ baselines in accuracy & robustness.
Table of Contents
- Installation
- Method Overview
- Folder Structure
- Reproducing Results
- Pretrained Weights
- Datasets
- Citation
Installation
- Clone repository:
git clone https://github.com/Razaimam45/TCube.git
cd T3
- Create conda environment:
conda create -n t3 python=3.9
conda activate t3
pip install -r requirements.txt
Method Overview
Adaptive Merging via Jensen-Shannon Divergence
The interpolation coefficient λ is computed dynamically for each sample using the following equation:
λ(x) = λ_{min} + (λ_{max}-λ_{min})σ(γ⋅JS(p_{pt}(x)‖p_{ft}(x)))
Where:
JS= Jensen-Shannon divergence between pretrained and fine-tuned model predictions.σ= Sigmoid function for smooth scaling.γ= Scaling factor (default=0.5).
Visual Explanation of the Method
Below justifies the method and its effectiveness:
Dynamic Weighting Based on Model Agreement
We propose using Jensen–Shannon (JS) divergence to measure mutual information between pretrained (p_pt) and fine-tuned (p_ft) model predictions, offering a more robust gauge of joint confidence than entropy-based methods like DaWin's entropy ratio:
R(x) = \frac{\mathcal{H}(p_{ft}(x))}{\mathcal{H}(p_{pt}(x)) + \mathcal{H}(p_{ft}(x))}
JS divergence explicitly captures agreement vs. disagreement by comparing full predictive distributions:
I(x) = \frac{1}{2} \Bigl(\mathrm{KL}(p_{pt}(x) \Vert \bar{p}(x)) + \mathrm{KL}(p_{ft}(x) \Vert \bar{p}(x))\Bigr)
where
\bar{p}(x) = 0.5 \cdot (p_{pt}(x) + p_{ft}(x))`.
This ensures:
- (I(x) = 0) when models fully agree.
- (I(x) > 0) when confident predictions disagree.
Empirically, (I(x)) correlates positively with (R(x)), but better distinguishes disagreements, validating its use for adaptive merging.
Mutual Information vs. Entropy

Figure 3: Relationship between mutual information and entropy for adaptive merging.Performance Across Modalities

Figure 4: T³ achieves superior performance across multiple medical imaging modalities.
Folder Structure
Do check our HuggingFace page for Expert Models and Evaluation Datasets.
T3/
├── clip/ # CLIP model adaptations
├── data/ # Data Utilities
├── utils/ # Helper functions
├── models/ # Put your finetuned models HERE
├── dataset/ # Put your medimeta/medmnist-c eval data HERE
├── baselines.py # Comparison methods
├── t_cube.py # Core T³ implementation
├── BetaMixture.py # Auxiliary models
└── README.md # This document
Reproducing Results
To reproduce the results from the paper, you can run the t_cube.py script. This script handles the evaluation of T³ and its baselines across multiple datasets and severity levels. Additional baselines are available in baselines.py.
To understand the script better; in t_cube.py:
- Refer to the
compute_samplewise_tcube_weightsandcompute_samplewise_tcube_weights_MIfunctions for entropy (DaWiN baseline) and Our mutual information-based merging. - Check the
evaluate_on_test_setfunction for how datasets and severities are processed. - Explore the
evaluate_tcubefunction for the merging and evaluation logic.
Pretrained Weights
We provide pretrained weights for the following models:
- Generalist CLIP: A pretrained model for general vision-language tasks.
- Expert CLIPs: 4x Fine-tuned models for the following medical imaging domains:
- Breast Imaging
- Fundoscopy
- Cell Microscopy
- Retinal OCT
If you would like to access these weights, please find them at model card at https://huggingface.co/razaimam45/TCube_Merging under models/finetuned subfolder.
Datasets
We provided Breast Imaging evaluation sets on HuggingFace page. Please download from there.
If you need to run multiple modalities datasets, just pass --testset arg with 'bloodmnist/breastmnist/'. This will evaluate medmnist-c and medimeta from each modality, resulting in 4 datasets evaluation.
If you need all modality datasets, you can find them as follows:
- MedMNIST datasets | In-Domain Fine-Tune Datasets
- MediMeta datasets | OOD-B2N Eval Datasets
- MedMNIST-C datasets | OOD-Corruptions Eval Datasets
License
This project is licensed under the MIT License - see the LICENSE file for details.
Citation
If you find this work useful, please cite the arXiv version below:
@misc{imam2025t3testtimemodelmerging,
title={T3: Test-Time Model Merging in VLMs for Zero-Shot Medical Imaging Analysis},
author={Raza Imam and Hu Wang and Dwarikanath Mahapatra and Mohammad Yaqub},
year={2025},
eprint={2510.27265},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.27265},
}
Contact
For questions or collaborations, contact Raza Imam. Please feel free to raise an issue in facing error in reproducing the results.