YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

T³: Test-Time Model Merging for Medical Vision-Language Models

Raza Imam, Hu Wang, Dwarikanath Mahapatra, Mohammad Yaqub
Mohamed bin Zayed University of Artificial Intelligence

License: MIT Paper Weights

This repository provides the official PyTorch implementation of our T³ Medical Model-Merging paper:

T³ Workflow
Figure 1: Dynamic test-time merging workflow of T³

Official implementation of T³: Test-Time Model Merging in Vision-Language Models for Zero-Shot Medical Imaging, a method for adaptive fusion of pretrained and fine-tuned vision-language models at test time using Jensen-Shannon divergence.


Key Features

  • 🧠 Mutual Information Guidance: Uses JS divergence to measure model consensus.
  • Backpropagation-Free: No gradient updates required during inference.
  • 🏥 Medical Modality Agnostic: Validated consistency on 4x medical imaging domains.
  • 🚀 Batch-Wise Efficiency: Reduces compute cost by 32x vs sample-wise merging.
  • 📈 SOTA Performance: Outperforms 8+ baselines in accuracy & robustness.

Table of Contents

Installation

  1. Clone repository:
git clone https://github.com/Razaimam45/TCube.git
cd T3
  1. Create conda environment:
conda create -n t3 python=3.9
conda activate t3
pip install -r requirements.txt

Method Overview

Adaptive Merging via Jensen-Shannon Divergence

The interpolation coefficient λ is computed dynamically for each sample using the following equation:

λ(x) = λ_{min} + (λ_{max}-λ_{min})σ(γ⋅JS(p_{pt}(x)‖p_{ft}(x)))

Where:

  • JS = Jensen-Shannon divergence between pretrained and fine-tuned model predictions.
  • σ = Sigmoid function for smooth scaling.
  • γ = Scaling factor (default=0.5).

Visual Explanation of the Method

Below justifies the method and its effectiveness:

Dynamic Weighting Based on Model Agreement

We propose using Jensen–Shannon (JS) divergence to measure mutual information between pretrained (p_pt) and fine-tuned (p_ft) model predictions, offering a more robust gauge of joint confidence than entropy-based methods like DaWin's entropy ratio:

R(x) = \frac{\mathcal{H}(p_{ft}(x))}{\mathcal{H}(p_{pt}(x)) + \mathcal{H}(p_{ft}(x))}

JS divergence explicitly captures agreement vs. disagreement by comparing full predictive distributions:

I(x) = \frac{1}{2} \Bigl(\mathrm{KL}(p_{pt}(x) \Vert \bar{p}(x)) + \mathrm{KL}(p_{ft}(x) \Vert \bar{p}(x))\Bigr)

where

\bar{p}(x) = 0.5 \cdot (p_{pt}(x) + p_{ft}(x))`.

This ensures:

  • (I(x) = 0) when models fully agree.
  • (I(x) > 0) when confident predictions disagree.

Empirically, (I(x)) correlates positively with (R(x)), but better distinguishes disagreements, validating its use for adaptive merging.

  1. Mutual Information vs. Entropy
    MI vs Entropy
    Figure 3: Relationship between mutual information and entropy for adaptive merging.

  2. Performance Across Modalities
    Performance Comparison
    Figure 4: T³ achieves superior performance across multiple medical imaging modalities.


Folder Structure

Do check our HuggingFace page for Expert Models and Evaluation Datasets.

T3/
├── clip/              # CLIP model adaptations
├── data/              # Data Utilities
├── utils/             # Helper functions
├── models/            # Put your finetuned models HERE
├── dataset/           # Put your medimeta/medmnist-c eval data HERE
├── baselines.py       # Comparison methods
├── t_cube.py          # Core T³ implementation
├── BetaMixture.py     # Auxiliary models
└── README.md          # This document

Reproducing Results

To reproduce the results from the paper, you can run the t_cube.py script. This script handles the evaluation of T³ and its baselines across multiple datasets and severity levels. Additional baselines are available in baselines.py.

To understand the script better; in t_cube.py:

  • Refer to the compute_samplewise_tcube_weights and compute_samplewise_tcube_weights_MI functions for entropy (DaWiN baseline) and Our mutual information-based merging.
  • Check the evaluate_on_test_set function for how datasets and severities are processed.
  • Explore the evaluate_tcube function for the merging and evaluation logic.

Pretrained Weights

We provide pretrained weights for the following models:

  1. Generalist CLIP: A pretrained model for general vision-language tasks.
  2. Expert CLIPs: 4x Fine-tuned models for the following medical imaging domains:
    • Breast Imaging
    • Fundoscopy
    • Cell Microscopy
    • Retinal OCT

If you would like to access these weights, please find them at model card at https://huggingface.co/razaimam45/TCube_Merging under models/finetuned subfolder.


Datasets

We provided Breast Imaging evaluation sets on HuggingFace page. Please download from there.

If you need to run multiple modalities datasets, just pass --testset arg with 'bloodmnist/breastmnist/'. This will evaluate medmnist-c and medimeta from each modality, resulting in 4 datasets evaluation.

If you need all modality datasets, you can find them as follows:

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you find this work useful, please cite the arXiv version below:

@misc{imam2025t3testtimemodelmerging,
      title={T3: Test-Time Model Merging in VLMs for Zero-Shot Medical Imaging Analysis}, 
      author={Raza Imam and Hu Wang and Dwarikanath Mahapatra and Mohammad Yaqub},
      year={2025},
      eprint={2510.27265},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.27265}, 
}

Contact

For questions or collaborations, contact Raza Imam. Please feel free to raise an issue in facing error in reproducing the results.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support