---
license: mit
tags:
- low-light
- low-light-image-enhancement
- image-enhancement
- image-restoration
- computer-vision
- low-light-enhance
- multimodal
- multimodal-learning
- transformer
- transformers
- vision-transformer
- vision-transformers
model-index:
- name: ModalFormer
  results:
  - task:
      type: low-light-image-enhancement
    dataset:
      name: LOL-v1
      type: LOL-v1
    metrics:
    - type: PSNR
      value: 27.97
      name: PSNR
    - type: SSIM
      value: 0.897
      name: SSIM
  - task:
      type: low-light-image-enhancement
    dataset:
      name: LOL-v2-Real
      type: LOL-v2-Real
    metrics:
    - type: PSNR
      value: 29.33
      name: PSNR
    - type: SSIM
      value: 0.915
      name: SSIM
  - task:
      type: low-light-image-enhancement
    dataset:
      name: LOL-v2-Synthetic
      type: LOL-v2-Synthetic
    metrics:
    - type: PSNR
      value: 30.15
      name: PSNR
    - type: SSIM
      value: 0.951
      name: SSIM
  - task:
      type: low-light-image-enhancement
    dataset:
      name: SDSD-indoor
      type: SDSD-indoor
    metrics:
    - type: PSNR
      value: 31.37
      name: PSNR
    - type: SSIM
      value: 0.917
      name: SSIM
  - task:
      type: low-light-image-enhancement
    dataset:
      name: SDSD-outdoor
      type: SDSD-outdoor
    metrics:
    - type: PSNR
      value: 31.73
      name: PSNR
    - type: SSIM
      value: 0.904
      name: SSIM
  - task:
      type: low-light-image-enhancement
    dataset:
      name: MEF
      type: MEF
    metrics:
    - type: NIQE
      value: 3.44
      name: NIQE
  - task:
      type: low-light-image-enhancement
    dataset:
      name: LIME
      type: LIME
    metrics:
    - type: NIQE
      value: 3.82
      name: NIQE
  - task:
      type: low-light-image-enhancement
    dataset:
      name: DICM
      type: DICM
    metrics:
    - type: NIQE
      value: 3.64
      name: NIQE
  - task:
      type: low-light-image-enhancement
    dataset:
      name: NPE
      type: NPE
    metrics:
    - type: NIQE
      value: 3.55
      name: NIQE
pipeline_tag: image-to-image
---

# ✨ ModalFormer: Multimodal Transformer for Low-Light Image Enhancement

<div align="center">
  
**[Alexandru Brateanu](https://scholar.google.com/citations?user=ru0meGgAAAAJ&hl=en), [Raul Balmez](https://scholar.google.com/citations?user=vPC7raQAAAAJ&hl=en), [Ciprian Orhei](https://scholar.google.com/citations?user=DZHdq3wAAAAJ&hl=en), [Codruta Ancuti](https://scholar.google.com/citations?user=5PA43eEAAAAJ&hl=en), [Cosmin Ancuti](https://scholar.google.com/citations?user=zVTgt8IAAAAJ&hl=en)**

[![arXiv](https://img.shields.io/badge/arxiv-paper-179bd3)](https://arxiv.org/abs/2507.20388)
</div>

### Abstract
*Low-light image enhancement (LLIE) is a fundamental yet challenging task due to the presence of noise, loss of detail, and poor contrast in images captured under insufficient lighting conditions. Recent methods often rely solely on pixel-level transformations of RGB images, neglecting the rich contextual information available from multiple visual modalities. In this paper, we present ModalFormer, the first large-scale multimodal framework for LLIE that fully exploits nine auxiliary modalities to achieve state-of-the-art performance. Our model comprises two main components: a Cross-modal Transformer (CM-T) designed to restore corrupted images while seamlessly integrating multimodal information, and multiple auxiliary subnetworks dedicated to multimodal feature reconstruction. Central to the CM-T is our novel Cross-modal Multi-headed Self-Attention mechanism (CM-MSA), which effectively fuses RGB data with modality-specific features—including deep feature embeddings, segmentation information, geometric cues, and color information—to generate information-rich hybrid attention maps. Extensive experiments on multiple benchmark datasets demonstrate ModalFormer’s state-of-the-art performance in LLIE. Pre-trained models and results are made available at https://github.com/albrateanu/ModalFormer*

## 🆕 Updates
- `29.07.2025` 🎉 The [**ModalFormer**](https://arxiv.org/abs/2401.15204) paper is now available! Check it out and explore our results and methodology.
- `28.07.2025` 📦 Pre-trained models and test data published! ArXiv paper version and HuggingFace demo coming soon, stay tuned!

## ⚙️ Setup and Testing
For ease, utilize a Linux machine with CUDA-ready devices (GPUs).

To setup the environment, first run the provided setup script:

```bash
./environment_setup.sh
# or 
bash environment_setup.sh
```

Note: in case of difficulties, ensure ```environment_setup.sh``` is executable by running:

```bash
chmod +x environment_setup.sh
```

Give the setup a couple of minutes to run.

Please check out the [**GitHub repository**](https://github.com/albrateanu/ModalFormer) for more implementation details.

## 📚 Citation

```
@misc{brateanu2025modalformer,
      title={ModalFormer: Multimodal Transformer for Low-Light Image Enhancement}, 
      author={Alexandru Brateanu and Raul Balmez and Ciprian Orhei and Codruta Ancuti and Cosmin Ancuti},
      year={2025},
      eprint={2507.20388},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2507.20388}, 
}
```

## 🙏 Acknowledgements
We use [this codebase](https://github.com/caiyuanhao1998/Retinexformer) as foundation for our implementation.

Paper: https://arxiv.org/pdf/2507.20388