--- license: mit tags: - low-light - low-light-image-enhancement - image-enhancement - image-restoration - computer-vision - low-light-enhance - multimodal - multimodal-learning - transformer - transformers - vision-transformer - vision-transformers model-index: - name: ModalFormer results: - task: type: low-light-image-enhancement dataset: name: LOL-v1 type: LOL-v1 metrics: - type: PSNR value: 27.97 name: PSNR - type: SSIM value: 0.897 name: SSIM - task: type: low-light-image-enhancement dataset: name: LOL-v2-Real type: LOL-v2-Real metrics: - type: PSNR value: 29.33 name: PSNR - type: SSIM value: 0.915 name: SSIM - task: type: low-light-image-enhancement dataset: name: LOL-v2-Synthetic type: LOL-v2-Synthetic metrics: - type: PSNR value: 30.15 name: PSNR - type: SSIM value: 0.951 name: SSIM - task: type: low-light-image-enhancement dataset: name: SDSD-indoor type: SDSD-indoor metrics: - type: PSNR value: 31.37 name: PSNR - type: SSIM value: 0.917 name: SSIM - task: type: low-light-image-enhancement dataset: name: SDSD-outdoor type: SDSD-outdoor metrics: - type: PSNR value: 31.73 name: PSNR - type: SSIM value: 0.904 name: SSIM - task: type: low-light-image-enhancement dataset: name: MEF type: MEF metrics: - type: NIQE value: 3.44 name: NIQE - task: type: low-light-image-enhancement dataset: name: LIME type: LIME metrics: - type: NIQE value: 3.82 name: NIQE - task: type: low-light-image-enhancement dataset: name: DICM type: DICM metrics: - type: NIQE value: 3.64 name: NIQE - task: type: low-light-image-enhancement dataset: name: NPE type: NPE metrics: - type: NIQE value: 3.55 name: NIQE pipeline_tag: image-to-image --- # ✨ ModalFormer: Multimodal Transformer for Low-Light Image Enhancement
**[Alexandru Brateanu](https://scholar.google.com/citations?user=ru0meGgAAAAJ&hl=en), [Raul Balmez](https://scholar.google.com/citations?user=vPC7raQAAAAJ&hl=en), [Ciprian Orhei](https://scholar.google.com/citations?user=DZHdq3wAAAAJ&hl=en), [Codruta Ancuti](https://scholar.google.com/citations?user=5PA43eEAAAAJ&hl=en), [Cosmin Ancuti](https://scholar.google.com/citations?user=zVTgt8IAAAAJ&hl=en)** [![arXiv](https://img.shields.io/badge/arxiv-paper-179bd3)](https://arxiv.org/abs/2507.20388)
### Abstract *Low-light image enhancement (LLIE) is a fundamental yet challenging task due to the presence of noise, loss of detail, and poor contrast in images captured under insufficient lighting conditions. Recent methods often rely solely on pixel-level transformations of RGB images, neglecting the rich contextual information available from multiple visual modalities. In this paper, we present ModalFormer, the first large-scale multimodal framework for LLIE that fully exploits nine auxiliary modalities to achieve state-of-the-art performance. Our model comprises two main components: a Cross-modal Transformer (CM-T) designed to restore corrupted images while seamlessly integrating multimodal information, and multiple auxiliary subnetworks dedicated to multimodal feature reconstruction. Central to the CM-T is our novel Cross-modal Multi-headed Self-Attention mechanism (CM-MSA), which effectively fuses RGB data with modality-specific features—including deep feature embeddings, segmentation information, geometric cues, and color information—to generate information-rich hybrid attention maps. Extensive experiments on multiple benchmark datasets demonstrate ModalFormer’s state-of-the-art performance in LLIE. Pre-trained models and results are made available at https://github.com/albrateanu/ModalFormer* ## 🆕 Updates - `29.07.2025` 🎉 The [**ModalFormer**](https://arxiv.org/abs/2401.15204) paper is now available! Check it out and explore our results and methodology. - `28.07.2025` 📦 Pre-trained models and test data published! ArXiv paper version and HuggingFace demo coming soon, stay tuned! ## ⚙️ Setup and Testing For ease, utilize a Linux machine with CUDA-ready devices (GPUs). To setup the environment, first run the provided setup script: ```bash ./environment_setup.sh # or bash environment_setup.sh ``` Note: in case of difficulties, ensure ```environment_setup.sh``` is executable by running: ```bash chmod +x environment_setup.sh ``` Give the setup a couple of minutes to run. Please check out the [**GitHub repository**](https://github.com/albrateanu/ModalFormer) for more implementation details. ## 📚 Citation ``` @misc{brateanu2025modalformer, title={ModalFormer: Multimodal Transformer for Low-Light Image Enhancement}, author={Alexandru Brateanu and Raul Balmez and Ciprian Orhei and Codruta Ancuti and Cosmin Ancuti}, year={2025}, eprint={2507.20388}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2507.20388}, } ``` ## 🙏 Acknowledgements We use [this codebase](https://github.com/caiyuanhao1998/Retinexformer) as foundation for our implementation. Paper: https://arxiv.org/pdf/2507.20388