LISAt_PRE

LISAt_PRE is a remote-sensing-focused MLLM that is tailored to improve performance in scenarios requiring detailed visual understanding and natural language reasoning over satellite and aerial imagery.


Overview

LISAt_PRE enhances the LISAt framework by adapting it to remote-sensing applications, which require better handling of diverse visual data and specialized query types. The architecture integrates:

  • A Remote-CLIP ViT-L/14 vision encoder
  • A Vicuna-7B LLM for text understanding and reasoning
  • A linear projection module to align vision and language representations
  • A segmentation model trained on high-quality mask annotations

An architectural overview is shown in Figure 3 (refer to paper).


Key Features

  • Remote-Sensing Specialization: Trained on domain-specific imagery to handle the unique challenges of satellite data.
  • Multimodal Alignment: Combines textual and visual inputs through a unified architecture.
  • Training with PreGRES: LISAt_PRE is pre-trained on the PreGRES dataset using LoRA (Hu et al., 2021), before being fine-tuned on GRES.

Architecture

  • Language Model: Vicuna-7B (Chiang et al., 2023)
  • Vision Encoder: Remote-CLIP ViT-L/14 (Liu et al., 2024a)

Citation

If you use LISAt_PRE in your work, please cite:

@article{quenum2025lisat,
  title={LISAt: Language-Instructed Segmentation Assistant for Satellite Imagery},
  author={Quenum, Jerome and Hsieh, Wen-Han and Wu, Tsung-Han and Gupta, Ritwik and Darrell, Trevor and Chan, David M},
  journal={arXiv preprint arXiv:2505.02829},
  year={2025},
  url={https://arxiv.org/pdf/2505.02829}
}
Downloads last month
8
Safetensors
Model size
7.06B params
Tensor type
F16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support