GeoQwen-VL-2B-EuroSAT

Remote sensing scene classification model fine-tuned on EuroSAT.

Usage

from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from peft import PeftModel

base_model = Qwen2VLForConditionalGeneration.from_pretrained("Qwen/Qwen2-VL-2B-Instruct", device_map="auto")
model = PeftModel.from_pretrained(base_model, "tugrulkaya/GeoQwen-VL-2B-EuroSAT")
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct")

Classes

AnnualCrop, Forest, HerbaceousVegetation, Highway, Industrial, Pasture, PermanentCrop, Residential, River, SeaLake

GeoQwen-VL-2B-EuroSAT: A Transformative Story for Remote Sensing

Introduction

Remote sensing and satellite imagery play a critical role in understanding and monitoring changes on our planet. Extracting meaningful information from this vast and complex dataset can be challenging and time-consuming with traditional methods. This is where vision-language models (VLMs), which combine natural language processing and computer vision, come into play. This project describes the development process of GeoQwen-VL-2B-EuroSAT, a VLM designed to classify and interpret satellite images using natural language.

Motivation

In the field of geospatial intelligence, the need for systems that can automatically extract information from satellite images is growing. Accurate and rapid analysis in areas such as land cover classification, environmental monitoring, urban planning, and disaster management is vital. The primary motivation of this project was to develop a powerful VLM capable of processing large-scale satellite data, understanding complex visual information, and generating human-like descriptions. By adapting a state-of-the-art model like Qwen2-VL to the field of satellite imagery, these goals were aimed to be achieved.

Technical Approach

The project primarily involved fine-tuning the Qwen2-VL-2B-Instruct model using the Low-Rank Adaptation (LoRA) technique on the EuroSAT RGB dataset. This chosen technique allows for efficient training of large models even in resource-constrained environments (e.g., T4 GPU) like Colab. 4-bit quantization significantly reduced the model's memory footprint, enhancing its ability to work with larger models. The EuroSAT dataset consists of Sentinel-2 satellite images representing 10 different land cover classes in Europe. The model learned to recognize and describe land cover types in these images by using natural language instructions in a question-answering format.

Challenges and Solutions

  1. Hardware Constraints: Fine-tuning the large Qwen2-VL model was particularly challenging due to memory limitations. This problem was solved by using 4-bit quantization and the bitsandbytes library, allowing the model to fit within the 16GB memory of a T4 GPU.
  2. Training Duration: The model's size and the scope of the dataset required long training times. The efficiency of LoRA and the use of strategies like gradient_accumulation_steps helped optimize the training process.
  3. Data Preparation: Converting labels and class definitions in the EuroSAT dataset into natural language instructions was critical for the model's better understanding. The VLMDataCollator class automated this conversion, simplifying the training process.
  4. Hugging Face Integration: The seamless upload and version control of the model and processor to the Hugging Face Hub were ensured using the huggingface_hub library, providing shareability and reproducibility of the model.
  5. Post-Fine-tuning Inference Errors: The truncation=True and max_length=512 parameters used in VLMDataCollator during training were missing in the processor call during inference. This led to an IndexError due to inconsistent input_ids and attention_mask dimensions. The solution involved re-adding these parameters to the inference processor call to ensure consistency.

Results and Achievements

The greatest achievement of this project is the successful fine-tuning of a complex VLM like Qwen2-VL-2B-Instruct on the EuroSAT dataset with limited resources, thanks to LoRA and 4-bit quantization techniques. The developed GeoQwen-VL-2B-EuroSAT model:

  • Can accurately classify land cover types in satellite images.
  • Can convert visual data into meaningful natural language descriptions.
  • Has gained visual question-answering (Visual QA) capabilities on geospatial data.

This is an important step towards developing more accessible and powerful VLMs in the field of remote sensing.

Future Work and Use Cases

The potential application areas of the GeoQwen-VL-2B-EuroSAT model are extensive:

  • Environmental Monitoring: Tracking changes in forested areas, status of water bodies, or agricultural lands.
  • Urban Planning: Analyzing urban expansion, identification of industrial zones, and infrastructure development.
  • Disaster Management: Damage assessment and situation evaluation after disasters such as floods, earthquakes, or fires.
  • Agriculture: Crop type detection, yield prediction, and plant health monitoring.

Future work includes training the model on broader and more diverse remote sensing datasets (RESISC45, AID, BigEarthNet), experimenting with larger models like Qwen2.5-VL-7B, and adding more advanced capabilities such as change detection. Additionally, there are plans to create a Gradio demo space to increase the model's accessibility.

This project establishes an important bridge in creating value from remote sensing data and opens new horizons in the field of geospatial intelligence.

Visual Examples

EuroSAT Training Data Examples

Below are example images from the EuroSAT dataset, showcasing different land cover classes that the model was trained on. These visuals represent the various scenarios the model needs to recognize.

EuroSAT Training Examples

Model Prediction Output Example

This visual directly compares the GeoQwen-VL-2B-EuroSAT model's generated description for a satellite image against its ground truth label. It vividly illustrates the model's capability to interpret complex geospatial visuals and provide concise, accurate natural language explanations, showcasing its potential for automated scene understanding.

Model Prediction Example

Downloads last month
32
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for tugrulkaya/GeoQwen-VL-2B-EuroSAT

Base model

Qwen/Qwen2-VL-2B
Adapter
(99)
this model

Dataset used to train tugrulkaya/GeoQwen-VL-2B-EuroSAT

Space using tugrulkaya/GeoQwen-VL-2B-EuroSAT 1