GeoQwen-VL-2B-EuroSAT
Remote sensing scene classification model fine-tuned on EuroSAT.
Usage
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from peft import PeftModel
base_model = Qwen2VLForConditionalGeneration.from_pretrained("Qwen/Qwen2-VL-2B-Instruct", device_map="auto")
model = PeftModel.from_pretrained(base_model, "tugrulkaya/GeoQwen-VL-2B-EuroSAT")
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct")
Classes
AnnualCrop, Forest, HerbaceousVegetation, Highway, Industrial, Pasture, PermanentCrop, Residential, River, SeaLake
GeoQwen-VL-2B-EuroSAT: A Transformative Story for Remote Sensing
Introduction
Remote sensing and satellite imagery play a critical role in understanding and monitoring changes on our planet. Extracting meaningful information from this vast and complex dataset can be challenging and time-consuming with traditional methods. This is where vision-language models (VLMs), which combine natural language processing and computer vision, come into play. This project describes the development process of GeoQwen-VL-2B-EuroSAT, a VLM designed to classify and interpret satellite images using natural language.
Motivation
In the field of geospatial intelligence, the need for systems that can automatically extract information from satellite images is growing. Accurate and rapid analysis in areas such as land cover classification, environmental monitoring, urban planning, and disaster management is vital. The primary motivation of this project was to develop a powerful VLM capable of processing large-scale satellite data, understanding complex visual information, and generating human-like descriptions. By adapting a state-of-the-art model like Qwen2-VL to the field of satellite imagery, these goals were aimed to be achieved.
Technical Approach
The project primarily involved fine-tuning the Qwen2-VL-2B-Instruct model using the Low-Rank Adaptation (LoRA) technique on the EuroSAT RGB dataset. This chosen technique allows for efficient training of large models even in resource-constrained environments (e.g., T4 GPU) like Colab. 4-bit quantization significantly reduced the model's memory footprint, enhancing its ability to work with larger models. The EuroSAT dataset consists of Sentinel-2 satellite images representing 10 different land cover classes in Europe. The model learned to recognize and describe land cover types in these images by using natural language instructions in a question-answering format.
Challenges and Solutions
- Hardware Constraints: Fine-tuning the large Qwen2-VL model was particularly challenging due to memory limitations. This problem was solved by using 4-bit quantization and the
bitsandbyteslibrary, allowing the model to fit within the 16GB memory of a T4 GPU. - Training Duration: The model's size and the scope of the dataset required long training times. The efficiency of LoRA and the use of strategies like
gradient_accumulation_stepshelped optimize the training process. - Data Preparation: Converting labels and class definitions in the EuroSAT dataset into natural language instructions was critical for the model's better understanding. The
VLMDataCollatorclass automated this conversion, simplifying the training process. - Hugging Face Integration: The seamless upload and version control of the model and processor to the Hugging Face Hub were ensured using the
huggingface_hublibrary, providing shareability and reproducibility of the model. - Post-Fine-tuning Inference Errors: The
truncation=Trueandmax_length=512parameters used inVLMDataCollatorduring training were missing in theprocessorcall during inference. This led to anIndexErrordue to inconsistentinput_idsandattention_maskdimensions. The solution involved re-adding these parameters to the inferenceprocessorcall to ensure consistency.
Results and Achievements
The greatest achievement of this project is the successful fine-tuning of a complex VLM like Qwen2-VL-2B-Instruct on the EuroSAT dataset with limited resources, thanks to LoRA and 4-bit quantization techniques. The developed GeoQwen-VL-2B-EuroSAT model:
- Can accurately classify land cover types in satellite images.
- Can convert visual data into meaningful natural language descriptions.
- Has gained visual question-answering (Visual QA) capabilities on geospatial data.
This is an important step towards developing more accessible and powerful VLMs in the field of remote sensing.
Future Work and Use Cases
The potential application areas of the GeoQwen-VL-2B-EuroSAT model are extensive:
- Environmental Monitoring: Tracking changes in forested areas, status of water bodies, or agricultural lands.
- Urban Planning: Analyzing urban expansion, identification of industrial zones, and infrastructure development.
- Disaster Management: Damage assessment and situation evaluation after disasters such as floods, earthquakes, or fires.
- Agriculture: Crop type detection, yield prediction, and plant health monitoring.
Future work includes training the model on broader and more diverse remote sensing datasets (RESISC45, AID, BigEarthNet), experimenting with larger models like Qwen2.5-VL-7B, and adding more advanced capabilities such as change detection. Additionally, there are plans to create a Gradio demo space to increase the model's accessibility.
This project establishes an important bridge in creating value from remote sensing data and opens new horizons in the field of geospatial intelligence.
Visual Examples
EuroSAT Training Data Examples
Below are example images from the EuroSAT dataset, showcasing different land cover classes that the model was trained on. These visuals represent the various scenarios the model needs to recognize.
Model Prediction Output Example
This visual directly compares the GeoQwen-VL-2B-EuroSAT model's generated description for a satellite image against its ground truth label. It vividly illustrates the model's capability to interpret complex geospatial visuals and provide concise, accurate natural language explanations, showcasing its potential for automated scene understanding.
- Downloads last month
- 32

