[WIP] GAEA: A Geolocation Aware Conversational Assistant

Summary

Image geolocalization, in which an AI model traditionally predicts the precise GPS coordinates of an image, is a challenging task with many downstream applications. However, the user cannot utilize the model to further their knowledge beyond the GPS coordinates; the model lacks an understanding of the location and the conversational ability to communicate with the user. In recent days, with the tremendous progress of large multimodal models (LMMs) — proprietary and open-source — researchers have attempted to geolocalize images via LMMs. However, the issues remain unaddressed; beyond general tasks, for more specialized downstream tasks, such as geolocalization, LMMs struggle. In this work, we propose solving this problem by introducing a conversational model, GAEA, that provides information regarding the location of an image as the user requires. No large-scale dataset enabling the training of such a model exists. Thus, we propose GAEA-1.4M, a comprehensive dataset comprising over 800k images and approximately 1.4M question-answer pairs, constructed by leveraging OpenStreetMap (OSM) attributes and geographical context clues. For quantitative evaluation, we propose a diverse benchmark, GAEA-Bench, comprising 3.5k image-text pairs to evaluate conversational capabilities equipped with diverse question types. We consider 11 state-of-the-art open-source and proprietary LMMs and demonstrate that GAEA significantly outperforms the best open-source model, LLaVA-OneVision, by 18.2% and the best proprietary model, GPT-4o, by 7.2%. We will publicly release our dataset and codes.

`GAEA` is the first open-source conversational model for conversational capabilities equipped with global-scale geolocalization.

Main contributions:

GAEA-Train: A Diverse Training Dataset: We propose GAEA-Train, a new dataset designed for training conversational image geolocalization models, incorporating diverse visual and contextual data.
GAEA-Bench: Evaluating Conversational Geolocalization: To assess conversational capabilities in geolocalization, we introduce GAEA-Bench, a benchmark featuring various question-answer formats.
GAEA: An Interactive Geolocalization Chatbot: We present GAEA, a conversational chatbot that extends beyond geolocalization to provide rich contextual insights about locations from images.
Benchmarking Against State-of-the-Art LMMs: We quantitatively compare our model’s performance against 8 open-source and 3 proprietary LMMs, including GPT-4o and Gemini-2.0-Flash.

This page is dedicated to the GAEA model

teaser