Authors: Mingyang Li, Henry Ning, Ethan Yang
Emory University Department of Computer Science, Atlanta, Georgia, USA
Geospatial Question Answering (QA) and tourism recommendations are important due to their importance in location-based decision making and recommendation. Recent Retrieval-Augmented Generation (RAG) approaches have achieved remarkable results in geographic reasoning. However, current methods still struggle to handle multimodal signals while balancing explicit spatial and temporal constraints. This paper presents GeoLens, a multimodal RAG framework that uses hybrid retrieval and multi-objective optimization to overcome this challenge. GeoLens demonstrates superior results against recent Large Language Models (LLM), outperforming the best LLM baseline by 52.3% in Precision@1 on TourismQA-Miami, and competitive results against previous specialized system, exceeding GeoLLM by 49.5% in Recall@10 on TourismQA-NYC, with a light-weight LLM backbone. To the best of our knowledge, this is the first work to integrate multimodal signals into geospatial RAG. Our framework is useful for real-life geospatial recommendation applications with scalable cost.