MAPWise: Evaluating Vision-Language Models for Advanced Map Queries

1IIIT Hyderabad, 2University of Utah, 3University of Delaware, 4University of Pennsylvania, 5Arizona State University,

About

MAPWise: Evaluating Vision-Language Models for Advanced Map Queries

MAPWise is a novel benchmark designed to evaluate the performance of Vision-Language Models (VLMs) in answering map-based questions. Choropleth maps, commonly used for geographical data representation, pose unique challenges for VLMs due to their reliance on color variations to encode information. Our study introduces a dataset featuring maps from the United States, India, and China, with 1000 manually created question-answer pairs per region. The dataset includes a diverse set of question templates to test spatial reasoning, numerical comprehension, and pattern recognition. Our goal is to push the boundaries of VLM capabilities in handling complex map-based reasoning tasks.

TL;DR: MAPWise is a benchmark designed to evaluate Vision-Language Models (VLMs) on map-based questions. It includes 3,000 manually curated Q&A pairs across maps from the United States, India, and China, testing spatial reasoning, numerical comprehension, and pattern recognition.

MAPWise Dataset Creation Procedure

Data Sources

India

Data from the Reserve Bank of India's Handbook of Statistics on Indian States.

USA

Healthcare statistics from the Kaiser Family Foundation.

China

Socioeconomic data from the National Bureau of Statistics of China.

Map Variations

Discrete Maps

Defined categorical legend with clear boundaries.

Continuous Maps

Gradual spectrum-based legends.

Annotated Maps

Maps with labels to aid interpretation.

Hatched Maps

Alternative representations using texture.

Question Generation & Validation

To design a Robust Benchmark for Map-Based QA, we developed question templates that vary in difficulty, from simple yes/no questions to complex region association tasks requiring spatial reasoning. The dataset includes three primary question types: Binary questions that require a yes or no answer based on map features, Direct Value Extraction tasks that involve retrieving specific numerical or categorical values from maps, and Region Association questions that challenge models to identify or count regions meeting particular criteria, often requiring geospatial reasoning.

To ensure high dataset quality, expert annotators meticulously reviewed all questions and answers. A rigorous benchmarking process, including human evaluation, was conducted to assess dataset reliability. This evaluation highlights the limitations of existing Vision-Language Models (VLMs) in handling complex map-based reasoning tasks and provides valuable insights for future improvements.

Example Questions and Answer Types

Answer Type Example Question
Binary Yes or no: California is an outlier compared to its neighbors?
Single Word Name the eastern-most state that belongs to a higher value range compared to all its neighbors.
List Which states in the East China Sea region have a value higher than state Guangdong?
Range What is the least value range in the west coast region?
Count How many states bordering Canada have a value lower than New Mexico?
Ranking Rank Rajasthan, Gujarat, and Jammu & Kashmir in terms of the legend value in the region bordering Pakistan.

Dataset Statistics

MAPWise Data Statistics
MAPWise Dataset statistic

Results & Analysis

Models Evaluated

We evaluated multiple Vision-Language Models (VLMs) on the MAPWise dataset. The models included both closed-source and open-source multimodal models:

  • Closed-Source Models: GPT-4o, Gemini 1.5 Flash
  • Open-Source Models: CogAgent, InternLM-XComposer2, Idefics 2, Qwen VL

In Depth Analysis

We conducted thorough Analysis to evaluate closed-source model performance, the effectiveness of different prompting strategies, and biases in model predictions. The study examined how models performed across different map types (discrete vs. continuous, colored vs. hatched), annotations (with vs. without labels), and country-wise variations. Additionally, model accuracy was compared across various question-answer types, revealing significant gaps in spatial and numerical reasoning. For Complete Results, Please visit Appendix in Paper

Performance Tables

Counterfactual Experiments

To analyze how models rely on internal knowledge versus actual map interpretation, we designed three types of counterfactual datasets. These datasets forced models to focus exclusively on the provided maps by altering the textual or numerical elements. For Results of Additional Model, Please visit Appendix in Paper

  • Imaginary Names: Replaced state names with fictional ones.
  • Shuffled Names: Randomized state names while keeping values unchanged.
  • Jumbled Values: Swapped numerical values among states, legend unchanged.
Performance on Counterfactual
Performance on Counterfactual

Human Baseline

Human Evaluation Process

To establish a baseline, we conducted a human evaluation of the MAPWise dataset. The evaluation included 450 randomly selected questions spanning three countries: USA, India, and China. These questions were answered by three independent human annotators with expertise in map interpretation and spatial reasoning. The annotators followed a majority voting system to determine the final answers. The dataset included both discrete and continuous maps, covering various answer types such as binary, list, ranking, and count-based questions.

human baseline
Human Baseline
majority voting
Majority Voting

Comparison: Human vs. Model Performance

Human evaluators consistently outperformed all Vision-Language Models (VLMs), particularly in reasoning-heavy tasks such as ranking and list-based questions. While human accuracy for binary and direct extraction tasks was near-perfect (above 95%), models struggled significantly, with even the best-performing model (GPT-4o) achieving only 71.52% binary accuracy and much lower recall on list-based questions.

This stark contrast underscores the challenges faced by VLMs in tasks requiring advanced spatial and numerical reasoning. Despite their sophisticated architectures, models exhibited biases, hallucinations, and errors when extracting values from the legend. The significant performance gap between humans and models highlights the need for advancements in multimodal learning techniques and improved reasoning strategies to bridge this divide.

Human vs Model Performance
Human vs Model Performance

Acknowledgment

We are grateful to Arqam Patel, Nirupama Ratna, and Jay Gala for their help with the creation of the MAPWise dataset. Their early efforts were instrumental in guiding the development of this research. We also thank Adnan Qidwai and Jennifer Sheffield for their valuable insights, which helped improve our work. We extend our sincere thanks to the reviewers for their insightful comments and suggestions, which have greatly enhanced the quality of this manuscript.

This research was partially supported by ONR Contract N00014-23-1-2364, and sponsored by the Army Research Office under Grant Number W911NF-20-1-0080. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Office or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein.

Authors

The MAPWise dataset is prepared by the following people:

From left to right, Srija Mukhopadhyay, Abhishek Rajgaria, Prerana Khatiwada, Manish Shrivastava, Dan Roth and Vivek Gupta.

BibTeX

@misc{mukhopadhyay2024mapwiseevaluatingvisionlanguagemodels,
      title={MAPWise: Evaluating Vision-Language Models for Advanced Map Queries}, 
      author={Srija Mukhopadhyay and Abhishek Rajgaria and Prerana Khatiwada and Vivek Gupta and Dan Roth},
      year={2024},
      eprint={2409.00255},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2409.00255}, 
}