MAPWise: New BenchMark for Advanced Map based Queries

About

MAPWise: Evaluating Vision-Language Models for Advanced Map Queries

MAPWise is a novel benchmark designed to evaluate the performance of Vision-Language Models (VLMs) in answering map-based questions. Choropleth maps, commonly used for geographical data representation, pose unique challenges for VLMs due to their reliance on color variations to encode information. Our study introduces a dataset featuring maps from the United States, India, and China, with 1000 manually created question-answer pairs per region. The dataset includes a diverse set of question templates to test spatial reasoning, numerical comprehension, and pattern recognition. Our goal is to push the boundaries of VLM capabilities in handling complex map-based reasoning tasks.

TL;DR: MAPWise is a benchmark designed to evaluate Vision-Language Models (VLMs) on map-based questions. It includes 3,000 manually curated Q&A pairs across maps from the United States, India, and China, testing spatial reasoning, numerical comprehension, and pattern recognition.

MAPWise Dataset Creation Procedure

Data Sources

India

Data from the Reserve Bank of India's Handbook of Statistics on Indian States.

USA

Healthcare statistics from the Kaiser Family Foundation.

China

Socioeconomic data from the National Bureau of Statistics of China.

Map Variations

Discrete Maps

Defined categorical legend with clear boundaries.

Continuous Maps

Gradual spectrum-based legends.

Annotated Maps

Maps with labels to aid interpretation.

Hatched Maps

Alternative representations using texture.

Question Generation & Validation

To design a Robust Benchmark for Map-Based QA, we developed question templates that vary in difficulty, from simple yes/no questions to complex region association tasks requiring spatial reasoning. The dataset includes three primary question types: Binary questions that require a yes or no answer based on map features, Direct Value Extraction tasks that involve retrieving specific numerical or categorical values from maps, and Region Association questions that challenge models to identify or count regions meeting particular criteria, often requiring geospatial reasoning.

To ensure high dataset quality, expert annotators meticulously reviewed all questions and answers. A rigorous benchmarking process, including human evaluation, was conducted to assess dataset reliability. This evaluation highlights the limitations of existing Vision-Language Models (VLMs) in handling complex map-based reasoning tasks and provides valuable insights for future improvements.

Example Questions and Answer Types

Answer Type	Example Question
Binary	Yes or no: California is an outlier compared to its neighbors?
Single Word	Name the eastern-most state that belongs to a higher value range compared to all its neighbors.
List	Which states in the East China Sea region have a value higher than state Guangdong?
Range	What is the least value range in the west coast region?
Count	How many states bordering Canada have a value lower than New Mexico?
Ranking	Rank Rajasthan, Gujarat, and Jammu & Kashmir in terms of the legend value in the region bordering Pakistan.

Dataset Statistics

MAPWise Data Statistics — *MAPWise Dataset statistic*

Results & Analysis

Models Evaluated

We evaluated multiple Vision-Language Models (VLMs) on the MAPWise dataset. The models included both closed-source and open-source multimodal models:

Closed-Source Models: GPT-4o, Gemini 1.5 Flash
Open-Source Models: CogAgent, InternLM-XComposer2, Idefics 2, Qwen VL

In Depth Analysis

We conducted thorough Analysis to evaluate closed-source model performance, the effectiveness of different prompting strategies, and biases in model predictions. The study examined how models performed across different map types (discrete vs. continuous, colored vs. hatched), annotations (with vs. without labels), and country-wise variations. Additionally, model accuracy was compared across various question-answer types, revealing significant gaps in spatial and numerical reasoning. For Complete Results, Please visit Appendix in Paper

MAPWise: A Challenging Benchmark

The MAPWise dataset reveals a significant performance gap between current Vision-Language Models (VLMs) and human performance, especially on complex reasoning tasks (e.g., counting, listing). Existing VLMs have substantial limitations in reasoning abilities, highlighting the need for further research. The average gap is close to 50% on some task types.

Model Performance Comparison

GPT-4o consistently outperforms other models, followed by Gemini 1.5 Flash. Gemini excels with hatched maps due to better legend resolution, while GPT-4o is generally better due to stronger reasoning. Open-source models (Idefics2, InternLM) show promise but struggle with complex reasoning, particularly evident in QwenVL's low counting accuracy (4.26%). Both strong data extraction and sophisticated reasoning are crucial for geo-spatial understanding.

Prompt Effectiveness

Standard Chain-of-Thought (COT) prompting generally outperforms Explicit Extraction and Reasoning (EER), except for Gemini 1.5 Flash, which performs similarly with both. Gemini 1.5 Flash has strong instruction-following capabilities. Larger models implicitly use an EER-like approach, showing progress in reasoning. Smaller models struggle with EER's complexity.

Discrete vs. Continuous Maps

Models generally perform better on discrete maps than continuous maps, especially for counting and range extraction. Continuous maps pose challenges in legend range and color resolution. Models do better on single-word answers within the continuous category, possibly due to question simplicity.

Colored Maps vs. Hatched Maps

Models consistently perform better on colored maps than hatched maps, indicating a preference for colored data representation. Even GPT-4o shows significant performance drops on hatched maps. Idefics2 is the most robust to this map type.

Maps with and without annotations

Models perform similarly on maps with and without annotations. Sometimes, performance is *better* without annotations. Annotations are not a crucial factor for model performance in map understanding, though they can sometimes be beneficial.

Country-Wise Performance

Open-source models show consistent performance across countries (USA, India, China), while closed-source models exhibit greater variation. Potential training data biases may contribute to performance differences in closed-source models across different geographic regions.

Analysis across Question and Answer Types

Models excel at binary questions, followed by single-word answers (strong data extraction). Closed-source models do well on range questions. Counting and listing tasks are the most difficult, requiring complex reasoning, external knowledge, and geospatial understanding. These are challenging even for humans. Models struggle with most questions concerning relative regions, except for binary questions.

Comparison with Random Baselines

Some models (CogAgent, QwenVL) perform below random baselines on binary questions and list precision. This is due to irrelevant responses, repeated tokens, failure to generate valid responses, poor data processing, hallucination, and a lack of task comprehension. Examples show failures in value deduction and task understanding.

Performance Tables

Closed Models — *Figure 1: Closed Model Performance*

*Figure 4: With vs. Without Annotations*

Counterfactual Experiments

To analyze how models rely on internal knowledge versus actual map interpretation, we designed three types of counterfactual datasets. These datasets forced models to focus exclusively on the provided maps by altering the textual or numerical elements. For Results of Additional Model, Please visit Appendix in Paper

Imaginary Names: Replaced state names with fictional ones.
Shuffled Names: Randomized state names while keeping values unchanged.
Jumbled Values: Swapped numerical values among states, legend unchanged.

Human Baseline

Human Evaluation Process

To establish a baseline, we conducted a human evaluation of the MAPWise dataset. The evaluation included 450 randomly selected questions spanning three countries: USA, India, and China. These questions were answered by three independent human annotators with expertise in map interpretation and spatial reasoning. The annotators followed a majority voting system to determine the final answers. The dataset included both discrete and continuous maps, covering various answer types such as binary, list, ranking, and count-based questions.

Comparison: Human vs. Model Performance

Human evaluators consistently outperformed all Vision-Language Models (VLMs), particularly in reasoning-heavy tasks such as ranking and list-based questions. While human accuracy for binary and direct extraction tasks was near-perfect (above 95%), models struggled significantly, with even the best-performing model (GPT-4o) achieving only 71.52% binary accuracy and much lower recall on list-based questions.

This stark contrast underscores the challenges faced by VLMs in tasks requiring advanced spatial and numerical reasoning. Despite their sophisticated architectures, models exhibited biases, hallucinations, and errors when extracting values from the legend. The significant performance gap between humans and models highlights the need for advancements in multimodal learning techniques and improved reasoning strategies to bridge this divide.

Acknowledgment

We are grateful to Arqam Patel, Nirupama Ratna, and Jay Gala for their help with the creation of the MAPWise dataset. Their early efforts were instrumental in guiding the development of this research. We also thank Adnan Qidwai and Jennifer Sheffield for their valuable insights, which helped improve our work. We extend our sincere thanks to the reviewers for their insightful comments and suggestions, which have greatly enhanced the quality of this manuscript.

This research was partially supported by ONR Contract N00014-23-1-2364, and sponsored by the Army Research Office under Grant Number W911NF-20-1-0080. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Office or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein.

Authors

The MAPWise dataset is prepared by the following people:

From left to right, Srija Mukhopadhyay, Abhishek Rajgaria, Prerana Khatiwada, Manish Shrivastava, Dan Roth and Vivek Gupta.

BibTeX

@misc{mukhopadhyay2024mapwiseevaluatingvisionlanguagemodels,
      title={MAPWise: Evaluating Vision-Language Models for Advanced Map Queries}, 
      author={Srija Mukhopadhyay and Abhishek Rajgaria and Prerana Khatiwada and Vivek Gupta and Dan Roth},
      year={2024},
      eprint={2409.00255},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2409.00255}, 
}