Navigating the Nuances: A Fine-grained Evaluation of Vision-Language Navigation

1ESAT-PSI, KU Leuven, 2Peking University,
3Nanyang Technological University, 4Fudan University
EMNLP 2024 (Findings)

Abstract

This study presents a novel evaluation framework for the Vision-Language Navigation (VLN) task. It aims to diagnose current models for various instruction categories at a finer-grained level. The framework is structured around the context-free grammar (CFG) of the task. The CFG serves as the basis for the problem decomposition and the core premise of the instruction categories design. We propose a semi-automatic method for CFG construction with the help of Large-Language Models (LLMs). Then, we induct and generate data spanning five principal instruction categories (i.e. direction change, landmark recognition, region recognition, vertical movement, and numerical comprehension). Our analysis of different models reveals notable performance discrepancies and recurrent issues. The stagnation of numerical comprehension, heavy selective biases over directional concepts, and other interesting findings contribute to the development of future language-guided navigation systems.

The recent work indicates the performance of the state of the art on standard VLN R2R task is high and even quite close to human performance. However, as shown in our preliminary experiments below, even a simple intervention on a common R2R VLN dataset fails to elicit a consistently strong response, even from the best supervised model. Additionally, other observations, such as the negligible high success rate of a randomly navigating agent and the unexpectedly low performance of Large Language Model (LLM)-based VLN models on standard datasets, motivates us to revisit the evaluation of VLN models.



In this work, we introduce a new evaluation framework that focuses on atomic instructions, i.e. the singular actions fundamental to VLN instructions. Diagnosing VLN models at the atomic-instruction level allows us to gauge performance through various nuanced perspectives. Our approach begins by iteratively constructing a context-free grammar (CFG) with the help of LLM to articulate and cover all components of VLN instructions in a unified representation (Section 3.1). Then, we induct and categorize the atomic components of the CFG into five principal categories (Section 3.2). Building on these categorizations, we develop a semi-automatic process for data annotation of each atomic instruction category, adhering to the CFG-defined natural instruction standards (Section 3.3). The five principal categories are demonstrated as follows:



Experiments

Part of the experimental results are presented in the following figures. Figure 6 shows the performance of different methods evaluated on five atomic-instruction categories of the NavNuances dataset. It is evident that for simple turning commands, supervised methods significantly lag behind LLM-based approaches, even though they perform much better on the standard R2R dataset. Additionally, we observe a substantial performance decline in LLM-based methods when handling vertical movement tasks. This may explain their lower overall performance on the R2R dataset, as approximately 35% of the data involves navigating stairs. Figure 3 highlights the limitations of current VLN models in terms of numerical comprehension, as there is no observed performance improvement compared to a random agent under the two different assumptions.


BibTeX

@inproceedings{wang2024navigating,
  title={Navigating the Nuances: A Fine-grained Evaluation of Vision-Language Navigation},
  author={Wang, Zehao and Wu, Minye and Cao, Yixin and Ma, Yubo and Chen, Meiqi and Tuytelaars, Tinne},
  booktitle={Findings of the Association for Computational Linguistics: EMNLP 2024},
  year={2024},
  publisher={Association for Computational Linguistics},
}