Select2Plan: Training-Free ICL-Based Planning through VQA and Memory Retrieval

Abstract

This study explores the potential of off-the-shelf Vision-Language Models (VLMs) for high-level robot planning in the context of autonomous navigation. Indeed, while most of existing learning-based approaches for path planning require extensive task-specific training/fine-tuning, we demonstrate how such training can be avoided for most practical cases. To do this, we introduce Select2Plan (S2P), a novel training-free frame work for high-level robot planning which completely eliminates the need for fine-tuning or specialised training. By leveraging structured Visual Question-Answering (VQA) and In-Context Learning (ICL), our approach drastically reduces the need for data collection, requiring a fraction of the task-specific data typically used by trained models, or even relying only on online data. Our method facilitates the effective use of a generally trained VLM in a flexible and cost-efficient way, and does not require additional sensing except for a simple monocular camera. We demonstrate its adaptability across various scene types, context sources, and sensing setups. We evaluate our approach in two distinct scenarios: traditional First-Person View (FPV) and infrastructure-driven Third-Person View (TPV) navigation, demonstrating the flexibility and simplicity of our method. Our technique significantly enhances the navigational capabilities of a baseline VLM of approximately 50% in TPV scenario, and is comparable to trained models in the FPV one, with as few as 20 demonstrations.

Can we plan and navigate in a training-free matter leveraging VQA and ICL?

Select2Plan (S2P) is a novel framework for high-level robot navigation that eliminates the need for extensive training or fine-tuning. The system leverages off-the-shelf Vision-Language Models (VLMs) combined with Visual Question-Answering (VQA) to make decisions without the need for specialized data collection.

Can we adapt to different perspectives?

S2P adapts effortlessy to different navigation scenarios, including First-Person View (FPV) and Third-Person View (TPV). This versatility enables it to operate in various conditions, from autonomous vehicle pathfinding to CCTV-based robotic control.

S2P incorporates In-Context Learning (ICL) combined with experiential memory to generate robust plans.
The memory retrieval system enables the robot to learn from previous experiences and generalize to new, unseen environments, improving navigation performance by as much as 50% without the need for additional training. S2P performs comparably to heavily trained models in traditional setups (FPV), achieving impressive results with significantly fewer data points, making it an efficient solution for high-level robot planning and navigation.

What matters in the context?

We also investigate how different data sources can affect the performance of S2P. In TPV scenario, we evaluate the performance of context coming from three different sources: same deployment scenario and same rooms, same scenario, different rooms same scenario, humans as demonstrators and finally, different scenario and different robotic support (online data).

Results in FPV scenario

In the FPV setup, our model, referred to as Select2Plan (S2P), was evaluated against state-of-the-art methods in a variety of scenarios. The average Success Rate (SR) of 46.16% in known scenes with known objects reflects the model's ability to perform well even with a minimal training dataset. Compared to the best-performing model, which was trained on 8 million episodes, S2P required only a fraction of the data, specifically one episode per object type (total of 15). Despite this, S2P managed to achieve comparable results, highlighting the efficiency of our approach in leveraging pre-trained Vision-Language Models (VLMs) for efficient knowledge transfer

Results in TPV scenario

Our In-Context Learning (ICL) approach demonstrated significant improvements in the TPV scenario. The highest Trajectory Score (TS) achieved was 270.70 in context scenario A, which allows unrestricted retrieval from the database, compared to the baseline zero-shot approach, which scored 147.82. This represents a 54.6% improvement, indicating that our model can effectively leverage contextual information to enhance navigational accuracy and safety. In addition to the overall TS improvement, the framework showed a remarkable 38% reduction in selecting dangerous points (DS), which indicates locations with a high risk of collision. This reduction is crucial for real-world applications, where safety and reliable navigation are paramount. Such performance gains were consistent across different context scenarios, further highlighting the versatility and adaptability of the proposed ap- proach.