BEACON: Language-Conditioned Navigation Affordance Prediction under Occlusion

Abstract

Motivation

Language-conditioned local navigation requires a robot to infer a nearby traversable target location from its current observation and an open-vocabulary, relational instruction. Existing vision-language spatial grounding methods usually rely on vision-language models (VLMs) to reason in image space, producing 2D predictions tied to visible pixels. As a result, they struggle to infer target locations in occluded regions, typically caused by furniture or moving humans.

To address this issue, we propose BEACON, which predicts an ego-centric Bird's-Eye View (BEV) affordance heatmap over a bounded local region including occluded areas.

Method

Given an instruction and surround-view RGB-D observations from four directions around the robot, BEACON predicts the BEV heatmap by injecting spatial cues into a VLM and fusing the VLM's output with depth-derived BEV features.

BEACON overview. Stage 1 performs auto-derived ego-centric instruction tuning with ego-centric 3D position encoding to train the Ego-Aligned VLM. Stage 2 initializes the Ego-Aligned VLM weights from Stage 1, combines the resulting instruction-conditioned output with Geometry-Aware BEV features, and predicts an ego-centric BEV navigation affordance heatmap via a Post-Fusion Affordance Decoder. The two stages use different supervision signals, and inference selects the navigation target by taking the argmax.

Results

Using an occlusion-aware dataset built in the Habitat simulator, we conduct detailed experimental analysis to validate both our BEV space formulation and the design choices of each module.

Our method improves the accuracy averaged across geodesic thresholds by 22.74 percentage points over the state-of-the-art image-space baseline on the validation subset with occluded target locations.

Data Overview

We use data from 70 scenes, with 75K training samples and 12K unseen validation samples.

Examples of language-conditioned local navigation under occlusion. The blue boxes mark the robot, the red boxes highlight humans and objects that cause occlusions, and the green boxes indicate target regions.

Qualitative Analysis

These two examples show BEACON correctly grounding the instruction under heavy occlusion. Compared with image-space baselines, it can infer plausible target regions even when they are occluded, by predicting affordance in BEV space beyond visible pixels.

These two examples illustrate typical errors: confusion about the referred landmark or spatial relation, and ambiguity in how far the robot should proceed. Even when it fails, the predicted target still remains spatially feasible.

BibTeX

@article{gao2026beacon,
  title={BEACON: Language-Conditioned Navigation Affordance Prediction under Occlusion},
  author={Gao, Xinyu and Chen, Gang and Alonso-Mora, Javier},
  journal={arXiv preprint arXiv:2603.09961},
  year={2026}
}