Data Overview
We use data from 70 scenes, with 75K training samples and 12K unseen validation samples.
Examples of language-conditioned local navigation under occlusion. The blue boxes mark the robot, the red boxes highlight humans and objects that cause occlusions, and the green boxes indicate target regions.
Qualitative Analysis
These two examples show BEACON correctly grounding the instruction under heavy occlusion. Compared with image-space baselines, it can infer plausible target regions even when they are occluded, by predicting affordance in BEV space beyond visible pixels.
These two examples illustrate typical errors: confusion about the referred landmark or spatial relation, and ambiguity in how far the robot should proceed. Even when it fails, the predicted target still remains spatially feasible.