Point What You Mean: Visually Grounded Instruction Policy

Hang Yu^1,3*, Juntu Zhao^2,3*, Yufeng Liu^2,3, Kaiyu Li³, Cheng Ma³, Di Zhang¹, Yingdong Hu⁴, Guang Chen¹, Junyuan Xie³, Junliang Guo^3‡, Junqiao Zhao^1†, Yang Gao^3,4†

¹Tongji University, ²Shanghai Jiao Tong University, ³Spirit AI, ⁴Tsinghua University

*Equal Contribution. Work done during internship at Spirit AI.
†Corresponding authors. ‡Project leader.

Paper arXiv

Point-VLA resolves language limitations in robot manipulation through visual grounding with bounding boxes, achieving 92.5% success rate (3x improvement over text-only VLA).

Demo Videos

Cluttered Picking

OOD Object Picking

Precise Picking

Plain Placement

Container Placement

Abstract

Vision-Language-Action (VLA) models align vision and language with embodied control, but their object referring ability remains limited when relying solely on text prompts, especially in cluttered or out-of-distribution (OOD) scenes.

In this study, we introduce Point-VLA, a plug-and-play policy that augments language instructions with explicit visual cues (bounding boxes) to resolve referential ambiguity and enable precise, object-level grounding. To efficiently scale visually grounded datasets, we further develop an automatic data annotation pipeline requiring minimal human effort.

We evaluate Point-VLA on diverse real-world referring tasks and observe consistently stronger performance than text-only instruction VLAs, particularly in cluttered or unseen-object scenarios, with robust generalization. These results demonstrate that Point-VLA effectively resolves object referring ambiguity through pixel-level visual grounding, achieving more generalizable embodied control.

Motivation

The Problem: Language Is Not Enough

Current VLA models rely solely on text to specify manipulation targets. But language is inherently ambiguous for spatial referring:

Verbose & error-prone — Describing "the red cup to the left of the blue bottle, behind the plate" is complex and often still insufficient in cluttered scenes.
Language-action gap — Even when a VLM localizes the correct object with 60–70% accuracy, the downstream text-only VLA achieves only ~25% manipulation success.

Our Insight: Point, Don't Describe

Humans solve this effortlessly — when words fall short, we simply point. Point-VLA adopts the same intuition:

Text-only: "Pick up the bottle to the right of the leftmost bottles, in the middle of the desk" — complex, ambiguous.
Point-VLA: "Pick up" + a bounding box on the target — simple, precise.

High-level intent stays in language; precise spatial reference moves to vision.

Method

Visually Grounded Instructions

Point-VLA overlays a bounding box on the camera image to provide pixel-level spatial grounding. The model is co-trained on both textual and visually grounded instructions, yielding a single unified policy that works in either mode.

Inference Pipeline: Users can provide visual grounding through GUI or gestures, combined with simple text commands.

Automatic Data Annotation Pipeline

We use MLLMs to automatically generate bounding box annotations from demonstration videos — no manual labeling needed. Two augmentation strategies improve generalization:

Random Translation — forces the model to learn relative position, not absolute coordinates.
Localized CutMix — prevents overfitting to specific object appearances.

Training Pipeline: Automatic data annotation with MLLM, combined with data augmentation strategies (random translation and localized CutMix).

Results

Evaluation Tasks: Six real-world manipulation tasks spanning diverse challenges — irregular object picking, OOD object picking, cluttered scenario picking, egg-slot picking, plain placement, and egg-slot placement.

Performance Comparison

We compare against two baselines across six real-world tasks:

Point-VLA achieves 92.5% average success, outperforming Text VLA by +60.1 pts and Interleave-VLA by +52.5 pts.
Cluttered picking: 94.3% vs. 43.3% (Text VLA) — visual grounding excels when language cannot disambiguate.
Egg-slot picking: 86.7% vs. 10.0% (Text VLA) — precise spatial tasks benefit most from bounding box cues.

Method	Irregular Object	OOD Object	Clutter Scenario	Egg from Slot	Plain Tabletop	Egg into Slot	Average
Text VLA	30.0	57.5	43.3	10.0	30.0	23.3	32.4
Interleave-VLA	60.0	86.7	33.3	13.3	26.7	20.0	40.0
Point-VLA (ours)	96.7	92.5	94.3	86.7	95.0	90.0	92.5

VLM Localization vs VLA Execution Gap

A key finding reveals the language-action gap:

VLMs localize targets with 60–70% accuracy via text.
But text-only VLAs convert this into only ~25% manipulation success.
Point-VLA bridges this gap by encoding spatial targets as visual cues, bypassing the lossy language bottleneck.

Text Compatibility & Data Scaling

Text compatible — Co-training on both modalities does not degrade text-mode performance.
Scalable — Performance continues to improve with more visually grounded training data.

Text-mode performance is preserved after co-training.

Performance scales with more visually grounded data.

BibTeX

@article{yu2024point,
  title={Point What You Mean: Visually Grounded Instruction Policy},
  author={Yu, Hang and Zhao, Juntu and Liu, Yufeng and Li, Kaiyu and Ma, Cheng and Zhang, Di and Hu, Yingdong and Chen, Guang and Xie, Junyuan and Guo, Junliang and Zhao, Junqiao and Gao, Yang},
  journal={arXiv preprint arXiv:2512.18933},
  year={2024}
}