AIRoA · Robot Learning
A new language for action — not words, but shareable latent thoughts another policy can decode and verify through action prediction.
Order request green socks → handkerchief → yellow socks
Language is shareable: a reasoning trace is good when a different model can read it and recover the answer.
But a subtask like “put the bottle in box 1” spans many action chunks — too coarse to say what to do right now. So what plays its role? A shared latent, at the granularity of control, another policy can decode and verify through action prediction.
Top: in a VLM, another model can consume the trace to recover the answer. Bottom: Continuous Reasoning gives VLA the same property — a shared Gaussian latent a second policy must consume to predict the actions.
We argue the “language” of VLA should be a continuous internal interface with three properties — and we instantiate each with a concrete mechanism.
If a reasoning trace is genuinely good, another model instance should benefit from it, not only the model that produced it.
The representation should live in a common latent space that can be transmitted and consumed across instances, not exist as an opaque byproduct of one forward pass.
It should match the temporal granularity of actions — above motor fluctuations, below free-form semantic language — close enough to guide control, not just describe it.
Not every latent is reasoning. Continuous Reasoning makes thoughts verifiable: another model instance must decode the same latent and validate it through action prediction.
A latent can improve one policy while still being a private shortcut. We instead require the thought to travel: the student produces continuous thoughts, maps them into a shared Gaussian code, and an EMA teacher must decode that same code during training. The interface is rewarded only when it supports target action prediction outside the model instance that produced it.
Gains appear where reasoning should matter most: spatial re-anchoring and task-level retargeting. On LIBERO-PRO, Continuous Reasoning raises the average suite mean from 58.0 to 64.0, with the clearest improvements on position and task perturbations rather than only appearance or wording shifts.
| Method | TX-G2 — bimanual | HSR — mobile | ||||||
|---|---|---|---|---|---|---|---|---|
| Cutlery | Bowl | Clothes | Dish | Bottle→Box | Bottles→Table | Box | Mug | |
| π0.5 | 5.0±3.2 | 47.5±6.8 | 83.3±4.8 | 60.0±7.2 | 83.3±9.6 | 45.8±8.5 | 41.7±14.0 | 66.7±9.6 |
| CR (Ours) | 22.5±6.4 | 70.0±7.2 | 95.0±2.7 | 87.5±4.5 | 83.3±10.8 | 83.3±6.8 | 58.3±12.3 | 75.0±10.2 |
Mean subtask success (%) ± std. Same 150k fine-tuning budget for both.
Platforms. TX-G2 (bimanual, 3 cameras, 10 Hz) — the policy must also choose which arm to use. HSR (mobile manipulation with locomotion, 2 Hz).
TX-G2 — a bimanual manipulator
A fixed-base bimanual platform with three cameras, queried at 10 Hz. Objects can appear on either side of the workspace, so the policy must also decide which arm to use.




HSR — a mobile manipulator
Unlike a fixed-base arm, HSR moves through the room as it works: every rollout interleaves locomotion with manipulation — approach, navigate, grasp, carry, and place — queried at 2 Hz. Below are four autonomous rollouts, played at 3×.
Removing Gaussian latent structuring, chunk-causal masking, continuous thoughts, or self-verification lowers the perturbations tied most directly to reasoning: position and task. The largest drop comes from removing the shared Gaussian latent, which supports the idea that the interface itself matters.
LIBERO-PRO success rate (%), averaged across the four suites.
Beyond scalar success rates, we project the learned reasoning trajectories with PCA. By comparing instruction pairs from the same initial scene, differences reflect task-dependent reasoning rather than scene variation.
Reasoning trajectories on paired LIBERO-PRO scenes. Each instruction is evaluated with five rollouts from the same initial scene. Trajectories converge near shared control phases (e.g. similar grasp geometry) and separate where the tasks demand different strategies — evidence that reasoning reorganizes around task phase and object-specific control demands, not scene identity.
On the TX-G2 subtask “pick up the green socks,” we compare three matched variants: target on the left, target on the right, and target that starts on the left and is thrown to the right mid-episode — a displacement that never appears in training data.
Online re-anchoring. The perturbed rollout (middle) starts on the left-target reasoning pattern, but its final reasoning state migrates toward the right-target configuration — consistent with re-anchoring the reasoning interface after the target moves, rather than rigidly replaying the original plan.
Dynamic object injection. The workspace starts empty and a human throws objects in one by one. The reasoning latent stays inert under distractors (e.g. unseen green fruit), then sharply transitions once the true target (green socks) appears, proceeding through distinct pickup (blue) and placement (brown) phases.
Robustness under live human intervention
The robot must place items into the basket in this fixed order. During every rollout a person continuously interferes—moving objects, adding clutter, and throwing new items into the scene. Only the three target items and the basket ever appear in training; everything else is unseen. Continuous Reasoning stays stable and finishes the job.
Four uncut autonomous rollouts, played at 3×. Every object except the three targets and the basket is a distractor unseen during training.
Natural language is a powerful reasoning medium for language and vision-language models, but it is mismatched to the granularity of continuous control. A single reasoning step can span many action chunks while remaining only weakly coupled to the action needed now. This suggests a different question for VLA: what should play the role of language? We argue that a useful VLA reasoning medium must be shareable across model instances, verifiable through downstream action improvement, and aligned with temporally extended control structure.
Based on this view, we propose Continuous Reasoning for Vision-Language-Action. Our model first predicts continuous reasoning in the form of a structured set of continuous thoughts, then reuses them as shared context for chunk-structured action generation. We instantiate continuous reasoning as a shared Gaussian latent interface and train it with a self-verification objective in which an exponential-moving-average teacher must successfully consume the student's reasoning when predicting target actions.
Empirically, Continuous Reasoning improves robustness on LIBERO-PRO and performs strongly on both HSR and TX-G2 (an AgiBot G2-compatible variant), raising mean real-robot subtask success over π0.5 by 40.4% on TX-G2 and 26.3% on HSR. Reasoning in VLA is less about producing extra tokens than about learning a shareable and verifiable internal language for action.
@misc{wu2026continuousreasoning,
title = {Continuous Reasoning for Vision-Language-Action},
author = {Wu, Yueh-Hua and Matsushima, Tatsuya and Ota, Kei},
year = {2026},
eprint = {TBA},
archivePrefix = {arXiv},
primaryClass = {cs.RO}
}