AIRoA · Robot Learning

Continuous Reasoning
for Vision-Language-
Action

A new language for action — not words, but shareable latent thoughts another policy can decode and verify through action prediction.

Yueh-Hua (Kris) Wu*, Tatsuya Matsushima, Kei Ota AIRoA  ·  *Corresponding author

Order request green sockshandkerchiefyellow socks

What should play the role of language for vision-language-action models?
In language models

Language is shareable: a reasoning trace is good when a different model can read it and recover the answer.

For robots

But a subtask like “put the bottle in box 1” spans many action chunks — too coarse to say what to do right now. So what plays its role? A shared latent, at the granularity of control, another policy can decode and verify through action prediction.

Continuous Reasoning motivation: language as a shareable reasoning medium in VLMs, and its analogue for VLA

Top: in a VLM, another model can consume the trace to recover the answer. Bottom: Continuous Reasoning gives VLA the same property — a shared Gaussian latent a second policy must consume to predict the actions.

Three Properties of a Good Reasoning Medium

We argue the “language” of VLA should be a continuous internal interface with three properties — and we instantiate each with a concrete mechanism.

Property 1

Reusable

If a reasoning trace is genuinely good, another model instance should benefit from it, not only the model that produced it.

→ Self-verification: an EMA teacher consumes the same thought.
Property 2

Shareable

The representation should live in a common latent space that can be transmitted and consumed across instances, not exist as an opaque byproduct of one forward pass.

→ Shared Gaussian code: a WAE structures the thought space.
Property 3

Abstraction-aligned

It should match the temporal granularity of actions — above motor fluctuations, below free-form semantic language — close enough to guide control, not just describe it.

→ Chunk-level control: reasoning stays close to action generation.

How It Works

The core idea

Not every latent is reasoning. Continuous Reasoning makes thoughts verifiable: another model instance must decode the same latent and validate it through action prediction.

Continuous Reasoning architecture
VLA-A writes continuous thoughts into a shared Gaussian latent code. VLA-B receives that same code and must use its decoded interface to predict the target action field, turning reuse by another model instance into the verification signal.
Why it works

A latent can improve one policy while still being a private shortcut. We instead require the thought to travel: the student produces continuous thoughts, maps them into a shared Gaussian code, and an EMA teacher must decode that same code during training. The interface is rewarded only when it supports target action prediction outside the model instance that produced it.

Under the hood
Interface
Continuous thoughts
Shared code
Gaussian latent (WAE)
Verification
EMA teacher
Actions
Chunk-causal flow matching

Results

+40.4%TX-G2 real-robot +26.3%HSR real-robot
Relative improvement in mean real-robot subtask success over π0.5, under the same 150k fine-tuning budget.

Gains appear where reasoning should matter most: spatial re-anchoring and task-level retargeting. On LIBERO-PRO, Continuous Reasoning raises the average suite mean from 58.0 to 64.0, with the clearest improvements on position and task perturbations rather than only appearance or wording shifts.

58.0 → 64.0
LIBERO-PRO average suite mean
26.8 → 39.3
Position perturbation
(action re-anchoring)
24.8 → 37.1
Task perturbation
(goal-level transfer)
Per-task real-robot subtask success vs. π0.5
Method TX-G2 — bimanual HSR — mobile
CutleryBowlClothesDish Bottle→BoxBottles→TableBoxMug
π0.5 5.0±3.2 47.5±6.8 83.3±4.8 60.0±7.2 83.3±9.6 45.8±8.5 41.7±14.0 66.7±9.6
CR (Ours) 22.5±6.4 70.0±7.2 95.0±2.7 87.5±4.5 83.3±10.8 83.3±6.8 58.3±12.3 75.0±10.2

Mean subtask success (%) ± std. Same 150k fine-tuning budget for both.

TX-G2 and HSR platforms

Platforms. TX-G2 (bimanual, 3 cameras, 10 Hz) — the policy must also choose which arm to use. HSR (mobile manipulation with locomotion, 2 Hz).

TX-G2 — a bimanual manipulator

A fixed-base bimanual platform with three cameras, queried at 10 Hz. Objects can appear on either side of the workspace, so the policy must also decide which arm to use.

Cutlery Transfer
Cutlery Transfer
Bowl Stacking
Bowl Stacking
Clothes Sorting
Clothes Sorting
Dish Racking
Dish Racking

HSR — a mobile manipulator

Unlike a fixed-base arm, HSR moves through the room as it works: every rollout interleaves locomotion with manipulation — approach, navigate, grasp, carry, and place — queried at 2 Hz. Below are four autonomous rollouts, played at 3×.

Every Ingredient Matters

Removing Gaussian latent structuring, chunk-causal masking, continuous thoughts, or self-verification lowers the perturbations tied most directly to reasoning: position and task. The largest drop comes from removing the shared Gaussian latent, which supports the idea that the interface itself matters.

Position perturbation
Full CR
39.3
w/o chunk-causal mask
32.5
w/o continuous thoughts
32.5
w/o self-verification
32.4
w/o Gaussian latent
30.2
Task perturbation
Full CR
37.1
w/o chunk-causal mask
29.2
w/o continuous thoughts
28.6
w/o self-verification
28.1
w/o Gaussian latent
24.9

LIBERO-PRO success rate (%), averaged across the four suites.

What the Reasoning Actually Encodes

Beyond scalar success rates, we project the learned reasoning trajectories with PCA. By comparing instruction pairs from the same initial scene, differences reflect task-dependent reasoning rather than scene variation.

PCA of continuous reasoning trajectories

Reasoning trajectories on paired LIBERO-PRO scenes. Each instruction is evaluated with five rollouts from the same initial scene. Trajectories converge near shared control phases (e.g. similar grasp geometry) and separate where the tasks demand different strategies — evidence that reasoning reorganizes around task phase and object-specific control demands, not scene identity.

Reasoning Re-anchors Online

On the TX-G2 subtask “pick up the green socks,” we compare three matched variants: target on the left, target on the right, and target that starts on the left and is thrown to the right mid-episode — a displacement that never appears in training data.

Online re-anchoring of reasoning

Online re-anchoring. The perturbed rollout (middle) starts on the left-target reasoning pattern, but its final reasoning state migrates toward the right-target configuration — consistent with re-anchoring the reasoning interface after the target moves, rather than rigidly replaying the original plan.

Dynamic object injection. The workspace starts empty and a human throws objects in one by one. The reasoning latent stays inert under distractors (e.g. unseen green fruit), then sharply transitions once the true target (green socks) appears, proceeding through distinct pickup (blue) and placement (brown) phases.

Robustness under live human intervention

Manipulation stays stable, even as the scene changes.

Green socks Handkerchief Yellow socks

The robot must place items into the basket in this fixed order. During every rollout a person continuously interferes—moving objects, adding clutter, and throwing new items into the scene. Only the three target items and the basket ever appear in training; everything else is unseen. Continuous Reasoning stays stable and finishes the job.

Four uncut autonomous rollouts, played at 3×. Every object except the three targets and the basket is a distractor unseen during training.

Paper Abstract

Natural language is a powerful reasoning medium for language and vision-language models, but it is mismatched to the granularity of continuous control. A single reasoning step can span many action chunks while remaining only weakly coupled to the action needed now. This suggests a different question for VLA: what should play the role of language? We argue that a useful VLA reasoning medium must be shareable across model instances, verifiable through downstream action improvement, and aligned with temporally extended control structure.

Based on this view, we propose Continuous Reasoning for Vision-Language-Action. Our model first predicts continuous reasoning in the form of a structured set of continuous thoughts, then reuses them as shared context for chunk-structured action generation. We instantiate continuous reasoning as a shared Gaussian latent interface and train it with a self-verification objective in which an exponential-moving-average teacher must successfully consume the student's reasoning when predicting target actions.

Empirically, Continuous Reasoning improves robustness on LIBERO-PRO and performs strongly on both HSR and TX-G2 (an AgiBot G2-compatible variant), raising mean real-robot subtask success over π0.5 by 40.4% on TX-G2 and 26.3% on HSR. Reasoning in VLA is less about producing extra tokens than about learning a shareable and verifiable internal language for action.

BibTeX

@misc{wu2026continuousreasoning,
  title         = {Continuous Reasoning for Vision-Language-Action},
  author        = {Wu, Yueh-Hua and Matsushima, Tatsuya and Ota, Kei},
  year          = {2026},
  eprint        = {TBA},
  archivePrefix = {arXiv},
  primaryClass  = {cs.RO}
}