Action Chunking — Robot Learning Glossary

Definition

Action chunking is a technique in robot imitation learning where a policy predicts a fixed-length sequence of future actions (typically 8 to 50 timesteps) in a single forward pass, rather than generating one action per inference step. By outputting an entire trajectory chunk at once, the policy produces temporally smooth and coordinated motions that avoid the jittery, inconsistent behavior that plagues single-step prediction.

The most widely adopted implementation is Action Chunking with Transformers (ACT), introduced by Tony Zhao et al. in 2023. ACT combines a Conditional Variational Autoencoder (CVAE) with a transformer encoder-decoder architecture. The CVAE captures the distribution of possible action sequences conditioned on the current observation, while the transformer attends to both visual features (from camera images) and proprioceptive state (joint positions, gripper state) to decode an entire action chunk.

ACT was specifically designed for bimanual manipulation tasks using the ALOHA teleoperation system, where coordinating two robot arms demands smooth, temporally coherent trajectories. It has since become one of the two dominant policy architectures in the robot learning community, alongside Diffusion Policy.

How It Works

During training, ACT takes an expert demonstration and encodes the full action sequence through the CVAE encoder to produce a latent variable z. This latent is concatenated with the current observation and fed to the transformer decoder, which predicts the action sequence. The training loss combines action reconstruction (L1 or L2 loss between predicted and ground-truth actions) with a KL-divergence term that regularizes the latent space.

At inference time, the CVAE encoder is discarded. The latent z is sampled from the prior (a standard Gaussian), concatenated with the current observation, and the transformer decoder generates the action chunk. The robot executes the first k actions from the chunk, then queries the policy again. This overlapping execution creates a receding-horizon control scheme.

A key insight is the temporal ensemble technique: rather than executing all actions from a single chunk, the system maintains a weighted average over overlapping predictions from consecutive chunks. This smooths transitions between chunks and further reduces jitter, particularly for contact-rich tasks like insertion or folding.

Key Variants

Standard ACT — The original CVAE + transformer architecture with L1 action loss and KL regularization. Uses joint-space action targets and wrist camera images.
ACT + DAgger — Combines ACT with Dataset Aggregation: after initial training, the policy is deployed while a human provides corrective demonstrations on states the policy visits. This iterative process addresses the distribution shift problem of pure behavior cloning.
Temporal Ensemble ACT — Applies exponentially weighted averaging over overlapping action chunks during execution. The weighting parameter controls how much influence older predictions retain versus newer ones.
ACT in LeRobot — Hugging Face's LeRobot framework provides a standardized ACT implementation with configurable chunk sizes, vision backbones (ResNet, ViT), and training hyperparameters. This has become the de facto reference implementation.

Comparison with Alternatives

ACT vs. Diffusion Policy: ACT produces deterministic action chunks (conditioned on a sampled latent) in a single forward pass, making it fast at inference (~4ms per chunk on a GPU). Diffusion Policy uses iterative denoising (10-100 steps) to generate actions, which is slower but naturally handles multimodal action distributions. For tasks with a single clear strategy, ACT tends to be simpler and faster. For tasks where multiple valid approaches exist (e.g., reaching around an obstacle from either side), Diffusion Policy's multimodal nature gives it an advantage.

ACT vs. BC-RNN: Recurrent behavior cloning (BC-RNN) generates actions one step at a time using an LSTM or GRU. This is prone to temporal inconsistency because small prediction errors compound across steps. ACT sidesteps this by predicting entire chunks, making it significantly more robust for multi-step manipulation sequences.

ACT vs. VLA models: Vision-Language-Action models like RT-2 or OpenVLA incorporate language understanding and can generalize across tasks described in natural language. ACT is a specialist: it excels at a single task with minimal data but does not understand language instructions. The two approaches are complementary rather than competing.

Practical Requirements

Data: ACT is remarkably data-efficient. For simple single-arm tasks (pick and place, pushing), 20-50 teleoperation demonstrations are often sufficient. Bimanual tasks (folding, insertion, handovers) typically require 50-200 demonstrations. Data quality matters more than quantity: consistent demonstrations with minimal pauses produce better policies.

Compute: Training an ACT policy takes 1-4 hours on a single consumer GPU (RTX 3090 / 4090) for typical datasets of 50-200 demonstrations. Inference runs at 200+ Hz on a GPU, far exceeding the typical 10-50 Hz robot control rate, so ACT adds negligible latency to the control loop.

Hardware: ACT was designed for position-controlled robot arms. It works best with low-cost arms like ALOHA (ViperX 300), Koch v1.1, or SO-100 where joint position commands are the native interface. Force-controlled arms require additional care, as ACT does not natively output torques or impedance parameters.

Code Example: Training ACT with LeRobot

# Install LeRobot
pip install lerobot

# Train an ACT policy on your dataset
python lerobot/scripts/train.py \
  --policy.type=act \
  --dataset.repo_id=your_hf_username/your_dataset \
  --training.num_epochs=2000 \
  --policy.chunk_size=100 \
  --policy.n_action_steps=100 \
  --policy.input_normalization_modes='{"observation.images.top":"mean_std","observation.state":"mean_std"}' \
  --output_dir=outputs/act_pick_place

# Evaluate the trained policy
python lerobot/scripts/eval.py \
  --policy.path=outputs/act_pick_place/checkpoints/last/pretrained_model \
  --env.type=real_world

Choosing the Right Chunk Size

Chunk size is the most impactful hyperparameter in ACT and it depends heavily on the task dynamics:

Short chunks (8-20 actions) — Best for reactive tasks where the environment changes frequently: tasks with dynamic objects, force-sensitive operations, or tasks near obstacles. Short chunks allow the policy to react to new observations more frequently. The tradeoff is increased jitter between chunks.
Medium chunks (20-50 actions) — The sweet spot for most tabletop manipulation: pick-and-place, stacking, simple assembly. Long enough for smooth trajectory segments but short enough to correct course when needed. The original ACT paper used chunk_size=100 at 50 Hz, equivalent to 2 seconds of action.
Long chunks (50-100 actions) — Best for tasks with long, smooth motion segments: wiping, spreading, drawing, or bimanual coordination where both arms must follow synchronized trajectories. Long chunks produce the smoothest motions but cannot react to unexpected changes within a chunk.

Temporal ensemble weight (w) controls how aggressively new chunks override old ones. w close to 0 trusts the latest prediction; w close to 1 averages heavily across past chunks. Start with w=0.01 (favor new predictions) and increase if you observe jerky transitions between chunks.

Key Papers

Zhao, T., Kumar, V., Levine, S., & Finn, C. (2023). "Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware." RSS 2023. The original ACT paper, introducing CVAE-based action chunking for bimanual ALOHA tasks.
Zhao, T. et al. (2024). "ALOHA 2: An Enhanced Low-Cost Hardware for Bimanual Teleoperation." Extends ACT with improved hardware and the temporal ensemble technique for smoother execution.
Cadene, R. et al. (2024). "LeRobot: State-of-the-art Machine Learning for Real-World Robotics in a Hugging Face Repository." Provides the reference open-source ACT implementation used by most practitioners.

Related Terms

Diffusion Policy — Iterative denoising approach to action generation with multimodal support
Behavior Cloning — The foundational supervised imitation learning method ACT builds upon
Imitation Learning — The broader paradigm of learning from demonstrations
Teleoperation — How demonstration data is collected for ACT training
Embodied AI — The field of AI systems that act in the physical world

Apply This at SVRC

Robotics Center of Silicon Valley provides the full stack for ACT-based policy training: ALOHA teleoperation hardware for data collection, GPU workstations for training, and real robot cells for evaluation. Our data services team can collect high-quality demonstrations for your specific manipulation tasks.

Explore Data Services Contact Us

Action Chunking with Transformers (ACT)

Definition

How It Works

Key Variants

Comparison with Alternatives

Practical Requirements

Code Example: Training ACT with LeRobot

Choosing the Right Chunk Size

See Also

Key Papers

Related Terms

Apply This at SVRC

Related Pages

Diffusion Policy

Behavior Cloning

Imitation Learning

Sim-to-Real Transfer