CLAD: Constrained Latent Action Diffusion for Vision-Language Procedure Planning
Lei Shi,
Andreas Bulling
Proc. IEEE International Conference on Robot and Human Interactive Communication (RO-MAN),
2026.
Abstract
Links
BibTeX
Project
We propose CLAD, a Constrained Latent Action Diffusion model for vision-language procedure planning, the challenging task of predicting a sequence of actions that lead from a start state towards an intended goal state. Procedure planning, while critical in robot skill learning and for assistive robots, has been largely neglected so far, and existing methods have not leveraged semantic information for action generation. In contrast, CLAD exploits the fact that the latent space of diffusion models trained for procedure planning contains rich semantic information. Our method uses a Variational Autoencoder (VAE) to learn the latent representation of actions and observations as constraints and integrate them into a diffusion process. As such, our method uses these latent constraints to steer the diffusion model to generate better actions in the procedural plan. We report extensive experiments on four datasets: three covering human procedure planning and one robot learning, and show that our method outperforms state-of-the-art methods by a large margin. We demonstrate that the proposed integration of the action and observation representations learnt in the VAE latent space is key to these performance improvements.
@inproceedings{shi26_roman,
title = {{{CLAD}}: {{Constrained Latent Action Diffusion}} for {{Vision-Language Procedure Planning}}},
author = {Shi, Lei and Bulling, Andreas},
year = {2026},
booktitle = {Proc. IEEE International Conference on Robot and Human Interactive Communication (RO-MAN)}
}