Zero-Shot Offline Imitation Learning via Optimal Transport
arXivZero-shot imitation learning algorithms hold the promise of reproducing unseen behavior from as little as a single demonstration at test time. Existing practical approaches view the expert demonstration as a sequence of goals, enabling imitation with a high-level goal selector, and a low-level goal-conditioned policy. However, this framework can suffer from myopic behavior: the agent's immediate actions towards achieving individual goals may undermine long-term objectives. We introduce a novel method that mitigates this issue by directly optimizing the occupancy matching objective that is intrinsic to imitation learning. We propose to lift a goal-conditioned value function to a distance between occupancies, which are in turn approximated via a learned world model. The resulting method can learn from offline, suboptimal data, and is capable of non-myopic, zero-shot imitation, as we demonstrate in complex, continuous benchmarks.
| Author(s): | Rupf, Thomas and Bagatella, Marco and Gürtler, Nico and Frey, Jonas and Martius, Georg |
| Links: | |
| Book Title: | Proceedings of the 42nd International Conference on Machine Learning (ICML) |
| Volume: | 267 |
| Pages: | 52345--52381 |
| Year: | 2025 |
| Month: | July |
| Series: | Proceedings of Machine Learning Research |
| Editors: | Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry |
| Publisher: | PMLR |
| BibTeX Type: | Conference Paper (inproceedings) |
| Event Name: | International Conference on Machine Learning |
| Event Place: | Vancouver Convention Center |
| State: | Published |
| URL: | https://proceedings.mlr.press/v267/rupf25a.html |
| Eprint: | arXiv:2410.08751 |
BibTeX
@inproceedings{rupf2024:ZILOT,
title = {Zero-Shot Offline Imitation Learning via Optimal Transport},
booktitle = {Proceedings of the 42nd International Conference on Machine Learning (ICML)},
abstract = {Zero-shot imitation learning algorithms hold the promise of reproducing unseen behavior from as little as a single demonstration at test time. Existing practical approaches view the expert demonstration as a sequence of goals, enabling imitation with a high-level goal selector, and a low-level goal-conditioned policy. However, this framework can suffer from myopic behavior: the agent's immediate actions towards achieving individual goals may undermine long-term objectives. We introduce a novel method that mitigates this issue by directly optimizing the occupancy matching objective that is intrinsic to imitation learning. We propose to lift a goal-conditioned value function to a distance between occupancies, which are in turn approximated via a learned world model. The resulting method can learn from offline, suboptimal data, and is capable of non-myopic, zero-shot imitation, as we demonstrate in complex, continuous benchmarks.},
volume = {267},
pages = {52345--52381},
series = {Proceedings of Machine Learning Research},
editors = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
publisher = {PMLR},
month = jul,
year = {2025},
author = {Rupf, Thomas and Bagatella, Marco and G{\"u}rtler, Nico and Frey, Jonas and Martius, Georg},
eprint = {arXiv:2410.08751},
url = {https://proceedings.mlr.press/v267/rupf25a.html},
month_numeric = {7}
}