ASkDAgger: Active Skill-level Data Aggregation

Human teaching effort is a significant bottleneck for the broader applicability of interactive imitation learning. To reduce the number of required queries, existing methods employ active learning to query the human teacher only in uncertain, risky, or novel situations. However, during these queries, the novice's planned actions are not utilized despite containing valuable information, such as the novice's capabilities, as well as corresponding uncertainty levels. To this end, we allow the novice to say: "I plan to do this, but I am uncertain". We introduce the Active Skill-level Data Aggregation (ASkDAgger) framework, which leverages teacher feedback on the novice plan in three key ways:

S-Aware Gating (SAG): Adjusts the gating threshold to track sensitivity, specificity, or a minimum success rate;
Foresight Interactive Experience Replay (FIER), which recasts valid and relabeled novice action plans into demonstrations; and
Prioritized Interactive Experience Replay (PIER), which prioritizes replay based on uncertainty, novice success, and demonstration age.

Together, these components balance query frequency with failure incidence, reduce the number of required demonstration annotations, improve generalization, and speed up adaptation to changing domains. We validate the effectiveness of ASkDAgger through language-conditioned manipulation tasks in both simulation and real-world environments. Code, data, and videos are available at this project page.

The Active Skill-level Data Aggregation (ASkDAgger) framework consists of three main components: S-Aware Gating (SAG), Foresight Interactive Experience Replay (FIER), and Prioritized Interactive Experience Replay (PIER). In this interactive imitation learning framework, we allow the novice to say: "I plan to do this, but I am uncertain." The uncertainty gating threshold is set by SAG to track a user-specified metric: sensitivity, specificity, or minimum system success rate. This facilitates the trade-off between queries and failures. Teacher feedback is obtained with FIER, enabling demonstrations through validation, relabeling, or annotation demonstrations. Lastly, PIER prioritizes replay based on novice success, uncertainty, and demonstration age.

Since ASkDAgger relies on the novice communicating its planned actions for teacher feedback, the method is most practical for moderate feedback frequencies. ASkDAgger therefore targets mid- to high-level control tasks rather than end-to-end policy learning. It is most applicable in scenarios where a robot has access to predefined parameterizable skills such as grasping, walking, pushing, door opening, screwing, or inserting. In such cases, the robot novice needs to learn the parameters and affordances of these skills given a user-specified command. When querying the teacher, the robot novice can specify which skill they plan to use, along with the parameterization of that skill. If the teacher deems the novice's plan invalid, they can provide a demonstration by annotating the appropriate skill and its parameters. For example, a pick skill can be parameterized by a Cartesian pick position and orientation.

We evaluated ASkDAgger and its components in four sets of experiments. First, we performed active dataset aggregation on the MNIST dataset using TorchUncertainty to validate SAG extensively. Second, we interactively trained CLIPort agents on simulated language-conditioned tabletop manipulation tasks. Third, we conducted experiments on a real-world assembly setup to demonstrate that these claims extend beyond simulation. Finally, we showcase ASkDAgger's applicability by integrating it with built-in primitive actions on a Spot robot to perform a sorting task.

To show that SAG balances query count and system failures by tracking a user-specified metric value, we conducted experiments in which we interactively trained digit classification models on the MNIST dataset. We selected this setup due to its low computational requirements, enabling extensive ablations and easy reproducibility. Since we focus on the SAG component, we applied ASkDAgger, but without demonstration collection via relabeling or replay prioritization. To validate whether SAG maintains a desired sensitivity, we performed interactive training for nine different desired sensitivity, specificity and success rate values. The code and data from these experiments are available on GitHub.

The results of these experiments are summarized above. The sensitivity and specificity plots in A and B show that SAG successfully tracks the desired levels for all nine values of the desired sensitivity, specificity or minimum system success rate. In success-aware mode, C shows that when the novice success rate is low, SAG issues enough queries to maintain the desired system success rate. As the novice success rate increases, the query rate decreases, reaching a minimum once the novice success rate exceeds the desired system success rate. The query rate plots also indicate that each mode requires a different query pattern to track its respective metric. The success rate plots show that, in all modes, the novice ultimately learns to perform the task.

We also conducted experiments using ASkDAgger to train CLIPort agents interactively. CLIPort is a language-conditioned imitation-learning agent that leverages the CLIP foundation model and sample-efficient Transporter Networks for vision-based manipulation. We selected this setup because it allows novices to communicate their actions by indicating planned pick-and-place locations on an image alongside a language command, making it well-suited for ASkDAgger. We compared ASkDAgger's performance against an active DAgger baseline without both PIER and FIER. We also compare ASkDAgger against SafeDAgger and ThriftyDAgger, which are also DAgger approaches that incorporate active learning. We also performed ablations with ASkDAgger without PIER and ASkDAgger without FIER to isolate the effects of the individual components. The code and data from these experiments are available on GitHub.

The cumulative rewards for evaluating checkpoints on tasks with seen and unseen objects are shown above. ASkDAgger exhibits a clear improvement across all unseen tasks. This performance gain stems from the composition of the demonstration dataset, shown below. For the active DAgger baselines, all demonstrations consist of annotation tuples, whereas ASkDAgger collects many through validation and relabeling. These relabeled demonstrations contribute to DAgger's superior performance on unseen tasks: agents sometimes obtained demonstrations by relabeling novice failures, where the intended pick was a distractor from the unseen set. Moreover, ASkDAgger requires significantly fewer teacher annotations to learn the tasks.

Real-World Experiments

We conducted experiments on a real-world assembly task to demonstrate that our claims extend beyond simulation and showcase ASkDAgger's applicability in real-world settings. This task is a simplified version of a diesel engine assembly using 3D-printed models. The procedure is shown below. The setup includes a Franka Panda robot equipped with an in-hand RealSense D405 RGB-D camera and a Franka hand with custom-printed fingers for grasping bolts. The objective is to pick bolts from a holder and insert them into specific locations on the engine block. We use pick-and-place primitives that rely on 2D Cartesian positions, assuming a fixed height for picking and placing. The task involves four bolt colors (red, yellow, green, and black) and seven insertion locations. The bolts are randomly ordered and placed in a holder. The human operator interacts with the robot via an interface that allows command input via speech or text. In our experiments, we generate random commands in the form: "Insert the [color] bolt at location number [location number]."

To further demonstrate ASkDAgger's applicability in real-world scenarios, we integrated it with Spot’s built-in primitive skills to perform a sorting task. Since ASkDAgger is designed to work with any robot with one or more skills, we selected Spot for its built-in grasping and walking capabilities. The task involves sorting objects into paper and organic waste bins, as shown below.

ASkDAgger: Active Skill-level Data Aggregation for Interactive Imitation Learning

TMLR 2025

Abstract

ASkDAgger

Experimental Evaluation

MNIST Dataset Aggregation

CLIPort Benchmark Tasks

Real-World Experiments