Active Inverse Reward Design
Reward design, the problem of selecting an appropriate reward function for an AI system, is both critically important, as it encodes the task the system should perform, and challenging, as it requires reasoning about and understanding the agent's environment in detail. As a result, system designers often iterate on the reward function in a trial-and-error process to get their desired behavior. We propose structuring this process as a series of reward design queries, where we actively select the set of reward functions available to the designer. We query with two types of sets: discrete queries, where the system designer chooses from a small set of reward functions, and feature queries, where the system queries the designer for weights on a small subset of features. After each query, we use inverse reward design (IRD) (Hadfield-Menell et al., 2017) to update the distribution over the true reward function from the observed proxy reward function chosen by the designer. Compared to vanilla IRD, we find that our approach not only decreases the uncertainty about the true reward, but also greatly improves performance in unseen environments while only querying for reward functions in a single training environment.
View on arXiv