64
0

Swarm Behavior Cloning

Abstract

In sequential decision-making environments, the primary approaches for training agents are Reinforcement Learning (RL) and Imitation Learning (IL). Unlike RL, which relies on modeling a reward function, IL leverages expert demonstrations, where an expert policy πe\pi_e (e.g., a human) provides the desired behavior. Formally, a dataset DD of state-action pairs is provided: D=(s,a=πe(s))D = {(s, a = \pi_e(s))}. A common technique within IL is Behavior Cloning (BC), where a policy π(s)=a\pi(s) = a is learned through supervised learning on DD. Further improvements can be achieved by using an ensemble of NN individually trained BC policies, denoted as E=πi(s)1iNE = {\pi_i(s)}{1 \leq i \leq N}. The ensemble's action aa for a given state ss is the aggregated output of the NN actions: a=1Niπi(s)a = \frac{1}{N} \sum{i} \pi_i(s). This paper addresses the issue of increasing action differences -- the observation that discrepancies between the NN predicted actions grow in states that are underrepresented in the training data. Large action differences can result in suboptimal aggregated actions. To address this, we propose a method that fosters greater alignment among the policies while preserving the diversity of their computations. This approach reduces action differences and ensures that the ensemble retains its inherent strengths, such as robustness and varied decision-making. We evaluate our approach across eight diverse environments, demonstrating a notable decrease in action differences and significant improvements in overall performance, as measured by mean episode returns.

View on arXiv
Comments on this paper