DARLA: Improving Zero-Shot Transfer in Reinforcement Learning

Domain adaptation is an important open problem in deep reinforcement learning (RL). In many scenarios of interest data is hard to obtain, so agents may learn a source policy in a setting where data is readily available, with the hope that it generalises well to the target domain. We propose a new multi-stage RL agent,DARLA(Disent Angled Representation Learning Agent), which learns to see be for elearning to act. DARLA’s vision is based on learning a disentangled representation of the observed environment. Once DARLA can see, it is able to acquire source policies that are robust to many domain shifts - even with no access to the target domain. DARLA significantly outperforms conventional baselines in zero-shot domain adaptation scenarios, an effect that holds across a variety of RL environments (Jaco arm, DeepMind Lab) and base RL algorithms (DQN, A3C and EC).
展开查看详情

1. DARLA: Improving Zero-Shot Transfer in Reinforcement Learning Irina Higgins * 1 Arka Pal * 1 Andrei Rusu 1 Loic Matthey 1 Christopher Burgess 1 Alexander Pritzel 1 Matthew Botvinick 1 Charles Blundell 1 Alexander Lerchner 1 Abstract efficacy of deep RL, however, fundamental issues remain. Domain adaptation is an important open prob- These include data inefficiency, the reactive nature and gen- lem in deep reinforcement learning (RL). In eral brittleness of learnt policies to changes in input data arXiv:1707.08475v2 [stat.ML] 6 Jun 2018 many scenarios of interest data is hard to ob- distribution, and lack of model interpretability (Garnelo tain, so agents may learn a source policy in a et al., 2016; Lake et al., 2016). This paper focuses on one setting where data is readily available, with the of these outstanding issues: the ability of RL agents to deal hope that it generalises well to the target do- with changes to the input distribution, a form of transfer main. We propose a new multi-stage RL agent, learning known as domain adaptation (Bengio et al., 2013). DARLA (DisentAngled Representation Learning In domain adaptation scenarios, an agent trained on a par- Agent), which learns to see before learning to act. ticular input distribution with a specified reward structure DARLA’s vision is based on learning a disen- (termed the source domain) is placed in a setting where the tangled representation of the observed environ- input distribution is modified but the reward structure re- ment. Once DARLA can see, it is able to acquire mains largely intact (the target domain). We aim to develop source policies that are robust to many domain an agent that can learn a robust policy using observations shifts - even with no access to the target domain. and rewards obtained exclusively within the source domain. DARLA significantly outperforms conventional Here, a policy is considered as robust if it generalises with baselines in zero-shot domain adaptation scenar- minimal drop in performance to the target domain without ios, an effect that holds across a variety of RL en- extra fine-tuning. vironments (Jaco arm, DeepMind Lab) and base Past attempts to build RL agents with strong domain adap- RL algorithms (DQN, A3C and EC). tation performance highlighted the importance of learn- ing good internal representations of raw observations (Finn et al., 2015; Raffin et al., 2017; Pan & Yang, 2009; Bar- 1. Introduction reto et al., 2016; Littman et al., 2001). Typically, these ap- Autonomous agents can learn how to maximise future proaches tried to align the source and target domain rep- expected rewards by choosing how to act based on in- resentations by utilising observation and reward signals coming sensory observations via reinforcement learning from both domains (Tzeng et al., 2016; Daftry et al., 2016; (RL). Early RL approaches did not scale well to envi- Parisotto et al., 2015; Guez et al., 2012; Talvitie & Singh, ronments with large state spaces and high-dimensional 2007; Niekum et al., 2013; Gupta et al., 2017; Finn et al., raw observations (Sutton & Barto, 1998). A commonly 2017; Rajendran et al., 2017). In many scenarios, such as used workaround was to embed the observations in a robotics, this reliance on target domain information can be lower-dimensional space, typically via hand-crafted and/or problematic, as the data may be expensive or difficult to privileged-information features. Recently, the advent of obtain (Finn et al., 2017; Rusu et al., 2016). Furthermore, deep learning and its successful combination with RL has the target domain may simply not be known in advance. enabled end-to-end learning of such embeddings directly On the other hand, policies learnt exclusively on the source from raw inputs, sparking success in a wide variety of pre- domain using existing deep RL approaches that have few viously challenging RL domains (Mnih et al., 2015; 2016; constraints on the nature of the learnt representations of- Jaderberg et al., 2017). Despite the seemingly universal ten overfit to the source input distribution, resulting in poor domain adaptation performance (Lake et al., 2016; Rusu * Equal contribution 1 DeepMind, 6 Pancras Square, Kings et al., 2016). Cross, London, N1C 4AG, UK. Correspondence to: Irina Higgins <irinah@google.com>, Arka Pal <arkap@google.com>. We propose tackling both of these issues by focusing in- stead on learning representations which capture an underly- Proceedings of the 34 th International Conference on Machine ing low-dimensional factorised representation of the world Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 and are therefore not task or domain specific. Many nat- by the author(s).

2. DARLA: Improving Zero-Shot Transfer in Reinforcement Learning DARLA develops its vision, learning to parse the world in terms of basic visual concepts, such as objects, positions, colours, etc. by utilising a stream of raw unlabelled obser- vations – not unlike human babies in their first few months of life (Leat et al., 2009; Candy et al., 2009). In the second stage, the agent utilises this disentangled visual represen- tation to learn a robust source policy. In stage three, we demonstrate that the DARLA source policy is more robust Figure 1. Schematic representation of DARLA. Yellow represents to domain shifts, leading to a significantly smaller drop in the denoising autoencoder part of the model, blue represents the performance in the target domain even when no further pol- β-VAE part of the model, and grey represents the policy learning icy finetuning is allowed (median 270.3% improvement). part of the model. These effects hold consistently across a number of differ- ent RL environments (DeepMind Lab and Jaco/MuJoCo: uralistic domains such as video game environments, sim- Beattie et al., 2016; Todorov et al., 2012) and algorithms ulations and our own world are well described in terms of (DQN, A3C and Episodic Control: Mnih et al., 2015; 2016; such a structure. Examples of such factors of variation are Blundell et al., 2016). object properties like colour, scale, or position; other exam- ples correspond to general environmental factors, such as geometry and lighting. We think of these factors as a set of 2. Framework high-level parameters that can be used by a world graphics 2.1. Domain adaptation in Reinforcement Learning engine to generate a particular natural visual scene (Kulka- rni et al., 2015). Learning how to project raw observations We now formalise domain adaptation scenarios in a rein- into such a factorised description of the world is addressed forcement learning (RL) setting. We denote the source by the large body of literature on disentangled representa- and target domains as DS and DT , respectively. Each tion learning (Schmidhuber, 1992; Desjardins et al., 2012; domain corresponds to an MDP defined as a tuple DS ≡ Cohen & Welling, 2014; 2015; Kulkarni et al., 2015; Hin- (SS , AS , TS , RS ) or DT ≡ (ST , AT , TT , RT ) (we assume ton et al., 2011; Rippel & Adams, 2013; Reed et al., 2014; a shared fixed discount factor γ), each with its own state Yang et al., 2015; Goroshin et al., 2015; Kulkarni et al., space S, action space A, transition function T and reward 2015; Cheung et al., 2015; Whitney et al., 2016; Karalet- function R.1 In domain adaptation scenarios the states S sos et al., 2016; Chen et al., 2016; Higgins et al., 2017). of the source and the target domains can be quite different, Disentangled representations are defined as interpretable, while the action spaces A are shared and the transitions T factorised latent representations where either a single latent and reward functions R have structural similarity. For ex- or a group of latent units are sensitive to changes in single ample, consider a domain adaptation scenario for the Jaco ground truth factors of variation used to generate the vi- robotic arm, where the MuJoCo (Todorov et al., 2012) sim- sual world, while being invariant to changes in other factors ulation of the arm is the source domain, and the real world (Bengio et al., 2013). The theoretical utility of disentangled setting is the target domain. The state spaces (raw pixels) representations for supervised and reinforcement learning of the source and the target domains differ significantly due has been described before (Bengio et al., 2013; Higgins to the perceptual-reality gap (Rusu et al., 2016); that is to et al., 2017; Ridgeway, 2016); however, to our knowledge, say, SS = ST . Both domains, however, share action spaces it has not been empirically validated to date. (AS = AT ), since the policy learns to control the same set of actuators within the arm. Finally, the source and tar- We demonstrate how disentangled representations can im- get domain transition and reward functions share structural prove the robustness of RL algorithms in domain adapta- similarity (TS ≈ TT and RS ≈ RT ), since in both domains tion scenarios by introducing DARLA (DisentAngled Rep- transitions between states are governed by the physics of resentation Learning Agent), a new RL agent capable the world and the performance on the task depends on the of learning a robust policy on the source domain that relative position of the arm’s end effectors (i.e. fingertips) achieves significantly better out-of-the-box performance in with respect to an object of interest. domain adaptation scenarios compared to various base- lines. DARLA relies on learning a latent state representa- 2.2. DARLA tion that is shared between the source and target domains, by learning a disentangled representation of the environ- In order to describe our proposed DARLA framework, we ment’s generative factors. Crucially, DARLA does not re- assume that there exists a set M of MDPs that is the set quire target domain data to form its representations. Our 1 For further background on the notation relating to the RL approach utilises a three stage pipeline: 1) learning to paradigm, see Section A.1 in the Supplementary Materials. see, 2) learning to act, 3) transfer. During the first stage,

3. DARLA: Improving Zero-Shot Transfer in Reinforcement Learning of all natural world MDPs, and each MDP Di is sampled intuitive example, imagine an agent that has learnt a pol- from M. We define M in terms of the state space Sˆ that icy to pick up oranges and avoid apples on the source do- contains all possible conjunctions of high-level factors of main. Such a source policy πS is likely to be based on variation necessary to generate any naturalistic observation an entangled latent state space SSz of object/room conjunc- in any Di ∈ M. A natural world MDP Di is then one tions: oranges/blue → good, apples/red → bad, since this whose state space S corresponds to some subset of S. ˆ In is arguably the most efficient representation for maximis- simple terms, we assume that there exists some shared un- ing expected rewards on the source task in the absence of derlying structure between the MDPs Di sampled from M. extra supervision signals suggesting otherwise. A source We contend that this is a reasonable assumption that per- policy πS (a|szS ; θ) based on such an entangled latent rep- mits inclusion of many interesting problems, including be- resentation szS will not generalise well to the target domain ing able to characterise our own reality (Lake et al., 2016). without further fine-tuning, since FS (soS ) = FS (soT ) and therefore crucially SSz = STz . We now introduce notation for two state space variables that may in principle be used interchangeably within the On the other hand, since both sˆS ∼ GS (S) ˆ and sˆT ∼ source and target domain MDPs DS and DT – the agent ˆ are sampled from the same natural world state GT (S) observation state space S o , and the agent’s internal latent space Sˆ for the source and target domains respectively, it state space S z .2 Sio in Di consists of raw (pixel) observa- should be possible to learn a latent state mapping function tions soi generated by the true world simulator from a sam- Fˆ : S o → SSzˆ, which projects the agent observation state pled set of data generative factors sˆi , i.e. soi ∼ Sim(ˆsi ). space S o to a latent state space SSzˆ expressed in terms of sˆi is sampled by some distribution or process Gi on S, ˆ factorised data generative factors that are representative of ˆ sˆi ∼ Gi (S). ˆ Consider again our intuitive the natural world i.e. SSzˆ ≈ S. ˆ example, where F maps agent observations (soS : orange Using the newly introduced notation, domain adaptation scenarios can be described as having different sampling in a blue room) to a factorised or disentangled representa- processes GS and GT such that sˆS ∼ GS (S) ˆ and sˆT ∼ tion expressed in terms of the data generative factors (szSˆ: ˆ GT (S) for the source and target domains respectively, and room type = blue; object type = orange). Such a disen- then using these to generate different agent observation tangled latent state mapping function should then directly states soS ∼ Sim(ˆsS ) and soT ∼ Sim(ˆ sT ). Intuitively, con- generalise to both the source and the target domains, so that ˆ o ) = F(s F(s ˆ o ) = sz . Since S z is a disentangled repre- sider a source domain where oranges appear in blue rooms S T Sˆ Sˆ and apples appear in red rooms, and a target domain where sentation of object and room attributes, the source policy the object/room conjunctions are reversed and oranges ap- πS can learn a decision boundary that ignores the irrele- pear in red rooms and apples appear in blue rooms. While vant room attributes: oranges → good, apples → bad. Such the true data generative factors of variation Sˆ remain the a policy would then generalise well to the target domain ˆ o ); θ) = πT (a|F(s out of the box, since πS (a|F(s ˆ o ); θ) = same - room colour (blue or red) and object type (apples S T and oranges) - the particular source and target distributions πT (a|sSˆ; θ). Hence, DARLA is based on the idea that a z GS and GT differ. good quality Fˆ learnt exclusively on the source domain DS ∈ M will zero-shot-generalise to all target domains Typically deep RL agents (e.g. Mnih et al., 2015; 2016) Di ∈ M, and therefore the source policy π(a|SSzˆ; θ) will operating in an MDP Di ∈ M learn an end-to-end map- also generalise to all target domains Di ∈ M out of the ping from raw (pixel) observations soi ∈ Sio to actions box. ai ∈ Ai (either directly or via a value function Qi (soi , ai ) from which actions can be derived). In the process of do- Next we describe each of the stages of the DARLA pipeline ing so, the agent implicitly learns a function F : Sio → Siz that allow it to learn source policies πS that are robust to that maps the typically high-dimensional raw observations domain adaptation scenarios, despite being trained with no soi to typically low-dimensional latent states szi ; followed knowledge of the target domains (see Fig. 1 for a graphical by a policy function πi : Siz → Ai that maps the latent representation of these steps): states szi to actions ai ∈ Ai . In the context of domain 1) Learn to see (unsupervised learning of FU ) – the task adaptation, if the agent learns a naive latent state map- of inferring a factorised set of generative factors SSzˆ = Sˆ ping function FS : SSo → SSz on the source domain us- from observations S o is the goal of the extensive disentan- ing reward signals to shape the representation learning, it gled factor learning literature (e.g. Chen et al., 2016; Hig- is likely that FS will overfit to the source domain and will gins et al., 2017). Hence, in stage one we learn a mapping not generalise well to the target domain. Returning to our FU : SUo → SUz , where SUz ≈ SSzˆ (U stands for ‘unsu- 2 Note that we do not assume these to be Markovian i.e. it is not pervised’) using an unsupervised model for learning dis- necessarily the case that p(sot+1 |sot ) = p(sot+1 |sot , sot−1 , . . . , so1 ), entangled factors that utilises observations collected by an and similarly for sz . Note the index t here corresponds to time. agent with a random policy πU from a visual pre-training

4. DARLA: Improving Zero-Shot Transfer in Reinforcement Learning MDP DU ∈ M. Note that we require sufficient variabil- Eqφ (z|x) [log pθ (x|z)] on a per-pixel basis are known and ity of factors and their conjunctions in DU in order to have have been addressed in the past by calculating the recon- SUz ≈ SSzˆ; struction cost in an abstract, high-level feature space given by another neural network model, such as a GAN (Good- 2) Learn to act (reinforcement learning of πS in the source fellow et al., 2014) or a pre-trained AlexNet (Krizhevsky domain DS utilising previously learned FU ) – an agent et al., 2012; Larsen et al., 2016; Dosovitskiy & Brox, that has learnt to see the world in stage one in terms of the 2016; Warde-Farley & Bengio, 2017). In practice we found natural data generative factors is now exposed to a source that pre-training a denoising autoencoder (Vincent et al., domain DS ∈ M. The agent is tasked with learning the 2010) on data from the visual pre-training MDP DU ∈ M source policy πS (a|szS ; θ), where szS = FU (soS ) ≈ szSˆ, via worked best as the reconstruction targets for β-VAE to a standard reinforcement learning algorithm. Crucially, we match (see Fig. 1 for model architecture and Sec. A.3.1 in do not allow FU to be modified (e.g. by gradient updates) Supplementary Materials for implementation details). The during this phase; new β-VAEDAE model was trained according to Eq. 2: 3) Transfer (to a target domain DT ) – in the final step, we 2 L(θ, φ; x, z, β) = − Eqφ (z|x) J(ˆ x) − J(x) test how well the policy πS learnt on the source domain 2 generalises to the target domain DT ∈ M in a zero-shot − β DKL (qφ (z|x)||p(z)) (2) domain adaptation setting, i.e. the agent is evaluated on the target domain without retraining. We compare the perfor- where x ˆ ∼ pθ (x|z) and J : RW ×H×C → RN is the func- mance of policies learnt with a disentangled latent state SSzˆ tion that maps images from pixel space with dimensionality to various baselines where the latent state mapping func- W × H × C to a high-level feature space with dimension- tion FU projects agent observations so to entangled latent ality N given by a stack of pre-trained DAE layers up to a state representations sz . certain layer depth. Note that by replacing the pixel based reconstruction loss in Eq. 1 with high-level feature recon- 2.3. Learning disentangled representations struction loss in Eq. 2 we are no longer optimising the vari- ational lower bound, and β-VAEDAE with β = 1 loses its In order to learn FU , DARLA utilises β-VAE (Higgins equivalence to the Variational Autoencoder (VAE) frame- et al., 2017), a state-of-the-art unsupervised model for au- work as proposed by (Kingma & Welling, 2014; Rezende tomated discovery of factorised latent representations from et al., 2014). In this setting, the only way to interpret β is as raw image data. β-VAE is a modification of the varia- a mixing coefficient that balances the capacity of the latent tional autoencoder framework (Kingma & Welling, 2014; channel z of β-VAEDAE against the pressure to match the Rezende et al., 2014) that controls the nature of the learnt high-level features within the pre-trained DAE. latent representations by introducing an adjustable hyper- parameter β to balance reconstruction accuracy with latent 2.4. Reinforcement Learning Algorithms channel capacity and independence constraints. It max- imises the objective: We used various RL algorithms (DQN, A3C and Episodic Control: Mnih et al., 2015; 2016; Blundell et al., 2016) to L(θ, φ; x, z, β) =Eqφ (z|x) [log pθ (x|z)] learn the source policy π S during stage two of the pipeline − β DKL (qφ (z|x)||p(z)) (1) using the latent states sz acquired by β-VAE based models where φ, θ parametrise the distributions of the encoder and during stage one of the DARLA pipeline. the decoder respectively. Well-chosen values of β - usually Deep Q Network (DQN) (Mnih et al., 2015) is a variant of larger than one (β > 1) - typically result in more disentan- the Q-learning algorithm (Watkins, 1989) that utilises deep gled latent representations z by limiting the capacity of the learning. It uses a neural network to parametrise an ap- latent information channel, and hence encouraging a more proximation for the action-value function Q(s, a; θ) using efficient factorised encoding through the increased pressure parameters θ. to match the isotropic unit Gaussian prior p(z) (Higgins et al., 2017). Asynchronous Advantage Actor-Critic (A3C) (Mnih et al., 2016) is an asynchronous implementation of the advantage 2.3.1. P ERCEPTUAL S IMILARITY L OSS actor-critic paradigm (Sutton & Barto, 1998; Degris & Sut- ton, 2012), where separate threads run in parallel and per- The cost of increasing β is that crucial information about form updates to shared parameters. The different threads the scene may be discarded in the latent representation z, each hold their own instance of the environment and have particularly if that information takes up a small proportion different exploration policies, thereby decorrelating param- of the observations x in pixel space. We encountered this eter updates without the need for experience replay. There- issue in some of our tasks, as discussed in Section 3.1. fore, A3C is an online algorithm, whereas DQN learns its The shortcomings of calculating the log-likelihood term policy offline, resulting in different learning dynamics be-

5. DARLA: Improving Zero-Shot Transfer in Reinforcement Learning tween the two algorithms. Model-Free Episodic Control (EC) (Blundell et al., 2016) was proposed as a complementary learning system to the other RL algorithms described above. The EC algorithm relies on near-determinism of state transitions and rewards in RL environments; in settings where this holds, it can ex- ploit these properties to memorise which action led to high returns in similar situations in the past. Since in its simplest form EC relies on a lookup table, it learns good policies Figure 2. A: DeepMind Lab (Beattie et al., 2016) transfer task much faster than value-function-approximation based deep setup. Different conjunctions of {room, object1 , object2 } were used during different parts of the domain adaptation curriculum. RL algorithms like DQN trained via gradient descent - at During stage one, DU (shown in yellow), we used a minimal set the cost of generality (i.e. potentially poor performance in spanning all objects and all rooms whereby each object is seen non-deterministic environments). in each room. Note there is no extrinsic reward signal or notion We also compared our approach to that of UNREAL (Jader- of ‘task’ in this phase. During stage two, DS (shown in green), berg et al., 2017), a recently proposed RL algorithm which the RL agents were taught to pick up cans and balloons and avoid hats and cakes. The objects were always presented in pairs hat/can also attempts to utilise unsupervised data in the environ- and cake/balloon. The agent never saw the hat/can pair in the pink ment. The UNREAL agent takes as a base an LSTM A3C room. This novel room/object conjunction was presented as the agent (Mnih et al., 2016) and augments it with a number of target domain adaptation condition DT (shown in red) where the unsupervised auxiliary tasks that make use of the rich per- ability of the agent to transfer knowledge of the objects’ value to ceptual data available to the agent besides the (sometimes a novel environment was tested. B: β-VAE reconstructions (bot- very sparse) extrinsic reward signals. This auxiliary learn- tom row) using frames from DeepMind Lab (top row). Due to ing tends to improve the representation learnt by the agent. the increased β > 1 necessary to disentangle the data genera- See Sec. A.6 in Supplementary Materials for further details tive factors of variations the model lost information about objects. of the algorithms above. See Fig. 3 for a model appropriately capturing objects. C: left – sample frames from MuJoCo simulation environments used for vision (phase 1, SU ) and source policy training (phase 2, SS ); 3. Tasks middle – sim2sim domain adaptation test (phase 3, ST ); and right – sim2real domain adaptation test (phase 3, ST ). We evaluate the performance of DARLA on different task and environment setups that probe subtly different aspects these factor values are different. In sim2sim, by contrast, of domain adaptation. As a reminder, in Sec. 2.2 we defined novel factor values are experienced in the target domain Sˆ as a state space that contains all possible conjunctions (this accordingly also leads to novel factor conjunctions). of high-level factors of variation necessary to generate any Hence, DeepMind Lab may be considered to be assessing naturalistic observation in any Di ∈ M. During domain domain interpolation performance, whereas sim2sim tests adaptation scenarios agent observation states are generated domain extrapolation. according to soS ∼ SimS (ˆ sS ) and soT ∼ SimT (ˆsT ) for the source and target domains respectively, where sˆS and sˆT The sim2real setup, on the other hand, is based on identical are sampled by some distributions or processes GS and GT processes GS = GT , but different observation simulators according to sˆS ∼ GS (S)ˆ and sˆT ∼ GT (S). ˆ SimS = SimT corresponding to the MuJoCo simulation and the real world, which results in the so-called ‘percep- We use DeepMind Lab (Beattie et al., 2016) to test a ver- tual reality gap’ (Rusu et al., 2016). More details of the sion of domain adaptation setup where the source and target tasks are given below. domain observation simulators are equal (SimS = SimT ), but the processes used to sample sˆS and sˆT are differ- 3.1. DeepMind Lab ent (GS = GT ). We use the Jaco arm with a matching MuJoCo simulation environment (Todorov et al., 2012) in DeepMind Lab is a first person 3D game environment with two domain adaptation scenarios: simulation to simula- rich visuals and realistic physics. We used a standard seek- tion (sim2sim) and simulation to reality (sim2real). The avoid object gathering setup, where a room is initialised sim2sim domain adaptation setup is relatively similar to with an equal number of randomly placed objects of two DeepMind Lab i.e. the source and target domains differ different types. One of the object varieties is ‘good’ (its col- in terms of processes GS and GT . However, there is a sig- lection is rewarded +1), while the other is ‘bad’ (its collec- nificant point of difference. In DeepMind Lab, all values of tion is punished -1). The full state space Sˆ consisted of all factors in the target domain, sˆT , are previously seen in the conjunctions of two room types (pink and green based on source domain; however, sˆS = sˆT as the conjunctions of the colour of the walls) and four object types (hat, can, cake and balloon) (see Fig. 2A). The source domain DS con-

6. DARLA: Improving Zero-Shot Transfer in Reinforcement Learning tained environments with hats/cans presented in the green 3.2. Jaco Arm and MuJoCo room, and balloons/cakes presented in either the green or We used frames from an RGB camera facing a robotic the pink room. The target domain DT contained hats/cans Jaco arm, or a matching rendered camera view from a presented in the pink room. In both domains cans and bal- MuJoCo physics simulation environment (Todorov et al., loons were the rewarded objects. 2012) to investigate the performance of DARLA in two 1) Learn to see: we used β-VAEDAE to learn the disen- domain adaptation scenarios: 1) simulation to simula- tangled latent state representation sz that includes both the tion (sim2sim), and 2) simulation to reality (sim2real). room and the object generative factors of variation within The sim2real setup is of particular importance, since the DeepMind Lab. We had to use the high-level feature space progress that deep RL has brought to control tasks in sim- of a pre-trained DAE within the β-VAEDAE framework ulation (Schulman et al., 2015; Mnih et al., 2016; Levine (see Section 2.3.1), instead of the pixel space of vanilla β- & Abbeel, 2014; Heess et al., 2015; Lillicrap et al., 2015; VAE , because we found that objects failed to reconstruct Schulman et al., 2016) has not yet translated as well to re- when using the values of β necessary to disentangle the ality, despite various attempts (Tobin et al., 2017; Tzeng generative factors of variation within DeepMind Lab (see et al., 2016; Daftry et al., 2016; Finn et al., 2015; Rusu Fig. 2B). et al., 2016). Solving control problems in reality is hard due to sparse reward signals, expensive data acquisition and the β-VAEDAE was trained on observations soU collected by attendant danger of breaking the robot (or its human min- an RL agent with a simple wall-avoiding policy πU (oth- ders) during exploration. erwise the training data was dominated by close up im- ages of walls). In order to enable the model to learn In both sim2sim and sim2real, we trained the agent to per- ˆ it is important to expose the agent to at least F(soU ) ≈ S, form an object reaching policy where the goal is to place a minimal set of environments that span the range of val- the end effector as close to the object as possible. While ues for each factor, and where no extraneous correlations conceptually the reaching task is simple, it is a hard control are added between different factors3 (see Fig. 2A, yellow). problem since it requires correct inference of the arm and See Section A.3.1 in Supplementary Materials for details object positions and velocities from raw visual inputs. of β-VAEDAE training. 1) Learn to see: β-VAE was trained on observations col- 2) Learn to act: the agent was trained with the algo- lected in MuJoCo simulations with the same factors of rithms detailed in Section 2.4 on a seek-avoid task us- variation as in DS . In order to enable the model to learn ing the source domain (DS ) conjunctions of object/room F(soU ) ≈ sˆ, a reaching policy was applied to phantom ob- shown in Fig. 2A (green). Pre-trained β-VAEDAE from jects placed in random positions - therefore ensuring that stage one was used as the ‘vision’ part of various RL al- the agent learnt the independent nature of the arm position gorithms (DQN, A3C and Episodic Control: Mnih et al., and object position (see Fig. 2C, left); 2015; 2016; Blundell et al., 2016) to learn a source policy 2) Learn to act: a feedforward-A3C based agent with the πS that picks up balloons and avoids cakes in both the green vision module pre-trained in stage one was taught a source and the pink rooms, and picks up cans and avoids hats in reaching policy πS towards the real object in simulation the green rooms. See Section A.3.1 in Supplementary Ma- (see Fig. 2C (left) for an example frame, and Sec. A.4 terials for more details of the various versions of DARLA in Supplementary Materials for a fuller description of the we have tried, each based on a different base RL algorithm. agent). In the source domain DS the agent was trained on 3) Transfer: we tested the ability of DARLA to transfer the a distribution of camera angles and positions. The colour seek-avoid policy πS it had learnt on the source domain in of the tabletop on which the arm rests and the object colour stage two using the domain adaptation condition DT illus- were both sampled anew every episode. trated in Figure 2A (red). The agent had to continue picking 3) Transfer: sim2sim: in the target domain, DT , the agent up cans and avoid hats in the pink room, even though these was faced with a new distribution of camera angles and po- objects had only been seen in the green room during source sitions with little overlap with the source domain distribu- policy training. The optimal policy πT is one that maintains tions, as well as a completely held out set of object colours the reward polarity from the source domain (cans are good (see Fig. 2C, middle). sim2real: in the target domain DT and hats are bad). For further details, see Appendix A.2.1. the camera position and angle as well as the tabletop colour 3 In our setup of DeepMind Lab domain adaptation task, the and object colour were sampled from the same distribu- object and environment factors are supposed to be independent. In tions as seen in the source domain DS , but the target do- order to ensure that β-VAEDAE learns a factorised representation main DT was now the real world. Many details present that reflects this ground truth independence, we present observa- tions of all possible conjunctions of room and individual object in the real world such as shadows, specularity, multiple types. light sources and so on are not modelled in the simulation;

7. DARLA: Improving Zero-Shot Transfer in Reinforcement Learning Disentangled Entangled disentangled β-VAE 4 (the original DARLA), an entan- Environment Object room id turn left turn right ig distance leftt obje object rotation object id -3 -3 gled β-VAE (DARLAENT ), and a denoising autoencoder (DARLADAE ). Apart from the nature of the learnt rep- resentations sz , DARLA and all versions of its baselines were equivalent throughout the three stages of our pro- posed pipeline in terms of architecture and the observed 0 0 data distribution (see Sec. A.3 in Supplementary Materials for more details). Figs. 3-4 display the degree of disentanglement learnt by the vision modules of DARLA and DARLAENT on Deep- Mind Lab and MuJoCo. DARLA’s vision learnt to inde- 3 3 z1 z2 z3 z4 z5 z6 z7 z1 z2 z3 z4 pendently represent environment variables (such as room Figure 3. Plot of traversals of various latents of an entangled and colour-scheme and geometry) and object-related variables a disentangled version of β-VAEDAE using frames from Deep- (change of object type, size, rotation) on DeepMind Lab Mind Lab (Beattie et al., 2016). (Fig. 3, left). Disentangling was also evident in MuJoCo. Fig. 4, left, shows that DARLA’s single latent units zi learnt Disentangled Entangled to represent different aspects of the Jaco arm, the object, Object Arm close/far left/right wrist turn close/far up/down right/left -3 -3 and the camera. By contrast, in the representations learnt by DARLAENT , each latent is responsible for changes to both the environment and objects (Fig. 3, right) in Deep- Mind Lab, or a mixture of camera, object and/or arm move- ments (Fig. 4, right) in MuJoCo. The table in Fig. 5 shows the average performance (across 0 0 different seeds) in terms of rewards per episode of the var- ious agents on the target domain with no fine-tuning of the source policy πS . It can be seen that DARLA is able to zero-shot-generalise significantly better than DARLAENT or DARLADAE , highlighting the importance of learning a disentangled representation sz = szSˆ during the unsuper- 3 3 z1 z2 z3 z4 z5 z6 z1 z2 z3 z4 vised stage one of the DARLA pipeline. In particular, this Figure 4. Plot of traversals of β-VAE on MuJoCo. Using a disen- also demonstrates that the improved domain transfer per- tangled β-VAE model, single latents directly control for the fac- formance is not simply a function of increased exposure to tors responsible for the object or arm placements. training observations, as both DARLAENT and DARLADAE were exposed to the same data. The results are mostly con- the physics engine is also not a perfect model of reality. sistent across target domains and in most cases DARLA is Thus sim2real tests the ability of the agent to cross the significantly better than the second-best-performing agent. perceptual-reality gap and generalise its source policy πS This holds in the sim2real task 5 , where being able to per- to the real world (see Fig. 2C, right). For further details, form zero-shot policy transfer is highly valuable due to the see Appendix A.2.2. particular difficulties of gathering data in the real world. DARLA’s performance is particularly surprising as it actu- 4. Results ally preserves less information about the raw observations so than DARLAENT and DARLADAE . This is due to the We evaluated the robustness of DARLA’s policy πS learnt nature of the β-VAE and how it achieves disentangling; the on the source domain to various shifts in the input data dis- disentangled model utilised a significantly higher value of tribution. In particular, we used domain adaptation sce- the hyperparameter β than the entangled model (see Ap- narios based on the DeepMind Lab seek-avoid task and pendix A.3 for further details), which constrains the ca- the Jaco arm reaching task described in Sec. 3. On each 4 task we compared DARLA’s performance to that of var- In this section of the paper, we use the term β-VAE to re- fer to a standard β-VAE for the MuJoCo experiments, and a ious baselines. We evaluated the importance of learning β-VAEDAE for the DeepMind Lab experiments (as described in ‘good’ vision during stage one of the pipeline, i.e one that stage 1 of Sec. 3.1). maps the input observations so to disentangled represen- 5 See https://youtu.be/sZqrWFl0wQ4 for example sim2sim tations sz ≈ sˆ. In order to do this, we ran the DARLA and sim2real zero-shot transfer policies of DARLA and baseline pipeline with different vision models: the encoders of a A3C agent.

8. DARLA: Improving Zero-Shot Transfer in Reinforcement Learning DARLA: Improving Zero-Shot Transfer in Reinforcement Learning 110 165 Table 1. Transfer performance 111 166 112 167 113 D EEP M IND L AB JACO (A3C) 168 114 V ISION T YPE DQN A3C EC SIM 2 SIM SIM 2 REAL 169 115 170 BASELINE AGENT 1.86 ± 3.91 5.32 ± 3.36 -0.41 ± 4.21 97.64 ± 9.02 94.56 ± 3.55 116 171 UNREAL - 4.13 ± 3.95 - - - 117 DARLA FT 13.36 ± 5.8 1.4 ± 2.16 - 86.59 ± 5.53 99.25 ± 2.3 172 118 DARLA ENT 3.45 ± 4.47 15.66 ± 5.19 5.69 ± 3.73 84.77 ± 4.42 59.99 ± 15.05 173 119 DARLA DAE 7.83 ± 4.47 6.74 ± 2.81 5.59 ± 3.37 85.15 ± 7.43 100.72 ± 4.7 174 120 175 121 DARLA 10.25 ± 5.46 19.7 ± 5.43 11.41 ± 3.52 100.85 ± 2.92 108.2 ± 5.97 176 122 DARLA’ S PERFORMANCE IS SIGNIFICANTLY DIFFERENT FROM ALL BASELINES UNDER W ELCH ’ S UNEQUAL VARIANCES T- TEST WITH p < 0.01 (N ∈ [60, 150]). 177 123 178 124 Figure 5. Table: Zero-shot performance (avg. reward per episode) of the source policy πS in target domains 179 within DeepMind Lab and 125 Jaco/MuJoCo environments. Baseline agent refers to vanilla DQN/A3C/EC (DeepMind Lab) or A3C 180(Jaco) agents. See main text for 126 181 127 more detailed model descriptions. Figure: Correlation between zero-shot performance transfer performance 182 on the DeepMind Lab task 128 obtained by EC based DARLA and the level of disentanglement as measured by the transfer/disentanglement 183 score (r = 0.6, p < 0.001) 129 184 130 pacity of the latent channel. Indeed, DARLA’s β-VAE in Supplementary Materials). 185 This is shown in the chart in 131 186 132 only utilises 8 of its possible 32 Gaussian latents to store Fig. 5; as can be seen, there187 is a strong positive correlation 133 observation-specific information for MuJoCo/Jaco (and 20 between the level of disentanglement 188 and DARLA’s zero- 134 in DeepMind Lab), whereas DARLAENT utilises all 32 for 189 shot domain transfer performance (r = 0.6, p < 0.001). 135 190 136 both environments (as does DARLADAE ). 191 Having shown the robust utility of disentangled represen- 137 192 138 Furthermore, we examined what happens if DARLA’s vi- tations in agents for domain193adaptation, we note that there 139 sion (i.e. the encoder of the disentangled β-VAE ) is al- is evidence that they can provide 194 an important additional 140 195 141 lowed to be fine-tuned via gradient updates while learning benefit. We found significantly 196 improved speed of learning 142 the source policy during stage two of the pipeline. This of πS on the source domain197 itself, as a function of how dis- 143 is denoted by DARLAFT in the table in Fig. 5. We see entangled the model was. The 198 gain in data efficiency from 144 199 145 that it exhibits significantly worse performance than that disentangled representations 200for source policy learning is 146 of DARLA in zero-shot domain adaptation using an A3C- not the main focus of this 201 paper so we leave it out of the 147 based agent in all tasks. This suggests that a favourable 202 main text; however, we provide results and discussion in 148 203 149 initialisation does not make up for subsequent overfitting Section A.7 in Supplementary 204 Materials. 150 to the source domain for the on-policy A3C. However, the 205 151 206 152 off-policy DQN-based fine-tuned agent performs very well. 5. Conclusion 207 153 We leave further investigation of this curious effect for fu- 208 154 ture work. We have demonstrated the209 benefits of using disentangled 155 210 representations in a deep RL setting for domain adaptation. 156 Finally, we compared the performance of DARLA to an 211 157 In particular, we have proposed 212 DARLA, a multi-stage RL 158 UNREAL (Jaderberg et al., 2017) agent with the same ar- agent. DARLA first learns a213visual system that encodes the 159 chitecture. Despite also exploiting the unsupervised data 214 160 observations it receives from215the environment as disentan- available in the source domain, UNREAL performed worse 161 gled representations, in a completely 216 unsupervised manner. 162 than baseline A3C on the DeepMind Lab domain adap- 217 It then uses these representations to learn a robust source 163 tation task. This further demonstrates that use of unsu- 218 164 policy that is capable of zero-shot 219 domain adaptation. pervised data in itself is not a panacea for transfer per- formance; it must be utilised in a careful and structured We have demonstrated the efficacy of this approach in a manner conducive to learning disentangled latent states range of domains and task setups: a 3D naturalistic first- sz = szSˆ. person environment (DeepMind Lab), a simulated graphics and physics engine (MuJoCo), and crossing the simulation In order to quantitatively evaluate our hypothesis that dis- to reality gap (MuJoCo to Jaco sim2real). We have also entangled representations are essential for DARLA’s per- shown that the effect of disentangling is consistent across formance in domain adaptation scenarios, we trained vari- very different RL algorithms (DQN, A3C, EC), achieving ous DARLAs with different degrees of learnt disentangle- significant improvements over the baseline algorithms (me- ment in sz by varying β (of β-VAE ) during stage one of dian 2.7 times improvement in zero-shot transfer across the pipeline. We then calculated the correlation between tasks and algorithms). To the best of our knowledge, this the performance of the EC-based DARLA on the Deep- is the first comprehensive empirical demonstration of the Mind Lab domain adaptation task and the transfer metric, strength of disentangled representations for domain adap- which approximately measures the quality of disentangle- tation in a deep RL setting. ment of DARLA’s latent representations sz (see Sec. A.5.2

9. DARLA: Improving Zero-Shot Transfer in Reinforcement Learning References Garnelo, Marta, Arulkumaran, Kai, and Shanahan, Murray. To- wards deep symbolic reinforcement learning. arXiv, 2016. Abadi, Martin, Agarwal, Ashish, and et al, Paul Barham. Ten- sorflow: Large-scale machine learning on heterogeneous dis- Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde- tributed systems. Preliminary White Paper, 2015. Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. NIPS, pp. 26722680, 2014. Barreto, Andr´e, Munos, R´emi, Schaul, Tom, and Silver, David. Successor features for transfer in reinforcement learning. Goroshin, Ross, Mathieu, Michael, and LeCun, Yann. Learning CoRR, abs/1606.05312, 2016. URL http://arxiv.org/ to linearize under uncertainty. NIPS, 2015. abs/1606.05312. Guez, Arthur, Silver, David, and Dayan, Peter. Efficient bayes- Beattie, Charles, Leibo, Joel Z., Teplyashin, Denis, Ward, Tom, adaptive reinforcement learning using sample-based search. and et al, Marcus Wainwright. Deepmind lab. arxiv, 2016. NIPS, 2012. Bengio, Y., Courville, A., and Vincent, P. Representation learn- Gupta, Abhishek, Devin, Coline, Liu, YuXuan, Abbeel, Pieter, ing: A review and new perspectives. In IEEE Transactions on and Levine, Sergey. Learning invariant feature spaces to trans- Pattern Analysis & Machine Intelligence, 2013. fer skills with reinforcement learning. ICLR, 2017. Blundell, Charles, Uria, Benigno, Pritzel, Alexander, Li, Yazhe, Heess, Nicolas, Wayne, Gregory, Silver, David, Lillicrap, Tim- Ruderman, Avraham, Leibo, Joel Z, Rae, Jack, Wierstra, Daan, othy P., Erez, Tom, and Tassa, Yuval. Learning continuous and Hassabis, Demis. Model-free episodic control. arXiv, control policies by stochastic value gradients. NIPS, 2015. 2016. Higgins, Irina, Matthey, Loic, Pal, Arka, Burgess, Christopher, Glorot, Xavier, Botvinick, Matthew, Mohamed, Shakir, and Candy, T. Rowan, Wang, Jingyun, and Ravikumar, Sowmya. Reti- Lerchner, Alexander. Beta-vae: Learning basic visual concepts nal image quality and postnatal visual experience during in- with a constrained variational framework. In ICLR, 2017. fancy. Optom Vis Sci, 86(6):556–571, 2009. Hinton, G., Krizhevsky, A., and Wang, S. D. Transforming auto- Chen, Xi, Duan, Yan, Houthooft, Rein, Schulman, John, encoders. International Conference on Artificial Neural Net- Sutskever, Ilya, and Abbeel, Pieter. Infogan: Interpretable rep- works, 2011. resentation learning by information maximizing generative ad- versarial nets. arXiv, 2016. Jaderberg, Max, Mnih, Volodymyr, Czarnecki, Wojciech Marian, Schaul, Tom, Leibo, Joel Z, Silver, David, and Kavukcuoglu, Cheung, Brian, Levezey, Jesse A., Bansal, Arjun K., and Ol- Koray. Reinforcement learning with unsupervised auxiliary shausen, Bruno A. Discovering hidden factors of variation in tasks. ICLR, 2017. deep networks. In Proceedings of the International Conference on Learning Representations, Workshop Track, 2015. Karaletsos, Theofanis, Belongie, Serge, and Rtsch, Gunnar. Bayesian representation learning with oracle constraints. Cohen, T. and Welling, M. Transformation properties of learned ICLR, 2016. visual representations. In ICLR, 2015. Kingma, D. P. and Ba, Jimmy. Adam: A method for stochastic Cohen, Taco and Welling, Max. Learning the irreducible repre- optimization. arXiv, 2014. sentations of commutative lie groups. arXiv, 2014. Kingma, D. P. and Welling, M. Auto-encoding variational bayes. Daftry, Shreyansh, Bagnell, J. Andrew, and Hebert, Martial. ICLR, 2014. Learning transferable policies for monocular reactive mav con- trol. International Symposium on Experimental Robotics, Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Im- 2016. agenet classification with deep convolutional neural networks. NIPS, 2012. Degris, Thomas, Pilarski Patrick M and Sutton, Richard S. Kulkarni, Tejas, Whitney, William, Kohli, Pushmeet, and Tenen- Model-free reinforcement learning with continuous action in baum, Joshua. Deep convolutional inverse graphics network. practice. American Control Conference (ACC), 22:21772182, NIPS, 2015. 2012. Lake, Brenden M., Ullman, Tomer D., Tenenbaum, Joshua B., Desjardins, G., Courville, A., and Bengio, Y. Disentangling fac- and Gershman, Samuel J. Building machines that learn and tors of variation via generative entangling. arXiv, 2012. think like people. arXiv, 2016. Dosovitskiy, Alexey and Brox, Thomas. Generating images with Larsen, Anders Boesen Lindbo, Snderby, Sren Kaae, Larochelle, perceptual similarity metrics based on deep networks. arxiv, Hugo, and Winther, Ole. Autoencoding beyond pixels using a 2016. learned similarity metric. ICML, 2016. Finn, Chelsea, Tan, Xin Yu, Duan, Yan, Darrell, Trevor, Levine, Leat, Susan J., Yadav, Naveen K., and Irving, Elizabeth L. De- Sergey, and Abbeel, Pieter. Deep spatial autoencoders for vi- velopment of visual acuity and contrast sensitivity in children. suomotor learning. arxiv, 2015. Journal of Optometry, 2009. Finn, Chelsea, Yu, Tianhe, Fu, Justin, Abbeel, Pieter, and Levine, Levine, Sergey and Abbeel, Pieter. Learning neural network Sergey. Generalizing skills with semi-supervised reinforce- policies with guided policy search under unknown dynamics. ment learning. ICLR, 2017. NIPS, 2014.

10. DARLA: Improving Zero-Shot Transfer in Reinforcement Learning Lillicrap, Timothy P., Hunt, Jonathan J., Pritzel, Alexander, Reed, Scott, Sohn, Kihyuk, Zhang, Yuting, and Lee, Honglak. Heess, Nicolas, Erez, Tom, Tassa, Yuval, Silver, David, and Learning to disentangle factors of variation with manifold in- Wierstra, Daan. Continuous control with deep reinforcement teraction. ICML, 2014. learning. CoRR, 2015. Rezende, Danilo J., Mohamed, Shakir, and Wierstra, Daan. Littman, Michael L., Sutton, Richard S., and Singh, Satinder. Pre- Stochastic backpropagation and approximate inference in deep dictive representations of state. NIPS, 2001. generative models. arXiv, 2014. Marr, D. Simple memory: A theory for archicortex. Philosophical Ridgeway, Karl. A survey of inductive biases for factorial Transactions of the Royal Society of London, pp. 2381, 1971. Representation-Learning. arXiv, 2016. URL http:// arxiv.org/abs/1612.05299. McClelland, James L, McNaughton, Bruce L, and OReilly, Ran- dall C. Why there are complementary learning systems in the Rippel, Oren and Adams, Ryan Prescott. High-dimensional prob- hippocampus and neocortex: insights from the successes and ability estimation with deep density models. arXiv, 2013. failures of connectionist models of learning and memory. Psy- chological review, 102:419, 1995. Rusu, Andrei A., Vecerik, Matej, Rothrl, Thomas, Heess, Nicolas, Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David S, and Pascanu, Razvan, and Hadsell, Raia. Sim-to-real robot learning Rusu, Andrei A. et al. Human-level control through deep rein- from pixels with progressive nets. arXiv, 2016. forcement learning. Nature, 2015. Schmidhuber, J¨urgen. Learning factorial codes by predictability Mnih, Volodymyr, Badia, Adri Puigdomnech, Mirza, Mehdi, minimization. Neural Computation, 4(6):863–869, 1992. Graves, Alex, Lillicrap, Timothy P., Harley, Tim, Silver, David, and Kavukcuoglu, Koray. Asynchronous methods for deep re- Schulman, John, Levine, Sergey, Moritz, Philipp, Jordan, inforcement learning. ICML, 2016. URL https://arxiv. Michael I., and Abbeel, Pieter. Trust region policy optimiza- org/pdf/1602.01783.pdf. tion. ICML, 2015. Niekum, Scott, Chitta, Sachin, Barto, Andrew G, Marthi, Schulman, John, Moritz, Philipp, Levine, Sergey, Jordan, Bhaskara, and Osentoski, Sarah. Incremental semantically Michael, and Abbeel, Pieter. High-dimensional continuous grounded learning from demonstration. Robotics: Science and control using generalized advantage estimation. ICLR, 2016. Systems, 2013. Sutherland, Robert J and Rudy, Jerry W. Configural association Norman, Kenneth A and O’Reilly, Randall C. Modeling hip- theory: The role of the hippocampal formation in learning, pocampal and neocortical contributions to recognition mem- memory, and amnesia. Psychobiology, 17:129144, 1989. ory: a complementary-learning-systems approach. Psycholog- ical review, 110:611, 2003. Sutton, Richard S. and Barto, Andrew G. Reinforcement Learn- ing: An Introduction. MIT Press, 1998. Pan, Sinno Jialin and Yang, Quiang. A survey on transfer learn- ing. IEEE Transactions on Knowledge and Data Engineering, Talvitie, Erik and Singh, Satinder. An experts algorithm for trans- 2009. fer learning. In Proceedings of the 20th international joint con- ference on Artifical intelligence, 2007. Parisotto, Emilio, Ba, Jimmy, and Salakhutdinov, Ruslan. Actor- mimic: Deep multitask and transfer reinforcement learning. Tobin, Josh, Fong, Rachel, Ray, Alex, Schneider, Jonas, Zaremba, CoRR, 2015. Wojciech, and Abbeel, Pieter. Domain randomization for trans- ferring deep neural networks from simulation to the real world. Pathak, Deepak, Kr¨ahenb¨uhl, Philipp, Donahue, Jeff, Darrell, arxiv, 2017. Trevor, and Efros, Alexei A. Context encoders: Feature learn- ing by inpainting. CoRR, abs/1604.07379, 2016. URL http: Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for //arxiv.org/abs/1604.07379. model-based control. IROS, 2012. Peng, J. Efficient dynamic programming-based learning for con- Tulving, Endel, Hayman, CA, and Macdonald, Carol A. Long- trol. PhD thesis, Northeastern University, Boston., 1993. lasting perceptual priming and semantic learning in amnesia: a Peng, Jing and Williams, Ronald J. Incremental multi-step q- case experiment. Journal of Experimental Psychology: Learn- learning. Machine Learning, 22:283290, 1996. ing, Memory, and Cognition, 17:595, 1991. Puterman, Martin L. Markov Decision Processes: Discrete Tzeng, Eric, Devin, Coline, Hoffman, Judy, Finn, Chelsea, Stochastic Dynamic Programming. John Wiley & Sons, Inc., Abbeel, Pieter, Levine, Sergey, Saenko, Kate, and Darrell, New York, NY, USA, 1st edition, 1994. ISBN 0471619779. Trevor. Adapting deep visuomotor representations with weak pairwise constraints. WAFR, 2016. Raffin, Antonin, Hfer, Sebastian, Jonschkowski, Rico, Brock, Oliver, and Stulp, Freek. Unsupervised learning of state repre- Vincent, Pascal, Larochelle, Hugo, Lajoie, Isabelle, Bengio, sentations for multiple tasks. ICLR, 2017. Yoshua, and Manzagol, Pierre-Antoine. Stacked denoising au- toencoders: Learning useful representations in a deep network Rajendran, Janarthanan, Lakshminarayanan, Aravind, Khapra, with a local denoising criterion. NIPS, 2010. Mitesh M., P, Prasanna, and Ravindran, Balaraman. Attend, adapt and transfer: Attentive deep architecture for adaptive Warde-Farley, David and Bengio, Yoshua. Improving generative transfer from multiple sources in the same domain. ICLR, adversarial networks with denoising feature matching. ICLR, 2017. 2017.

11. DARLA: Improving Zero-Shot Transfer in Reinforcement Learning Watkins, Christopher John Cornish Hellaby. Learning from de- A. Supplementary Materials layed rewards. PhD thesis, University of Cambridge, Cam- bridge, UK, 1989. A.1. The Reinforcement Learning Paradigm Whitney, William F., Chang, Michael, Kulkarni, Tejas, and The reinforcement learning (RL) paradigm consists of an agent re- Tenenbaum, Joshua B. Understanding visual concepts with ceiving a sequence of observations sot which are some function of continuation learning. arXiv, 2016. URL http://arxiv. environment states st ∈ S and may be accompanied by rewards org/pdf/1602.06822.pdf. rt+1 ∈ R conditional on the actions at ∈ A, chosen at each time step t (Sutton & Barto, 1998). We assume that these interactions Yang, Jimei, Reed, Scott, Yang, Ming-Hsuan, and Lee, Honglak. can be modelled as a Markov Decision Process (MDP) (Puterman, Weakly-supervised disentangling with recurrent transforma- 1994) defined as a tuple D ≡ (S, A, T , R, γ). T = p(s|st , at ) tions for 3d view synthesis. NIPS, 2015. is a transition function that models the distribution of all possible next states given action at is taken in state st for all st ∈ S and a at ∈ A. Each transition st →t st+1 may be accompanied by a reward signal rt+1 (st , at , st+1 ). The goal of the agent is to learn a policy π(at |st ), a probability distribution over actions at ∈ A, that maximises the expected return i.e. the discounted sum of fu- −t τ −1 ture rewards Rt = E[ Tτ =1 γ rt+τ ]. T is the time step at which each episode ends, and γ ∈ [0, 1) is the discount factor that progressively down-weights future rewards. Given a policy π(a|s), one can define the value function Vπ (s) = E[Rt |st = s, π], which is the expected return from state s following policy π. The action-value function Qπ (s, a) = E[Rt |st = s, at = a, π] is the expected return for taking action a in state s at time t, and then following policy π from time t + 1 onward. A.2. Further task details A.2.1. D EEP M IND L AB As described in Sec 3.1, in each source episode of DeepMind Lab the agent was presented with one of three possible room/object type conjunctions, chosen at random. These are marked DS in Fig 2. The setup was a seek-avoid style task, where one of the two object types in the room gave a reward of +1 and the other gave a reward of -1. The agent was allowed to pick up objects for 60 seconds after which the episode would terminate and a new one would begin; if the agent was able to pick up all the ‘good’ objects in less than 60 seconds, a new episode was begun immediately. The agent was spawned in a random location in the room at the start of each new episode. During transfer, the agent was placed into the held out conjunction of object types and room background; see DT in Fig 2. Visual pre-training was performed in other conjunctions of object type and room background denoted DU in Fig 2. The observation size of frames in the DeepMind Lab task was 84x84x3 (HxW xC). A.2.2. M U J O C O /JACO A RM E XPERIMENTS As described in Sec 3.2, the source task consisted of an agent learning to control a simulated arm in order to reach toward an object. A shaping reward was used, with a maximum value of 1 when the centre of the object fell between the pinch and grip sites of the end effector, or within a 10cm distance of the two. Distances on the x and y dimensions counted double compared to distances on the z dimension. During each episode the object was placed at a random drop point within a 40x40cm area, and the arm was set to a random ini- tial start position high above the work-space, independent of the object’s position. Each episode lasted for 150 steps, or 7.5 sec- onds, with a control step of 50ms. Observations soU were sampled

12. DARLA: Improving Zero-Shot Transfer in Reinforcement Learning randomly across episodes. Overall, 4 million frames of dimen- Eq. 2, repeated here: sions 64x64x3 (HxW xC) were used for this stage of the curricu- lum. For each episode the camera position and orientation were L(θ, φ; x, z, β) =Eqφ (z|x) J(ˆ x) − J(x) 2 2 randomly sampled from an isotropic normal distribution centred around the approximate position and orientation of the real cam- − β DKL (qφ (z|x)||p(z)) (3) era, with standard deviation 0.01. No precise measurements were used to match the two. Work-space table colour was sampled Specifically, the input was passed through the β-VAE and a sam- uniformly between −5% and +5% around the midpoint, inde- pled7 reconstruction was passed through the pre-trained DAE up pendently for each RGB channel; object colours were sampled to a designated layer. The L2 distance of this representation from uniformly at random in RGB space, rejecting colours which fell the representation of the original input passed through the same within a ball around 10 held-out intensities (radius 10% of range); layers of the DAE was then computed, and this formed the train- the latter were only used for simulated transfer experiments, i.e. ing loss for the β-VAE part of the β-VAEDAE 8 . The DAE in DT in the sim2sim experiments. Additionally, Gaussian noise weights remained frozen throughout. with standard deviation 0.01 was added to the observations soT in The β-VAE architecture consisted of an encoder of four convolu- the sim2sim task. tional layers, each with kernel size 4, and stride 2 in the height For the real Jaco arm and its MuJoCo simulation counterpart, each and width dimensions. The number of filters learnt for each layer of the nine joints could independently take 11 different actions (a was {32, 32, 64, 64} respectively. This was followed by a fully linear discretisation of the continuous velocity action space). In connected layer of size 256 neurons. The latent layer comprised simulation Gaussian noise with standard deviation 0.1 was added 64 neurons parametrising 32 (marginally) independent Gaussian to each discrete velocity output; delays in the real setup between distributions. The decoder architecture was simply the reverse of observations and action execution were simulated by randomly the encoder, utilising deconvolutional layers. The decoder used mixing velocity outputs from two previous steps instead of emit- was Gaussian, so that the number of output channels was 2C, ting the last output directly. Speed ranges were between −50% where C was the number of channels that the input frames had. and 50% of the Jaco arm’s top speed on joints 1 through 6 start- The padding algorithm used was ‘SAME’ in TensorFlow. ReLU ing at the base, while the fingers could use a full range. For safety non-linearities were used throughout. reasons the speed ranges have been reduced by a factor of 0.3 The model was trained with the loss given by Eq. 3. Specifically, while evaluating agents on the Jaco arm, without significant per- the disentangled model used for DARLA was trained with a β hy- formance degradation. perparameter value of 1 and the layer of the DAE used to compute the perceptual similarity loss was the last deconvolutional layer. A.3. Vision model details The entangled model used for DARLAENT was trained with a β hyperparameter value of 0.1 with the last deconvolutional layer of A.3.1. D ENOISING AUTOENCODER FOR β-VAE the DAE was used to compute the perceptual similarity loss. A denoising autoencoder (DAE) was used as a model to provide The optimiser used was Adam with a learning rate of 1e-4. the feature space for the β-VAE reconstruction loss to be com- puted over (for motivation, see Sec. 2.3.1). The DAE was trained A.3.3. β-VAE with occlusion-style masking noise in the vein of (Pathak et al., 2016), with the aim for the DAE to learn a semantic representation For the MuJoCo/Jaco tasks, a standard β-VAE was used rather of the input frames. Concretely, two values were independently than the β-VAEDAE used for DeepMind Lab. The architecture of sampled from U [0, W ] and two from U [0, H] where W and H the VAE encoder, decoder and the latent size were exactly as de- were the width and height of the input frames. These four values scribed in the previous section A.3.2. β for the the disentangled β- determined the corners of the rectangular mask applied; all pixels VAE in DARLA was 175. β for the entangled model DARLAENT that fell within the mask were set to zero. was 1, corresponding to the standard VAE of (Kingma & Welling, 2014). The DAE architecture consisted of four convolutional layers, each with kernel size 4 and stride 2 in both the height and width di- The optimizer used was Adam with a learning rate of 1e-4. mensions. The number of filters learnt for each layer was {32, 32, 64, 64} respectively. The bottleneck layer consisted of a fully A.3.4. D ENOISING AUTOENCODER FOR BASELINE connected layer of size 100 neurons. This was followed by four deconvolutional layers, again with kernel sizes 4, strides 2, and For the baseline model DARLADAE , we trained a denoising au- {64, 64, 32, 32} filters. The padding algorithm used was ‘SAME’ toencoder with occlusion-style masking noise as described in Ap- in TensorFlow (Abadi et al., 2015). ReLU non-linearities were pendix Section A.3.1. The architecture used matched that exactly used throughout. of the β-VAE described in Appendix Section A.3.2 - however, all stochastic nodes were replaced with deterministic neurons. The model was trained with loss given by the L2 distance of the outputs from the original, un-noised inputs. The optimiser used The optimizer used was Adam with a learning rate of 1e-4. was Adam (Kingma & Ba, 2014) with a learning rate of 1e-3. 7 It is more typical to use the mean of the reconstruction dis- tribution, but this does not induce any pressure on the Gaussians A.3.2. β-VAE WITH P ERCEPTUAL S IMILARITY L OSS parametrising the decoder to reduce their variances. Hence full After training a DAE, as detailed in the previous section6 , a samples were used instead. 8 β-VAEDAE was trained with perceptual similarity loss given by The representations were taken after passing through the layer but before passing through the following non-linearity. We 6 In principle, the β-VAEDAE could also have been trained also briefly experimented with taking the L2 loss post-activation end-to-end in one pass, but we did not experiment with this. but did not find a significant difference.

13. DARLA: Improving Zero-Shot Transfer in Reinforcement Learning A.4. Reinforcement Learning Algorithm Details Transfer metric score: A.4.1. D EEP M IND L AB 0.457 0.196 0.065 The action space in the DeepMind Lab task consisted of 8 discrete actions. DQN: in DQN, the convolutional (or ‘vision’) part of the Q-net was replaced with the encoder of the β-VAEDAE from stage 1 and frozen. DQN takes four consecutive frames as input in order to capture some aspect of environment dynamics in the agent’s state. In order to match this in our setup with a pre-trained vision stack FU , we passed each observation frame so{1..4} through the pre-trained model sz{1..4} = FU (so{1..4} ) and then concatenated the outputs together to form the k-dimensional (where k = 4|sz |) input to the policy network. In this case the size of sz was 64 for DARLA as well as DARLAENT , DARLADAE and DARLAFT . On top of the frozen convolutional stack, two ‘policy’ layers of 512 neurons each were used, with a final linear layer of 8 neurons corresponding to the size of the action space in the DeepMind Lab task. ReLU non-linearities were used throughout. All other hyperparameters were as reported in (Mnih et al., 2015). A3C: in A3C, as with DQN, the convolutional part of the net- work that is shared between the policy net and the value net was replaced with the encoder of the β-VAEDAE in DeepMind Lab tasks. All other hyperparameters were as reported in (Mnih et al., 2016). Figure 6. Traversals of the latent corresponding to room back- Episodic Control: for the Episodic Controller-based DARLA we ground for models with different transfer metric scores (shown used mostly the same hyperparameters as in the original paper by top). Note that in the entangled model, many other objects appear (Blundell et al., 2016). We explored the following hyperparameter and blue hat changes shape in addition to the background chang- settings: number of nearest neighbours ∈ {10, 50}, return hori- ing. For the model with middling transfer score, both the object zon ∈ {100, 400, 800, 1800, 500000}, kernel type ∈ {inverse, type and background alter; whereas for the disentangled model, gaussian}, kernel width ∈ {1e − 6, 1e − 5, 1e − 4, 1e − 3, 1e − very little apart from the background changes. 2, 1e − 1, 0.5, 0.99} and we tried training EC with and without Peng’s Q(λ) (Peng, 1993). In practice we found that none of the explored hyperparameter choices significantly influenced the re- sults of our experiments. The final hyperparameters used for all A.5. Disentanglement Evaluation experiments reported in the paper were the following: number of nearest neighbours: 10, return horizon: 400, kernel type: inverse, A.5.1. V ISUAL H EURISTIC D ETAILS kernel width: 1e-6 and no Peng’s Q(λ) (Peng, 1993). In order to choose the optimal value of β for the β-VAE -DAE UNREAL: We used a vanilla version of UNREAL, with parame- models and evaluate the fitness of the representations szU learnt in ters as reported in (Jaderberg et al., 2017). stage 1 of our pipeline (in terms of disentanglement achieved), we used the visual inspection heuristic described in (Higgins et al., A.4.2. M U J O C O /JACO A RM E XPERIMENTS 2017). The heuristic involved clustering trained β-VAE based models based on the number of informative latents (estimated as For the real Jaco arm and its MuJoCo simulation, each of the nine the number of latents zi with average inferred standard deviation joints could independently take 11 different actions (a linear dis- below 0.75). For each cluster we examined the degree of learnt cretisation of the continuous velocity action space). Therefore the disentanglement by running inference on a number of seed im- action space size was 99. ages, then traversing each latent unit z{i} one at a time over three standard deviations away from its average inferred mean while DARLA for MuJoCo/Jaco was based on feedforward A3C (Mnih keeping all other latents z{\i} fixed to their inferred values. This et al., 2016). We closely followed the simulation training setup of allowed us to visually examine whether each individual latent unit (Rusu et al., 2016) for feed-forward networks using raw visual- zi learnt to control a single interpretable factor of variation in the input only. In place of the usual conv-stack, however, we used the data. A similar heuristic has been the de rigueur method for ex- encoder of the β-VAE as described in Appendix A.3.3. This was hibiting disentanglement in the disentanglement literature (Chen followed by a linear layer with 512 units, a ReLU non-linearity et al., 2016; Kulkarni et al., 2015). and a collection of 9 linear and softmax layers for the 9 indepen- dent policy outputs, as well as a single value output layer that outputted the value function. A.5.2. T RANSFER M ETRIC D ETAILS In the case of DeepMind Lab, we were able to use the ground truth labels corresponding to the two factors of variation of the object type and the background to design a proxy to the disentanglement metric proposed in (Higgins et al., 2017). The procedure used

14. DARLA: Improving Zero-Shot Transfer in Reinforcement Learning consisted of the following steps: Es∼π (Rt:t+n + γ n V (st+n+1 ; θ) − V (st ; θ))2 and Lπ = Es∼π [log π(a|s; θ)(Qπ (s, a; θ) − V π (s; θ))]. Unlike DQN, 1) Train the model under consideration on observations soU to A3C uses an LSTM core to encode its history and therefore has learn FU , as described in stage 1 of the DARLA pipeline. a longer term memory permitting it to perform better in partially 2) Learn a linear model L : SVz → M × N from the represen- observed environments. In the version of A3C used in this pa- tations szV = FV (soV ), where M ∈ {0, 1} corresponds to the per for the DeepMind Lab task, the policy net additionally takes set of possible rooms and N ∈ {0, 1, 2, 3} corresponds to the set the last action at−1 and last reward rt−1 as inputs along with the of possible objects9 . Therefore we are learning a low-VC dimen- observation sot , as introduced in (Jaderberg et al., 2017). sion classifier to predict the room and the object class from the latent representation of the model. Crucially, the linear model L A.6.3. UNREAL is trained on only a subset of the Cartesian product M ×N e.g. on {{0, 0}, {0, 3}, {1, 1}, {1, 2}}. In practice, we utilised a softmax The UNREAL agent (Jaderberg et al., 2017) takes as a base an classifier each for M and N and trained this using backpropaga- LSTM A3C agent (Mnih et al., 2016) and augments it with a tion with a cross-entropy loss, keeping the unsupervised model number of unsupervised auxiliary tasks that make use of the rich (and therefore FU ) fixed. perceptual data available to the agent besides the (sometimes very sparse) extrinsic reward signals. This auxiliary learning tends to 3) The trained linear model L’s accuracy is evaluated on the held improve the representation learnt by the agent. While training the out subset of the Cartesian product M × N . base agent, its observations, rewards, and actions are stored in a replay buffer, which is used by the auxiliary learning tasks. The Although the above procedure only measures disentangling up to tasks include: 1) pixel control the agent learns how to control the linearity, and only does so for the latents of object type and room environment by training auxiliary policies to maximally change background, we nevertheless found that the metric was highly cor- pixel intensities in different parts of the input; 2) reward predic- related with disentanglement as determined via visual inspection tion - given a replay buffer of observations within a short time (see Fig. 6). period of an extrinsic reward, the agent has to predict the reward obtained during the next unobserved timestep using a sequence of A.6. Background on RL Algorithms three preceding steps; 3) value function replay - extra training of the value function to promote faster value iteration. In this Appendix, we provide background on the different RL al- gorithms that the DARLA framework was tested on in this paper. A.6.4. E PISODIC C ONTROL A.6.1. DQN In its simplest form EC is a lookup table of states and actions denoted as QEC (s, a). In each state EC picks the action with (DQN) (Mnih et al., 2015) is a variant of the Q-learning algo- the highest QEC value. At the end of each episode QEC (s, a) rithm (Watkins, 1989) that utilises deep learning. It uses a neu- ral network to parametrise an approximation for the action-value / QEC , where Rt is the discounted re- is set to Rt if (st , at ) ∈ function Q(s, a; θ) using parameters θ. These parameters are up- turn. Otherwise Q (s, a) = max QEC (s, a), Rt . In order EC dated by minimising the mean-squared error of a 1-step looka- to generalise its policy to novel states that are not in QEC , EC head loss LQ = E (rt + γmaxa Q(s , a ; θ− ) − Q(s, a; θ))2 , uses a non-parametric nearest neighbours search QEC (s, a) = where θ− are parameters corresponding to a frozen network and 1 k EC i (s , a), where si , i = 1, ..., k are k states with the k i=1 Q optimisation is performed with respect to θ, with θ− being synced smallest distance to the novel state s. Like DQN, EC takes a con- to θ at regular intervals. catenation of four frames as input. The EC algorithm is proposed as a model of fast hippocampal A.6.2. A3C instance-based learning in the brain (Marr, 1971; Sutherland & Asynchronous Advantage Actor-Critic (A3C) (Mnih et al., 2016) Rudy, 1989), while the deep RL algorithms described above are is an asynchronous implementation of the advantage actor-critic more analogous to slow cortical learning that relies on generalised paradigm (Sutton & Barto, 1998; Degris & Sutton, 2012), where statistical summaries of the input distribution (McClelland et al., separate threads run in parallel and perform updates to shared pa- 1995; Norman & O’Reilly, 2003; Tulving et al., 1991). rameters. The different threads each hold their own instance of the environment and have different exploration policies, thereby A.7. Source Task Performance Results decorrelating parameter updates without the need for experience replay. The focus of this paper is primarily on zero-shot domain adapta- tion performance. However, it is also interesting to analyse the A3C uses neural networks to approximate both policy π(a|s; θ) effect of the DARLA approach on source domain policy perfor- and value Vπ (s; θ) functions using parameters θ using n- mance. In order to compare the models’ behaviour on the source step look-ahead loss (Peng & Williams, 1996). The algo- task, we examined the training curves (see Figures 7-10) and rithm is trained using an advantage actor-critic loss func- noted in particular their: tion with an entropy regularisation penalty: LA3C ≈ LV R + Lπ − Es∼π [αH(π(a|s; θ))], where H is entropy. The parameter updates are performed after every tmax ac- 1. Asymptotic task performance, i.e. the rewards per episode tions or when a terminal state is reached. LV R = at the point where πS has converged for the agent under consideration. 9 For the purposes of this metric, we utilised rooms with only single objects, which we denote by the subscript V e.g. the obser- 2. Data efficiency, i.e. how quickly the training curve was able vation set SVo . to achieve convergence.

15. DARLA: Improving Zero-Shot Transfer in Reinforcement Learning 20 DQN 120 A3C (JACO) DARLA DARLA DARLA_ENT 100 DARLA_ENT DARLA_DAE DARLA_DAE 15 A3C DQN 80 DARLA_FT Average reward DARLA_FT Average reward 60 10 40 5 20 0 0 1 2 3 4 5 6 7 8 9 0 environment steps 1e6 1 5 2 3 4 Environment steps 1e6 Figure 10. Training curves for various baselines on the source Figure 7. Source task training curves for DQN. Curves show av- MuJoCo reaching task erage and standard deviation over 20 random seeds. see Fig. 8. A3C (DEEPMIND LAB) DARLA 2. Baseline algorithms where F could be fine-tuned to the 25 DARLA_ENT source task were able to achieve higher asymptotic perfor- DARLA_DAE Average reward / episode mance. This was particularly notable on DQN and A3C (see 20 A3C Figs. 7 and 8) in DeepMind Lab. However, in both those DARLA_FT UNREAL cases, DARLA was able to learn very reasonable policies 15 on the source task which were on the order of 20% lower than the fine-tuned models – arguably a worthwhile sacri- 10 fice for a subsequent median 270% improvement in target domain performance noted in the main text. 5 3. Allowing DARLA to fine-tune its vision module 0 (DARLAFT ) boosted its source task learning speed, 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 and allowed the agent to asymptote at the same level as Environment steps 1e7 the baseline algorithms. As discussed in the main text, this Figure 8. Source task performance training curves for A3C and comes at the cost of significantly reduced domain transfer UNREAL. DARLA shows accelerated learning of the task com- performance on A3C. For DQN, however, finetuning pared to other architectures. Results show average and standard appears to offer the best of both worlds. deviation over 20 random seeds, each using 16 workers. 4. Perhaps most relevantly for this paper, even if solely exam- ining source task performance, DARLA outperforms both DARLAENT and DARLADAE on both asymptotic perfor- 20 Episodic Controller mance and data efficiency – suggesting that disentangled DARLA representations have wider applicability in RL beyond the DARLA_ENT DARLA_DAE zero-shot domain adaptation that is the focus of this paper. 15 EC Average reward 10 5 0 1 2 3 4 5 Environment steps 1e6 Figure 9. Source task training curves for EC. Results show aver- age and standard deviation over 20 random seeds. We note the following consistent trends across the results: 1. Using DARLA provided an initial boost in learning perfor- mance, which depended on the degree of disentanglement of the representation. This was particularly observable in A3C,