Asynchronous Methods for Deep Reinforcement Learning

We propose a conceptually simple and lightweight framework for deep reinforcement learning that uses asynchronous gradient descent for optimization of deep neural network controllers. We present asynchronous variants of four standard reinforcement learning algorithms and show that parallel actor-learners have a stabilizing effect on training allowing all four methods to successfully train neural network controllers. The best performing method, an asynchronous variant of actor-critic, surpasses the current state-of-the-art on the Atari domain while training for half the time on a single multi-core CPU instead of a GPU. Furthermore, we show that asynchronous actor-critic succeeds on a wide variety of continuous motor control problems as well as on a new task of navigating random 3D mazes using a visual input.
展开查看详情

1. Asynchronous Methods for Deep Reinforcement Learning Volodymyr Mnih1 VMNIH @ GOOGLE . COM Adrià Puigdomènech Badia1 ADRIAP @ GOOGLE . COM Mehdi Mirza1,2 MIRZAMOM @ IRO . UMONTREAL . CA Alex Graves1 GRAVESA @ GOOGLE . COM Tim Harley1 THARLEY @ GOOGLE . COM Timothy P. Lillicrap1 COUNTZERO @ GOOGLE . COM David Silver1 arXiv:1602.01783v2 [cs.LG] 16 Jun 2016 DAVIDSILVER @ GOOGLE . COM Koray Kavukcuoglu 1 KORAYK @ GOOGLE . COM 1 Google DeepMind 2 Montreal Institute for Learning Algorithms (MILA), University of Montreal Abstract line RL updates are strongly correlated. By storing the We propose a conceptually simple and agent’s data in an experience replay memory, the data can lightweight framework for deep reinforce- be batched (Riedmiller, 2005; Schulman et al., 2015a) or ment learning that uses asynchronous gradient randomly sampled (Mnih et al., 2013; 2015; Van Hasselt descent for optimization of deep neural network et al., 2015) from different time-steps. Aggregating over controllers. We present asynchronous variants of memory in this way reduces non-stationarity and decorre- four standard reinforcement learning algorithms lates updates, but at the same time limits the methods to and show that parallel actor-learners have a off-policy reinforcement learning algorithms. stabilizing effect on training allowing all four Deep RL algorithms based on experience replay have methods to successfully train neural network achieved unprecedented success in challenging domains controllers. The best performing method, an such as Atari 2600. However, experience replay has several asynchronous variant of actor-critic, surpasses drawbacks: it uses more memory and computation per real the current state-of-the-art on the Atari domain interaction; and it requires off-policy learning algorithms while training for half the time on a single that can update from data generated by an older policy. multi-core CPU instead of a GPU. Furthermore, we show that asynchronous actor-critic succeeds In this paper we provide a very different paradigm for deep on a wide variety of continuous motor control reinforcement learning. Instead of experience replay, we problems as well as on a new task of navigating asynchronously execute multiple agents in parallel, on mul- random 3D mazes using a visual input. tiple instances of the environment. This parallelism also decorrelates the agents’ data into a more stationary process, since at any given time-step the parallel agents will be ex- 1. Introduction periencing a variety of different states. This simple idea enables a much larger spectrum of fundamental on-policy Deep neural networks provide rich representations that can RL algorithms, such as Sarsa, n-step methods, and actor- enable reinforcement learning (RL) algorithms to perform critic methods, as well as off-policy RL algorithms such effectively. However, it was previously thought that the as Q-learning, to be applied robustly and effectively using combination of simple online RL algorithms with deep deep neural networks. neural networks was fundamentally unstable. Instead, a va- Our parallel reinforcement learning paradigm also offers riety of solutions have been proposed to stabilize the algo- practical benefits. Whereas previous approaches to deep re- rithm (Riedmiller, 2005; Mnih et al., 2013; 2015; Van Has- inforcement learning rely heavily on specialized hardware selt et al., 2015; Schulman et al., 2015a). These approaches such as GPUs (Mnih et al., 2015; Van Hasselt et al., 2015; share a common idea: the sequence of observed data en- Schaul et al., 2015) or massively distributed architectures countered by an online RL agent is non-stationary, and on- (Nair et al., 2015), our experiments run on a single machine Proceedings of the 33 rd International Conference on Machine with a standard multi-core CPU. When applied to a vari- Learning, New York, NY, USA, 2016. JMLR: W&CP volume ety of Atari 2600 domains, on many games asynchronous 48. Copyright 2016 by the author(s). reinforcement learning achieves better results, in far less

2. Asynchronous Methods for Deep Reinforcement Learning time than previous GPU-based algorithms, using far less proaches have recently been applied to some visual rein- resource than massively distributed approaches. The best forcement learning tasks. In one example, (Koutník et al., of the proposed methods, asynchronous advantage actor- 2014) evolved convolutional neural network controllers for critic (A3C), also mastered a variety of continuous motor the TORCS driving simulator by performing fitness evalu- control tasks as well as learned general strategies for ex- ations on 8 CPU cores in parallel. ploring 3D mazes purely from visual inputs. We believe that the success of A3C on both 2D and 3D games, discrete 3. Reinforcement Learning Background and continuous action spaces, as well as its ability to train feedforward and recurrent agents makes it the most general We consider the standard reinforcement learning setting and successful reinforcement learning agent to date. where an agent interacts with an environment E over a number of discrete time steps. At each time step t, the 2. Related Work agent receives a state st and selects an action at from some set of possible actions A according to its policy π, where The General Reinforcement Learning Architecture (Gorila) π is a mapping from states st to actions at . In return, the of (Nair et al., 2015) performs asynchronous training of re- agent receives the next state st+1 and receives a scalar re- inforcement learning agents in a distributed setting. In Go- ward rt . The process continues until the agent reaches a rila, each process contains an actor that acts in its own copy terminal state after which the process restarts. The return ∞ of the environment, a separate replay memory, and a learner Rt = k=0 γ k rt+k is the total accumulated return from that samples data from the replay memory and computes time step t with discount factor γ ∈ (0, 1]. The goal of the gradients of the DQN loss (Mnih et al., 2015) with respect agent is to maximize the expected return from each state st . to the policy parameters. The gradients are asynchronously The action value Qπ (s, a) = E [Rt |st = s, a] is the ex- sent to a central parameter server which updates a central pected return for selecting action a in state s and follow- copy of the model. The updated policy parameters are sent ing policy π. The optimal value function Q∗ (s, a) = to the actor-learners at fixed intervals. By using 100 sep- maxπ Qπ (s, a) gives the maximum action value for state arate actor-learner processes and 30 parameter server in- s and action a achievable by any policy. Similarly, the stances, a total of 130 machines, Gorila was able to signif- value of state s under policy π is defined as V π (s) = icantly outperform DQN over 49 Atari games. On many E [Rt |st = s] and is simply the expected return for follow- games Gorila reached the score achieved by DQN over 20 ing policy π from state s. times faster than DQN. We also note that a similar way of parallelizing DQN was proposed by (Chavez et al., 2015). In value-based model-free reinforcement learning methods, the action value function is represented using a function ap- In earlier work, (Li & Schuurmans, 2011) applied the proximator, such as a neural network. Let Q(s, a; θ) be an Map Reduce framework to parallelizing batch reinforce- approximate action-value function with parameters θ. The ment learning methods with linear function approximation. updates to θ can be derived from a variety of reinforcement Parallelism was used to speed up large matrix operations learning algorithms. One example of such an algorithm is but not to parallelize the collection of experience or sta- Q-learning, which aims to directly approximate the optimal bilize learning. (Grounds & Kudenko, 2008) proposed a action value function: Q∗ (s, a) ≈ Q(s, a; θ). In one-step parallel version of the Sarsa algorithm that uses multiple Q-learning, the parameters θ of the action value function separate actor-learners to accelerate training. Each actor- Q(s, a; θ) are learned by iteratively minimizing a sequence learner learns separately and periodically sends updates to of loss functions, where the ith loss function defined as weights that have changed significantly to the other learn- ers using peer-to-peer communication. 2 Li (θi ) = E r + γ max Q(s , a ; θi−1 ) − Q(s, a; θi ) a (Tsitsiklis, 1994) studied convergence properties of Q- learning in the asynchronous optimization setting. These where s is the state encountered after state s. results show that Q-learning is still guaranteed to converge when some of the information is outdated as long as out- We refer to the above method as one-step Q-learning be- dated information is always eventually discarded and sev- cause it updates the action value Q(s, a) toward the one- eral other technical assumptions are satisfied. Even earlier, step return r + γ maxa Q(s , a ; θ). One drawback of us- (Bertsekas, 1982) studied the related problem of distributed ing one-step methods is that obtaining a reward r only di- dynamic programming. rectly affects the value of the state action pair s, a that led to the reward. The values of other state action pairs are Another related area of work is in evolutionary meth- affected only indirectly through the updated value Q(s, a). ods, which are often straightforward to parallelize by dis- This can make the learning process slow since many up- tributing fitness evaluations over multiple machines or dates are required the propagate a reward to the relevant threads (Tomassini, 1999). Such parallel evolutionary ap- preceding states and actions.

3. Asynchronous Methods for Deep Reinforcement Learning One way of propagating rewards faster is by using n- Algorithm 1 Asynchronous one-step Q-learning - pseu- step returns (Watkins, 1989; Peng & Williams, 1996). docode for each actor-learner thread. In n-step Q-learning, Q(s, a) is updated toward the n- // Assume global shared θ, θ− , and counter T = 0. step return defined as rt + γrt+1 + · · · + γ n−1 rt+n−1 + Initialize thread step counter t ← 0 maxa γ n Q(st+n , a). This results in a single reward r di- Initialize target network weights θ− ← θ Initialize network gradients dθ ← 0 rectly affecting the values of n preceding state action pairs. Get initial state s This makes the process of propagating rewards to relevant repeat state-action pairs potentially much more efficient. Take action a with -greedy policy based on Q(s, a; θ) Receive new state s and reward r In contrast to value-based methods, policy-based model- r for terminal s y= free methods directly parameterize the policy π(a|s; θ) and r + γ maxa Q(s , a ; θ− ) for non-terminal s 2 update the parameters θ by performing, typically approx- Accumulate gradients wrt θ: dθ ← dθ + ∂(y−Q(s,a;θ)) ∂θ imate, gradient ascent on E[Rt ]. One example of such s=s a method is the REINFORCE family of algorithms due T ← T + 1 and t ← t + 1 to Williams (1992). Standard REINFORCE updates the if T mod Itarget == 0 then Update the target network θ− ← θ policy parameters θ in the direction ∇θ log π(at |st ; θ)Rt , end if which is an unbiased estimate of ∇θ E[Rt ]. It is possible to if t mod IAsyncU pdate == 0 or s is terminal then reduce the variance of this estimate while keeping it unbi- Perform asynchronous update of θ using dθ. ased by subtracting a learned function of the state bt (st ), Clear gradients dθ ← 0. known as a baseline (Williams, 1992), from the return. The end if until T > Tmax resulting gradient is ∇θ log π(at |st ; θ) (Rt − bt (st )). A learned estimate of the value function is commonly used as the baseline bt (st ) ≈ V π (st ) leading to a much lower learners running in parallel are likely to be exploring dif- variance estimate of the policy gradient. When an approx- ferent parts of the environment. Moreover, one can explic- imate value function is used as the baseline, the quantity itly use different exploration policies in each actor-learner Rt − bt used to scale the policy gradient can be seen as to maximize this diversity. By running different explo- an estimate of the advantage of action at in state st , or ration policies in different threads, the overall changes be- A(at , st ) = Q(at , st )−V (st ), because Rt is an estimate of ing made to the parameters by multiple actor-learners ap- Qπ (at , st ) and bt is an estimate of V π (st ). This approach plying online updates in parallel are likely to be less corre- can be viewed as an actor-critic architecture where the pol- lated in time than a single agent applying online updates. icy π is the actor and the baseline bt is the critic (Sutton & Hence, we do not use a replay memory and rely on parallel Barto, 1998; Degris et al., 2012). actors employing different exploration policies to perform the stabilizing role undertaken by experience replay in the 4. Asynchronous RL Framework DQN training algorithm. We now present multi-threaded asynchronous variants of In addition to stabilizing learning, using multiple parallel one-step Sarsa, one-step Q-learning, n-step Q-learning, and actor-learners has multiple practical benefits. First, we ob- advantage actor-critic. The aim in designing these methods tain a reduction in training time that is roughly linear in was to find RL algorithms that can train deep neural net- the number of parallel actor-learners. Second, since we no work policies reliably and without large resource require- longer rely on experience replay for stabilizing learning we ments. While the underlying RL methods are quite dif- are able to use on-policy reinforcement learning methods ferent, with actor-critic being an on-policy policy search such as Sarsa and actor-critic to train neural networks in a method and Q-learning being an off-policy value-based stable way. We now describe our variants of one-step Q- method, we use two main ideas to make all four algorithms learning, one-step Sarsa, n-step Q-learning and advantage practical given our design goal. actor-critic. First, we use asynchronous actor-learners, similarly to the Asynchronous one-step Q-learning: Pseudocode for our Gorila framework (Nair et al., 2015), but instead of using variant of Q-learning, which we call Asynchronous one- separate machines and a parameter server, we use multi- step Q-learning, is shown in Algorithm 1. Each thread in- ple CPU threads on a single machine. Keeping the learn- teracts with its own copy of the environment and at each ers on a single machine removes the communication costs step computes a gradient of the Q-learning loss. We use of sending gradients and parameters and enables us to use a shared and slowly changing target network in comput- Hogwild! (Recht et al., 2011) style updates for training. ing the Q-learning loss, as was proposed in the DQN train- Second, we make the observation that multiple actors- ing method. We also accumulate gradients over multiple timesteps before they are applied, which is similar to us-

4. Asynchronous Methods for Deep Reinforcement Learning ing minibatches. This reduces the chances of multiple ac- by tmax . The pseudocode for the algorithm is presented in tor learners overwriting each other’s updates. Accumulat- Supplementary Algorithm S3. ing updates over several steps also provides some ability to As with the value-based methods we rely on parallel actor- trade off computational efficiency for data efficiency. learners and accumulated updates for improving training Finally, we found that giving each thread a different explo- stability. Note that while the parameters θ of the policy ration policy helps improve robustness. Adding diversity and θv of the value function are shown as being separate to exploration in this manner also generally improves per- for generality, we always share some of the parameters in formance through better exploration. While there are many practice. We typically use a convolutional neural network possible ways of making the exploration policies differ we that has one softmax output for the policy π(at |st ; θ) and experiment with using -greedy exploration with periodi- one linear output for the value function V (st ; θv ), with all cally sampled from some distribution by each thread. non-output layers shared. Asynchronous one-step Sarsa: The asynchronous one- We also found that adding the entropy of the policy π to the step Sarsa algorithm is the same as asynchronous one-step objective function improved exploration by discouraging Q-learning as given in Algorithm 1 except that it uses a dif- premature convergence to suboptimal deterministic poli- ferent target value for Q(s, a). The target value used by cies. This technique was originally proposed by (Williams one-step Sarsa is r + γQ(s , a ; θ− ) where a is the action & Peng, 1991), who found that it was particularly help- taken in state s (Rummery & Niranjan, 1994; Sutton & ful on tasks requiring hierarchical behavior. The gradi- Barto, 1998). We again use a target network and updates ent of the full objective function including the entropy accumulated over multiple timesteps to stabilize learning. regularization term with respect to the policy parame- ters takes the form ∇θ log π(at |st ; θ )(Rt − V (st ; θv )) + Asynchronous n-step Q-learning: Pseudocode for our β∇θ H(π(st ; θ )), where H is the entropy. The hyperpa- variant of multi-step Q-learning is shown in Supplementary rameter β controls the strength of the entropy regulariza- Algorithm S2. The algorithm is somewhat unusual because tion term. it operates in the forward view by explicitly computing n- step returns, as opposed to the more common backward Optimization: We investigated three different optimiza- view used by techniques like eligibility traces (Sutton & tion algorithms in our asynchronous framework – SGD Barto, 1998). We found that using the forward view is eas- with momentum, RMSProp (Tieleman & Hinton, 2012) ier when training neural networks with momentum-based without shared statistics, and RMSProp with shared statis- methods and backpropagation through time. In order to tics. We used the standard non-centered RMSProp update compute a single update, the algorithm first selects actions given by using its exploration policy for up to tmax steps or until a ∆θ terminal state is reached. This process results in the agent g = αg + (1 − α)∆θ2 and θ ← θ − η √ , (1) receiving up to tmax rewards from the environment since g+ its last update. The algorithm then computes gradients for where all operations are performed elementwise. A com- n-step Q-learning updates for each of the state-action pairs parison on a subset of Atari 2600 games showed that a vari- encountered since the last update. Each n-step update uses ant of RMSProp where statistics g are shared across threads the longest possible n-step return resulting in a one-step is considerably more robust than the other two methods. update for the last state, a two-step update for the second Full details of the methods and comparisons are included last state, and so on for a total of up to tmax updates. The in Supplementary Section 7. accumulated updates are applied in a single gradient step. Asynchronous advantage actor-critic: The algorithm, 5. Experiments which we call asynchronous advantage actor-critic (A3C), We use four different platforms for assessing the properties maintains a policy π(at |st ; θ) and an estimate of the value of the proposed framework. We perform most of our exper- function V (st ; θv ). Like our variant of n-step Q-learning, iments using the Arcade Learning Environment (Bellemare our variant of actor-critic also operates in the forward view et al., 2012), which provides a simulator for Atari 2600 and uses the same mix of n-step returns to update both the games. This is one of the most commonly used benchmark policy and the value-function. The policy and the value environments for RL algorithms. We use the Atari domain function are updated after every tmax actions or when a to compare against state of the art results (Van Hasselt et al., terminal state is reached. The update performed by the al- 2015; Wang et al., 2015; Schaul et al., 2015; Nair et al., gorithm can be seen as ∇θ log π(at |st ; θ )A(st , at ; θ, θv ) 2015; Mnih et al., 2015), as well as to carry out a detailed where A(st , at ; θ, θv ) is an estimate of the advantage func- stability and scalability analysis of the proposed methods. k−1 tion given by i=0 γ i rt+i + γ k V (st+k ; θv ) − V (st ; θv ), We performed further comparisons using the TORCS 3D where k can vary from state to state and is upper-bounded car racing simulator (Wymann et al., 2013). We also use

5. Asynchronous Methods for Deep Reinforcement Learning 16000 Beamrider 600 Breakout 30 Pong 12000 Q*bert 1600 Space Invaders DQN DQN DQN DQN 14000 1-step Q 500 1-step Q 20 10000 1-step Q 1400 1-step Q 12000 1-step SARSA 1-step SARSA 1-step SARSA 1200 1-step SARSA n-step Q n-step Q n-step Q n-step Q 10000 A3C 400 A3C 10 8000 A3C 1000 A3C Score Score Score Score Score 8000 300 0 6000 800 6000 200 10 DQN 4000 600 4000 1-step Q 400 100 20 1-step SARSA 2000 2000 n-step Q 200 A3C 0 0 30 0 0 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 Training time (hours) Training time (hours) Training time (hours) Training time (hours) Training time (hours) Figure 1. Learning speed comparison for DQN and the new asynchronous algorithms on five Atari 2600 games. DQN was trained on a single Nvidia K40 GPU while the asynchronous methods were trained using 16 CPU cores. The plots are averaged over 5 runs. In the case of DQN the runs were for different seeds with fixed hyperparameters. For asynchronous methods we average over the best 5 models from 50 experiments with learning rates sampled from LogU nif orm(10−4 , 10−2 ) and all other hyperparameters fixed. two additional domains to evaluate only the A3C algorithm Method Training Time Mean Median – Mujoco and Labyrinth. MuJoCo (Todorov, 2015) is a DQN 8 days on GPU 121.9% 47.5% physics simulator for evaluating agents on continuous mo- Gorila 4 days, 100 machines 215.2% 71.3% D-DQN 8 days on GPU 332.9% 110.9% tor control tasks with contact dynamics. Labyrinth is a new Dueling D-DQN 8 days on GPU 343.8% 117.1% 3D environment where the agent must learn to find rewards Prioritized DQN 8 days on GPU 463.6% 127.6% in randomly generated mazes from a visual input. The pre- A3C, FF 1 day on CPU 344.1% 68.2% cise details of our experimental setup can be found in Sup- A3C, FF 4 days on CPU 496.8% 116.6% plementary Section 8. A3C, LSTM 4 days on CPU 623.0% 112.6% Table 1. Mean and median human-normalized scores on 57 Atari 5.1. Atari 2600 Games games using the human starts evaluation metric. Supplementary Table SS3 shows the raw scores for all games. We first present results on a subset of Atari 2600 games to demonstrate the training speed of the new methods. Fig- ure 1 compares the learning speed of the DQN algorithm from (Bellemare et al., 2012). We trained our agents for trained on an Nvidia K40 GPU with the asynchronous four days using 16 CPU cores, while the other agents were methods trained using 16 CPU cores on five Atari 2600 trained for 8 to 10 days on Nvidia K40 GPUs. Table 1 games. The results show that all four asynchronous meth- shows the average and median human-normalized scores ods we presented can successfully train neural network obtained by our agents trained by asynchronous advantage controllers on the Atari domain. The asynchronous meth- actor-critic (A3C) as well as the current state-of-the art. ods tend to learn faster than DQN, with significantly faster Supplementary Table S3 shows the scores on all games. learning on some games, while training on only 16 CPU A3C significantly improves on state-of-the-art the average cores. Additionally, the results suggest that n-step methods score over 57 games in half the training time of the other learn faster than one-step methods on some games. Over- methods while using only 16 CPU cores and no GPU. Fur- all, the policy-based advantage actor-critic method signifi- thermore, after just one day of training, A3C matches the cantly outperforms all three value-based methods. average human normalized score of Dueling Double DQN We then evaluated asynchronous advantage actor-critic on and almost reaches the median human normalized score of 57 Atari games. In order to compare with the state of the Gorila. We note that many of the improvements that are art in Atari game playing, we largely followed the train- presented in Double DQN (Van Hasselt et al., 2015) and ing and evaluation protocol of (Van Hasselt et al., 2015). Dueling Double DQN (Wang et al., 2015) can be incorpo- Specifically, we tuned hyperparameters (learning rate and rated to 1-step Q and n-step Q methods presented in this amount of gradient norm clipping) using a search on six work with similar potential improvements. Atari games (Beamrider, Breakout, Pong, Q*bert, Seaquest and Space Invaders) and then fixed all hyperparameters for 5.2. TORCS Car Racing Simulator all 57 games. We trained both a feedforward agent with the We also compared the four asynchronous methods on same architecture as (Mnih et al., 2015; Nair et al., 2015; the TORCS 3D car racing game (Wymann et al., 2013). Van Hasselt et al., 2015) as well as a recurrent agent with an TORCS not only has more realistic graphics than Atari additional 256 LSTM cells after the final hidden layer. We 2600 games, but also requires the agent to learn the dy- additionally used the final network weights for evaluation namics of the car it is controlling. At each step, an agent to make the results more comparable to the original results received only a visual input in the form of an RGB image

6. Asynchronous Methods for Deep Reinforcement Learning of the current frame as well as a reward proportional to the Number of threads agent’s velocity along the center of the track at the agent’s Method 1 2 4 8 16 1-step Q 1.0 3.0 6.3 13.3 24.1 current position. We used the same neural network archi- 1-step SARSA 1.0 2.8 5.9 13.1 22.1 tecture as the one used in the Atari experiments specified in n-step Q 1.0 2.7 5.9 10.7 17.2 Supplementary Section 8. We performed experiments us- A3C 1.0 2.1 3.7 6.9 12.5 ing four different settings – the agent controlling a slow car with and without opponent bots, and the agent controlling a Table 2. The average training speedup for each method and num- fast car with and without opponent bots. Full results can be ber of threads averaged over seven Atari games. To compute the found in Supplementary Figure S6. A3C was the best per- training speed-up on a single game we measured the time to re- quired reach a fixed reference score using each method and num- forming agent, reaching between roughly 75% and 90% of ber of threads. The speedup from using n threads on a game was the score obtained by a human tester on all four game con- defined as the time required to reach a fixed reference score using figurations in about 12 hours of training. A video showing one thread divided the time required to reach the reference score the learned driving behavior of the A3C agent can be found using n threads. The table shows the speedups averaged over at https://youtu.be/0xo1Ldx3L5Q. seven Atari games (Beamrider, Breakout, Enduro, Pong, Q*bert, Seaquest, and Space Invaders). 5.3. Continuous Action Control Using the MuJoCo Physics Simulator We trained an A3C LSTM agent on this task using only 84 × 84 RGB images as input. The final average score We also examined a set of tasks where the action space of around 50 indicates that the agent learned a reason- is continuous. In particular, we looked at a set of rigid able strategy for exploring random 3D maxes using only body physics domains with contact dynamics where the a visual input. A video showing one of the agents ex- tasks include many examples of manipulation and loco- ploring previously unseen mazes is included at https: motion. These tasks were simulated using the Mujoco //youtu.be/nMR5mjCFZCw. physics engine. We evaluated only the asynchronous ad- vantage actor-critic algorithm since, unlike the value-based methods, it is easily extended to continuous actions. In all 5.5. Scalability and Data Efficiency problems, using either the physical state or pixels as in- We analyzed the effectiveness of our proposed framework put, Asynchronous Advantage-Critic found good solutions by looking at how the training time and data efficiency in less than 24 hours of training and typically in under a few changes with the number of parallel actor-learners. When hours. Some successful policies learned by our agent can using multiple workers in parallel and updating a shared be seen in the following video https://youtu.be/ model, one would expect that in an ideal case, for a given Ajjc08-iPx8. Further details about this experiment can task and algorithm, the number of training steps to achieve be found in Supplementary Section 9. a certain score would remain the same with varying num- bers of workers. Therefore, the advantage would be solely 5.4. Labyrinth due to the ability of the system to consume more data in the same amount of wall clock time and possibly improved We performed an additional set of experiments with A3C exploration. Table 2 shows the training speed-up achieved on a new 3D environment called Labyrinth. The specific by using increasing numbers of parallel actor-learners av- task we considered involved the agent learning to find re- eraged over seven Atari games. These results show that all wards in randomly generated mazes. At the beginning of four methods achieve substantial speedups from using mul- each episode the agent was placed in a new randomly gen- tiple worker threads, with 16 threads leading to at least an erated maze consisting of rooms and corridors. Each maze order of magnitude speedup. This confirms that our pro- contained two types of objects that the agent was rewarded posed framework scales well with the number of parallel for finding – apples and portals. Picking up an apple led to workers, making efficient use of resources. a reward of 1. Entering a portal led to a reward of 10 after which the agent was respawned in a new random location in Somewhat surprisingly, asynchronous one-step Q-learning the maze and all previously collected apples were regener- and Sarsa algorithms exhibit superlinear speedups that ated. An episode terminated after 60 seconds after which a cannot be explained by purely computational gains. We new episode would begin. The aim of the agent is to collect observe that one-step methods (one-step Q and one-step as many points as possible in the time limit and the optimal Sarsa) often require less data to achieve a particular score strategy involves first finding the portal and then repeatedly when using more parallel actor-learners. We believe this going back to it after each respawn. This task is much more is due to positive effect of multiple threads to reduce the challenging than the TORCS driving domain because the bias in one-step methods. These effects are shown more agent is faced with a new maze in each episode and must clearly in Figure 3, which shows plots of the average score learn a general strategy for exploring random mazes. against the total number of training frames for different

7. Asynchronous Methods for Deep Reinforcement Learning 16000 A3C, Beamrider 1000 A3C, Breakout 30 A3C, Pong 12000 A3C, Q*bert 1400 A3C, Space Invaders 14000 10000 1200 800 20 12000 8000 1000 10000 600 10 8000 6000 800 Score Score Score Score Score 400 0 6000 4000 600 4000 200 10 2000 400 2000 0 20 0 200 0 2000 200 30 2000 0 10-4 10-3 10-2 10-4 10-3 10-2 10-4 10-3 10-2 10-4 10-3 10-2 10-4 10-3 10-2 Learning rate Learning rate Learning rate Learning rate Learning rate Figure 2. Scatter plots of scores obtained by asynchronous advantage actor-critic on five games (Beamrider, Breakout, Pong, Q*bert, Space Invaders) for 50 different learning rates and random initializations. On each game, there is a wide range of learning rates for which all random initializations acheive good scores. This shows that A3C is quite robust to learning rates and initial random weights. numbers of actor-learners and training methods on five substantially improve the data efficiency of these methods Atari games, and Figure 4, which shows plots of the av- by reusing old data. This could in turn lead to much faster erage score against wall-clock time. training times in domains like TORCS where interacting with the environment is more expensive than updating the 5.6. Robustness and Stability model for the architecture we used. Finally, we analyzed the stability and robustness of the Combining other existing reinforcement learning meth- four proposed asynchronous algorithms. For each of the ods or recent advances in deep reinforcement learning four algorithms we trained models on five games (Break- with our asynchronous framework presents many possibil- out, Beamrider, Pong, Q*bert, Space Invaders) using 50 ities for immediate improvements to the methods we pre- different learning rates and random initializations. Figure 2 sented. While our n-step methods operate in the forward shows scatter plots of the resulting scores for A3C, while view (Sutton & Barto, 1998) by using corrected n-step re- Supplementary Figure S11 shows plots for the other three turns directly as targets, it has been more common to use methods. There is usually a range of learning rates for each the backward view to implicitly combine different returns method and game combination that leads to good scores, through eligibility traces (Watkins, 1989; Sutton & Barto, indicating that all methods are quite robust to the choice of 1998; Peng & Williams, 1996). The asynchronous ad- learning rate and random initialization. The fact that there vantage actor-critic method could be potentially improved are virtually no points with scores of 0 in regions with good by using other ways of estimating the advantage function, learning rates indicates that the methods are stable and do such as generalized advantage estimation of (Schulman not collapse or diverge once they are learning. et al., 2015b). All of the value-based methods we inves- tigated could benefit from different ways of reducing over- 6. Conclusions and Discussion estimation bias of Q-values (Van Hasselt et al., 2015; Belle- mare et al., 2016). Yet another, more speculative, direction We have presented asynchronous versions of four standard is to try and combine the recent work on true online tempo- reinforcement learning algorithms and showed that they ral difference methods (van Seijen et al., 2015) with non- are able to train neural network controllers on a variety linear function approximation. of domains in a stable manner. Our results show that in In addition to these algorithmic improvements, a number our proposed framework stable training of neural networks of complementary improvements to the neural network ar- through reinforcement learning is possible with both value- chitecture are possible. The dueling architecture of (Wang based and policy-based methods, off-policy as well as on- et al., 2015) has been shown to produce more accurate es- policy methods, and in discrete as well as continuous do- timates of Q-values by including separate streams for the mains. When trained on the Atari domain using 16 CPU state value and advantage in the network. The spatial soft- cores, the proposed asynchronous algorithms train faster max proposed by (Levine et al., 2015) could improve both than DQN trained on an Nvidia K40 GPU, with A3C sur- value-based and policy-based methods by making it easier passing the current state-of-the-art in half the training time. for the network to represent feature coordinates. One of our main findings is that using parallel actor- learners to update a shared model had a stabilizing effect on ACKNOWLEDGMENTS the learning process of the three value-based methods we considered. While this shows that stable online Q-learning We thank Thomas Degris, Remi Munos, Marc Lanctot, is possible without experience replay, which was used for Sasha Vezhnevets and Joseph Modayil for many helpful this purpose in DQN, it does not mean that experience re- discussions, suggestions and comments on the paper. We play is not useful. Incorporating experience replay into also thank the DeepMind evaluation team for setting up the the asynchronous reinforcement learning framework could environments used to evaluate the agents in the paper.

8. Asynchronous Methods for Deep Reinforcement Learning 10000 Beamrider 350 Breakout 20 Pong 4500 Q*bert 800 Space Invaders 1-step Q, 1 threads 1-step Q, 1 threads 1-step Q, 1 threads 1-step Q, 2 threads 1-step Q, 2 threads 15 4000 1-step Q, 2 threads 1-step Q, 4 threads 300 1-step Q, 4 threads 1-step Q, 4 threads 700 8000 1-step Q, 8 threads 1-step Q, 8 threads 10 3500 1-step Q, 8 threads 1-step Q, 16 threads 1-step Q, 16 threads 1-step Q, 16 threads 250 600 5 3000 6000 200 500 0 2500 Score Score Score Score Score 150 5 2000 400 4000 10 1500 100 300 15 1-step Q, 1 threads 1000 1-step Q, 1 threads 2000 1-step Q, 2 threads 1-step Q, 2 threads 50 20 1-step Q, 4 threads 500 200 1-step Q, 4 threads 1-step Q, 8 threads 1-step Q, 8 threads 1-step Q, 16 threads 1-step Q, 16 threads 0 0 25 0 100 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 Training epochs Training epochs Training epochs Training epochs Training epochs 12000 Beamrider 350 Breakout 20 Pong 6000 Q*bert 800 Space Invaders n-step Q, 1 threads n-step Q, 1 threads n-step Q, 1 threads n-step Q, 2 threads n-step Q, 2 threads 15 n-step Q, 2 threads 10000 n-step Q, 4 threads 300 n-step Q, 4 threads 5000 n-step Q, 4 threads 700 n-step Q, 8 threads n-step Q, 8 threads n-step Q, 8 threads n-step Q, 16 threads n-step Q, 16 threads 10 n-step Q, 16 threads 250 600 8000 5 4000 200 0 500 Score Score Score Score Score 6000 3000 150 5 400 4000 10 2000 100 300 15 n-step Q, 1 threads n-step Q, 1 threads 2000 n-step Q, 2 threads 1000 n-step Q, 2 threads 50 20 n-step Q, 4 threads 200 n-step Q, 4 threads n-step Q, 8 threads n-step Q, 8 threads n-step Q, 16 threads n-step Q, 16 threads 0 0 25 0 100 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 Training epochs Training epochs Training epochs Training epochs Training epochs 16000 Beamrider 800 Breakout 30 Pong 12000 Q*bert 1400 Space Invaders A3C, 1 threads A3C, 1 threads A3C, 1 threads A3C, 1 threads A3C, 2 threads A3C, 2 threads A3C, 2 threads A3C, 2 threads 14000 A3C, 4 threads 700 A3C, 4 threads A3C, 4 threads 1200 A3C, 4 threads A3C, 8 threads A3C, 8 threads 20 10000 A3C, 8 threads A3C, 8 threads 12000 A3C, 16 threads 600 A3C, 16 threads A3C, 16 threads A3C, 16 threads 1000 10 8000 10000 500 800 Score Score Score Score Score 8000 400 0 6000 600 6000 300 10 4000 400 4000 200 A3C, 1 threads 20 A3C, 2 threads 2000 2000 100 A3C, 4 threads 200 A3C, 8 threads A3C, 16 threads 0 0 30 0 0 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 Training epochs Training epochs Training epochs Training epochs Training epochs Figure 3. Data efficiency comparison of different numbers of actor-learners for three asynchronous methods on five Atari games. The x-axis shows the total number of training epochs where an epoch corresponds to four million frames (across all threads). The y-axis shows the average score. Each curve shows the average over the three best learning rates. Single step methods show increased data efficiency from more parallel workers. Results for Sarsa are shown in Supplementary Figure S9. 9000 Beamrider 300 Breakout 20 Pong 4000 Q*bert 800 Space Invaders 1-step Q, 1 threads 1-step Q, 1 threads 1-step Q, 1 threads 1-step Q, 1 threads 1-step Q, 1 threads 8000 1-step Q, 2 threads 1-step Q, 2 threads 15 1-step Q, 2 threads 1-step Q, 2 threads 1-step Q, 2 threads 1-step Q, 4 threads 1-step Q, 4 threads 1-step Q, 4 threads 3500 1-step Q, 4 threads 700 1-step Q, 4 threads 1-step Q, 8 threads 250 1-step Q, 8 threads 1-step Q, 8 threads 1-step Q, 8 threads 1-step Q, 8 threads 7000 1-step Q, 16 threads 1-step Q, 16 threads 10 1-step Q, 16 threads 3000 1-step Q, 16 threads 1-step Q, 16 threads 600 6000 200 5 2500 5000 0 500 Score Score Score Score Score 150 2000 4000 5 400 1500 3000 100 10 300 2000 15 1000 50 200 1000 20 500 0 0 25 0 100 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 Training time (hours) Training time (hours) Training time (hours) Training time (hours) Training time (hours) 12000 Beamrider 350 Breakout 20 Pong 4500 Q*bert 800 Space Invaders n-step Q, 1 threads n-step Q, 1 threads n-step Q, 1 threads n-step Q, 1 threads n-step Q, 2 threads n-step Q, 2 threads 15 4000 n-step Q, 2 threads n-step Q, 2 threads 10000 n-step Q, 4 threads 300 n-step Q, 4 threads n-step Q, 4 threads 700 n-step Q, 4 threads n-step Q, 8 threads n-step Q, 8 threads n-step Q, 8 threads n-step Q, 8 threads n-step Q, 16 threads n-step Q, 16 threads 10 3500 n-step Q, 16 threads n-step Q, 16 threads 250 600 8000 5 3000 200 0 2500 500 Score Score Score Score Score 6000 150 5 2000 400 4000 10 1500 100 300 15 n-step Q, 1 threads 1000 2000 n-step Q, 2 threads 50 20 n-step Q, 4 threads 500 200 n-step Q, 8 threads n-step Q, 16 threads 0 0 25 0 100 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 Training time (hours) Training time (hours) Training time (hours) Training time (hours) Training time (hours) 16000 Beamrider 600 Breakout 30 Pong 12000 Q*bert 1600 Space Invaders A3C, 1 threads A3C, 1 threads A3C, 1 threads A3C, 1 threads A3C, 1 threads A3C, 2 threads A3C, 2 threads A3C, 2 threads A3C, 2 threads A3C, 2 threads 14000 A3C, 4 threads A3C, 4 threads A3C, 4 threads A3C, 4 threads 1400 A3C, 4 threads A3C, 8 threads 500 A3C, 8 threads 20 A3C, 8 threads 10000 A3C, 8 threads A3C, 8 threads 12000 A3C, 16 threads A3C, 16 threads A3C, 16 threads A3C, 16 threads 1200 A3C, 16 threads 400 10 8000 10000 1000 Score Score Score Score Score 8000 300 0 6000 800 6000 600 200 10 4000 4000 400 100 20 2000 2000 200 0 0 30 0 0 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 Training time (hours) Training time (hours) Training time (hours) Training time (hours) Training time (hours) Figure 4. Training speed comparison of different numbers of actor-learners on five Atari games. The x-axis shows training time in hours while the y-axis shows the average score. Each curve shows the average over the three best learning rates. All asynchronous methods show significant speedups from using greater numbers of parallel actor-learners. Results for Sarsa are shown in Supplementary Figure S10.

9. Asynchronous Methods for Deep Reinforcement Learning References Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Rusu, Andrei A., Veness, Joel, Bellemare, Marc G., Bellemare, Marc G, Naddaf, Yavar, Veness, Joel, and Graves, Alex, Riedmiller, Martin, Fidjeland, Andreas K., Bowling, Michael. The arcade learning environment: Ostrovski, Georg, Petersen, Stig, Beattie, Charles, Sadik, An evaluation platform for general agents. Journal of Amir, Antonoglou, Ioannis, King, Helen, Kumaran, Artificial Intelligence Research, 2012. Dharshan, Wierstra, Daan, Legg, Shane, and Hassabis, Bellemare, Marc G., Ostrovski, Georg, Guez, Arthur, Demis. Human-level control through deep reinforcement Thomas, Philip S., and Munos, Rémi. Increasing the ac- learning. Nature, 518(7540):529–533, 02 2015. URL tion gap: New operators for reinforcement learning. In http://dx.doi.org/10.1038/nature14236. Proceedings of the AAAI Conference on Artificial Intel- Nair, Arun, Srinivasan, Praveen, Blackwell, Sam, Alci- ligence, 2016. cek, Cagdas, Fearon, Rory, Maria, Alessandro De, Pan- Bertsekas, Dimitri P. Distributed dynamic programming. neershelvam, Vedavyas, Suleyman, Mustafa, Beattie, Automatic Control, IEEE Transactions on, 27(3):610– Charles, Petersen, Stig, Legg, Shane, Mnih, Volodymyr, 616, 1982. Kavukcuoglu, Koray, and Silver, David. Massively par- allel methods for deep reinforcement learning. In ICML Chavez, Kevin, Ong, Hao Yi, and Hong, Augustus. Dis- Deep Learning Workshop. 2015. tributed deep q-learning. Technical report, Stanford Uni- versity, June 2015. Peng, Jing and Williams, Ronald J. Incremental multi-step q-learning. Machine Learning, 22(1-3):283–290, 1996. Degris, Thomas, Pilarski, Patrick M, and Sutton, Richard S. Model-free reinforcement learning with continuous ac- Recht, Benjamin, Re, Christopher, Wright, Stephen, and tion in practice. In American Control Conference (ACC), Niu, Feng. Hogwild: A lock-free approach to paralleliz- 2012, pp. 2177–2182. IEEE, 2012. ing stochastic gradient descent. In Advances in Neural Information Processing Systems, pp. 693–701, 2011. Grounds, Matthew and Kudenko, Daniel. Parallel rein- forcement learning with linear function approximation. Riedmiller, Martin. Neural fitted q iteration–first experi- In Proceedings of the 5th, 6th and 7th European Confer- ences with a data efficient neural reinforcement learning ence on Adaptive and Learning Agents and Multi-agent method. In Machine Learning: ECML 2005, pp. 317– Systems: Adaptation and Multi-agent Learning, pp. 60– 328. Springer Berlin Heidelberg, 2005. 74. Springer-Verlag, 2008. Rummery, Gavin A and Niranjan, Mahesan. On-line q- Koutník, Jan, Schmidhuber, Jürgen, and Gomez, Faustino. learning using connectionist systems. 1994. Evolving deep unsupervised convolutional networks for Schaul, Tom, Quan, John, Antonoglou, Ioannis, and Sil- vision-based reinforcement learning. In Proceedings of ver, David. Prioritized experience replay. arXiv preprint the 2014 conference on Genetic and evolutionary com- arXiv:1511.05952, 2015. putation, pp. 541–548. ACM, 2014. Schulman, John, Levine, Sergey, Moritz, Philipp, Jordan, Levine, Sergey, Finn, Chelsea, Darrell, Trevor, and Abbeel, Michael I, and Abbeel, Pieter. Trust region policy op- Pieter. End-to-end training of deep visuomotor policies. timization. In International Conference on Machine arXiv preprint arXiv:1504.00702, 2015. Learning (ICML), 2015a. Li, Yuxi and Schuurmans, Dale. Mapreduce for parallel re- Schulman, John, Moritz, Philipp, Levine, Sergey, Jordan, inforcement learning. In Recent Advances in Reinforce- Michael, and Abbeel, Pieter. High-dimensional con- ment Learning - 9th European Workshop, EWRL 2011, tinuous control using generalized advantage estimation. Athens, Greece, September 9-11, 2011, Revised Selected arXiv preprint arXiv:1506.02438, 2015b. Papers, pp. 309–320, 2011. Sutton, R. and Barto, A. Reinforcement Learning: an In- Lillicrap, Timothy P, Hunt, Jonathan J, Pritzel, Alexander, troduction. MIT Press, 1998. Heess, Nicolas, Erez, Tom, Tassa, Yuval, Silver, David, and Wierstra, Daan. Continuous control with deep re- Tieleman, Tijmen and Hinton, Geoffrey. Lecture 6.5- inforcement learning. arXiv preprint arXiv:1509.02971, rmsprop: Divide the gradient by a running average of 2015. its recent magnitude. COURSERA: Neural Networks for Machine Learning, 4, 2012. Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Graves, Alex, Antonoglou, Ioannis, Wierstra, Daan, and Todorov, E. MuJoCo: Modeling, Simulation and Visual- Riedmiller, Martin. Playing atari with deep reinforce- ization of Multi-Joint Dynamics with Contact (ed 1.0). ment learning. In NIPS Deep Learning Workshop. 2013. Roboti Publishing, 2015.

10. Asynchronous Methods for Deep Reinforcement Learning Tomassini, Marco. Parallel and distributed evolutionary al- gorithms: A review. Technical report, 1999. Tsitsiklis, John N. Asynchronous stochastic approxima- tion and q-learning. Machine Learning, 16(3):185–202, 1994. Van Hasselt, Hado, Guez, Arthur, and Silver, David. Deep reinforcement learning with double q-learning. arXiv preprint arXiv:1509.06461, 2015. van Seijen, H., Rupam Mahmood, A., Pilarski, P. M., Machado, M. C., and Sutton, R. S. True Online Temporal-Difference Learning. ArXiv e-prints, Decem- ber 2015. Wang, Z., de Freitas, N., and Lanctot, M. Dueling Network Architectures for Deep Reinforcement Learning. ArXiv e-prints, November 2015. Watkins, Christopher John Cornish Hellaby. Learning from delayed rewards. PhD thesis, University of Cambridge England, 1989. Williams, R.J. Simple statistical gradient-following algo- rithms for connectionist reinforcement learning. Ma- chine Learning, 8(3):229–256, 1992. Williams, Ronald J and Peng, Jing. Function optimization using connectionist reinforcement learning algorithms. Connection Science, 3(3):241–268, 1991. Wymann, B., EspiÃl’, E., Guionneau, C., Dimitrakakis, C., Coulom, R., and Sumner, A. Torcs: The open racing car simulator, v1.3.5, 2013.

11.Supplementary Material for "Asynchronous Methods for Deep Reinforcement Learning" June 17, 2016 7. Optimization Details We investigated two different optimization algorithms with our asynchronous framework – stochastic gradient descent and RMSProp. Our implementations of these algorithms do not use any locking in order to maximize throughput when using a large number of threads. Momentum SGD: The implementation of SGD in an asynchronous setting is relatively straightforward and well studied (Recht et al., 2011). Let θ be the parameter vector that is shared across all threads and let ∆θi be the accumulated gradients of the loss with respect to parameters θ computed by thread number i. Each thread i independently applies the standard momentum SGD update mi = αmi + (1 − α)∆θi followed by θ ← θ − ηmi with learning rate η, momentum α and without any locks. Note that in this setting, each thread maintains its own separate gradient and momentum vector. RMSProp: While RMSProp (Tieleman & Hinton, 2012) has been widely used in the deep learning literature, it has not been extensively studied in the asynchronous optimization setting. The standard non-centered RMSProp update is given by g = αg + (1 − α)∆θ2 (S2) ∆θ θ ← θ − η√ , (S3) g+ where all operations are performed elementwise. In order to apply RMSProp in the asynchronous optimiza- tion setting one must decide whether the moving average of elementwise squared gradients g is shared or per-thread. We experimented with two versions of the algorithm. In one version, which we refer to as RM- SProp, each thread maintains its own g shown in Equation S2. In the other version, which we call Shared RMSProp, the vector g is shared among threads and is updated asynchronously and without locking. Sharing statistics among threads also reduces memory requirements by using one fewer copy of the parameter vector per thread. We compared these three asynchronous optimization algorithms in terms of their sensitivity to different learn- ing rates and random network initializations. Figure S5 shows a comparison of the methods for two different reinforcement learning methods (Async n-step Q and Async Advantage Actor-Critic) on four different games (Breakout, Beamrider, Seaquest and Space Invaders). Each curve shows the scores for 50 experiments that correspond to 50 different random learning rates and initializations. The x-axis shows the rank of the model after sorting in descending order by final average score and the y-axis shows the final average score achieved by the corresponding model. In this representation, the algorithm that performs better would achieve higher maximum rewards on the y-axis and the algorithm that is most robust would have its slope closest to horizon- tal, thus maximizing the area under the curve. RMSProp with shared statistics tends to be more robust than RMSProp with per-thread statistics, which is in turn more robust than Momentum SGD.

12. Asynchronous Methods for Deep Reinforcement Learning 8. Experimental Setup The experiments performed on a subset of Atari games (Figures 1, 3, 4 and Table 2) as well as the TORCS experiments (Figure S6) used the following setup. Each experiment used 16 actor-learner threads running on a single machine and no GPUs. All methods performed updates after every 5 actions (tmax = 5 and IU pdate = 5) and shared RMSProp was used for optimization. The three asynchronous value-based methods used a shared target network that was updated every 40000 frames. The Atari experiments used the same input preprocessing as (Mnih et al., 2015) and an action repeat of 4. The agents used the network architecture from (Mnih et al., 2013). The network used a convolutional layer with 16 filters of size 8 × 8 with stride 4, followed by a convolutional layer with with 32 filters of size 4 × 4 with stride 2, followed by a fully connected layer with 256 hidden units. All three hidden layers were followed by a rectifier nonlinearity. The value-based methods had a single linear output unit for each action representing the action-value. The model used by actor-critic agents had two set of outputs – a softmax output with one entry per action representing the probability of selecting the action, and a single linear output representing the value function. All experiments used a discount of γ = 0.99 and an RMSProp decay factor of α = 0.99. The value based methods sampled the exploration rate from a distribution taking three values 1 , 2 , 3 with probabilities 0.4, 0.3, 0.3. The values of 1 , 2 , 3 were annealed from 1 to 0.1, 0.01, 0.5 respectively over the first four million frames. Advantage actor-critic used entropy regularization with a weight β = 0.01 for all Atari and TORCS experiments. We performed a set of 50 experiments for five Atari games and every TORCS level, each using a different random initialization and initial learning rate. The initial learning rate was sampled from a LogU nif orm(10−4 , 10−2 ) distribution and annealed to 0 over the course of training. Note that in comparisons to prior work (Tables 1 and S3) we followed standard evaluation protocol and used fixed hyperparameters. 9. Continuous Action Control Using the MuJoCo Physics Simulator To apply the asynchronous advantage actor-critic algorithm to the Mujoco tasks the necessary setup is nearly identical to that used in the discrete action domains, so here we enumerate only the differences required for the continuous action domains. The essential elements for many of the tasks (i.e. the physics models and task objectives) are near identical to the tasks examined in (Lillicrap et al., 2015). However, the rewards and thus performance are not comparable for most of the tasks due to changes made by the developers of Mujoco which altered the contact model. For all the domains we attempted to learn the task using the physical state as input. The physical state consisted of the joint positions and velocities as well as the target position if the task required a target. In addition, for three of the tasks (pendulum, pointmass2D, and gripper) we also examined training directly from RGB pixel inputs. In the low dimensional physical state case, the inputs are mapped to a hidden state using one hidden layer with 200 ReLU units. In the cases where we used pixels, the input was passed through two layers of spatial convolutions without any non-linearity or pooling. In either case, the output of the encoder layers were fed to a single layer of 128 LSTM cells. The most important difference in the architecture is in the the output layer of the policy network. Unlike the discrete action domain where the action output is a Softmax, here the two outputs of the policy network are two real number vectors which we treat as the mean vector µ and scalar variance σ 2 of a multidimensional normal distribution with a spherical covariance. To act, the input is passed through the model to the output layer where we sample from the normal distribution determined by µ and σ 2 . In practice, µ is modeled by a linear layer and σ 2 by a SoftPlus operation, log(1 + exp(x)), as the activation computed as a function of the output of a linear layer. In our experiments with continuous control problems the networks for policy network and value network do not share any parameters, though this detail is unlikely to be crucial. Finally, since the episodes were typically at most several hundred time steps long, we did not use any bootstrapping in the policy or value function updates and batched each episode into a single update. As in the discrete action case, we included an entropy cost which encouraged exploration. In the continuous

13. Asynchronous Methods for Deep Reinforcement Learning case the we used a cost on the differential entropy of the normal distribution defined by the output of the actor network, − 12 (log(2πσ 2 ) + 1), we used a constant multiplier of 10−4 for this cost across all of the tasks examined. The asynchronous advantage actor-critic algorithm finds solutions for all the domains. Figure S8 shows learning curves against wall-clock time, and demonstrates that most of the domains from states can be solved within a few hours. All of the experiments, including those done from pixel based observations, were run on CPU. Even in the case of solving the domains directly from pixel inputs we found that it was possible to reliably discover solutions within 24 hours. Figure S7 shows scatter plots of the top scores against the sampled learning rates. In most of the domains there is large range of learning rates that consistently achieve good performance on the task. Algorithm S2 Asynchronous n-step Q-learning - pseudocode for each actor-learner thread. // Assume global shared parameter vector θ. // Assume global shared target parameter vector θ− . // Assume global shared counter T = 0. Initialize thread step counter t ← 1 Initialize target network parameters θ− ← θ Initialize thread-specific parameters θ = θ Initialize network gradients dθ ← 0 repeat Clear gradients dθ ← 0 Synchronize thread-specific parameters θ = θ tstart = t Get state st repeat Take action at according to the -greedy policy based on Q(st , a; θ ) Receive reward rt and new state st+1 t←t+1 T ←T +1 until terminal st or t − tstart == tmax 0 for terminal st R= maxa Q(st , a; θ− ) for non-terminal st for i ∈ {t − 1, . . . , tstart } do R ← ri + γR 2 ∂ (R−Q(si ,ai ;θ )) Accumulate gradients wrt θ : dθ ← dθ + ∂θ end for Perform asynchronous update of θ using dθ. if T mod Itarget == 0 then θ− ← θ end if until T > Tmax

14. Asynchronous Methods for Deep Reinforcement Learning Algorithm S3 Asynchronous advantage actor-critic - pseudocode for each actor-learner thread. // Assume global shared parameter vectors θ and θv and global shared counter T = 0 // Assume thread-specific parameter vectors θ and θv Initialize thread step counter t ← 1 repeat Reset gradients: dθ ← 0 and dθv ← 0. Synchronize thread-specific parameters θ = θ and θv = θv tstart = t Get state st repeat Perform at according to policy π(at |st ; θ ) Receive reward rt and new state st+1 t←t+1 T ←T +1 until terminal st or t − tstart == tmax 0 for terminal st R= V (st , θv ) for non-terminal st // Bootstrap from last state for i ∈ {t − 1, . . . , tstart } do R ← ri + γR Accumulate gradients wrt θ : dθ ← dθ + ∇θ log π(ai |si ; θ )(R − V (si ; θv )) 2 Accumulate gradients wrt θv : dθv ← dθv + ∂ (R − V (si ; θv )) /∂θv end for Perform asynchronous update of θ using dθ and of θv using dθv . until T > Tmax

15. Asynchronous Methods for Deep Reinforcement Learning 400 Breakout 25000 Beamrider 6000 Seaquest 1800 Space Invaders n-step Q, SGD n-step Q, SGD n-step Q, SGD n-step Q, SGD 350 n-step Q, RMSProp n-step Q, RMSProp n-step Q, RMSProp 1600 n-step Q, RMSProp n-step Q, Shared RMSProp n-step Q, Shared RMSProp 5000 n-step Q, Shared RMSProp n-step Q, Shared RMSProp 20000 1400 300 4000 1200 250 15000 1000 Score Score Score Score 200 3000 800 150 10000 2000 600 100 400 5000 1000 50 200 0 0 0 0 10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 Model Rank Model Rank Model Rank Model Rank 900 Breakout 25000 Beamrider 1800 Seaquest 4000 Space Invaders A3C, SGD A3C, SGD A3C, SGD A3C, SGD 800 A3C, RMSProp A3C, RMSProp 1600 A3C, RMSProp 3500 A3C, RMSProp A3C, Shared RMSProp A3C, Shared RMSProp A3C, Shared RMSProp A3C, Shared RMSProp 700 20000 1400 3000 600 15000 1200 2500 500 Score Score Score Score 1000 2000 400 10000 800 1500 300 200 600 1000 5000 100 400 500 0 0 200 0 10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 Model Rank Model Rank Model Rank Model Rank Figure S5. Comparison of three different optimization methods (Momentum SGD, RMSProp, Shared RMSProp) tested using two different algorithms (Async n-step Q and Async Advantage Actor-Critic) on four different Atari games (Break- out, Beamrider, Seaquest and Space Invaders). Each curve shows the final scores for 50 experiments sorted in descending order that covers a search over 50 random initializations and learning rates. The top row shows results using Async n-step Q algorithm and bottom row shows results with Async Advantage Actor-Critic. Each individual graph shows results for one of the four games and three different optimization methods. Shared RMSProp tends to be more robust to different learning rates and random initializations than Momentum SGD and RMSProp without sharing. 5000 Slow car, no bots 5000 Slow car, bots 4000 4000 3000 3000 Score Score 2000 2000 1000 Async 1-step Q 1000 Async 1-step Q Async SARSA Async SARSA 0 Async n-step Q 0 Async n-step Q Async actor-critic Async actor-critic Human tester Human tester 1000 1000 0 10 20 30 40 0 10 20 30 40 Training time (hours) Training time (hours) 6000 Fast car, no bots 6000 Fast car, bots 5000 5000 4000 4000 3000 3000 Score Score 2000 2000 Async 1-step Q Async 1-step Q 1000 Async SARSA 1000 Async SARSA Async n-step Q Async n-step Q 0 Async actor-critic 0 Async actor-critic Human tester Human tester 1000 1000 0 10 20 30 40 0 10 20 30 40 Training time (hours) Training time (hours) Figure S6. Comparison of algorithms on the TORCS car racing simulator. Four different configurations of car speed and opponent presence or absence are shown. In each plot, all four algorithms (one-step Q, one-step Sarsa, n-step Q and Advantage Actor-Critic) are compared on score vs training time in wall clock hours. Multi-step algorithms achieve better policies much faster than one-step algorithms on all four levels. The curves show averages over the 5 best runs from 50 experiments with learning rates sampled from LogU nif orm(10−4 , 10−2 ) and all other hyperparameters fixed.

16. Asynchronous Methods for Deep Reinforcement Learning Figure S7. Performance for the Mujoco continuous action domains. Scatter plot of the best score obtained against learning rates sampled from LogU nif orm(10−5 , 10−1 ). For nearly all of the tasks there is a wide range of learning rates that lead to good performance on the task.

17. Asynchronous Methods for Deep Reinforcement Learning Figure S8. Score per episode vs wall-clock time plots for the Mujoco domains. Each plot shows error bars for the top 5 experiments. 12000 Beamrider 350 Breakout 20 Pong 4500 Q*bert 900 Space Invaders 1-step SARSA, 1 threads 1-step SARSA, 1 threads 1-step SARSA, 1 threads 1-step SARSA, 1 threads 1-step SARSA, 2 threads 1-step SARSA, 2 threads 15 4000 1-step SARSA, 2 threads 1-step SARSA, 2 threads 1-step SARSA, 4 threads 300 1-step SARSA, 4 threads 1-step SARSA, 4 threads 800 1-step SARSA, 4 threads 10000 1-step SARSA, 8 threads 1-step SARSA, 8 threads 1-step SARSA, 8 threads 1-step SARSA, 8 threads 1-step SARSA, 16 threads 1-step SARSA, 16 threads 10 3500 1-step SARSA, 16 threads 700 1-step SARSA, 16 threads 250 8000 5 3000 600 200 0 2500 Score Score Score Score Score 6000 500 150 5 2000 400 4000 10 1500 100 15 1-step SARSA, 1 threads 1000 300 2000 1-step SARSA, 2 threads 50 20 1-step SARSA, 4 threads 500 200 1-step SARSA, 8 threads 1-step SARSA, 16 threads 0 0 25 0 100 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 Training epochs Training epochs Training epochs Training epochs Training epochs Figure S9. Data efficiency comparison of different numbers of actor-learners one-step Sarsa on five Atari games. The x-axis shows the total number of training epochs where an epoch corresponds to four million frames (across all threads). The y-axis shows the average score. Each curve shows the average of the three best performing agents from a search over 50 random learning rates. Sarsa shows increased data efficiency with increased numbers of parallel workers.

18. Asynchronous Methods for Deep Reinforcement Learning 12000 Beamrider 350 Breakout 20 Pong 3500 Q*bert 800 Space Invaders 1-step SARSA, 1 threads 1-step SARSA, 1 threads 1-step SARSA, 1 threads 1-step SARSA, 1 threads 1-step SARSA, 1 threads 1-step SARSA, 2 threads 1-step SARSA, 2 threads 15 1-step SARSA, 2 threads 1-step SARSA, 2 threads 1-step SARSA, 2 threads 10000 1-step SARSA, 4 threads 300 1-step SARSA, 4 threads 1-step SARSA, 4 threads 3000 1-step SARSA, 4 threads 700 1-step SARSA, 4 threads 1-step SARSA, 8 threads 1-step SARSA, 8 threads 1-step SARSA, 8 threads 1-step SARSA, 8 threads 1-step SARSA, 8 threads 1-step SARSA, 16 threads 1-step SARSA, 16 threads 10 1-step SARSA, 16 threads 1-step SARSA, 16 threads 1-step SARSA, 16 threads 250 2500 600 8000 5 200 0 2000 500 Score Score Score Score Score 6000 150 5 1500 400 4000 10 100 1000 300 15 2000 50 500 200 20 0 0 25 0 100 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 Training time (hours) Training time (hours) Training time (hours) Training time (hours) Training time (hours) Figure S10. Training speed comparison of different numbers of actor-learners for all one-step Sarsa on five Atari games. The x-axis shows training time in hours while the y-axis shows the average score. Each curve shows the average of the three best performing agents from a search over 50 random learning rates. Sarsa shows significant speedups from using greater numbers of parallel actor-learners. 12000 1-step Q, Beamrider 400 1-step Q, Breakout 30 1-step Q, Pong 5000 1-step Q, Q*bert 800 1-step Q, Space Invaders 350 700 10000 20 4000 300 600 8000 250 10 3000 200 500 Score Score Score Score Score 6000 0 2000 150 400 4000 100 10 1000 300 50 2000 20 0 200 0 0 50 30 1000 100 10-4 10-3 10-2 10-4 10-3 10-2 10-4 10-3 10-2 10-4 10-3 10-2 10-4 10-3 10-2 Learning rate Learning rate Learning rate Learning rate Learning rate 14000 1-step SARSA, Beamrider 400 1-step SARSA, Breakout 20 1-step SARSA, Pong 5000 1-step SARSA, Q*bert 900 1-step SARSA, Space Invaders 12000 350 15 800 4000 10000 300 10 700 250 5 3000 8000 600 200 0 Score Score Score Score Score 6000 2000 500 150 5 4000 400 100 10 1000 2000 50 15 300 0 0 0 20 200 2000 50 25 1000 100 10-4 10-3 10-2 10-4 10-3 10-2 10-4 10-3 10-2 10-4 10-3 10-2 10-4 10-3 10-2 Learning rate Learning rate Learning rate Learning rate Learning rate 16000 n-step Q, Beamrider 400 n-step Q, Breakout 30 n-step Q, Pong 5000 n-step Q, Q*bert 1000 n-step Q, Space Invaders 14000 350 900 20 4000 12000 300 800 10000 250 10 3000 8000 200 700 Score Score Score Score Score 0 2000 6000 150 600 4000 100 10 1000 500 2000 50 20 0 400 0 0 2000 50 30 1000 300 10-4 10-3 10-2 10-4 10-3 10-2 10-4 10-3 10-2 10-4 10-3 10-2 10-4 10-3 10-2 Learning rate Learning rate Learning rate Learning rate Learning rate Figure S11. Scatter plots of scores obtained by one-step Q, one-step Sarsa, and n-step Q on five games (Beamrider, Breakout, Pong, Q*bert, Space Invaders) for 50 different learning rates and random initializations. All algorithms exhibit some level of robustness to the choice of learning rate.

19. Asynchronous Methods for Deep Reinforcement Learning Game DQN Gorila Double Dueling Prioritized A3C FF, 1 day A3C FF A3C LSTM Alien 570.2 813.5 1033.4 1486.5 900.5 182.1 518.4 945.3 Amidar 133.4 189.2 169.1 172.7 218.4 283.9 263.9 173.0 Assault 3332.3 1195.8 6060.8 3994.8 7748.5 3746.1 5474.9 14497.9 Asterix 124.5 3324.7 16837.0 15840.0 31907.5 6723.0 22140.5 17244.5 Asteroids 697.1 933.6 1193.2 2035.4 1654.0 3009.4 4474.5 5093.1 Atlantis 76108.0 629166.5 319688.0 445360.0 593642.0 772392.0 911091.0 875822.0 Bank Heist 176.3 399.4 886.0 1129.3 816.8 946.0 970.1 932.8 Battle Zone 17560.0 19938.0 24740.0 31320.0 29100.0 11340.0 12950.0 20760.0 Beam Rider 8672.4 3822.1 17417.2 14591.3 26172.7 13235.9 22707.9 24622.2 Berzerk 1011.1 910.6 1165.6 1433.4 817.9 862.2 Bowling 41.2 54.0 69.6 65.7 65.8 36.2 35.1 41.8 Boxing 25.8 74.2 73.5 77.3 68.6 33.7 59.8 37.3 Breakout 303.9 313.0 368.9 411.6 371.6 551.6 681.9 766.8 Centipede 3773.1 6296.9 3853.5 4881.0 3421.9 3306.5 3755.8 1997.0 Chopper Comman 3046.0 3191.8 3495.0 3784.0 6604.0 4669.0 7021.0 10150.0 Crazy Climber 50992.0 65451.0 113782.0 124566.0 131086.0 101624.0 112646.0 138518.0 Defender 27510.0 33996.0 21093.5 36242.5 56533.0 233021.5 Demon Attack 12835.2 14880.1 69803.4 56322.8 73185.8 84997.5 113308.4 115201.9 Double Dunk -21.6 -11.3 -0.3 -0.8 2.7 0.1 -0.1 0.1 Enduro 475.6 71.0 1216.6 2077.4 1884.4 -82.2 -82.5 -82.5 Fishing Derby -2.3 4.6 3.2 -4.1 9.2 13.6 18.8 22.6 Freeway 25.8 10.2 28.8 0.2 27.9 0.1 0.1 0.1 Frostbite 157.4 426.6 1448.1 2332.4 2930.2 180.1 190.5 197.6 Gopher 2731.8 4373.0 15253.0 20051.4 57783.8 8442.8 10022.8 17106.8 Gravitar 216.5 538.4 200.5 297.0 218.0 269.5 303.5 320.0 H.E.R.O. 12952.5 8963.4 14892.5 15207.9 20506.4 28765.8 32464.1 28889.5 Ice Hockey -3.8 -1.7 -2.5 -1.3 -1.0 -4.7 -2.8 -1.7 James Bond 348.5 444.0 573.0 835.5 3511.5 351.5 541.0 613.0 Kangaroo 2696.0 1431.0 11204.0 10334.0 10241.0 106.0 94.0 125.0 Krull 3864.0 6363.1 6796.1 8051.6 7406.5 8066.6 5560.0 5911.4 Kung-Fu Master 11875.0 20620.0 30207.0 24288.0 31244.0 3046.0 28819.0 40835.0 Montezuma’s Revenge 50.0 84.0 42.0 22.0 13.0 53.0 67.0 41.0 Ms. Pacman 763.5 1263.0 1241.3 2250.6 1824.6 594.4 653.7 850.7 Name This Game 5439.9 9238.5 8960.3 11185.1 11836.1 5614.0 10476.1 12093.7 Phoenix 12366.5 20410.5 27430.1 28181.8 52894.1 74786.7 Pit Fall -186.7 -46.9 -14.8 -123.0 -78.5 -135.7 Pong 16.2 16.7 19.1 18.8 18.9 11.4 5.6 10.7 Private Eye 298.2 2598.6 -575.5 292.6 179.0 194.4 206.9 421.1 Q*Bert 4589.8 7089.8 11020.8 14175.8 11277.0 13752.3 15148.8 21307.5 River Raid 4065.3 5310.3 10838.4 16569.4 18184.4 10001.2 12201.8 6591.9 Road Runner 9264.0 43079.8 43156.0 58549.0 56990.0 31769.0 34216.0 73949.0 Robotank 58.5 61.8 59.1 62.0 55.4 2.3 32.8 2.6 Seaquest 2793.9 10145.9 14498.0 37361.6 39096.7 2300.2 2355.4 1326.1 Skiing -11490.4 -11928.0 -10852.8 -13700.0 -10911.1 -14863.8 Solaris 810.0 1768.4 2238.2 1884.8 1956.0 1936.4 Space Invaders 1449.7 1183.3 2628.7 5993.1 9063.0 2214.7 15730.5 23846.0 Star Gunner 34081.0 14919.2 58365.0 90804.0 51959.0 64393.0 138218.0 164766.0 Surround 1.9 4.0 -0.9 -9.6 -9.7 -8.3 Tennis -2.3 -0.7 -7.8 4.4 -2.0 -10.2 -6.3 -6.4 Time Pilot 5640.0 8267.8 6608.0 6601.0 7448.0 5825.0 12679.0 27202.0 Tutankham 32.4 118.5 92.2 48.0 33.6 26.1 156.3 144.2 Up and Down 3311.3 8747.7 19086.9 24759.2 29443.7 54525.4 74705.7 105728.7 Venture 54.0 523.4 21.0 200.0 244.0 19.0 23.0 25.0 Video Pinball 20228.1 112093.4 367823.7 110976.2 374886.9 185852.6 331628.1 470310.5 Wizard of Wor 246.0 10431.0 6201.0 7054.0 7451.0 5278.0 17244.0 18082.0 Yars Revenge 6270.6 25976.5 5965.1 7270.8 7157.5 5615.5 Zaxxon 831.0 6159.4 8593.0 10164.0 9501.0 2659.0 24622.0 23519.0 Table S3. Raw scores for the human start condition (30 minutes emulator time). DQN scores taken from (Nair et al., 2015). Double DQN scores taken from (Van Hasselt et al., 2015), Dueling scores from (Wang et al., 2015) and Prioritized scores taken from (Schaul et al., 2015)