Distributional Reinforcement Learning with Quantile Regression

In reinforcement learning an agent interacts with the environment by taking actions and observing the next state and reward. When sampled probabilistically, these state transitions,rewards, and actions can all induce randomness in the observed long-term return. Traditionally, reinforcement learning algorithms average over this randomness to estimate the value function. In this paper, we build on recent work advocating a distributional approach to reinforcement learning in which the distribution over returns is modeled explicitly instead of only estimating the mean. That is, we examine methods of learning the value distribution instead of the value function. We give results that close a number of gaps between the theoretical and algorithmic results given by Bellemare,Dabney, and Munos (2017). First, we extend existing results to the approximate distribution setting. Second, we present a novel distributional reinforcement learning algorithm consistent with our theoretical formulation. Finally, we evaluate this new algorithm on the Atari 2600 games, observing that it significantly outperforms many of the recent improvements on DQN , including the related distributional algorithm C 51.
展开查看详情

1. Distributional Reinforcement Learning with Quantile Regression Will Dabney Mark Rowland Marc G. Bellemare R´emi Munos DeepMind University of Cambridge∗ Google Brain DeepMind arXiv:1710.10044v1 [cs.AI] 27 Oct 2017 Abstract the-art on the suite of benchmark Atari 2600 games (Belle- mare et al. 2013). In reinforcement learning an agent interacts with the environ- One of the theoretical contributions of the C 51 work was ment by taking actions and observing the next state and re- a proof that the distributional Bellman operator is a contrac- ward. When sampled probabilistically, these state transitions, rewards, and actions can all induce randomness in the ob- tion in a maximal form of the Wasserstein metric between served long-term return. Traditionally, reinforcement learn- probability distributions. In this context, the Wasserstein ing algorithms average over this randomness to estimate the metric is particularly interesting because it does not suffer value function. In this paper, we build on recent work ad- from disjoint-support issues (Arjovsky, Chintala, and Bot- vocating a distributional approach to reinforcement learning tou 2017) which arise when performing Bellman updates. in which the distribution over returns is modeled explicitly Unfortunately, this result does not directly lead to a practical instead of only estimating the mean. That is, we examine algorithm: as noted by the authors, and further developed by methods of learning the value distribution instead of the value Bellemare et al. (2017), the Wasserstein metric, viewed as a function. We give results that close a number of gaps between loss, cannot generally be minimized using stochastic gradi- the theoretical and algorithmic results given by Bellemare, ent methods. Dabney, and Munos (2017). First, we extend existing results to the approximate distribution setting. Second, we present This negative result left open the question as to whether it a novel distributional reinforcement learning algorithm con- is possible to devise an online distributional reinforcement sistent with our theoretical formulation. Finally, we evaluate learning algorithm which takes advantage of the contraction this new algorithm on the Atari 2600 games, observing that result. Instead, the C 51 algorithm first performs a heuristic it significantly outperforms many of the recent improvements projection step, followed by the minimization of a KL di- on DQN, including the related distributional algorithm C 51. vergence between projected Bellman update and prediction. The work therefore leaves a theory-practice gap in our un- derstanding of distributional reinforcement learning, which Introduction makes it difficult to explain the good performance of C 51. In reinforcement learning, the value of an action a in state s Thus, the existence of a distributional algorithm that oper- describes the expected return, or discounted sum of rewards, ates end-to-end on the Wasserstein metric remains an open obtained from beginning in that state, choosing action a, and question. subsequently following a prescribed policy. Because know- In this paper, we answer this question affirmatively. By ing this value for the optimal policy is sufficient to act opti- appealing to the theory of quantile regression (Koenker mally, it is the object modelled by classic value-based meth- 2005), we show that there exists an algorithm, applicable in ods such as SARSA (Rummery and Niranjan 1994) and Q- a stochastic approximation setting, which can perform distri- Learning (Watkins and Dayan 1992), which use Bellman’s butional reinforcement learning over the Wasserstein metric. equation (Bellman 1957) to efficiently reason about value. Our method relies on the following techniques: Recently, Bellemare, Dabney, and Munos (2017) showed • We “transpose” the parametrization from C 51: whereas that the distribution of the random returns, whose expecta- the former uses N fixed locations for its approxima- tion constitutes the aforementioned value, can be described tion distribution and adjusts their probabilities, we assign by the distributional analogue of Bellman’s equation, echo- fixed, uniform probabilities to N adjustable locations; ing previous results in risk-sensitive reinforcement learning • We show that quantile regression may be used to stochas- (Heger 1994; Morimura et al. 2010; Chow et al. 2015). In tically adjust the distributions’ locations so as to minimize this previous work, however, the authors argued for the use- the Wasserstein distance to a target distribution. fulness in modeling this value distribution in and of itself. Their claim was asserted by exhibiting a distributional rein- • We formally prove contraction mapping results for our forcement learning algorithm, C 51, which achieved state-of- overall algorithm, and use these results to conclude that our method performs distributional RL end-to-end under ∗ Contributed during an internship at DeepMind. the Wasserstein metric, as desired.

2. The main interest of the original distributional algorithm was its state-of-the-art performance, despite still acting by maximizing expectations. One might naturally expect that a q1 T ⇡Z DKL ( T ⇡ ZkZ) direct minimization of the Wasserstein metric, rather than its heuristic approximation, may yield even better results. We q2 T ⇡Z derive the Q-Learning analogue for our method (QR - DQN), ⇡ apply it to the same suite of Atari 2600 games, and find that it T Z achieves even better performance. By using a smoothed ver- z1 z2 4z 4z 4z 4z sion of quantile regression, Huber quantile regression, we gain an impressive 33% median score increment over the al- Figure 1: Projection used by C 51 assigns mass inversely ready state-of-the-art C 51. proportional to distance from nearest support. Update mini- mizes KL between projected target and estimate. Distributional RL We model the agent-environment interactions by a Markov decision process (MDP) (X , A, R, P, γ) (Puterman 1994), Similarly, the value distribution can be computed through with X and A the state and action spaces, R the random dynamic programming using a distributional Bellman oper- variable reward function, P (x |x, a) the probability of tran- ator (Bellemare, Dabney, and Munos 2017), sitioning from state x to state x after taking action a, and D T π Z(x, a) := R(x, a) + γZ(x , a ), (3) γ ∈ [0, 1) the discount factor. A policy π(·|x) maps each state x ∈ X to a distribution over A. x ∼ P (·|x, a), a ∼ π(·|x ), ∞ For a fixed policy π, the return, Z π = t=0 γ t Rt , is a D where Y := U denotes equality of probability laws, that is random variable representing the sum of discounted rewards the random variable Y is distributed according to the same observed along one trajectory of states while following π. law as U . Standard RL algorithms estimate the expected value of Z π , The C 51 algorithm models Z π (x, a) using a discrete dis- the value function, tribution supported on a “comb” of fixed locations z1 ≤ ∞ · · · ≤ zN uniformly spaced over a predetermined interval. V π (x) := E [Z π (x)] = E γ t R(xt , at ) | x0 = x . The parameters of that distribution are the probabilities qi , t=0 expressed as logits, associated with each location zi . Given (1) a current value distribution, the C 51 algorithm applies a pro- Similarly, many RL algorithms estimate the action-value jection step Φ to map the target T π Z onto its finite ele- function, ment support, followed by a Kullback-Leibler (KL) mini- ∞ mization step (see Figure 1). C 51 achieved state-of-the-art Qπ (x, a) := E [Z π (x, a)] = E γ t R(xt , at ) , (2) performance on Atari 2600 games, but did so with a clear disconnect with the theoretical results of Bellemare, Dab- t=0 ney, and Munos (2017). We now review these results before xt ∼ P (·|xt−1 , at−1 ), at ∼ π(·|xt ), x0 = x, a0 = a. extending them to the case of approximate distributions. The -greedy policy on Qπ chooses actions uniformly at random with probability and otherwise according to The Wasserstein Metric arg maxa Qπ (x, a). The p-Wasserstein metric Wp , for p ∈ [1, ∞], also known In distributional RL the distribution over returns (i.e. the as the Mallows metric (Bickel and Freedman 1981) or the probability law of Z π ), plays the central role and replaces Earth Mover’s Distance (EMD) when p = 1 (Levina and the value function. We will refer to the value distribution Bickel 2001), is an integral probability metric between dis- by its random variable. When we say that the value func- tributions. The p-Wasserstein distance is characterized as the tion is the mean of the value distribution we are saying Lp metric on inverse cumulative distribution functions (in- that the value function is the expected value, taken over verse CDFs) (M¨uller 1997). That is, the p-Wasserstein met- all sources of intrinsic randomness (Goldstein, Misra, and ric between distributions U and Y is given by,1 Courtage 1981), of the value distribution. This should high- 1 1/p light that the value distribution is not designed to capture Wp (U, Y ) = |FY−1 (ω) − FU−1 (ω)|p dω , (4) the uncertainty in the estimate of the value function (Dear- 0 den, Friedman, and Russell 1998; Engel, Mannor, and Meir where for a random variable Y , the inverse CDF FY−1 of Y 2005), that is the parametric uncertainty, but rather the ran- is defined by domness in the returns intrinsic to the MDP. Temporal difference (TD) methods significantly speed up FY−1 (ω) := inf{y ∈ R : ω ≤ FY (y)} , (5) the learning process by incrementally improving an estimate where FY (y) = P r(Y ≤ y) is the CDF of Y . Figure 2 il- of Qπ using dynamic programming through the Bellman op- lustrates the 1-Wasserstein distance as the area between two erator (Bellman 1957), CDFs. T π Q(x, a) = E [R(x, a)] + γEP,π [Q(x , a )] . 1 For p = ∞, W∞ (Y, U ) = supω∈[0,1] |FY−1 (ω) − FU−1 (ω)|.

3. Recently, the Wasserstein metric has been the focus of in- ⌧4 = 1 creased research due to its appealing properties of respect- Z2Z ing the underlying metric distances between outcomes (Ar- ⌧ˆ4 q4 ⇧W 1 Z 2 Z Q Probability Space jovsky, Chintala, and Bottou 2017; Bellemare et al. 2017). ⌧3 Unlike the Kullback-Leibler divergence, the Wasserstein ⌧ˆ3 q3 metric is a true probability metric and considers both the probability of and the distance between various outcome ⌧2 events. These properties make the Wasserstein well-suited to ⌧ˆ2 q2 domains where an underlying similarity in outcome is more ⌧1 important than exactly matching likelihoods. ⌧ˆ1 q1 Convergence of Distributional Bellman Operator 1 In the context of distributional RL, let Z be the space of ⌧0 = 0 z1 = FZ (ˆ ⌧1 ) z2 z3 z4 action-value distributions with finite moments: Space of Returns Z = {Z : X × A → P(R)| Figure 2: 1-Wasserstein minimizing projection onto N = 4 E [|Z(x, a)|p ] < ∞, ∀(x, a), p ≥ 1}. uniformly weighted Diracs. Shaded regions sum to form the Then, for two action-value distributions Z1 , Z2 ∈ Z, we will 1-Wasserstein error. use the maximal form of the Wasserstein metric introduced by (Bellemare, Dabney, and Munos 2017), al. (2010) parameterize a value distribution with the mean d¯p (Z1 , Z2 ) := sup Wp (Z1 (x, a), Z2 (x, a)). (6) and scale of a Gaussian or Laplace distribution, and min- x,a imize the KL divergence between the target T π Z and the It was shown that d¯p is a metric over value distributions. Fur- prediction Z. They demonstrate that value distributions thermore, the distributional Bellman operator T π is a con- learned in this way are sufficient to perform risk-sensitive Q- traction in d¯p , a result that we now recall. Learning. However, any theoretical guarantees derived from their method can only be asymptotic; the Bellman operator Lemma 1 (Lemma 3, Bellemare, Dabney, and Munos 2017). is at best a non-expansion in KL divergence. T π is a γ-contraction: for any two Z1 , Z2 ∈ Z, d¯p (T π Z1 , T π Z2 ) ≤ γ d¯p (Z1 , Z2 ). Approximately Minimizing Wasserstein Recall that C 51 approximates the distribution at each Lemma 1 tells us that d¯p is a useful metric for studying state by attaching variable (parametrized) probabilities the behaviour of distributional reinforcement learning algo- q1 , . . . , qN to fixed locations z1 ≤ · · · ≤ zN . Our approach rithms, in particular to show their convergence to the fixed is to “transpose” this parametrization by considering fixed point Z π . Moreover, the lemma suggests that an effective probabilities but variable locations. Specifically, we take way in practice to learn value distributions is to attempt to uniform weights, so that qi = 1/N for each i = 1, . . . , N . minimize the Wasserstein distance between a distribution Z Effectively, our new approximation aims to estimate and its Bellman update T π Z, analogous to the way that TD- quantiles of the target distribution. Accordingly, we will call learning attempts to iteratively minimize the L2 distance be- it a quantile distribution, and let ZQ be the space of quan- tween Q and T Q. tile distributions for fixed N . We will denote the cumulative Unfortunately, another result shows that we cannot in gen- probabilities associated with such a distribution (that is, the eral minimize the Wasserstein metric (viewed as a loss) us- discrete values taken on by the CDF) by τ1 , . . . , τN , so that ing stochastic gradient descent. τi = Ni for i = 1, . . . , N . We will also write τ0 = 0 to Theorem 1 (Theorem 1, Bellemare et al. 2017). Let Yˆm := simplify notation. m m 1 i=1 δYi be the empirical distribution derived from sam- Formally, let θ : X ×A → RN be some parametric model. ples Y1 , . . . , Ym drawn from a Bernoulli distribution B. Let A quantile distribution Zθ ∈ ZQ maps each state-action Bµ be a Bernoulli distribution parametrized by µ, the prob- pair (x, a) to a uniform probability distribution supported ability of the variable taking the value 1. Then the minimum on {θi (x, a)}. That is, of the expected sample loss is in general different from the N minimum of the true Wasserstein loss; that is, 1 Zθ (x, a) := N δθi (x,a) , (7) arg min E Wp (Yˆm , Bµ ) = arg min Wp (B, Bµ ). i=1 µ Y1:m µ where δz denotes a Dirac at z ∈ R. This issue becomes salient in a practical context, where Compared to the original parametrization, the benefits of the value distribution must be approximated. Crucially, a parameterized quantile distribution are threefold. First, (1) the C 51 algorithm is not guaranteed to minimize any p- we are not restricted to prespecified bounds on the support, Wasserstein metric. This gap between theory and practice or a uniform resolution, potentially leading to significantly in distributional RL is not restricted to C 51. Morimura et more accurate predictions when the range of returns vary

4.greatly across states. This also (2) lets us do away with the Proposition 1. Let Zθ be a quantile distribution, and Zˆm unwieldy projection step present in C 51, as there are no is- the empirical distribution composed of m samples from Z. sues of disjoint supports. Together, these obviate the need Then for all p ≥ 1, there exists a Z such that for domain knowledge about the bounds of the return dis- tribution when applying the algorithm to new tasks. Finally, arg min E[Wp (Zˆm , Zθ )] = arg min Wp (Z, Zθ ). (3) this reparametrization allows us to minimize the Wasser- However, there is a method, more widely used in eco- stein loss, without suffering from biased gradients, specifi- nomics than machine learning, for unbiased stochastic ap- cally, using quantile regression. proximation of the quantile function. Quantile regression, The Quantile Approximation and conditional quantile regression, are methods for ap- proximating the quantile functions of distributions and con- It is well-known that in reinforcement learning, the use of ditional distributions respectively (Koenker 2005). These function approximation may result in instabilities in the methods have been used in a variety of settings where learning process (Tsitsiklis and Van Roy 1997). Specifically, outcomes have intrinsic randomness (Koenker and Hallock the Bellman update projected onto the approximation space 2001); from food expenditure as a function of household in- may no longer be a contraction. In our case, we analyze the come (Engel 1857), to studying value-at-risk in economic distributional Bellman update, projected onto a parameter- models (Taylor 1999). ized quantile distribution, and prove that the combined op- The quantile regression loss, for quantile τ ∈ [0, 1], is erator is a contraction. an asymmetric convex loss function that penalizes overesti- Quantile Projection We are interested in quantifying the mation errors with weight τ and underestimation errors with projection of an arbitrary value distribution Z ∈ Z onto ZQ , weight 1−τ . For a distribution Z, and a given quantile τ , the that is value of the quantile function FZ−1 (τ ) may be characterized ΠW1 Z := arg min W1 (Z, Zθ ), as the minimizer of the quantile regression loss Zθ ∈ZQ Let Y be a distribution with bounded first moment and U LτQR (θ) := EZ∼Z ˆ [ρτ (Zˆ − θ)] , where a uniform distribution over N Diracs as in (7), with support ρτ (u) = u(τ − δ{u<0} ), ∀u ∈ R. (8) {θ1 , . . . , θN }. Then More generally, by Lemma 2 we have that the minimizing N τi values of {θ1 , . . . , θN } for W1 (Z, Zθ ) are those that mini- W1 (Y, U ) = |FY−1 (ω) − θi |dω. mize the following objective: i=1 τi−1 N EZ∼Z ˆ [ρτˆi (Zˆ − θi )] Lemma 2. For any τ, τ ∈ [0, 1] with τ < τ and cumulative i=1 distribution function F with inverse F −1 , the set of θ ∈ R minimizing In particular, this loss gives unbiased sample gradients. τ As a result, we can find the minimizing {θ1 , . . . , θN } by −1 |F (ω) − θ|dω , stochastic gradient descent. τ is given by Quantile Huber Loss The quantile regression loss is not smooth at zero; as u → 0+ , the gradient of Equation 8 stays τ +τ constant. We hypothesized that this could limit performance θ ∈ R F (θ) = . 2 when using non-linear function approximation. To this end, we also consider a modified quantile loss, called the quantile In particular, if F −1 is the inverse CDF, then F −1 ((τ + Huber loss.3 This quantile regression loss acts as an asym- τ )/2) is always a valid minimizer, and if F −1 is continuous metric squared loss in an interval [−κ, κ] around zero and at (τ + τ )/2, then F −1 ((τ + τ )/2) is the unique minimizer. reverts to a standard quantile loss outside this interval. These quantile midpoints will be denoted by τˆi = τi−12+τi The Huber loss is given by (Huber 1964), for 1 ≤ i ≤ N . Therefore, by Lemma 2, the values 1 2 for {θ1 , θ1 , . . . , θN } that minimize W1 (Y, U ) are given by 2u , if |u| ≤ κ Lκ (u) = 1 . (9) θi = FY−1 (ˆ τi ). Figure 2 shows an example of the quantile κ(|u| − 2 κ), otherwise projection ΠW1 Z minimizing the 1-Wasserstein distance to Z.2 The quantile Huber loss is then simply the asymmetric vari- ant of the Huber loss, Quantile Regression ρκτ (u) = |τ − δ{u<0} |Lκ (u). (10) The original proof of Theorem 1 only states the existence of a distribution whose gradients are biased. As a result, we For notational simplicity we will denote ρ0τ = ρτ , that is, it might hope that our quantile parametrization leads to unbi- will revert to the standard quantile regression loss. ased gradients. Unfortunately, this is not true. 3 Our quantile Huber loss is related to, but distinct from that of 2 We save proofs for the appendix due to space limitations. Aravkin et al. (2014).

5.Combining Projection and Bellman Update Quantile Regression DQN We are now in a position to prove our main result, which Q-Learning is an off-policy reinforcement learning algo- states that the combination of the projection implied by rithm built around directly learning the optimal action-value quantile regression with the Bellman operator is a contrac- function using the Bellman optimality operator (Watkins and tion. The result is in ∞-Wasserstein metric, i.e. the size of Dayan 1992), the largest gap between the two CDFs. Proposition 2. Let ΠW1 be the quantile projection defined T Q(x, a) = E [R(x, a)] + γ E max Q(x , a ) . x ∼P a as above, and when applied to value distributions gives the projection for each state-value distribution. For any two The distributional variant of this is to estimate a state- value distributions Z1 , Z2 ∈ Z for an MDP with countable action value distribution and apply a distributional Bellman state and action spaces, optimality operator, d¯∞ (ΠW1 T π Z1 , ΠW1 T π Z2 ) ≤ γ d¯∞ (Z1 , Z2 ). (11) T Z(x, a) = R(x, a) + γZ(x , a ), (13) We therefore conclude that the combined operator x ∼ P (·|x, a), a = arg maxa Ez∼Z(x ,a ) [z] . ΠW1 T π has a unique fixed point Zˆ π , and the repeated appli- Notice in particular that the action used for the next state is cation of this operator, or its stochastic approximation, con- the greedy action with respect to the mean of the next state- verges to Zˆ π . Because d¯p ≤ d¯∞ , we conclude that conver- action value distribution. gence occurs for all p ∈ [1, ∞]. Interestingly, the contraction For a concrete algorithm we will build on the DQN archi- property does not directly hold for p < ∞; see Lemma 5 in tecture (Mnih et al. 2015). We focus on the minimal changes the appendix. necessary to form a distributional version of DQN. Specifi- cally, we require three modifications to DQN. First, we use a Distributional RL using Quantile Regression nearly identical neural network architecture as DQN, only We can now form a complete algorithmic approach to distri- changing the output layer to be of size |A| × N , where butional RL consistent with our theoretical results. That is, N is a hyper-parameter giving the number of quantile tar- approximating the value distribution with a parameterized gets. Second, we replace the Huber loss used by DQN4 , quantile distribution over the set of quantile midpoints, de- Lκ (rt + γ maxa Q(xt+1 , a ) − Q(xt , at )) with κ = 1, fined by Lemma 2. Then, training the location parameters with a quantile Huber loss (full loss given by Algorithm 1). using quantile regression (Equation 8). Finally, we replace RMSProp (Tieleman and Hinton 2012) with Adam (Kingma and Ba 2015). We call this new algo- Quantile Regression Temporal Difference Learning rithm quantile regression DQN (QR - DQN). Recall the standard TD update for evaluating a policy π, V (x) ← V (x) + α(r + γV (x ) − V (x)), Algorithm 1 Quantile Regression Q-Learning a ∼ π(·|x), r ∼ R(x, a), x ∼ P (·|x, a). Require: N, κ TD allows us to update the estimated value function with input x, a, r, x , γ ∈ [0, 1) a single unbiased sample following π. Quantile regression # Compute distributional Bellman target also allows us to improve the estimate of the quantile func- Q(x , a ) := j qj θj (x , a ) tion for some target distribution, Y (x), by observing sam- a∗ ← arg maxa Q(x, a ) ples y ∼ Y (x) and minimizing Equation 8. T θj ← r + γθj (x , a∗ ), ∀j Furthermore, we have shown that by estimating the quan- # Compute quantile regression loss (Equation 10) N tile function for well-chosen values of τ ∈ (0, 1) we can ob- output κ i=1 Ej ρτˆi (T θj − θi (x, a)) tain an approximation with minimal 1-Wasserstein distance from the original (Lemma 2). Finally, we can combine this with the distributional Bellman operator to give a target dis- Unlike C 51, QR - DQN does not require projection onto the tribution for quantile regression. This gives us the quantile approximating distribution’s support, instead it is able to ex- regression temporal difference learning (QRTD) algorithm, pand or contract the values arbitrarily to cover the true range summarized simply by the update, of return values. As an additional advantage, this means that QR - DQN does not require the additional hyper-parameter θi (x) ← θi (x) + α(ˆτi − δ{r+γz <θi (x))} ), (12) giving the bounds of the support required by C 51. The only a ∼ π(·|x), r ∼ R(x, a), x ∼ P (·|x, a), z ∼ Zθ (x ), additional hyper-parameter of QR - DQN not shared by DQN is where Zθ is a quantile distribution as in (7), and θi (x) is the the number of quantiles N , which controls with what reso- estimated value of FZ−1 π (x) (ˆ τi ) in state x. It is important to lution we approximate the value distribution. As we increase note that this update is for each value of τˆi and is defined N , QR - DQN goes from DQN to increasingly able to estimate for a single sample from the next state value distribution. the upper and lower quantiles of the value distribution. It In general it is better to draw many samples of z ∼ Z(x ) becomes increasingly capable of distinguishing low proba- and minimize the expected update. A natural approach in bility events at either end of the cumulative distribution over this case, which we use in practice, is to compute the update returns. for all pairs of (θi (x), θj (x )). Next, we turn to a control 4 DQN uses gradient clipping of the squared error that makes it algorithm and the use of non-linear function approximation. equivalent to a Huber loss with κ = 1.

6. 2 V (xS )] (a) (b) ⇡ ZM (c) C C (xS ), Z(xS )) [VM C (xS ) Z✓ (d) ⇡ FZ(xS ) Z(xS ) (e) W1 (ZM ⇡ xS xG 0 1 2 2 2 0 0 0 0 0 0 Returns Returns Episodes Figure 3: (a) Two-room windy gridworld, with wind magnitude shown along bottom row. Policy trajectory shown by blue path, with additional cycles caused by randomness shown by dashed line. (b, c) (Cumulative) Value distribution at start state xS , π estimated by MC, ZM C , and by QRTD , Zθ . (d, e) Value function (distribution) approximation errors for TD (0) and QRTD . Experimental Results In Figure 3 we show the approximation errors at xS for both algorithms with respect to the number of episodes. In (d) In the introduction we claimed that learning the distribution we evaluated, for both TD(0) and QRTD, the squared error, over returns had distinct advantages over learning the value π 2 function alone. We have now given theoretically justified al- (VM C − V (xS )) , and in (e) we show the 1-Wasserstein π metric for QRTD, W1 (ZM C (xS ), Z(xS )), where V (xS ) and gorithms for performing distributional reinforcement learn- Z(xS ) are the expected returns and value distribution at ing, QRTD for policy evaluation and QR - DQN for control. In state xS estimated by the algorithm. As expected both al- this section we will empirically validate that the proposed gorithms converge correctly in mean, and QRTD minimizes distributional reinforcement learning algorithms: (1) learn π the 1-Wasserstein distance to ZM C. the true distribution over returns, (2) show increased robust- ness during training, and (3) significantly improve sample complexity and final performance over baseline algorithms. Evaluation on Atari 2600 We now provide experimental results that demonstrate the Value Distribution Approximation Error We begin our practical advantages of minimizing the Wasserstein metric experimental results by demonstrating that QRTD actually end-to-end, in contrast to the C 51 approach. We use the 57 learns an approximate value distribution that minimizes the Atari 2600 games from the Arcade Learning Environment 1-Wasserstein to the ground truth distribution over returns. (ALE) (Bellemare et al. 2013). Both C 51 and QR - DQN build Although our theoretical results already establish conver- on the standard DQN architecture, and we expect both to gence of the former to the latter, the empirical performance benefit from recent improvements to DQN such as the du- helps to round out our understanding. eling architectures (Wang et al. 2016) and prioritized replay We use a variant of the classic windy gridworld domain (Schaul et al. 2016). However, in our evaluations we com- (Sutton and Barto 1998), modified to have two rooms and pare the pure versions of C 51 and QR - DQN without these randomness in the transitions. Figure 3(a) shows our ver- additions. We present results for both a strict quantile loss, sion of the domain, where we have combined the transition κ = 0 (QR - DQN-0), and with a Huber quantile loss with stochasticity, wind, and the doorway to produce a multi- κ = 1 (QR - DQN-1). modal distribution over returns when anywhere in the first We performed hyper-parameter tuning over a set of five room. Each state transition has probability 0.1 of moving in training games and evaluated on the full set of 57 games a random direction, otherwise the transition is affected by using these best settings (α = 0.00005, ADAM = 0.01/32, wind moving the agent northward. The reward function is and N = 200).5 As with DQN we use a target network when zero until reaching the goal state xG , which terminates the computing the distributional Bellman update. We also allow episode and gives a reward of 1.0. The discount factor is to decay at the same rate as in DQN, but to a lower value γ = 0.99. of 0.01, as is common in recent work (Bellemare, Dabney, We compute the ground truth value distribution for opti- and Munos 2017; Wang et al. 2016; van Hasselt, Guez, and mal policy π, learned by policy iteration, at each state by per- Silver 2016). forming 1K Monte-Carlo (MC) rollouts and recording the Out training procedure follows that of Mnih et al. observed returns as an empirical distribution, shown in Fig- (2015)’s, and we present results under two evaluation pro- ure 3(b). Next, we ran both TD(0) and QRTD while following tocols: best agent performance and online performance. In π for 10K episodes. Each episode begins in the designated both evaluation protocols we consider performance over all start state (xS ). Both algorithms started with a learning rate 57 Atari 2600 games, and transform raw scores into human- of α = 0.1. For QRTD we used N = 32 and drop α by half normalized scores (van Hasselt, Guez, and Silver 2016). every 2K episodes. π Let ZM C (xS ) be the MC estimated distribution over re- 5 We swept over α in (10−3 , 5 × 10−4 , 10−4 , 5 × 10−5 , 10−5 ); π turns from the start state xS , similarly VM C (xS ) its mean. ADAM in (0.01/32, 0.005/32, 0.001/32); N (10, 50, 100, 200)

7. 50% 50% 50% 50% 40% 40% 40% 40% 30% 30% 30% 30% 20% 20% 20% 20% 10% Figure 4: Online evaluation results, in human-normalized scores, over 57 Atari 2600 games for 200 million training samples. (Left) Testing performance for one seed, showing median over games. (Right) Training performance, averaged over three seeds, showing percentiles (10, 20, 30, 40, and 50) over games. Mean Median >human >DQN percentile (10th, 20th, 30th, 40th, and 50th). The upper per- DQN 228% 79% 24 0 centiles show a similar trend but are omitted here for visual DDQN 307% 118% 33 43 clarity, as their scale dwarfs the more informative lower half. D UEL . 373% 151% 37 50 From this, we can infer a few interesting results. (1) Early P RIOR . 434% 124% 39 48 in learning, most algorithms perform worse than random for P R . D UEL . 592% 172% 39 44 at least 10% of games. (2) QRTD gives similar improvements C 51 701% 178% 40 50 to sample complexity as prioritized replay, while also im- QR - DQN -0 881% 199% 38 52 proving final performance. (3) Even at 200 million frames, QR - DQN -1 915% 211% 41 54 there are 10% of games where all algorithms reach less than 10% of human. This final point in particular shows us that all of our recent advances continue to be severely limited on Table 1: Mean and median of best scores across 57 Atari a small subset of the Atari 2600 games. 2600 games, measured as percentages of human baseline (Nair et al. 2015). Conclusions The importance of the distribution over returns in reinforce- Best agent performance To provide comparable results ment learning has been (re)discovered and highlighted many with existing work we report test evaluation results un- times by now. In Bellemare, Dabney, and Munos (2017) the der the best agent protocol. Every one million training idea was taken a step further, and argued to be a central part frames, learning is frozen and the agent is evaluated for of approximate reinforcement learning. However, the paper 500K frames while recording the average return. Evalua- left open the question of whether there exists an algorithm tion episodes begin with up to 30 random no-ops (Mnih which could bridge the gap between Wasserstein-metric the- et al. 2015), and the agent uses a lower exploration rate ory and practical concerns. ( = 0.001). As training progresses we keep track of the In this paper we have closed this gap with both theoreti- best agent performance achieved thus far. cal contributions and a new algorithm which achieves state- Table 1 gives the best agent performance, at 200 million of-the-art performance in Atari 2600. There remain many frames trained, for QR - DQN, C 51, DQN, Double DQN (van promising directions for future work. Most exciting will be Hasselt, Guez, and Silver 2016), Prioritized replay (Schaul to expand on the promise of a richer policy class, made pos- et al. 2016), and Dueling architecture (Wang et al. 2016). We sible by action-value distributions. We have mentioned a few see that QR - DQN outperforms all previous agents in mean examples of such policies, often used for risk-sensitive deci- and median human-normalized score. sion making. However, there are many more possible deci- sion policies that consider the action-value distributions as a Online performance In this evaluation protocol (Fig- whole. ure 4) we track the average return attained during each test- Additionally, QR - DQN is likely to benefit from the im- ing (left) and training (right) iteration. For the testing perfor- provements on DQN made in recent years. For instance, due mance we use a single seed for each algorithm, but show on- to the similarity in loss functions and Bellman operators line performance with no form of early stopping. For train- we might expect that QR - DQN suffers from similar over- ing performance, values are averages over three seeds. In- estimation biases to those that Double DQN was designed stead of reporting only median performance, we look at the to address (van Hasselt, Guez, and Silver 2016). A natu- distribution of human-normalized scores over the full set of ral next step would be to combine QR - DQN with the non- games. Each bar represents the score distribution at a fixed distributional methods found in Table 1.

8. Acknowledgements Heger, M. 1994. Consideration of Risk in Reinforcement Learning. In Proceedings of the 11th International Confer- The authors acknowledge the vital contributions of their col- ence on Machine Learning, 105–111. leagues at DeepMind. Special thanks to Tom Schaul, Au- drunas Gruslys, Charles Blundell, and Benigno Uria for their Huber, P. J. 1964. Robust Estimation of a Location Param- early suggestions and discussions on the topic of quantile eter. The Annals of Mathematical Statistics 35(1):73–101. regression. Additionally, we are grateful for feedback from Kingma, D., and Ba, J. 2015. Adam: A Method for Stochas- David Silver, Yee Whye Teh, Georg Ostrovski, Joseph Mo- tic Optimization. Proceedings of the International Confer- dayil, Matt Hoffman, Hado van Hasselt, Ian Osband, Mo- ence on Learning Representations. hammad Azar, Tom Stepleton, Olivier Pietquin, Bilal Piot; Koenker, R., and Hallock, K. 2001. Quantile Regression: An and a second acknowledgement in particular of Tom Schaul Introduction. Journal of Economic Perspectives 15(4):43– for his detailed review of an previous draft. 56. Koenker, R. 2005. Quantile Regression. Cambridge Univer- References sity Press. Aravkin, A. Y.; Kambadur, A.; Lozano, A. C.; and Luss, R. Levina, E., and Bickel, P. 2001. The Earth Mover’s Distance 2014. Sparse Quantile Huber Regression for Efficient and is the Mallows Distance: Some Insights from Statistics. In Robust Estimation. arXiv. The 8th IEEE International Conference on Computer Vision Arjovsky, M.; Chintala, S.; and Bottou, L. 2017. Wasser- (ICCV). IEEE. stein Generative Adversarial Networks. In Proceedings of Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Ve- the 34th International Conference on Machine Learning ness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fid- (ICML). jeland, A. K.; Ostrovski, G.; et al. 2015. Human-level Bellemare, M. G.; Naddaf, Y.; Veness, J.; and Bowling, M. Control through Deep Reinforcement Learning. Nature 2013. The Arcade Learning Environment: An Evaluation 518(7540):529–533. Platform for General Agents. Journal of Artificial Intelli- Morimura, T.; Hachiya, H.; Sugiyama, M.; Tanaka, T.; and gence Research 47:253–279. Kashima, H. 2010. Parametric Return Density Estimation Bellemare, M. G.; Danihelka, I.; Dabney, W.; Mohamed, S.; for Reinforcement Learning. In Proceedings of the Confer- Lakshminarayanan, B.; Hoyer, S.; and Munos, R. 2017. The ence on Uncertainty in Artificial Intelligence (UAI). Cramer Distance as a Solution to Biased Wasserstein Gradi- M¨uller, A. 1997. Integral Probability Metrics and their Gen- ents. arXiv. erating Classes of Functions. Advances in Applied Proba- Bellemare, M. G.; Dabney, W.; and Munos, R. 2017. A bility 29(2):429–443. Distributional Perspective on Reinforcement Learning. Pro- Nair, A.; Srinivasan, P.; Blackwell, S.; Alcicek, C.; Fearon, ceedings of the 34th International Conference on Machine R.; De Maria, A.; Panneershelvam, V.; Suleyman, M.; Beat- Learning (ICML). tie, C.; and Petersen, S. e. a. 2015. Massively Parallel Meth- Bellman, R. E. 1957. Dynamic Programming. Princeton, ods for Deep Reinforcement Learning. In ICML Workshop NJ: Princeton University Press. on Deep Learning. Bickel, P. J., and Freedman, D. A. 1981. Some Asymptotic Puterman, M. L. 1994. Markov Decision Processes: Dis- Theory for the Bootstrap. The Annals of Statistics 1196– crete stochastic dynamic programming. John Wiley & Sons, 1217. Inc. Chow, Y.; Tamar, A.; Mannor, S.; and Pavone, M. 2015. Rummery, G. A., and Niranjan, M. 1994. On-line Q- Risk-Sensitive and Robust Decision-Making: a CVaR Op- learning using Connectionist Systems. Technical report, timization Approach. In Advances in Neural Information Cambridge University Engineering Department. Processing Systems (NIPS), 1522–1530. Schaul, T.; Quan, J.; Antonoglou, I.; and Silver, D. 2016. Dearden, R.; Friedman, N.; and Russell, S. 1998. Bayesian Prioritized Experience Replay. In Proceedings of the Inter- Q-learning. In Proceedings of the National Conference on national Conference on Learning Representations (ICLR). Artificial Intelligence. Sutton, R. S., and Barto, A. G. 1998. Reinforcement Learn- Engel, Y.; Mannor, S.; and Meir, R. 2005. Reinforcement ing: An Introduction. MIT Press. Learning with Gaussian Processes. In Proceedings of the Taylor, J. W. 1999. A Quantile Regression Approach to Esti- International Conference on Machine Learning (ICML). mating the Distribution of Multiperiod Returns. The Journal Engel, E. 1857. Die Productions-und Consum- of Derivatives 7(1):64–78. tionsverh¨altnisse des K¨onigreichs Sachsen. Zeitschrift des Tieleman, T., and Hinton, G. 2012. Lecture 6.5: Rmsprop. Statistischen Bureaus des K¨oniglich S¨achsischen Ministeri- COURSERA: Neural Networks for Machine Learning 4(2). ums des Innern 8:1–54. Tsitsiklis, J. N., and Van Roy, B. 1997. An Analysis of Goldstein, S.; Misra, B.; and Courtage, M. 1981. On Intrin- Temporal-Difference Learning with Function Approxima- sic Randomness of Dynamical Systems. Journal of Statisti- tion. IEEE Transactions on Automatic Control 42(5):674– cal Physics 25(1):111–126. 690.

9.van Hasselt, H.; Guez, A.; and Silver, D. 2016. Deep Rein- Appendix forcement Learning with Double Q-Learning. In Proceed- Proofs ings of the AAAI Conference on Artificial Intelligence. Lemma 2. For any τ, τ ∈ [0, 1] with τ < τ and cumulative Wang, Z.; Schaul, T.; Hessel, M.; Hasselt, H. v.; Lanctot, M.; distribution function F with inverse F −1 , the set of θ ∈ R and de Freitas, N. 2016. Dueling Network Architectures minimizing for Deep Reinforcement Learning. In Proceedings of the τ International Conference on Machine Learning (ICML). |F −1 (ω) − θ|dω , Watkins, C. J., and Dayan, P. 1992. Q-learning. Machine τ Learning 8(3):279–292. is given by τ +τ θ ∈ R F (θ) = . 2 In particular, if F −1 is the inverse CDF, then F −1 ((τ + τ )/2) is always a valid minimizer, and if F −1 is continuous at (τ + τ )/2, then F −1 ((τ + τ )/2) is the unique minimizer. Proof. For any ω ∈ [0, 1], the function θ → |F −1 (ω) − θ| is convex, and has subgradient given by if θ < F −1 (ω)  1 θ → [−1, 1] if θ = F −1 (ω) −1 if θ > F −1 (ω) ,  τ so the function θ → τ |F −1 (ω) − θ|dω is also convex, and has subgradient given by F (θ) τ θ→ −1dω + 1dω . τ F (θ) Setting this subgradient equal to 0 yields (τ + τ ) − 2F (θ) = 0 , (14) and since F ◦ F −1 is the identity map on [0, 1], it is clear that θ = F −1 ((τ + τ )/2) satisfies Equation 14. Note that in fact any θ such that F (θ) = (τ + τ )/2 yields a subgradient of 0, which leads to a multitude of minimizers if F −1 is not continuous at (τ + τ )/2. Proposition 1. Let Zθ be a quantile distribution, and Zˆm the empirical distribution composed of m samples from Z. Then for all p ≥ 1, there exists a Z such that arg min E[Wp (Zˆm , Zθ )] = arg min Wp (Z, Zθ ). N Proof. Write Zθ = i=1 N1 δθi , with θ1 ≤ · · · ≤ θN . We take Z to be of the same form as Zθ . Specifically, consider Z given by N 1 Z= δi , i=1 N supported on the set {1, . . . , N }, and take m = N . Then clearly the unique minimizing Zθ for Wp (Z, Zθ ) is given by taking Zθ = Z. However, consider the gradient with respect to θ1 for the objective E[Wp (ZˆN , Zθ )] . We have ∇θ1 E[Wp (ZˆN , Zθ )]|θ1 =1 = E[∇θ1 Wp (ZˆN , Zθ )|θ1 =1 ] .

10.In the event that the sample distribution ZˆN has an atom at 1, then the optimal transport plan pairs the atom of Zθ at θ1 = 1 with this atom of ZˆN , and gradient with respect to θ1 of Wp (ZˆN , Zθ ) is 0. If the sample distribution ZˆN does not contain an atom at 1, then the left-most atom of ZˆN is greater than 1 (since Z is supported on {1, . . . , N }. In this case, the gradient on θ1 is negative. Since this happens with non-zero probability, we conclude that ∇θ1 E[Wp (ZˆN , Zθ )]|θ1 =1 < 0 , and therefore Zθ = Z cannot be the minimizer of E[Wp (ZˆN , Zθ )]. Proposition 2. Let ΠW1 be the quantile projection defined as above, and when applied to value distributions gives the projection for each state-value distribution. For any two value distributions Z1 , Z2 ∈ Z for an MDP with countable state and action spaces, d¯∞ (ΠW1 T π Z1 , ΠW1 T π Z2 ) ≤ γ d¯∞ (Z1 , Z2 ). (11) Proof. We assume that instantaneous rewards given a state- action pair are deterministic; the general case is a straight- forward generalization. Further, since the operator T π is a γ-contraction in d∞ , it is sufficient to prove the claim in the Figure 5: Initial MDP and value distribution Z (top), and case γ = 1. In addition, since Wasserstein distances are in- transformed MDP and value distribution Z (bottom). variant under translation of the support of distributions, it is sufficient to deal with the case where r(x, a) ≡ 0 for all (x, a) ∈ X × A. The proof then proceeds by first reducing may appeal to Lemma 3 to obtain the following: to the case where every value distribution consists only of single Diracs, and then dealing with this reduced case using d∞ (ΠW1 (T π Z)(x , a ), ΠW1 (T π Y )(x , a )) Lemma 3. N 1 ≤ sup |θj (xi , ai ) − ψj (xi , ai )| We write Z(x, a) = k=1 N δθk (x,a) and Y (x, a) = i=1∈I N 1 n j=1,...,N k=1 N δψk (x,a) , for some functions θ, ψ : X × A → R . Let (x, a) be a state-action pair, and let ((xi , ai ))i∈I be all = sup d∞ (Z(xi , ai ), Y (xi , ai )) (15) i=1∈I the state-action pairs that are accessible from (x , a ) in a single transition, where I is a (finite or countable) index- Now note that by construction, (T π Z)(x , a ) (re- ing set. Write pi for the probability of transitioning from spectively, (T π Y )(x , a )) has the same distribution as (x , a ) to (xi , ai ), for each i ∈ I. We now construct a new (T π Z)(x , a ) (respectively, (T π Y )(x , a )), and so MDP and new value distributions for this MDP in which all distributions are given by single Diracs, with a view d∞ (ΠW1 (T π Z)(x , a ), ΠW1 (T π Y )(x , a )) to applying Lemma 3. The new MDP is of the following form. We take the state-action pair (x , a ), and define new = d∞ (ΠW1 (T π Z)(x , a ), ΠW1 (T π Y )(x , a )) . states, actions, transitions, and a policy π, so that the state- Therefore, substituting this into the Inequality 15, we obtain action pairs accessible from (x , a ) in this new MDP are given by ((xji , aji )i∈I )Nj=1 , and the probability of reaching d∞ (ΠW1 (T π Z)(x , a ), ΠW1 (T π Y )(x , a )) the state-action pair (xji , aji ) is pi /n. Further, we define new ≤ sup d∞ (Z(xi , ai ), Y (xi , ai )) . i∈I value distributions Z, Y as follows. For each i ∈ I and j = 1, . . . , N , we set: Taking suprema over the initial state (x , a ) then yields the result. Z(xji , aji ) = δθj (xi ,ai ) Supporting results Y (xji , aji ) = δψj (xi ,ai ) . Lemma 3. Consider an MDP with countable state and ac- The construction is illustrated in Figure 5. tion spaces. Let Z, Y be value distributions such that each Since, by Lemma 4, the d∞ distance between the 1- state-action distribution Z(x, a), Y (x, a) is given by a sin- Wasserstein projections of two real-valued distributions is gle Dirac. Consider the particular case where rewards are the max over the difference of a certain set of quantiles, we identically 0 and γ = 1, and let τ ∈ [0, 1]. Denote by Πτ

11.the projection operator that maps a probability distribution Lemma 4. For any two probability distributions ν1 , ν2 over onto a Dirac delta located at its τ th quantile. Then the real numbers, and the Wasserstein projection operator ΠW1 that projects distributions onto support of size n, we d∞ (Πτ T π Z, Πτ T π Y ) ≤ d∞ (Z, Y ) have that Proof. Let Z(x, a) = δθ(x,a) and Y (x, a) = δψ(x,a) for d∞ (ΠW1 ν1 , ΠW1 ν2 ) each state-action pair (x, a) ∈ X × A, for some functions ψ, θ : X × A → R. Let (x , a ) be a state-action pair, and let 2i − 1 2i − 1 = max Fν−1 − Fν−1 . ((xi , ai ))i∈I be all the state-action pairs that are accessible i=1,...,n 1 2n 2 2n from (x , a ) in a single transition, with I a (finite or count- Proof. By the discussion surrounding Lemma 2, we have ably infinite) indexing set. To lighten notation, we write θi n that ΠW1 νk = i=1 n1 δFν−1 ( 2i−1 ) for k = 1, 2. Therefore, for θ(xi , ai ) and ψi for ψ(xi , ai ). Further, let the probability k 2n the optimal coupling between ΠW1 ν1 and ΠW1 ν2 must be of transitioning from (x , a ) to (xi , ai ) be pi , for all i ∈ I. given by Fν−1 ( 2i−1 −1 2i−1 2n ) → Fν2 ( 2n ) for each i = 1, . . . , n. Then we have 1 This immediately leads to the expression of the lemma. (T π Z)(x , a ) = pi δθi (16) i∈I Further theoretical results π (T Y )(x , a ) = pi δψi . (17) Lemma 5. The projected Bellman operator ΠW1 T π is in i∈I general not a non-expansion in dp , for p ∈ [1, ∞). th Proof. Consider the case where the number of Dirac deltas Now consider the τ quantile of each of these distributions, for τ ∈ [0, 1] arbitrary. Let u ∈ I be such that θu is equal to in each distribution, N , is equal to 2, and let γ = 1. We con- this quantile of (T π Z)(x , a ), and let v ∈ I such that ψv is sider an MDP with a single initial state, x, and two terminal equal to this quantile of (T π Y )(x , a ). Now note that states, x1 and x2 . We take the action space of the MDP to be trivial, and therefore omit it in the notation that follows. d∞ (Πτ T π Z(x , a ), Πτ T π Y (x , a )) = |θu − ψv | Let the MDP have a 2/3 probability of transitioning from We now show that x to x1 , and 1/3 probability of transitioning from x to x2 . |θu − ψv | > |θi − ψi | ∀i ∈ I (18) We take all rewards in the MDP to be identically 0. Further, consider two value distributions, Z and Y , given by: is impossible, from which it will follow that 1 1 1 1 d∞ (Πτ T π Z(x , a ), Πτ T π Y (x , a )) ≤ d∞ (Z, Y ) , Z(x1 ) = δ0 + δ2 , Y (x1 ) = δ1 + δ2 , 2 2 2 2 and the result then follows by taking maxima over state- 1 1 1 1 Z(x2 ) = δ3 + δ5 , Y (x2 ) = δ4 + δ5 , action pairs (x , a ). To demonstrate the impossibility of 2 2 2 2 (18), without loss of generality we take θu ≤ ψv . Z(x) = δ0 , Y (x) = δ0 . We now introduce the following partitions of the indexing Then note that we have set I. Define: 1/p 1 1 I≤θu = {i ∈ I|θi ≤ θu } , dp (Z(x1 ), Y (x1 )) = |1 − 0| = 1/p , 2 2 I>θu = {i ∈ I|θi > θu } , 1/p I<ψv = {i ∈ I|ψi < ψv } , 1 1 dp (Z(x2 ), Y (x2 )) = |4 − 3| = 1/p , 2 2 I≥ψv = {i ∈ I|ψi ≥ ψv } , dp (Z(x), Y (x)) = 0 , and observe that we clearly have the following disjoint unions: and so 1 I = I≤θu ∪ I>θu , dp (Z, Y ) = . 21/p I = I<ψv ∪ I≥ψv . We now consider the projected backup for these two value distributions at the state x. We first compute the full backup: If (18) is to hold, then we must have I≤θu ∩ I≥ψv = ∅. Therefore, we must have I≤θu ⊆ I<ψv . But if this is the 1 1 1 1 (T π Z)(x) = δ0 + δ2 + δ3 + δ5 , case, then since θu is the τ th quantile of (T π Z)(x , a ), we 3 3 6 6 must have 1 1 1 1 (T π Y )(x) = δ1 + δ2 + δ4 + δ5 . pi ≥ τ , 3 3 6 6 i∈I≤θu Appealing to Lemma 2, we note that when projected these and so consequently distributions onto two equally-weighted Diracs, the loca- tions of these Diracs correspond to the 25% and 75% quan- pi ≥ τ , tiles of the original distributions. We therefore have i∈I<ψv 1 1 (ΠW1 T π Z)(x) = δ0 + δ3 , from which we conclude that the τ th quantile of 2 2 (T π Y )(x , a ) is less than ψv , a contradiction. Therefore π 1 1 (ΠW1 T Y )(x) = δ1 + δ4 , (18) cannot hold, completing the proof. 2 2

12.and we therefore obtain 1/p 1 d1 (ΠW1 T π Z, ΠW1 T π Y ) = (|1 − 0|p + |4 − 3|p ) 2 Table 2: Notation used in the paper 1 =1 > 1/p = d1 (Z, Y ) , Symbol Description of usage 2 completing the proof. Reinforcement Learning M MDP (X , A, R, P , γ) Notation X State space of MDP Human-normalized scores are given by (van Hasselt, Guez, A Action space of MDP and Silver 2016), R, Rt Reward function, random variable reward agent − random P Transition probabilities, P (x |x, a) score = , γ Discount factor, γ ∈ [0, 1) human − random x, xt ∈ X States where agent, human and random represent the per-game a, a∗ , b ∈ A Actions raw scores for the agent, human baseline, and random agent r, rt ∈ R Rewards baseline. π Policy Tπ (dist.) Bellman operator T (dist.) Bellman optimality operator V π, V Value function, state-value function Qπ , Q Action-value function α Step-size parameter, learning rate Exploration rate, -greedy ADAM Adam parameter κ Huber-loss parameter Lκ Huber-loss with parameter κ Distributional Reinforcement Learning Zπ, Z Random return, value distribution π ZM C Monte-Carlo value distribution under policy π Z Space of value distributions Zˆ π Fixed point of convergence for ΠW1 T π z∼Z Instantiated return sample p Metric order Wp p-Wasserstein metric Lp Metric order p d¯p maximal form of Wasserstein Φ Projection used by C 51 ΠW1 1-Wasserstein projection ρτ Quantile regression loss ρκτ Huber quantile loss q1 , . . . , qN Probabilities, parameterized probabilities τ0 , τ1 , . . . , τN Cumulative probabilities with τ0 := 0 τˆ1 , . . . , τˆN Midpoint quantile targets ω Sample from unit interval δz Dirac function at z ∈ R θ Parameterized function B Bernoulli distribution Bµ Parameterized Bernoulli distribution ZQ Space of quantile (value) distributions Zθ Parameterized quantile (value) distribution Y Random variable over R Y1 , . . . , Ym Random variable samples Yˆm Empirical distribution from m-Diracs

13. DQN QR-DQN-0 C51 QR-DQN-1 Figure 6: Online training curves for DQN, C 51, and QR - DQN on 57 Atari 2600 games. Curves are averages over three seeds, smoothed over a sliding window of 5 iterations, and error bands give standard deviations.

14. GAMES RANDOM HUMAN DQN PRIOR . DUEL . C 51 QR - DQN -0 QR - DQN -1 Alien 227.8 7,127.7 1,620.0 3,941.0 3,166 9,983 4,871 Amidar 5.8 1,719.5 978.0 2,296.8 1,735 2,726 1,641 Assault 222.4 742.0 4,280.4 11,477.0 7,203 19,961 22,012 Asterix 210.0 8,503.3 4,359.0 375,080.0 406,211 454,461 261,025 Asteroids 719.1 47,388.7 1,364.5 1,192.7 1,516 2,335 4,226 Atlantis 12,850.0 29,028.1 279,987.0 395,762.0 841,075 1,046,625 971,850 Bank Heist 14.2 753.1 455.0 1,503.1 976 1,245 1,249 Battle Zone 2,360.0 37,187.5 29,900.0 35,520.0 28,742 35,580 39,268 Beam Rider 363.9 16,926.5 8,627.5 30,276.5 14,074 24,919 34,821 Berzerk 123.7 2,630.4 585.6 3,409.0 1,645 34,798 3,117 Bowling 23.1 160.7 50.4 46.7 81.8 85.3 77.2 Boxing 0.1 12.1 88.0 98.9 97.8 99.8 99.9 Breakout 1.7 30.5 385.5 366.0 748 766 742 Centipede 2,090.9 12,017.0 4,657.7 7,687.5 9,646 9,163 12,447 Chopper Command 811.0 7,387.8 6,126.0 13,185.0 15,600 7,138 14,667 Crazy Climber 10,780.5 35,829.4 110,763.0 162,224.0 179,877 181,233 161,196 Defender 2,874.5 18,688.9 23,633.0 41,324.5 47,092 42,120 47,887 Demon Attack 152.1 1,971.0 12,149.4 72,878.6 130,955 117,577 121,551 Double Dunk -18.6 -16.4 -6.6 -12.5 2.5 12.3 21.9 Enduro 0.0 860.5 729.0 2,306.4 3,454 2,357 2,355 Fishing Derby -91.7 -38.7 -4.9 41.3 8.9 37.4 39.0 Freeway 0.0 29.6 30.8 33.0 33.9 34.0 34.0 Frostbite 65.2 4,334.7 797.4 7,413.0 3,965 4,839 4,384 Gopher 257.6 2,412.5 8,777.4 104,368.2 33,641 118,050 113,585 Gravitar 173.0 3,351.4 473.0 238.0 440 546 995 H.E.R.O. 1,027.0 30,826.4 20,437.8 21,036.5 38,874 21,785 21,395 Ice Hockey -11.2 0.9 -1.9 -0.4 -3.5 -3.6 -1.7 James Bond 29.0 302.8 768.5 812.0 1,909 1,028 4,703 Kangaroo 52.0 3,035.0 7,259.0 1,792.0 12,853 14,780 15,356 Krull 1,598.0 2,665.5 8,422.3 10,374.4 9,735 11,139 11,447 Kung-Fu Master 258.5 22,736.3 26,059.0 48,375.0 48,192 71,514 76,642 Montezuma’s Revenge 0.0 4,753.3 0.0 0.0 0.0 75.0 0.0 Ms. Pac-Man 307.3 6,951.6 3,085.6 3,327.3 3,415 5,822 5,821 Name This Game 2,292.3 8,049.0 8,207.8 15,572.5 12,542 17,557 21,890 Phoenix 761.4 7,242.6 8,485.2 70,324.3 17,490 65,767 16,585 Pitfall! -229.4 6,463.7 -286.1 0.0 0.0 0.0 0.0 Pong -20.7 14.6 19.5 20.9 20.9 21.0 21.0 Private Eye 24.9 69,571.3 146.7 206.0 15,095 146 350 Q*Bert 163.9 13,455.0 13,117.3 18,760.3 23,784 26,646 572,510 River Raid 1,338.5 17,118.0 7,377.6 20,607.6 17,322 9,336 17,571 Road Runner 11.5 7,845.0 39,544.0 62,151.0 55,839 67,780 64,262 Robotank 2.2 11.9 63.9 27.5 52.3 61.1 59.4 Seaquest 68.4 42,054.7 5,860.6 931.6 266,434 2,680 8,268 Skiing -17,098.1 -4,336.9 -13,062.3 -19,949.9 -13,901 -9,163 -9,324 Solaris 1,236.3 12,326.7 3,482.8 133.4 8,342 2,522 6,740 Space Invaders 148.0 1,668.7 1,692.3 15,311.5 5,747 21,039 20,972 Star Gunner 664.0 10,250.0 54,282.0 125,117.0 49,095 70,055 77,495 Surround -10.0 6.5 -5.6 1.2 6.8 9.7 8.2 Tennis -23.8 -8.3 12.2 0.0 23.1 23.7 23.6 Time Pilot 3,568.0 5,229.2 4,870.0 7,553.0 8,329 9,344 10,345 Tutankham 11.4 167.6 68.1 245.9 280 312 297 Up and Down 533.4 11,693.2 9,989.9 33,879.1 15,612 53,585 71,260 Venture 0.0 1,187.5 163.0 48.0 1,520 0.0 43.9 Video Pinball 16,256.9 17,667.9 196,760.4 479,197.0 949,604 701,779 705,662 Wizard Of Wor 563.5 4,756.5 2,704.0 12,352.0 9,300 26,844 25,061 Yars’ Revenge 3,092.9 54,576.9 18,098.9 69,618.1 35,050 32,605 26,447 Zaxxon 32.5 9,173.3 5,363.0 13,886.0 10,513 7,200 13,112 Figure 7: Raw scores across all games, starting with 30 no-op actions. Reference values from Wang et al. (2016) and Bellemare, Dabney, and Munos (2017).