# Neural GPUs Learning Algorithms

Learning an algorithm from examples is a fundamental problem that has been widely studied. It has been addressed using neural networks too, in particular by Neural Turing Machines (NTMs). These are fully differentiable computers that use backpropagation to learn their own programming. Despite their appeal NTMs have a weakness that is caused by their sequential nature: they are not parallel and are are hard to train due to their large depth when unfolded. We present a neural network architecture to address this problem: the Neural GPU. It is based on a type of convolutional gated recurrent unit and, like the NTM, is computationally universal. Unlike the NTM, the Neural GPU is highly parallel which makes it easier to train and efficient to run. An essential property of algorithms is their ability to handle inputs of arbitrary size. We show that the Neural GPU can be trained on short instances of an algorithmic task and successfully generalize to long instances. We verified it on a number of tasks including long addition and long multiplication of numbers represented in binary. We train the Neural GPU on numbers with up-to 20 bits and observe no errors whatsoever while testing it, even on much longer numbers.To achieve these results we introduce a technique for training deep recurrent networks: parameter sharing relaxation. We also found a small amount of dropout and gradient noise to have a large positive effect on learning and generalization.

3.Published as a conference paper at ICLR 2016 In this representation the triples of bits from (x, y, x + y), e.g., (1, 1, 0) (0, 0, 1) (0, 1, 1) (1, 0, 1) as in the figure above, form a regular language. To learn binary addition in this representation it therefore suffices to find a regular expression or an automaton that accepts this language, which can be done with a variant of Anguin’s algorithm (Angluin, 1987). But only few interesting functions have regular representations, as for example long multiplication does not (Blumensath & Gr¨adel, 2000). It is therefore desirable to learn long binary addition without alignment, for example when x and y are provided one after another. This is the representation we use in the present paper. Input (x, y) 1 0 0 1 + 1 0 1 0 Desired Output (x + y) 0 1 1 1 2 T HE N EURAL GPU Before we introduce the Neural GPU, let us recall the architecture of a Gated Recurrent Unit (GRU) (Cho et al., 2014). A GRU is similar to an LSTM, but its input and state are the same size, which makes it easier for us to generalize it later; a highway network could have also been used (Srivastava et al., 2015), but it lacks the reset gate. GRUs have shown performance similar to LSTMs on a number of tasks (Chung et al., 2014; Greff et al., 2015). A GRU takes an input vector x and a current state vector s, and outputs: GRU(x, s) = u ⊙ s + (1 − u) ⊙ tanh(W x + U (r ⊙ s) + B), where u = σ(W ′ x + U ′ s + B ′ ) and r = σ(W ′′ x + U ′′ s + B ′′ ). In the equations above, W, W ′ , W ′′ , U, U ′ , U ′′ are matrices and B, B ′ , B ′′ are bias vectors; these are the parameters that will be learned. We write W x for a matrix-vector multiplication and r ⊙ s for elementwise vector multiplication. The vectors u and r are called gates since their elements are in [0, 1] — u is the update gate and r is the reset gate. In recurrent neural networks a unit like GRU is applied at every step and the result is both passed as new state and used to compute the output. In a Neural GPU we do not process a new input in every step. Instead, all inputs are written into the starting state s0 . This state has 2-dimensional structure: it consists of w × h vectors of m numbers, i.e., it is a 3-dimensional tensor of shape [w, h, m]. This mental image evolves in time in a way defined by a convolutional gated recurrent unit: CGRU(s) = u ⊙ s + (1 − u) ⊙ tanh(U ∗ (r ⊙ s) + B), where u = σ(U ′ ∗ s + B ′ ) and r = σ(U ′′ ∗ s + B ′′ ). U ∗ s above denotes the convolution of a kernel bank U with the mental image s. A kernel bank is a 4-dimensional tensor of shape [kw , kh , m, m], i.e., it contains kw · kh · m2 parameters, where kw and kh are kernel width and height. It is applied to a mental image s of shape [w, h, m] which results in another mental image U ∗ s of the same shape defined by: ⌊kw /2⌋ ⌊kh /2⌋ m U ∗ s[x, y, i] = s[x + u, y + v, c] · U [u, v, c, i]. u=⌊−kw /2⌋ v=⌊−kh /2⌋ c=1 In the equation above the index x + u might sometimes be negative or larger than the size of s, and in such cases we assume the value is 0. This corresponds to the standard convolution operator used in convolutional neural networks with zero padding on both sides and stride 1. Using the standard operator has the advantage that it is heavily optimized (see Section 4 for Neural GPU performance). New work on faster convolutions, e.g., Lavin & Gray (2015), can be directly used in a Neural GPU. Knowing how a CGRU gate works, the definition of a l-layer Neural GPU is simple, as depicted in Figure 1. The given sequence i = (i1 , . . . , in ) of n discrete symbols from {0, . . . , I} is first em- bedded into the mental image s0 by concatenating the vectors obtained from an embedding lookup of the input symbols into its first column. More precisely, we create the starting mental image s0 of shape [w, n, m] by using an embedding matrix E of shape [I, m] and setting s0 [0, k, :] = E[ik ] (in python notation) for all k = 1 . . . n (here i1 , . . . , in is the input). All other elements of s0 are set to 0. Then, we apply l different CGRU gates in turn for n steps to produce the final mental image sfin : st+1 = CGRUl (CGRUl−1 . . . CGRU1 (st ) . . .) and sfin = sn . 3