Learning Phrase Representations using RNN Encoder–Decoder

In this paper, we propose a novel neural network model called RNN Encoder–Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixedlength vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN Encoder–Decoder as an additional feature in the existing log-linear model. Qualitatively, we show that the proposed model learns a semantically and syntactically meaningful representation of linguistic phrases.

1. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation Kyunghyun Cho Bart van Merri¨enboer Caglar Gulcehre Dzmitry Bahdanau Universit´e de Montr´eal Jacobs University, Germany firstname.lastname@umontreal.ca d.bahdanau@jacobs-university.de Fethi Bougares Holger Schwenk Yoshua Bengio Universit´e du Maine, France Universit´e de Montr´eal, CIFAR Senior Fellow firstname.lastname@lium.univ-lemans.fr find.me@on.the.web Abstract Along this line of research on using neural net- works for SMT, this paper focuses on a novel neu- In this paper, we propose a novel neu- ral network architecture that can be used as a part ral network model called RNN Encoder– of the conventional phrase-based SMT system. Decoder that consists of two recurrent The proposed neural network architecture, which arXiv:1406.1078v3 [cs.CL] 3 Sep 2014 neural networks (RNN). One RNN en- we will refer to as an RNN Encoder–Decoder, con- codes a sequence of symbols into a fixed- sists of two recurrent neural networks (RNN) that length vector representation, and the other act as an encoder and a decoder pair. The en- decodes the representation into another se- coder maps a variable-length source sequence to a quence of symbols. The encoder and de- fixed-length vector, and the decoder maps the vec- coder of the proposed model are jointly tor representation back to a variable-length target trained to maximize the conditional prob- sequence. The two networks are trained jointly to ability of a target sequence given a source maximize the conditional probability of the target sequence. The performance of a statisti- sequence given a source sequence. Additionally, cal machine translation system is empiri- we propose to use a rather sophisticated hidden cally found to improve by using the con- unit in order to improve both the memory capacity ditional probabilities of phrase pairs com- and the ease of training. puted by the RNN Encoder–Decoder as an additional feature in the existing log-linear The proposed RNN Encoder–Decoder with a model. Qualitatively, we show that the novel hidden unit is empirically evaluated on the proposed model learns a semantically and task of translating from English to French. We syntactically meaningful representation of train the model to learn the translation probabil- linguistic phrases. ity of an English phrase to a corresponding French phrase. The model is then used as a part of a stan- 1 Introduction dard phrase-based SMT system by scoring each phrase pair in the phrase table. The empirical eval- Deep neural networks have shown great success in uation reveals that this approach of scoring phrase various applications such as objection recognition pairs with an RNN Encoder–Decoder improves (see, e.g., (Krizhevsky et al., 2012)) and speech the translation performance. recognition (see, e.g., (Dahl et al., 2012)). Fur- thermore, many recent works showed that neu- We qualitatively analyze the trained RNN ral networks can be successfully used in a num- Encoder–Decoder by comparing its phrase scores ber of tasks in natural language processing (NLP). with those given by the existing translation model. These include, but are not limited to, language The qualitative analysis shows that the RNN modeling (Bengio et al., 2003), paraphrase detec- Encoder–Decoder is better at capturing the lin- tion (Socher et al., 2011) and word embedding ex- guistic regularities in the phrase table, indirectly traction (Mikolov et al., 2013). In the field of sta- explaining the quantitative improvements in the tistical machine translation (SMT), deep neural overall translation performance. The further anal- networks have begun to show promising results. ysis of the model reveals that the RNN Encoder– (Schwenk, 2012) summarizes a successful usage Decoder learns a continuous space representation of feedforward neural networks in the framework of a phrase that preserves both the semantic and of phrase-based SMT system. syntactic structure of the phrase.

2.2 RNN Encoder–Decoder Decoder 2.1 Preliminary: Recurrent Neural Networks yT' y2 y1 A recurrent neural network (RNN) is a neural net- work that consists of a hidden state h and an optional output y which operates on a variable- length sequence x = (x1 , . . . , xT ). At each time step t, the hidden state h t of the RNN is updated c by h t =f h t−1 , xt , (1) where f is a non-linear activation func- x1 x2 xT tion. f may be as simple as an element- Encoder wise logistic sigmoid function and as com- plex as a long short-term memory (LSTM) Figure 1: An illustration of the proposed RNN unit (Hochreiter and Schmidhuber, 1997). Encoder–Decoder. An RNN can learn a probability distribution over a sequence by being trained to predict the should note that the input and output sequence next symbol in a sequence. In that case, the output lengths T and T may differ. at each timestep t is the conditional distribution The encoder is an RNN that reads each symbol p(xt | xt−1 , . . . , x1 ). For example, a multinomial of an input sequence x sequentially. As it reads distribution (1-of-K coding) can be output using a each symbol, the hidden state of the RNN changes softmax activation function according to Eq. (1). After reading the end of the sequence (marked by an end-of-sequence sym- bol), the hidden state of the RNN is a summary c exp wj h t of the whole input sequence. p(xt,j = 1 | xt−1 , . . . , x1 ) = K , The decoder of the proposed model is another j =1 exp wj h t (2) RNN which is trained to generate the output se- quence by predicting the next symbol yt given the for all possible symbols j = 1, . . . , K, where wj hidden state h t . However, unlike the RNN de- are the rows of a weight matrix W. By combining scribed in Sec. 2.1, both yt and h t are also con- these probabilities, we can compute the probabil- ditioned on yt−1 and on the summary c of the input ity of the sequence x using sequence. Hence, the hidden state of the decoder at time t is computed by, T p(x) = p(xt | xt−1 , . . . , x1 ). (3) h t =f h t−1 , yt−1 , c , t=1 and similarly, the conditional distribution of the From this learned distribution, it is straightfor- next symbol is ward to sample a new sequence by iteratively sam- pling a symbol at each time step. P (yt |yt−1 , yt−2 , . . . , y1 , c) = g h t , yt−1 , c . 2.2 RNN Encoder–Decoder for given activation functions f and g (the latter In this paper, we propose a novel neural network must produce valid probabilities, e.g. with a soft- architecture that learns to encode a variable-length max). sequence into a fixed-length vector representation See Fig. 1 for a graphical depiction of the pro- and to decode a given fixed-length vector rep- posed model architecture. resentation back into a variable-length sequence. The two components of the proposed RNN From a probabilistic perspective, this new model Encoder–Decoder are jointly trained to maximize is a general method to learn the conditional dis- the conditional log-likelihood tribution over a variable-length sequence condi- N 1 tioned on yet another variable-length sequence, max log pθ (yn | xn ), (4) θ N e.g. p(y1 , . . . , yT | x1 , . . . , xT ), where one n=1

3.where θ is the set of the model parameters and each (xn , yn ) is an (input sequence, output se- z quence) pair from the training set. In our case, as the output of the decoder, starting from the in- ~ put, is differentiable, we can use a gradient-based h r h x algorithm to estimate the model parameters. Once the RNN Encoder–Decoder is trained, the Figure 2: An illustration of the proposed hidden model can be used in two ways. One way is to use activation function. The update gate z selects the model to generate a target sequence given an whether the hidden state is to be updated with input sequence. On the other hand, the model can a new hidden state h.˜ The reset gate r decides be used to score a given pair of input and output whether the previous hidden state is ignored. See sequences, where the score is simply a probability Eqs. (5)–(8) for the detailed equations of r, z, h pθ (y | x) from Eqs. (3) and (4). ˜ and h. 2.3 Hidden Unit that Adaptively Remembers and Forgets only. This effectively allows the hidden state to drop any information that is found to be irrelevant In addition to a novel model architecture, we also later in the future, thus, allowing a more compact propose a new type of hidden unit (f in Eq. (1)) representation. that has been motivated by the LSTM unit but is On the other hand, the update gate controls how much simpler to compute and implement.1 Fig. 2 much information from the previous hidden state shows the graphical depiction of the proposed hid- will carry over to the current hidden state. This den unit. acts similarly to the memory cell in the LSTM Let us describe how the activation of the j-th network and helps the RNN to remember long- hidden unit is computed. First, the reset gate rj is term information. Furthermore, this may be con- computed by sidered an adaptive variant of a leaky-integration rj = σ [Wr x]j + Ur h , (5) unit (Bengio et al., 2013). t−1 j As each hidden unit has separate reset and up- where σ is the logistic sigmoid function, and [.]j date gates, each hidden unit will learn to capture denotes the j-th element of a vector. x and ht−1 dependencies over different time scales. Those are the input and the previous hidden state, respec- units that learn to capture short-term dependencies tively. Wr and Ur are weight matrices which are will tend to have reset gates that are frequently ac- learned. tive, but those that capture longer-term dependen- Similarly, the update gate zj is computed by cies will have update gates that are mostly active. In our preliminary experiments, we found that zj = σ [Wz x]j + Uz h t−1 j . (6) it is crucial to use this new unit with gating units. We were not able to get meaningful result with an The actual activation of the proposed unit hj is oft-used tanh unit without any gating. then computed by t t−1 ˜t, 3 Statistical Machine Translation hj = zj hj + (1 − zj )h j (7) In a commonly used statistical machine translation where system (SMT), the goal of the system (decoder, ˜ t = φ [Wx] + U r h h t−1 . (8) specifically) is to find a translation f given a source j j j sentence e, which maximizes In this formulation, when the reset gate is close to 0, the hidden state is forced to ignore the pre- p(f | e) ∝ p(e | f )p(f ), vious hidden state and reset with the current input 1 The LSTM unit, which has shown impressive results in where the first term at the right hand side is called several applications such as speech recognition, has a mem- translation model and the latter language model ory cell and four gating units that adaptively control the in- (see, e.g., (Koehn, 2005)). In practice, however, formation flow inside the unit, compared to only two gating units in the proposed hidden unit. For details on LSTM net- most SMT systems model log p(f | e) as a log- works, see, e.g., (Graves, 2012). linear model with additional features and corre-

4.sponding weights: pairs in the original corpus. With a fixed capacity of the RNN Encoder–Decoder, we try to ensure N that most of the capacity of the model is focused log p(f | e) = wn fn (f , e) + log Z(e), (9) n=1 toward learning linguistic regularities, i.e., distin- guishing between plausible and implausible trans- where fn and wn are the n-th feature and weight, lations, or learning the “manifold” (region of prob- respectively. Z(e) is a normalization constant that ability concentration) of plausible translations. does not depend on the weights. The weights are Once the RNN Encoder–Decoder is trained, we often optimized to maximize the BLEU score on a add a new score for each phrase pair to the exist- development set. ing phrase table. This allows the new scores to en- In the phrase-based SMT framework ter into the existing tuning algorithm with minimal introduced in (Koehn et al., 2003) and additional overhead in computation. (Marcu and Wong, 2002), the translation model As Schwenk pointed out in (Schwenk, 2012), log p(e | f ) is factorized into the translation it is possible to completely replace the existing probabilities of matching phrases in the source phrase table with the proposed RNN Encoder– and target sentences.2 These probabilities are Decoder. In that case, for a given source phrase, once again considered additional features in the the RNN Encoder–Decoder will need to generate log-linear model (see Eq. (9)) and are weighted a list of (good) target phrases. This requires, how- accordingly to maximize the BLEU score. ever, an expensive sampling procedure to be per- Since the neural net language model was pro- formed repeatedly. In this paper, thus, we only posed in (Bengio et al., 2003), neural networks consider rescoring the phrase pairs in the phrase have been used widely in SMT systems. In table. many cases, neural networks have been used to rescore translation hypotheses (n-best lists) (see, 3.2 Related Approaches: Neural Networks in e.g., (Schwenk et al., 2006)). Recently, however, Machine Translation there has been interest in training neural networks to score the translated sentence (or phrase pairs) Before presenting the empirical results, we discuss using a representation of the source sentence as a number of recent works that have proposed to an additional input. See, e.g., (Schwenk, 2012), use neural networks in the context of SMT. (Son et al., 2012) and (Zou et al., 2013). Schwenk in (Schwenk, 2012) proposed a simi- lar approach of scoring phrase pairs. Instead of the 3.1 Scoring Phrase Pairs with RNN RNN-based neural network, he used a feedforward Encoder–Decoder neural network that has fixed-size inputs (7 words Here we propose to train the RNN Encoder– in his case, with zero-padding for shorter phrases) Decoder (see Sec. 2.2) on a table of phrase pairs and fixed-size outputs (7 words in the target lan- and use its scores as additional features in the log- guage). When it is used specifically for scoring linear model in Eq. (9) when tuning the SMT de- phrases for the SMT system, the maximum phrase coder. length is often chosen to be small. However, as the When we train the RNN Encoder–Decoder, we length of phrases increases or as we apply neural ignore the (normalized) frequencies of each phrase networks to other variable-length sequence data, pair in the original corpora. This measure was it is important that the neural network can han- taken in order (1) to reduce the computational ex- dle variable-length input and output. The pro- pense of randomly selecting phrase pairs from a posed RNN Encoder–Decoder is well-suited for large phrase table according to the normalized fre- these applications. quencies and (2) to ensure that the RNN Encoder– Similar to (Schwenk, 2012), Devlin et al. Decoder does not simply learn to rank the phrase (Devlin et al., 2014) proposed to use a feedfor- pairs according to their numbers of occurrences. ward neural network to model a translation model, One underlying reason for this choice was that the however, by predicting one word in a target phrase existing translation probability in the phrase ta- at a time. They reported an impressive improve- ble already reflects the frequencies of the phrase ment, but their approach still requires the maxi- 2 Without loss of generality, from here on, we refer to mum length of the input phrase (or context words) p(e | f ) for each phrase pair as a translation model as well to be fixed a priori.

5. Although it is not exactly a neural network they 4.1 Data and Baseline System train, the authors of (Zou et al., 2013) proposed Large amounts of resources are available to build to learn a bilingual embedding of words/phrases. an English/French SMT system in the framework They use the learned embedding to compute the of the WMT’14 translation task. The bilingual distance between a pair of phrases which is used corpora include Europarl (61M words), news com- as an additional score of the phrase pair in an SMT mentary (5.5M), UN (421M), and two crawled system. corpora of 90M and 780M words respectively. In (Chandar et al., 2014), a feedforward neural The last two corpora are quite noisy. To train network was trained to learn a mapping from a the French language model, about 712M words of bag-of-words representation of an input phrase to crawled newspaper material is available in addi- an output phrase. This is closely related to both the tion to the target side of the bitexts. All the word proposed RNN Encoder–Decoder and the model counts refer to French words after tokenization. proposed in (Schwenk, 2012), except that their in- It is commonly acknowledged that training sta- put representation of a phrase is a bag-of-words. tistical models on the concatenation of all this A similar approach of using bag-of-words repre- data does not necessarily lead to optimal per- sentations was proposed in (Gao et al., 2013) as formance, and results in extremely large mod- well. Earlier, a similar encoder–decoder model us- els which are difficult to handle. Instead, one ing two recursive neural networks was proposed should focus on the most relevant subset of the in (Socher et al., 2011), but their model was re- data for a given task. We have done so by stricted to a monolingual setting, i.e. the model applying the data selection method proposed in reconstructs an input sentence. More recently, an- (Moore and Lewis, 2010), and its extension to bi- other encoder–decoder model using an RNN was texts (Axelrod et al., 2011). By these means we proposed in (Auli et al., 2013), where the de- selected a subset of 418M words out of more coder is conditioned on a representation of either than 2G words for language modeling and a a source sentence or a source context. subset of 348M out of 850M words for train- One important difference between the pro- ing the RNN Encoder–Decoder. We used the posed RNN Encoder–Decoder and the approaches test set newstest2012 and 2013 for data in (Zou et al., 2013) and (Chandar et al., 2014) is selection and weight tuning with MERT, and that the order of the words in source and tar- newstest2014 as our test set. Each set has get phrases is taken into account. The RNN more than 70 thousand words and a single refer- Encoder–Decoder naturally distinguishes between ence translation. sequences that have the same words but in a differ- For training the neural networks, including the ent order, whereas the aforementioned approaches proposed RNN Encoder–Decoder, we limited the effectively ignore order information. source and target vocabulary to the most frequent The closest approach related to the proposed 15,000 words for both English and French. This RNN Encoder–Decoder is the Recurrent Contin- covers approximately 93% of the dataset. All the uous Translation Model (Model 2) proposed in out-of-vocabulary words were mapped to a special (Kalchbrenner and Blunsom, 2013). In their pa- token ([UNK]). per, they proposed a similar model that consists The baseline phrase-based SMT system was of an encoder and decoder. The difference with built using Moses with default settings. This sys- our model is that they used a convolutional n-gram tem achieves a BLEU score of 30.64 and 33.3 on model (CGM) for the encoder and the hybrid of the development and test sets, respectively (see Ta- an inverse CGM and a recurrent neural network ble 1). for the decoder. They, however, evaluated their model on rescoring the n-best list proposed by the 4.1.1 RNN Encoder–Decoder conventional SMT system and computing the per- plexity of the gold standard translations. The RNN Encoder–Decoder used in the experi- ment had 1000 hidden units with the proposed 4 Experiments gates at the encoder and at the decoder. The in- put matrix between each input symbol x t and the We evaluate our approach on the English/French hidden unit is approximated with two lower-rank translation task of the WMT’14 workshop. matrices, and the output matrix is approximated

6. BLEU tem add up or are redundant. Models dev test We trained the CSLM model on 7-grams Baseline 30.64 33.30 from the target corpus. Each input word RNN 31.20 33.87 was projected into the embedding space R512 , CSLM + RNN 31.48 34.64 and they were concatenated to form a 3072- CSLM + RNN + WP 31.50 34.54 dimensional vector. The concatenated vector was fed through two rectified layers (of size 1536 and Table 1: BLEU scores computed on the develop- 1024) (Glorot et al., 2011). The output layer was ment and test sets using different combinations of a simple softmax layer (see Eq. (2)). All the approaches. WP denotes a word penalty, where weight parameters were initialized uniformly be- we penalizes the number of unknown words to tween −0.01 and 0.01, and the model was trained neural networks. until the validation perplexity did not improve for 10 epochs. After training, the language model similarly. We used rank-100 matrices, equivalent achieved a perplexity of 45.80. The validation set to learning an embedding of dimension 100 for was a random selection of 0.1% of the corpus. The each word. The activation function used for h ˜ in model was used to score partial translations dur- Eq. (8) is a hyperbolic tangent function. The com- ing the decoding process, which generally leads to putation from the hidden state in the decoder to higher gains in BLEU score than n-best list rescor- the output is implemented as a deep neural net- ing (Vaswani et al., 2013). work (Pascanu et al., 2014) with a single interme- To address the computational complexity of diate layer having 500 maxout units each pooling using a CSLM in the decoder a buffer was 2 inputs (Goodfellow et al., 2013). used to aggregate n-grams during the stack- All the weight parameters in the RNN Encoder– search performed by the decoder. Only when Decoder were initialized by sampling from an the buffer is full, or a stack is about to isotropic zero-mean (white) Gaussian distribution be pruned, the n-grams are scored by the with its standard deviation fixed to 0.01, except CSLM. This allows us to perform fast matrix- for the recurrent weight parameters. For the re- matrix multiplication on GPU using Theano current weight matrices, we first sampled from a (Bergstra et al., 2010; Bastien et al., 2012). white Gaussian distribution and used its left singu- lar vectors matrix, following (Saxe et al., 2014). 0 We used Adadelta and stochastic gradient −2 descent to train the RNN Encoder–Decoder with hyperparameters = 10−6 and ρ = −4 TM Scores (log) 0.95 (Zeiler, 2012). At each update, we used 64 −6 randomly selected phrase pairs from a phrase ta- −8 ble (which was created from 348M words). The −10 model was trained for approximately three days. Details of the architecture used in the experi- −12 ments are explained in more depth in the supple- −14 −60 −50 −40 −30 −20 −10 0 mentary material. RNN Scores (log) 4.1.2 Neural Language Model Figure 3: The visualization of phrase pairs accord- In order to assess the effectiveness of scoring ing to their scores (log-probabilities) by the RNN phrase pairs with the proposed RNN Encoder– Encoder–Decoder and the translation model. Decoder, we also tried a more traditional approach of using a neural network for learning a target 4.2 Quantitative Analysis language model (CSLM) (Schwenk, 2007). Espe- We tried the following combinations: cially, the comparison between the SMT system using CSLM and that using the proposed approach 1. Baseline configuration of phrase scoring by RNN Encoder–Decoder will 2. Baseline + RNN clarify whether the contributions from multiple 3. Baseline + CSLM + RNN neural networks in different parts of the SMT sys- 4. Baseline + CSLM + RNN + Word penalty

7. Source Translation Model RNN Encoder–Decoder at the end of the [a la fin de la] [´r la fin des ann´ees] [ˆetre sup- [`a la fin du] [`a la fin des] [`a la fin de la] prim´es a` la fin de la] for the first time [r c pour la premir¨ere fois] [´et´e donn´es pour [pour la premi`ere fois] [pour la premi`ere fois ,] la premi`ere fois] [´et´e comm´emor´ee pour la [pour la premi`ere fois que] premi`ere fois] in the United States [? aux ?tats-Unis et] [´et´e ouvertes aux Etats-´ [aux Etats-Unis et] [des Etats-Unis et] [des and ´ Unis et] [´et´e constat´ees aux Etats-Unis et] ´ Etats-Unis et] , as well as [?s , qu’] [?s , ainsi que] [?re aussi bien que] [, ainsi qu’] [, ainsi que] [, ainsi que les] one of the most [?t ?l’ un des plus] [?l’ un des plus] [ˆetre retenue [l’ un des] [le] [un des] comme un de ses plus] (a) Long, frequent source phrases Source Translation Model RNN Encoder–Decoder , Minister of Commu- [Secr´etaire aux communications et aux trans- [Secr´etaire aux communications et aux trans- nications and Trans- ports :] [Secr´etaire aux communications et aux ports] [Secr´etaire aux communications et aux port transports] transports :] did not comply with [vestimentaire , ne correspondaient pas a` des] [n’ ont pas respect´e les] [n’ e´ tait pas conforme the [susmentionn´ee n’ e´ tait pas conforme aux] aux] [n’ ont pas respect´e la] [pr´esent´ees n’ e´ taient pas conformes a` la] parts of the world . [ c gions du monde .] [r´egions du monde con- [parties du monde .] [les parties du monde .] sid´er´ees .] [r´egion du monde consid´er´ee .] [des parties du monde .] the past few days . [le petit texte .] [cours des tout derniers jours .] [ces derniers jours .] [les derniers jours .] [cours [les tout derniers jours .] des derniers jours .] on Friday and Satur- [vendredi et samedi a` la] [vendredi et samedi a` ] [le vendredi et le samedi] [le vendredi et samedi] day [se d´eroulera vendredi et samedi ,] [vendredi et samedi] (b) Long, rare source phrases Table 2: The top scoring target phrases for a small set of source phrases according to the translation model (direct translation probability) and by the RNN Encoder–Decoder. Source phrases were randomly selected from phrases with 4 or more words. ? denotes an incomplete (partial) character. r is a Cyrillic letter ghe. The results are presented in Table 1. As ex- were not able to achieve better performance on the pected, adding features computed by neural net- test set, but only on the development set. works consistently improves the performance over the baseline performance. 4.3 Qualitative Analysis The best performance was achieved when we In order to understand where the performance im- used both CSLM and the phrase scores from the provement comes from, we analyze the phrase pair RNN Encoder–Decoder. This suggests that the scores computed by the RNN Encoder–Decoder contributions of the CSLM and the RNN Encoder– against the corresponding p(f | e) from the trans- Decoder are not too correlated and that one can lation model. Since the existing translation model expect better results by improving each method in- relies solely on the statistics of the phrase pairs in dependently. Furthermore, we tried penalizing the the corpus, we expect its scores to be better esti- number of words that are unknown to the neural mated for the frequent phrases but badly estimated networks (i.e. words which are not in the short- for rare phrases. Also, as we mentioned earlier list). We do so by simply adding the number of in Sec. 3.1, we further expect the RNN Encoder– unknown words as an additional feature the log- Decoder which was trained without any frequency linear model in Eq. (9).3 However, in this case we information to score the phrase pairs based rather 3 To understand the effect of the penalty, consider the set on the linguistic regularities than on the statistics of all words in the 15,000 large shortlist, SL. All words xi ∈ / of their occurrences in the corpus. SL are replaced by a special token [UNK] before being scored by the neural networks. Hence, the conditional probability of We focus on those pairs whose source phrase is any xit ∈ / SL is actually given by the model as long (more than 3 words per source phrase) and p (xt = [UNK] | x<t ) = p (xt ∈ / SL | x<t ) As a result, the probability of words not in the shortlist is always overestimated. It is possible to address this issue by = p xjt | x<t ≥ p xit | x<t , backing off to an existing model that contain non-shortlisted j xt ∈SL / words (see (Schwenk, 2007)) In this paper, however, we opt for introducing a word penalty instead, which counteracts the where x<t is a shorthand notation for xt−1 , . . . , x1 . word probability overestimation.

8. Source Samples from RNN Encoder–Decoder at the end of the [`a la fin de la] (×11) for the first time [pour la premi`ere fois] (×24) [pour la premi`ere fois que] (×2) in the United States and ´ [aux Etats-Unis ´ et] (×6) [dans les Etats-Unis et] (×4) , as well as [, ainsi que] [,] [ainsi que] [, ainsi qu’] [et UNK] one of the most [l’ un des plus] (×9) [l’ un des] (×5) [l’ une des plus] (×2) (a) Long, frequent source phrases Source Samples from RNN Encoder–Decoder , Minister of Communica- [ , ministre des communications et le transport] (×13) tions and Transport did not comply with the [n’ tait pas conforme aux] [n’ a pas respect l’] (×2) [n’ a pas respect la] (×3) parts of the world . [arts du monde .] (×11) [des arts du monde .] (×7) the past few days . [quelques jours .] (×5) [les derniers jours .] (×5) [ces derniers jours .] (×2) on Friday and Saturday [vendredi et samedi] (×5) [le vendredi et samedi] (×7) [le vendredi et le samedi] (×4) (b) Long, rare source phrases Table 3: Samples generated from the RNN Encoder–Decoder for each source phrase used in Table 2. We show the top-5 target phrases out of 50 samples. They are sorted by the RNN Encoder–Decoder scores. Figure 4: 2–D embedding of the learned word representation. The left one shows the full embedding space, while the right one shows a zoomed-in view of one region (color–coded). For more plots, see the supplementary material. frequent. For each such source phrase, we look other phrase pairs that were scored radically dif- at the target phrases that have been scored high ferent (see Fig. 3). This could arise from the either by the translation probability p(f | e) or proposed approach of training the RNN Encoder– by the RNN Encoder–Decoder. Similarly, we per- Decoder on a set of unique phrase pairs, discour- form the same procedure with those pairs whose aging the RNN Encoder–Decoder from learning source phrase is long but rare in the corpus. simply the frequencies of the phrase pairs from the Table 2 lists the top-3 target phrases per source corpus, as explained earlier. phrase favored either by the translation model Furthermore, in Table 3, we show for each of or by the RNN Encoder–Decoder. The source the source phrases in Table 2, the generated sam- phrases were randomly chosen among long ones ples from the RNN Encoder–Decoder. For each having more than 4 or 5 words. source phrase, we generated 50 samples and show In most cases, the choices of the target phrases the top-five phrases accordingly to their scores. by the RNN Encoder–Decoder are closer to ac- We can see that the RNN Encoder–Decoder is tual or literal translations. We can observe that the able to propose well-formed target phrases with- RNN Encoder–Decoder prefers shorter phrases in out looking at the actual phrase table. Importantly, general. the generated phrases do not overlap completely Interestingly, many phrase pairs were scored with the target phrases from the phrase table. This similarly by both the translation model and the encourages us to further investigate the possibility RNN Encoder–Decoder, but there were as many of replacing the whole or a part of the phrase table

9.Figure 5: 2–D embedding of the learned phrase representation. The top left one shows the full represen- tation space (5000 randomly selected points), while the other three figures show the zoomed-in view of specific regions (color–coded). with the proposed RNN Encoder–Decoder in the with each other (see the zoomed-in plots in Fig. 4). future. The proposed RNN Encoder–Decoder naturally generates a continuous-space representation of a 4.4 Word and Phrase Representations phrase. The representation (c in Fig. 1) in this Since the proposed RNN Encoder–Decoder is not case is a 1000-dimensional vector. Similarly to the specifically designed only for the task of machine word representations, we visualize the representa- translation, here we briefly look at the properties tions of the phrases that consists of four or more of the trained model. words using the Barnes-Hut-SNE in Fig. 5. It has been known for some time that From the visualization, it is clear that the RNN continuous space language models using Encoder–Decoder captures both semantic and syn- neural networks are able to learn seman- tactic structures of the phrases. For instance, in tically meaningful embeddings (See, e.g., the bottom-left plot, most of the phrases are about (Bengio et al., 2003; Mikolov et al., 2013)). Since the duration of time, while those phrases that are the proposed RNN Encoder–Decoder also projects syntactically similar are clustered together. The to and maps back from a sequence of words into bottom-right plot shows the cluster of phrases that a continuous space vector, we expect to see a are semantically similar (countries or regions). On similar property with the proposed model as well. the other hand, the top-right plot shows the phrases The left plot in Fig. 4 shows the 2–D embedding that are syntactically similar. of the words using the word embedding matrix 5 Conclusion learned by the RNN Encoder–Decoder. The pro- jection was done by the recently proposed Barnes- In this paper, we proposed a new neural network Hut-SNE (van der Maaten, 2013). We can clearly architecture, called an RNN Encoder–Decoder see that semantically similar words are clustered that is able to learn the mapping from a sequence

10.of an arbitrary length to another sequence, possi- sion under the project MateCat, and by DARPA bly from a different set, of an arbitrary length. The under the BOLT project. proposed RNN Encoder–Decoder is able to either score a pair of sequences (in terms of a conditional probability) or generate a target sequence given a References source sequence. Along with the new architecture, [Auli et al.2013] Michael Auli, Michel Galley, Chris we proposed a novel hidden unit that includes a re- Quirk, and Geoffrey Zweig. 2013. Joint language set gate and an update gate that adaptively control and translation modeling with recurrent neural net- works. In Proceedings of the ACL Conference on how much each hidden unit remembers or forgets Empirical Methods in Natural Language Processing while reading/generating a sequence. (EMNLP), pages 1044–1054. We evaluated the proposed model with the task [Axelrod et al.2011] Amittai Axelrod, Xiaodong He, of statistical machine translation, where we used and Jianfeng Gao. 2011. Domain adaptation via the RNN Encoder–Decoder to score each phrase pseudo in-domain data selection. In Proceedings of pair in the phrase table. Qualitatively, we were the ACL Conference on Empirical Methods in Natu- able to show that the new model is able to cap- ral Language Processing (EMNLP), pages 355–362. ture linguistic regularities in the phrase pairs well [Bastien et al.2012] Fr´ed´eric Bastien, Pascal Lamblin, and also that the RNN Encoder–Decoder is able to Razvan Pascanu, James Bergstra, Ian J. Goodfellow, propose well-formed target phrases. Arnaud Bergeron, Nicolas Bouchard, and Yoshua Bengio. 2012. Theano: new features and speed im- The scores by the RNN Encoder–Decoder were provements. Deep Learning and Unsupervised Fea- found to improve the overall translation perfor- ture Learning NIPS 2012 Workshop. mance in terms of BLEU scores. Also, we found that the contribution by the RNN Encoder– [Bengio et al.2003] Yoshua Bengio, R´ejean Ducharme, Pascal Vincent, and Christian Janvin. 2003. A neu- Decoder is rather orthogonal to the existing ap- ral probabilistic language model. J. Mach. Learn. proach of using neural networks in the SMT sys- Res., 3:1137–1155, March. tem, so that we can improve further the perfor- [Bengio et al.2013] Y. Bengio, N. Boulanger- mance by using, for instance, the RNN Encoder– Lewandowski, and R. Pascanu. 2013. Advances Decoder and the neural net language model to- in optimizing recurrent networks. In Proceedings gether. of the 38th International Conference on Acoustics, Our qualitative analysis of the trained model Speech, and Signal Processing (ICASSP 2013), May. shows that it indeed captures the linguistic regu- larities in multiple levels i.e. at the word level as [Bergstra et al.2010] James Bergstra, Olivier Breuleux, well as phrase level. This suggests that there may Fr´ed´eric Bastien, Pascal Lamblin, Razvan Pascanu, be more natural language related applications that Guillaume Desjardins, Joseph Turian, David Warde- Farley, and Yoshua Bengio. 2010. Theano: a CPU may benefit from the proposed RNN Encoder– and GPU math expression compiler. In Proceedings Decoder. of the Python for Scientific Computing Conference The proposed architecture has large potential (SciPy), June. Oral Presentation. for further improvement and analysis. One ap- [Chandar et al.2014] Sarath Chandar, Stanislas Lauly, proach that was not investigated here is to re- Hugo Larochelle, Mitesh Khapra, Balaraman Ravin- place the whole, or a part of the phrase table by dran, Vikas Raykar, and Amrita Saha. 2014. An au- letting the RNN Encoder–Decoder propose target toencoder approach to learning bilingual word repre- sentations. arXiv:1402.1454 [cs.CL], Febru- phrases. Also, noting that the proposed model is ary. not limited to being used with written language, it will be an important future research to apply the [Dahl et al.2012] George E. Dahl, Dong Yu, Li Deng, proposed architecture to other applications such as and Alex Acero. 2012. Context-dependent pre- trained deep neural networks for large vocabulary speech transcription. speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 20(1):33–42. Acknowledgments [Devlin et al.2014] Jacob Devlin, Rabih Zbib, KC, BM, CG, DB and YB would like to thank Zhongqiang Huang, Thomas Lamar, Richard Schwartz, , and John Makhoul. 2014. Fast and NSERC, Calcul Qu´ebec, Compute Canada, the robust neural network joint models for statistical Canada Research Chairs and CIFAR. FB and HS machine translation. In Proceedings of the ACL were partially funded by the European Commis- 2014 Conference, ACL ’14, pages 1370–1380.

11.[Gao et al.2013] Jianfeng Gao, Xiaodong He, Wen tau [Pascanu et al.2014] R. Pascanu, C. Gulcehre, K. Cho, Yih, and Li Deng. 2013. Learning semantic repre- and Y. Bengio. 2014. How to construct deep recur- sentations for the phrase translation model. Techni- rent neural networks. In Proceedings of the Second cal report, Microsoft Research. International Conference on Learning Representa- tions (ICLR 2014), April. [Glorot et al.2011] X. Glorot, A. Bordes, and Y. Ben- gio. 2011. Deep sparse rectifier neural networks. In [Saxe et al.2014] Andrew M. Saxe, James L. McClel- AISTATS’2011. land, and Surya Ganguli. 2014. Exact solutions to the nonlinear dynamics of learning in deep lin- [Goodfellow et al.2013] Ian J. Goodfellow, David ear neural networks. In Proceedings of the Second Warde-Farley, Mehdi Mirza, Aaron Courville, and International Conference on Learning Representa- Yoshua Bengio. 2013. Maxout networks. In tions (ICLR 2014), April. ICML’2013. [Schwenk et al.2006] Holger Schwenk, Marta R. Costa- Juss`a, and Jos´e A. R. Fonollosa. 2006. Continuous [Graves2012] Alex Graves. 2012. Supervised Se- space language models for the iwslt 2006 task. In quence Labelling with Recurrent Neural Networks. IWSLT, pages 166–173. Studies in Computational Intelligence. Springer. [Schwenk2007] Holger Schwenk. 2007. Continuous [Hochreiter and Schmidhuber1997] S. Hochreiter and space language models. Comput. Speech Lang., J. Schmidhuber. 1997. Long short-term memory. 21(3):492–518, July. Neural Computation, 9(8):1735–1780. [Schwenk2012] Holger Schwenk. 2012. Continuous [Kalchbrenner and Blunsom2013] Nal Kalchbrenner space translation models for phrase-based statisti- and Phil Blunsom. 2013. Two recurrent continuous cal machine translation. In Martin Kay and Chris- translation models. In Proceedings of the ACL Con- tian Boitet, editors, Proceedings of the 24th Inter- ference on Empirical Methods in Natural Language national Conference on Computational Linguistics Processing (EMNLP), pages 1700–1709. (COLIN), pages 1071–1080. [Koehn et al.2003] Philipp Koehn, Franz Josef Och, [Socher et al.2011] Richard Socher, Eric H. Huang, Jef- and Daniel Marcu. 2003. Statistical phrase-based frey Pennington, Andrew Y. Ng, and Christopher D. translation. In Proceedings of the 2003 Conference Manning. 2011. Dynamic pooling and unfolding of the North American Chapter of the Association recursive autoencoders for paraphrase detection. In for Computational Linguistics on Human Language Advances in Neural Information Processing Systems Technology - Volume 1, NAACL ’03, pages 48–54. 24. [Son et al.2012] Le Hai Son, Alexandre Allauzen, and [Koehn2005] P. Koehn. 2005. Europarl: A parallel cor- Franc¸ois Yvon. 2012. Continuous space transla- pus for statistical machine translation. In Machine tion models with neural networks. In Proceedings of Translation Summit X, pages 79–86, Phuket, Thai- the 2012 Conference of the North American Chap- land. ter of the Association for Computational Linguistics: Human Language Technologies, NAACL HLT ’12, [Krizhevsky et al.2012] Alex Krizhevsky, Ilya pages 39–48, Stroudsburg, PA, USA. Sutskever, and Geoffrey Hinton. 2012. Ima- geNet classification with deep convolutional neural [van der Maaten2013] Laurens van der Maaten. 2013. networks. In Advances in Neural Information Barnes-hut-sne. In Proceedings of the First Inter- Processing Systems 25 (NIPS’2012). national Conference on Learning Representations (ICLR 2013), May. [Marcu and Wong2002] Daniel Marcu and William Wong. 2002. A phrase-based, joint probability [Vaswani et al.2013] Ashish Vaswani, Yinggong Zhao, model for statistical machine translation. In Pro- Victoria Fossum, and David Chiang. 2013. De- ceedings of the ACL-02 Conference on Empirical coding with large-scale neural language models im- Methods in Natural Language Processing - Volume proves translation. Proceedings of the Conference 10, EMNLP ’02, pages 133–139. on Empirical Methods in Natural Language Pro- cessing, pages 1387–1392. [Mikolov et al.2013] Tomas Mikolov, Ilya Sutskever, [Zeiler2012] Matthew D. Zeiler. 2012. ADADELTA: Kai Chen, Greg Corrado, and Jeff Dean. 2013. Dis- an adaptive learning rate method. Technical report, tributed representations of words and phrases and arXiv 1212.5701. their compositionality. In Advances in Neural Infor- mation Processing Systems 26, pages 3111–3119. [Zou et al.2013] Will Y. Zou, Richard Socher, Daniel M. Cer, and Christopher D. Manning. [Moore and Lewis2010] Robert C. Moore and William 2013. Bilingual word embeddings for phrase-based Lewis. 2010. Intelligent selection of language machine translation. In Proceedings of the ACL model training data. In Proceedings of the ACL Conference on Empirical Methods in Natural 2010 Conference Short Papers, ACLShort ’10, Language Processing (EMNLP), pages 1393–1398. pages 220–224, Stroudsburg, PA, USA.

12.A RNN Encoder–Decoder In this document, we describe in detail the architecture of the RNN Encoder–Decoder used in the exper- iments. Let us denote an source phrase by X = (x1 , x2 , . . . , xN ) and a target phrase by Y = (y1 , y2 , . . . , yM ). Each phrase is a sequence of K-dimensional one-hot vectors, such that only one element of the vector is 1 and all the others are 0. The index of the active (1) element indicates the word represented by the vector. A.1 Encoder Each word of the source phrase is embedded in a 500-dimensional vector space: e(xi ) ∈ R500 . e(x) is used in Sec. 4.4 to visualize the words. The hidden state of an encoder consists of 1000 hidden units, and each one of them at time t is computed by t t−1 ˜t, hj = zj hj + (1 − zj )hj where ˜ t = tanh [We(xt )] + U r h h t−1 , j j j zj =σ [Wz e(xt )]j + Uz h t−1 j , rj =σ [Wr e(xt )]j + Ur h t−1 j . σ and are a logistic sigmoid function and an element-wise multiplication, respectively. To make the 0 equations uncluttered, we omit biases. The initial hidden state hj is fixed to 0. Once the hidden state at the N step (the end of the source phrase) is computed, the representation of the source phrase c is N c = tanh Vh . A.1.1 Decoder The decoder starts by initializing the hidden state with 0 h = tanh V c , where we will use · to distinguish parameters of the decoder from those of the encoder. The hidden state at time t of the decoder is computed by t t−1 t h j = z jh j + (1 − z j )h˜ j , where t h˜ j = tanh W e(yt−1 ) j +rj Uh t−1 + Cc , z j =σ W z e(yt−1 ) j + U zh t−1 j + [Cz c]j , r j =σ W r e(yt−1 ) j + U rh t−1 j + [Cr c]j , and e(y0 ) is an all-zero vector. Similarly to the case of the encoder, e(y) is an embedding of a target word. Unlike the encoder which simply encodes the source phrase, the decoder is learned to generate a target phrase. At each time t, the decoder computes the probability of generating j-th word by exp gj s t p(yt,j = 1 | yt−1 , . . . , y1 , X) = K , j =1 exp gj s t

13.where the i-element of s t is t t t si = max s 2i−1 , s 2i and t t s = Oh h + Oy yt−1 + Oc c. t In short, the si is a so-called maxout unit. For the computational efficiency, instead of a single-matrix output weight G, we use a product of two matrices such that G = Gl Gr , where Gl ∈ RK×500 and Gr ∈ R500×1000 . B Word and Phrase Representations Here, we show enlarged plots of the word and phrase representations in Figs. 4–5.

14.Figure 6: 2–D embedding of the learned word representation. The top left one shows the full embedding space, while the other three figures show the zoomed-in view of specific regions (color–coded).

15.Figure 7: 2–D embedding of the learned phrase representation. The top left one shows the full representation space (1000 randomly selected points), while the other three figures show the zoomed-in view of specific regions (color–coded).