Non-Autoregressive Neural Machine Translation with Enhanced Decoder Input

Non-autoregressive translation (NAT) models, which remove the dependence on previous target tokens from the inputs of the decoder, achieve significantly inference speedup but at the cost of inferior accuracy compared to autoregressive translation (AT) models. Previous work shows that the quality of the inputs of the decoder is important and largely impacts the model accuracy. In this paper, we propose two methods to enhance the decoder inputs so as to improve NAT models. The first one directly leverages a phrase table generated by conventional SMT approaches to translate source tokens to target tokens, which are then fed into the decoder as inputs. The second one transforms source-side word embeddings to target-side word embeddings through sentence-level alignment and word-level adversary learning, and then feeds the transformed word embeddings into the decoder as inputs. Experimental results show our method largely outperforms the NAT baseline (Gu et al. 2017) by 5.11 BLEU scores on WMT14 English-German task and 4.72 BLEU scores on WMT16 English-Romanian task.

1. Non-Autoregressive Neural Machine Translation with Enhanced Decoder Input Junliang Guo†∗ , Xu Tan‡ , Di He§ , Tao Qin‡ , Linli Xu† and Tie-Yan Liu‡ † Anhui Province Key Laboratory of Big Data Analysis and Application, School of Computer Science and Technology, University of Science and Technology of China ‡ Microsoft Research § Key Laboratory of Machine Perception (MOE), School of EECS, Peking University †,, ‡ {xuta,taoqin,tyliu}, § di arXiv:1812.09664v1 [cs.CL] 23 Dec 2018 Abstract Since AT models generate target tokens sequentially, the inference speed becomes a bottleneck for real-world trans- Non-autoregressive translation (NAT) models, which remove lation systems, in which fast response and low latency are the dependence on previous target tokens from the inputs of expected. To speed up the inference of machine translation, the decoder, achieve significantly inference speedup but at the cost of inferior accuracy compared to autoregressive transla- non-autoregressive models (Gu et al. 2017) have been pro- tion (AT) models. Previous work shows that the quality of posed, which generate all target tokens independently and the inputs of the decoder is important and largely impacts simultaneously. Instead of using previously generated to- the model accuracy. In this paper, we propose two methods kens as in AT models, NAT models take other global sig- to enhance the decoder inputs so as to improve NAT mod- nals derived from the source sentence as input. Specifically, els. The first one directly leverages a phrase table generated Non-AutoRegressive Transformer (NART) (Gu et al. 2017) by conventional SMT approaches to translate source tokens takes a copy of source sentence x as the decoder input, and to target tokens, which are then fed into the decoder as in- the copy process is guided by fertilities (Brown et al. 1993) puts. The second one transforms source-side word embed- which represents how many times each source token will be dings to target-side word embeddings through sentence-level copied; after that all target tokens are simultaneously pre- alignment and word-level adversary learning, and then feeds the transformed word embeddings into the decoder as inputs. dicted: Experimental results show our method largely outperforms yt = D(ˆ x, E(x)), (2) the NAT baseline (Gu et al. 2017) by 5.11 BLEU scores where x ˆ = (ˆx1 , ..., x ˆTy ) is the copied source sentence and on WMT14 English-German task and 4.72 BLEU scores on Ty is the length of the target sentence y. WMT16 English-Romanian task. While NAT models significantly reduce the inference la- tency, they suffer from accuracy degradation compared with their autoregressive counterparts. We notice that the encoder 1 Introduction of AT models and that of NAT models are the same; the The neural network based encoder-decoder framework has differences lie in the decoder. In AT models, the genera- achieved very promising performance for machine trans- tion of the t-th token yt is conditioned on previously gener- lation and different network architectures have been pro- ated tokens y1:t−1 , which provides strong target side context posed, including RNNs (Sutskever, Vinyals, and Le 2014; information. In contrast, as NART models generate tokens Bahdanau, Cho, and Bengio 2014; Cho et al. 2014a; Wu in parallel, there is no such target-side information avail- et al. 2016), CNNs (Gehring et al. 2017), and self-attention able. Although the fertilities are learned to cover target-side based Transformer (Vaswani et al. 2017). All those models information in NART (Gu et al. 2017), such information translate a source sentence in an autoregressive manner, i.e., contained in the copied source tokens x ˆ guided by fertili- they generate a target sentence word by word from left to ties is indirect and weak because the copied tokens are still right (Wu et al. 2018) and the generation of t-th token yt in the domain of source language, while the inputs of the depends on previously generated tokens y1:t−1 : decoder of AT models are target-side tokens y1:t−1 . Con- sequently, the decoder of a NAT model has to handle the yt = D(y1:t−1 , E(x)), (1) translation task conditioned on less and weaker information compared with its AT counterpart, thus leading to inferior where E(·) and D(·) denote the encoder and decoder of the accuracy. As verified by our study (see Figure 2 and Ta- model respectively, x is the source sentence and E(x) is the ble 3), NART performs poorly for long sentences, which output of the encoder, i.e., the set of hidden representations need stronger target-side conditional information for correct in the top layer of the encoder. translation than short sentences. ∗ The work was done when the first author was an intern at Mi- In this paper, we aim to enhance the decoder inputs of crosoft Research Asia. NAT models so as to reduce the difficulty of the task that Copyright c 2019, Association for the Advancement of Artificial the decoder needs to handle. Our basic idea is to directly Intelligence ( All rights reserved. feed target-side tokens as the inputs of the decoder. We pro-

2.pose two concrete methods to generate the decoder input Models Training Inference yˆ = (ˆ y1 , ..., yˆTy ) which contains coarse target-side informa- RNNs based × × tion. The first one is based on a phrase table, and explicitly √ AT models CNNs based √ × translates source tokens into target-side tokens through such Self-Attention based × a pre-trained phrase table. The second one linearly maps the √ √ embeddings of source tokens into the target-side embedding NAT models space and then the mapped embeddings are fed into the de- coder. The mapping is learned in an end-to-end manner by Table 1: Comparison between Autoregressive Transla- minimizing the L2 distance of the mapped source and target tion (AT) and Non-Autoregressive Translation (NAT) mod- embeddings in the sentence level as well as the adversary els about whether they are parallelizable in different stages. loss between the mapped source embeddings and target em- beddings in the word level. With target-side information as inputs, the decoder works and yˆ in Equation (2) and (3). The recently proposed non- as follows: autoregressive model NART (Gu et al. 2017) breaks the in- yt = D(ˆy , E(x)), (3) ference bottleneck by exposing all decoder inputs to the net- work simultaneously. The generation of z is guided by the where yˆ is the enhanced decoder input provided by our fertility prediction function which represents how many tar- methods. The decoder now can generate all yt ’s in parallel get tokens that each source token can translate to, and then conditioned on the global information yˆ, which is more close repeatedly copy source tokens w.r.t their corresponding fer- to the target tokens y1:t−1 as in the AT model. In this way, tilities as the decoder input z. Given z, the conditional prob- the difficulty of the task for the decoder is largely reduced. ability P (y|x) is defined as: We conduct experiments on three tasks to verify the Ty Ty proposed method. On WMT14 English-German, WMT16 English-Romanian and IWSLT14 German-English trans- P (y|x, z) = P (yt |z, x) = P (yt |z, E(x; θenc ); θdec ), (4) t=1 t=1 lation tasks, our model outperforms all compared non- autoregressive baseline models. Specifically, we obtain where Ty is the length of target sentence, which equals to BLEU scores of 24.28 and 34.51 which outperform the non- the summation of all fertility numbers. θenc and θdec denote autoregressive baseline (19.17 and 29.79 reported in Gu et the parameter of the encoder and decoder. The negative log- al. (2017)) on WMT14 En-De and WMT16 En-Ro tasks. likelihood loss function for NAT model becomes: Ty 2 Background Lneg (x, y; θenc , θdec ) = − log P (yt |z, x) (5) 2.1 Autoregressive Neural Machine Translation t=1 Deep neural network with encoder-decoder framework has Although non-autoregressive models can achieve 15× achieved great success on machine translation, with differ- speedup compared to autoregressive models, they are also ent choices of architectures such as recurrent neural net- suffering from accuracy degradation. Since the conditional works (RNNs) (Bahdanau, Cho, and Bengio 2014; Cho et dependencies within the target sentence (yt depends on al. 2014b), convolutional neural networks (CNNs) (Gehring y1:t−1 ) are removed from the decoder input, the decoder is et al. 2017), as well as self-attention based trans- unable to leverage the inherent sentence structure for pre- former (Vaswani et al. 2017; He et al. 2018). Early RNNs diction. Hence the decoder has to figure out such target-side based models have an inherently sequential architecture information by itself just with the source-side information that prevents them from being parallelized during train- during training, which is a much more challenging task com- ing and inference, which is partially solved by CNNs pared to its autoregressive counterpart. From our study, we and self-attention based models (Kalchbrenner et al. 2016; find the NART model fails to handle the target sentence gen- Gehring et al. 2017; Shen et al. 2018; Vaswani et al. 2017; eration well. It usually generates repetitive and semantically Song et al. 2018). Since the entire target translation is ex- incoherent sentences with missing words, as shown in Ta- posed to the model at training time, each input token of the ble 3. Therefore, strong conditional signals should be intro- decoder is the previous ground truth token and the whole duced as the decoder input to help the model learn better training can be parallel given the well-designed CNNs or internal dependencies within a sentence. self-attention models. However, the autoregressive nature still creates a bottleneck at inference stage, since without 3 Methodology ground truth, the prediction of each target token has to con- As discussed in Section 1, to improve the accuracy of NAT dition on previously predicted tokens. See Table 1 for a clear models, we need to enhance the inputs of the decoder. We comparison between models about whether they are paral- introduce our model, Enhanced Non-Autoregressive Trans- lelizable. former (ENAT), in this section. We design two kinds of enhanced inputs: one is token level enhancement based on 2.2 Non-Autoregressive Neural Machine phrase-table lookup and the other one is embedding level Translation enhancement based on embedding mapping. The illustration We generally denote the decoder input as z = (z1 , ..., zTy ) of the phrase-table lookup and embedding mapping can be to be consistent in the rest of our paper, which represents x ˆ found in Figure 1.

3. Differentiable operations Non-differentiable operations Transformer Model Target Sequence y Golden Target Sequence y Sentence- Encoder Decoder Encoder Decoder Word-Level Adversary Level Learning Alignment Source Input Emb Decoder Input Emb Source Input Emb Decoder Input Emb Generated Source Input x Decoder Input z Phrase Mapped by W Table Lookup Table Phrase-Table Lookup Embedding Mapping Figure 1: The architecture of our model. A concrete description of fine-grained modules can be found in Section 4.2. 3.1 Phrase-Table Lookup as inputs. This linear mapping can be trained end-to-end to- Previous NAT models take tokens in the source language in gether with NAT models. as decoder inputs, which make the decoding task difficult. To be concrete, given the source sentence x = Considering that AT models takes (already generated) target (x1 , ..., xTx ) and its corresponding embedding matrix Ex ∈ tokens as inputs, a straightforward idea to enhance decoder RTx ×d where d is the dimensionality of embeddings, we inputs is to also feed tokens in the target language into the transform Ex into the target-side embedding space by a lin- decoder of NAT models. Given a source sentence, a simple ear mapping function fG : method to get target tokens is to translate those source tokens Ez˜ = fG (Ex ; W ) = Ex W, (6) to target tokens using a phrase table, which brings negligible d×d latency in inference. where W ∈ R is the projection matrix to be learned To implement this idea, we pre-train a phrase table based and Ez˜ ∈ RTx ×d is the decoder input candidate who has on the bilingual training corpus utilizing Moses (Koehn et al. the same number of tokens as the source sentence x. We 2007), an open-source statistic machine translation (SMT) then reconstruct Ez˜ ∈ RTx ×d to the final decoder input toolkit. We then greedily segment the source sentence into Ez ∈ RTy ×d whose length is identical to the length of tar- Tp phrases and translate the phrases one by one according to get sentence by a simple method which will be introduced in the phrase table. The details are as follows. We first calcu- the next section. Intuitively, Ez should contain coarse target- late the maximum length L among all the phrases contained side information, which is the translation of the correspond- in the phrase table. For i-th source word xi , we first check ing source tokens in the embedding space, although in sim- whether phrase xi:i+L has a translation in the phrase table; ilar order as the source tokens. To ensure the projection ma- if not then check xi:i+L−1 , and so on. If there exists a phrase trix W to be learned end-to-end with the NAT model, we translation for xi:i+L−j , then translate it and check the trans- regularize the learning of W with sentence-level alignment lation started at xi+L−j+1 following the same strategy. This and word-level adversary learning. procedure only brings 0.14ms latency per sentence on av- Since we already have the sentence-level alignment from erage over the newstest2014 test set on an Intel Xeon the training set, we can minimize the L2 distance between E5-2690 CPU, which is negligible compared with the whole the mapped source embeddings and the ground truth target inference latency (e.g., 25 to 200+ ms) of the NAT model, as embeddings in the sentence level: shown in Table 2. Lalign (x, y) = fG (e(x)) − e(y) 2 , (7) Note that to reduce inference latency, we only search T the phrase table to obtain a course phrase-to-phrase transla- where e(x) = T1x i=1 x e(xi ) is the embedding of source tion, without utilizing the full procedure (including language sentence x which is simply calculated by the average of em- model scoring and tree-based searching). During inference, beddings of all source tokens. e(y) is the embedding of tar- we generate z by the phrase table lookup and skip phrases get sentence y which is defined in the same way. that do not have translations. As the regularization in Equation (7) just ensures the coarse alignment between the sentence embeddings which 3.2 Embedding Mapping is simply the summation of each word embeddings, it As the phrase table is pre-trained from SMT systems, it can- misses the fine-grained token-level alignment. Therefore, not be updated/optimized during NAT model training, and we propose the word-level adversary learning, considering may lead to poor translation quality if the table is not very we do not have the supervision signal of word-level map- accurate. Therefore, we propose the embedding mapping ap- ping. Specifically, we use Generative Adversarial Network proach, which first linearly maps the source token embed- (GAN) (Goodfellow et al. 2014) to regularize the the projec- dings to target embeddings and feeds them into the decoder tion matrix W , which is widely used in NLP tasks such as

4.unsupervised word translation (Conneau et al. 2017) and text 4 Experimental Setup generation (Yu et al. 2017). The discriminator fD takes an 4.1 Datasets embedding as input and outputs a confidence score between 0 and 1 to differentiate the embeddings mapped from source We evaluate our model on three widely used public ma- tokens, i.e., Ez , and the ground truth embedding of the target chine translation datasets: IWSLT14 De-En1 , WMT14 En- tokens, i.e., Ey , during training. The linear mapping func- De2 and WMT16 En-Ro3 , which has 153K/4.5M/2.9M tion fG acts as the generator whose goal is to make fG able bilingual sentence pairs in corresponding training sets. For to provide plausible Ez that is indistinguishable to Ey in WMT14 tasks, newstest2013 and newstest2014 are the embedding space, to fool the discriminator. We imple- used as the validation and test set respectively. For the ment the discriminator by a two-layers multi-layer percep- WMT16 En-Ro task, newsdev2016 is the validation set tron (MLP). Although other architectures such as CNNs can and newstest2016 is used as the test set. For IWSLT14 also be chosen, we find that the simple MLP has achieved De-En, we use 7K data split from the training set as fairly good performance. the validation set and use the concatenation of dev2010, Formally, given the linear mapping function fG (·; W ), tst2010, tst2011 and tst2012 as the test set, which i.e., the generator, and the discriminator fD (·; θD ), the ad- is widely used in prior works (Ranzato et al. 2015; Bah- versarial training objective Ladv can be written as: danau et al. 2016). All the data are tokenized and segmented into subword tokens using byte-pair encoding (BPE) (Sen- Ladv (x, y) = min max Vword (fG , fD ), (8) nrich, Haddow, and Birch 2015) , and we share the source W θD and target vocabulary and embeddings in each language where Vword is the word-level value function which encour- pair. The phrase table is extracted from each training set by ages every word in z and y to be distinguishable: Moses (Koehn et al. 2007), and we follow the default hyper- Vword (fG , fD ) =Ee(yi )∼Ey [log fD (e(yi ))]+ parameters in the toolkit. Ee(xj )∼Ex [log(1 − fD (fG (e(xj ))))], 4.2 Model Configurations (9) where e(xj ) and e(yi ) indicates the embedding of j-th We follow the same encoder and decoder architecture as source and i-th target token respectively. In conclusion, for Transformer (Vaswani et al. 2017). The encoder is composed each training pair (x, y), along with the original negative by multi-head attention modules and feed forward networks log-likelihood loss Lneg (x, y) defined in Equation (5), the , which are all fully parallelizable. In order to make the de- total loss function of our model is: coding process parallelizable, we cannot use target tokens min max L(x, y) = Lneg (x, y; θenc , θdec )+ as decoder input cause such strong signals are unavailable Θ θD while inference. Instead, we use the input introduced in the µLalign (x, y; W ) + λLadv (x, y; θD , W ), Section 3. There exists the problem of length mismatch be- (10) tween the decoder input z and the target sentence, which where Θ = (θenc , θdec , W ) and θD consist of all parameters is solved by a simple and efficient method. Given the de- that need to be learned, while µ and λ are hyper-parameters coder input candidate z˜ = (˜ z1 , ..., z˜Tz˜ ) which is either pro- that control the weight of different losses. vided by phrase-table lookup or Equation (6), the j-th ele- ment of the decoder input z = (z1 , ..., zTy ) is computed as 3.3 Discussion zj = i wij · e(z˜i ), where wij = exp(−(j − j (i))2 /τ ), The approach of phrase-table lookup is simple and efficient. T and j (i) = i · Tyz˜ , and τ is a hyper-parameter controlling It achieves considerable performance in experiments by pro- the sharpness of the function, which is set to 0.3 in all tasks. viding direct token-level enhancements, when the phrase ta- We also use multi-head self attention and encoder-to- ble is good enough. However, when training data is messy decoder attention, as well as feed forward networks for de- and noisy, the generated phrase table might be of low qual- coder, as used in Transformer (Vaswani et al. 2017). Consid- ity and consequently hurts NAT model training. We observe ering the enhanced decoder input is of the same word order that the phrase table trained by Moses can obtain fairly good of the source sentence, we add the multi-head positional at- performance on small and clean datasets such as IWSLT14 tention to rearrange the local word orders within a sentence, but very poor on big and noisy datasets such as WMT14. as used in NART (Gu et al. 2017). Therefore, the three kinds See Section 5.3 for more details. In contrast, the approach of attentions along with residual connections (He et al. 2016) of embedding mapping learns to adjust the mapping func- and layer normalization (Ba, Kiros, and Hinton 2016) con- tion together with the training of NAT models, resulting in stitute our model. more stable results. To enable a fair comparison, we use same network ar- As for the two components proposed in embedding map- chitectures as in NART (Gu et al. 2017). Specifically, for ping, the sentence-level alignment Lalign leverages bilingual WMT14 and WMT16 datasets, we use the default hyper- supervisions which can well guide the learning of the map- parameters of the base model described in Vaswani et ping function, but lacks the fine-grained word-level mapping al. (2017), whose encoder and decoder both have 6 layers signal; word-level adversary loss Ladv can provide compli- 1 mentary information to Lalign . Our ablation study in Sec- 2 tion 5.3 (see Table 5) verify the benefit of combining the 3 two loss functions.

5.and the size of hidden state and embeddings are set to 512, per sentence and demonstrate the efficiency of our model in and the number of heads is set to 8. As IWSLT14 is a smaller experiment. dataset, we choose to a smaller architecture as well, which consists of a 5-layer encoder and a 5-layer decoder. The size of hidden state and embeddings are set to 256, and the num- 5 Results ber of heads is set to 4. 5.1 Translation Quality and Inference Latency 4.3 Training and Inference We compare our model with non-autoregressive baselines We follow the optimizer settings in Vaswani et al. (2017). including NART (Gu et al. 2017), a semi-non-autoregressive Models on WMT/IWSLT tasks are trained on 8/1 NVIDIA model Latent Transformer (LT) (Kaiser et al. 2018) which M40 GPUs respectively. We set µ = 0.1 and λ = 1.0 in incorporates an autoregressive module into NART, as well Equation (10) for all tasks to ensure Lneg , Lalign and Ladv as Iterative Refinement NAT (IR-NAT) (Lee, Mansimov, and are in the same scale. We implement our model on Tensor- Cho 2018) which trains extra decoders to iteratively refine flow (Abadi et al. 2016). We provide detailed description of the translation output, and we list the “Adaptive” results re- the knowledge distillation and the inference stage below. ported in their paper. We also compare with strong autore- Sequence-Level Knowledge Distillation During train- gressive baselines that based on LSTM (Wu et al. 2016; ing, we apply the same knowledge distillation method used Bahdanau et al. 2016) and self-attention (Vaswani et al. in (Kim and Rush 2016; Gu et al. 2017; Li et al. 2019). 2017). We also list the translation quality purely by lookup We first train an autoregressive teacher model which has the from the phrase table, denoted as Phrase-Table Lookup, same architecture as the non-autoregressive student model, which serves as the decoder input in the hard model. For and collect the translations of each source sentence in the inference latency, the average per-sentence decoding latency training set by beam search, which are then used as the on WMT14 En-De task over the newstest2014 test set is ground truth for training the student. By doing so, we pro- also reported, which is conducted on a single NVIDIA P100 vide less noisy and more deterministic training data which GPU to keep consistent with NART (Gu et al. 2017). Results make the NAT model easy to learn (Kim and Rush 2016; are shown in Table 2. Ott et al. 2018; Gong et al. 2019). Specifically, we pre-train Among different datasets, our model achieves state- the state-of-the-art Transformer (Vaswani et al. 2017) archi- of-the-art performance all non-autoregressive baselines. tecture as the autoregressive teacher model, and the beam Specifically, our model outperforms NART with rescoring size while decoding is set to 4. 10 candidates from 4.26 to 5.62 BLEU score on differ- Inference While inference, we do not know the target ent tasks. Comparing to autoregressive models, our model length Ty . Therefore we first calculate the average ratio be- is only 1.1 BLEU score behind its Transformer teacher at tween target and source sentence length in the training set En-Ro tasks, and we also outperforms the state-of-the-art which is denoted as α, then predict the target length rang- LSTM-based baseline (Wu et al. 2016) on IWSLT14 De-En ing from α · Tz˜i − B , α · Tz˜i + B where · denotes task. The promising results demonstrate that the proposed the rounding operation. This length prediction method de- method can make the decoder easy to learn by providing pends on the intuition that the length of source sentence and a strong input close to target tokens and result in a bet- target sentence is similar, where B is half of the searching ter model. For inference latency, NART needs to first pre- window. B = 0 indicates the greedy output that only gener- dict the fertilities of source sentence before the translation ates a single translation result for a source sentence. While process, which is slower than the phrase-table lookup pro- B ≥ 1, there will be multiple translations for one source sen- cedure and matrix multiplication in our method. Moreover, tence, therefore we utilize the autoregressive teacher model our method outperforms NART with rescoring 100 candi- to rescore and select the final translation. While inference, α dates on all tasks, but with nearly 5 times faster, which also is set to 1.1 for English-to-Others tasks and 0.9 for Others- demonstrate the advantages of the enhanced decoder input. to-English tasks, and we try both B = 0 and B = 4 which result in 1 and 9 candidates. We use BLEU scores (Papineni Translation Quality w.r.t Different Lengths We compare et al. 2002) as the evaluation metric4 . the translation quality between AT (Vaswani et al. 2017), As for the efficiency, the decoder input z is obtained NART (Gu et al. 2017) and our method with regard to dif- through table-lookup or the multiplication between dense ferent sentence lengths. We conduct the analysis on WMT14 matrices, which brings negligible additional latency. The En-De test set and divide the sentence pairs into different teacher model rescoring procedure introduced above is fully length buckets according to the length of reference sentence. parallelizable as it is identical to the teacher forcing train- The results are shown in Figure 2. It can be seen that as sen- ing process in autoregressive models, and thus will not in- tence length increases, the accuracy of NART model drops crease the latency much. We analyze the inference latency quickly and the gap between AT and NART model also en- larges. Our method achieves more improvements over the 4 We report tokenized and case-sensitive BLEU scores for longer sentence, which demonstrates that NART perform WMT14 En-De and WMT16 En-Ro to keep consistent with worse on long sentence, due to the weak decoder input, NART (Gu et al. 2017), as well as tokenized and case-insensitive while our enhanced decoder input provides strong condi- scores for IWSLT14 De-En, which is common practices in litera- tional information for the decoder, resulting more accuracy ture (Wu et al. 2016; Vaswani et al. 2017). improvements on these sentences.

6. WMT14 WMT16 IWSLT14 Models En−De De−En En−Ro De−En Latency / Speedup † LSTM-based S2S (Wu et al. 2016) 24.60 / / 28.53 / / Transformer (Vaswani et al. 2017) 27.41† 31.29† 35.61† 32.55† 607 ms 1.00× LT (Kaiser et al. 2018) 19.80 / / / 105 ms 5.78× LT (rescoring 10 candidates) 21.00 / / / / / LT (rescoring 100 candidates) 22.50 / / / / / NART (Gu et al. 2017) 17.69 21.47 27.29 22.95† 39 ms 15.6× NART (rescoring 10 candidates) 18.66 22.41 29.02 25.05† 79 ms 7.68× NART (rescoring 100 candidates) 19.17 23.20 29.79 / 257 ms 2.36× IR-NAT (Lee, Mansimov, and Cho 2018) 21.54 25.42 29.66 / 254† ms 2.39× Phrase-Table Lookup 6.03 11.24 9.16 15.69 / / ENAT Phrase-Table Lookup 20.26 23.23 29.85 25.09 25 ms 24.3× ENAT Phrase-Table Lookup (rescoring 9 candidates) 23.22 26.67 34.04 28.60 50 ms 12.1× ENAT Embedding Mapping 20.65 23.02 30.08 24.13 24 ms 25.3× ENAT Embedding Mapping (rescoring 9 candidates) 24.28 26.10 34.51 27.30 49 ms 12.4× Table 2: BLEU scores on WMT14 En-De, WMT14 De-En, WMT16 En-Ro and IWSLT14 De-En tasks. “/” indicates the cor- responding result is not reported and “†” means results are produced by ourselves. We also list the inference latency compared with previous works. ENAT with rescoring 9 candidates indicates results when B = 4, otherwise B = 0. BLEU scores over different sentence length buckets 5.3 Method Analysis 30.0 NART 27.5 Ours Phrase-Table Lookup v.s. Embedding Mapping We have AT proposed two different approaches to provide decoder input 25.0 with enhanced quality, and we make a comparison between 22.5 the two approaches in this subsection. According to Table 2, the phrase-table lookup achieves BLEU 20.0 better BLEU scores in IWSLT14 De-En and WMT14 De- 17.5 En task, and the embedding mapping performs better on the 15.0 other two tasks. We find the performance of the first ap- proach is related to the quality of phrase table, which can 12.5 be judged by the BLEU score of the Phrase-to-Phrase trans- 10.0 lation. As IWSLT14 De-En is a cleaner and smaller dataset, [0,10) [10, 20) [20, 30) [30,40) >40 Length Buckets the pre-trained phrase table tends to have good quality (with BLEU score 15.69 as shown in Table 2), therefore it is Figure 2: The BLEU scores comparison between AT, NART, able to provide an accurate enough signal to the decoder. and our method over sentences in different length buckets on Although WMT14 En-De and WMT16 En-Ro dataset are newstest2014. Best view in color. much larger, the phrase tables are of low quality (with BLEU score 6.03 in WMT14 En-De and 9.16 in WMT16 En-Ro), which may provides noise signals such as missing too much 5.2 Case Study tokens and misguide the learning procedure. Therefore, our embedding mapping outperforms the phrase-table lookup by We conduct several case studies on IWSLT14 De-En task to providing implicit guidance and allow the model adjust the intuitively demonstrate the superiority of our model, listed decoder input in a way of end-to-end learning. in Table 3. Varying the Quality of Decoder Input We study how the As we claimed in Section 1, the NART model tends to quality of decoder input influence the performance of the repetitively translate same words or phrases and sometimes NAT model. We mainly analyze in the phrase-table lookup misses meaningful words, as well as performs poorly while approach as it is easy to change the quality of decoder input translating long sentences. In the first case, NART fails to with word-table. After obtained the phrase table by Moses translate a long sentence due to the weak signal provided from the training data, we further extract the word table from by the decoder input, while both of our models success- the phrase table following the word alignments. Then we fully translate the last half sentence thanks to the strong can utilize word-table lookup by the extracted word table as information carried in our decoder input. As for the sec- the decoder input z, which provides relatively weaker sig- ond case, NART translates “to you” twice, and misses “all nals compared with the phrase-table lookup. We measure the of”, which therefore result in a wrong translation, while our BLEU score directly between the phrase/word-table lookup model achieves better translation results again. and the reference, as well as between the NAT model out-

7. hier ist ein foto, das ich am nrdlichen ende der baffin-inseln aufnahm, als ich mit inuits auf die narwhal-jagd ging. Source: und dieser mann, olaya, erzhlte mir eine wunderbare geschichte seines grovaters. this is a photograph i took at the northern tip of baffin island when i went narwhal hunting with some inuit people, Target: and this man, olayuk, told me a marvelous story of his grandfather. here’s a photograph i took up at the northern end of the fin islands when i went to the narwhal hunt, Teacher: and this man, olaya, told me a wonderful story of his grandfather. here’s a photograph that i took up the north end of of the baffin fin when i with iuits went to the narwhal hunt, NART: and this guy guy, ollaya. & lt; em & gt; & lt; / em & gt; so here’s a photo which i the northern end the detected when i was sitting on on the went. PT: and this man , told me a wonderful story his’s. here’s a photograph i took up at the end of the baffin islands i went to the nnarwhal hunting hunt, ENAT Phrase: and this man, olaaya told me a wonderful story of his grandfather. here’s a photograph that i took on the north of the end of the baffin islands, when i went to nuits on the narhal hunt, ENAT Embedding: and this man, olaya, told me a wonderful story of his grandfather. Source: ich freue mich auf die gesprche mit ihnen allen! Target: i look forward to talking with all of you. Teacher: i’m happy to talk to you all! NART: i’m looking to the talking to to you you. PT: i look forward to the conversations with you all! ENAT Phrase: i’m looking forward to the conversations with all of you. ENAT Embedding: i’m looking forward to the conversations to all of you. Table 3: Case studies on IWSLT14 De-En task. ENAT Phrase and ENAT Embedding denotes the proposed phrase-table lookup and embedding mapping methods respectively. PT indicates the phrase-table lookup results, which serves as the decoder input to ENAT Phrase method. We collect the results of NART with rescoring 10 candidates and set B = 4 while inference for our methods to confirm a fair comparison. Approach Decoder Input NAT Result Lalign slightly outperforms the word-level adversary learn- Word-Table Lookup 3.54 19.16 ing Ladv . However, adding Ladv to Lalign improves the BLEU Phrase-Table Lookup 6.03 20.33 score to 24.13, which illustrates that the complimentary in- formation provided by two loss functions is indispensable. Table 4: The BLEU scores when varying the quality of de- coder input on WMT14 En-De task. We set B = 0 in the 6 Conclusion inference for the NAT result. We targeted at improving accuracy of non-autoregressive translation models and proposed two methods to enhance the Lalign Ladv BLEU score decoder inputs of NAT models: one based on a phrase table √ √ and the other one based on word embeddings. Our methods 24.13 √ outperform the baseline on all tasks by BLEU scores ranging √ 23.53 from 3.47 to 5.02. 23.74 In the future, we will extend this study from several as- pects. First, we will test our methods on more language Table 5: Ablation study of the embedding mapping approach pairs and larger scale datasets. Second, we will explore bet- on IWSLT14 De-En task. We set B = 0 while inference. ter methods to utilize the phrase table. For example, we may sample multiple candidate target tokens (instead of using the one with largest probability in this work) for each source to- puts and the reference in WMT14 En-De test set, listed in ken and feed all the candidates into the decoder. Third, it is Table 4. The quality of the word-table lookup is relatively interesting to investigate better methods (beyond the phrase poor compared with the phrase-table lookup. Under this cir- table and word embedding based methods in this work) to cumstance, the signal provided by the decoder input will be enhance the decoder inputs and further improve translation weaker, and thus influence the accuracy of NAT model. accuracy for NAT models. Ablation Study on Embedding Mapping We conduct an ablation study in this subsection to study the different components in the embedding mapping approach, i.e., the Acknowledgements sentence-level alignment and word-level adversary learn- This research was supported by the National Natural Sci- ing. Results are shown in Table 5. Sentence-level alignment ence Foundation of China (No. 61673364, No. 91746301)

8. and the Fundamental Research Funds for the Central Uni- tween encoder and decoder for neural machine translation. versities (WK2150110008). In NIPS. [Kaiser et al. 2018] Kaiser, Ł.; Roy, A.; Vaswani, A.; Pamar, References N.; Bengio, S.; Uszkoreit, J.; and Shazeer, N. 2018. Fast [Abadi et al. 2016] Abadi, M.; Barham, P.; Chen, J.; Chen, decoding in sequence models using discrete latent variables. Z.; Davis, A.; Dean, J.; Devin, M.; Ghemawat, S.; Irving, arXiv preprint arXiv:1803.03382. G.; Isard, M.; et al. 2016. Tensorflow: a system for large- [Kalchbrenner et al. 2016] Kalchbrenner, N.; Espeholt, L.; scale machine learning. In OSDI, volume 16, 265–283. Simonyan, K.; Oord, A. v. d.; Graves, A.; and Kavukcuoglu, [Ba, Kiros, and Hinton 2016] Ba, J. L.; Kiros, J. R.; and Hin- K. 2016. Neural machine translation in linear time. arXiv ton, G. E. 2016. Layer normalization. arXiv preprint preprint arXiv:1610.10099. arXiv:1607.06450. [Kim and Rush 2016] Kim, Y., and Rush, A. M. 2016. [Bahdanau et al. 2016] Bahdanau, D.; Brakel, P.; Xu, K.; Sequence-level knowledge distillation. arXiv preprint Goyal, A.; Lowe, R.; Pineau, J.; Courville, A.; and Bengio, arXiv:1606.07947. Y. 2016. An actor-critic algorithm for sequence prediction. [Koehn et al. 2007] Koehn, P.; Hoang, H.; Birch, A.; arXiv preprint arXiv:1607.07086. Callison-Burch, C.; Federico, M.; Bertoldi, N.; Cowan, B.; [Bahdanau, Cho, and Bengio 2014] Bahdanau, D.; Cho, K.; Shen, W.; Moran, C.; Zens, R.; et al. 2007. Moses: Open and Bengio, Y. 2014. Neural machine translation by source toolkit for statistical machine translation. In Proceed- jointly learning to align and translate. arXiv preprint ings of the 45th annual meeting of the ACL on interactive arXiv:1409.0473. poster and demonstration sessions, 177–180. [Brown et al. 1993] Brown, P. F.; Pietra, V. J. D.; Pietra, S. [Lee, Mansimov, and Cho 2018] Lee, J.; Mansimov, E.; and A. D.; and Mercer, R. L. 1993. The mathematics of statis- Cho, K. 2018. Deterministic non-autoregressive neural se- tical machine translation: Parameter estimation. Computa- quence modeling by iterative refinement. arXiv preprint tional linguistics 19(2):263–311. arXiv:1802.06901. [Cho et al. 2014a] Cho, K.; Van Merri¨enboer, B.; Bahdanau, [Li et al. 2019] Li, Z.; He, D.; Tian, F.; Qin, T.; Wang, L.; and D.; and Bengio, Y. 2014a. On the properties of neural Liu, T.-Y. 2019. Hint-based training for non-autoregressive machine translation: Encoder-decoder approaches. arXiv translation. preprint arXiv:1409.1259. [Ott et al. 2018] Ott, M.; Auli, M.; Granger, D.; and Ranzato, [Cho et al. 2014b] Cho, K.; Van Merri¨enboer, B.; Gulcehre, M. 2018. Analyzing uncertainty in neural machine transla- C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; and Ben- tion. arXiv preprint arXiv:1803.00047. gio, Y. 2014b. Learning phrase representations using rnn [Papineni et al. 2002] Papineni, K.; Roukos, S.; Ward, T.; encoder-decoder for statistical machine translation. arXiv and Zhu, W.-J. 2002. Bleu: a method for automatic evalua- preprint arXiv:1406.1078. tion of machine translation. In Proceedings of the 40th an- [Conneau et al. 2017] Conneau, A.; Lample, G.; Ranzato, nual meeting on association for computational linguistics, M.; Denoyer, L.; and J´egou, H. 2017. Word translation 311–318. without parallel data. arXiv preprint arXiv:1710.04087. [Ranzato et al. 2015] Ranzato, M.; Chopra, S.; Auli, M.; and [Gehring et al. 2017] Gehring, J.; Auli, M.; Grangier, D.; Zaremba, W. 2015. Sequence level training with recurrent Yarats, D.; and Dauphin, Y. N. 2017. Convolu- neural networks. arXiv preprint arXiv:1511.06732. tional sequence to sequence learning. arXiv preprint [Sennrich, Haddow, and Birch 2015] Sennrich, R.; Haddow, arXiv:1705.03122. B.; and Birch, A. 2015. Neural machine transla- [Gong et al. 2019] Gong, C.; Tan, X.; He, D.; and Qin, T. tion of rare words with subword units. arXiv preprint 2019. Sentence-wise smooth regularization for sequence to arXiv:1508.07909. sequence learning. In AAAI. [Shen et al. 2018] Shen, Y.; Tan, X.; He, D.; Qin, T.; and Liu, [Goodfellow et al. 2014] Goodfellow, I.; Pouget-Abadie, J.; T.-Y. 2018. Dense information flow for neural machine Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, translation. In NAACL. A.; and Bengio, Y. 2014. Generative adversarial nets. In [Song et al. 2018] Song, K.; Tan, X.; He, D.; Lu, J.; Qin, T.; NIPS. and Liu, T.-Y. 2018. Double path networks for sequence to [Gu et al. 2017] Gu, J.; Bradbury, J.; Xiong, C.; Li, V. O.; sequence learning. In COLING. and Socher, R. 2017. Non-autoregressive neural machine [Sutskever, Vinyals, and Le 2014] Sutskever, I.; Vinyals, O.; translation. arXiv preprint arXiv:1711.02281. and Le, Q. V. 2014. Sequence to sequence learning with [He et al. 2016] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. neural networks. In NIPS. Deep residual learning for image recognition. In Proceed- [Vaswani et al. 2017] Vaswani, A.; Shazeer, N.; Parmar, N.; ings of the IEEE conference on computer vision and pattern Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polo- recognition, 770–778. sukhin, I. 2017. Attention is all you need. In NIPS. [He et al. 2018] He, T.; Tan, X.; Xia, Y.; He, D.; Qin, T.; [Wu et al. 2016] Wu, Y.; Schuster, M.; Chen, Z.; Le, Q. V.; Chen, Z.; and Liu, T.-Y. 2018. Layer-wise coordination be- Norouzi, M.; Macherey, W.; Krikun, M.; Cao, Y.; Gao, Q.;

9. Macherey, K.; et al. 2016. Google’s neural machine transla- tion system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144. [Wu et al. 2018] Wu, L.; Tan, X.; He, D.; Tian, F.; Qin, T.; Lai, J.; and Liu, T. 2018. Beyond error propagation in neural machine translation: Characteristics of language also matter. In EMNLP. [Yu et al. 2017] Yu, L.; Zhang, W.; Wang, J.; and Yu, Y. 2017. Seqgan: Sequence generative adversarial nets with policy gradient. In AAAI, 2852–2858.