Unsupervised Neural Machine Translation with SMT

Without real bilingual corpus available, unsupervised Neural Machine Translation (NMT) typically requires pseudo parallel data generated with the back-translation method for the model training. However, due to weak supervision, the pseudo data inevitably contain noises and errors that will be accumulated and reinforced in the subsequent training process, leading to bad translation performance. To address this issue, we introduce phrase based Statistic Machine Translation (SMT) models which are robust to noisy data, as posterior regularizations to guide the training of unsupervised NMT models in the iterative back-translation process. Our method starts from SMT models built with pre-trained language models and word-level translation tables inferred from cross-lingual embeddings. Then SMT and NMT models are optimized jointly and boost each other incrementally in a unified EM framework.Inthisway,(1)thenegativeeffectcaused by errors in the iterative back-translation process can be alleviated timely by SMT filtering noises from its phrase tables; meanwhile, (2) NMT can compensate for the deficiency of fluency inherent in SMT. Experiments conducted on en-frand en-de translation tasks show that our method outperforms the strong baseline and achieves new state-of-the-art unsupervised machine translation performance.

1. Unsupervised Neural Machine Translation with SMT as Posterior Regularization Shuo Ren†∗ , Zhirui Zhang‡ , Shujie Liu§ , Ming Zhou§ , Shuai Ma† † SKLSDE Lab, Beihang University † Beijing Advanced Innovation Center for Big Data and Brain Computing, China ‡ University of Science and Technology of China, Hefei, China § Microsoft Research Asia † {shuoren,mashuai}@buaa.edu.cn ‡ zrustc11@gmail.com § {shujliu,mingzhou}@microsoft.com arXiv:1901.04112v1 [cs.CL] 14 Jan 2019 Abstract Without real bilingual corpus available, unsupervised Neu- ral Machine Translation (NMT) typically requires pseudo parallel data generated with the back-translation method for the model training. However, due to weak supervision, the pseudo data inevitably contain noises and errors that will be accumulated and reinforced in the subsequent training pro- cess, leading to bad translation performance. To address this issue, we introduce phrase based Statistic Machine Transla- tion (SMT) models which are robust to noisy data, as pos- terior regularizations to guide the training of unsupervised NMT models in the iterative back-translation process. Our method starts from SMT models built with pre-trained lan- Figure 1: The effect of noisy training data. The first training guage models and word-level translation tables inferred from cross-lingual embeddings. Then SMT and NMT models are sample contains the noise (“malade” in French means “ill”, optimized jointly and boost each other incrementally in a uni- not “ill-fated”), leading to the wrong test result (sys). fied EM framework. In this way, (1) the negative effect caused by errors in the iterative back-translation process can be al- leviated timely by SMT filtering noises from its phrase ta- sentences in two languages are mapped into the same latent bles; meanwhile, (2) NMT can compensate for the deficiency space with a shared encoder, which is expected to be the in- of fluency inherent in SMT. Experiments conducted on en-fr ternal information representation irrelevant to the languages and en-de translation tasks show that our method outperforms the strong baseline and achieves new state-of-the-art unsuper- themselves. From that target sentences are generated by a vised machine translation performance. shared or different decoders. Some of them also use denois- ing auto-encoders (Vincent et al. 2010) and adversarial train- ing. Despite the differences in structures and training meth- 1 Introduction ods, they reach a consensus to use the pseudo parallel data Recent years have witnessed the rise and success of Neu- generated iteratively with the back-translation method (Sen- ral Machine Translation (NMT) (Sutskever, Vinyals, and nrich, Haddow, and Birch 2016; Zhang et al. 2018a) to train Le 2014; Bahdanau, Cho, and Bengio 2014; Luong, Pham, their unsupervised NMT models, i.e. they use monolingual and Manning 2015; Wu et al. 2016; Vaswani et al. 2017; data in the target language and a target-to-source translation Hassan et al. 2018). However, NMT relies heavily on large model to generate source sentences, then use the pseudo par- in-domain parallel data, resulting in poor performance on allel data of generated sources and real targets to train the low-resource language pairs (Koehn and Knowles 2017). source-to-target model, and vice versa. For some low-resource pairs without any bilingual corpus, However, since the pseudo data are generated by unsu- how to train NMT models with only a monolingual corpus pervised models, random errors and noises are inevitably is a popular and interesting topic. introduced, such as redundant or unaligned words deviat- Existing methods for unsupervised machine translation ing from the meaning of source sentences. Due to the lack (Artetxe et al. 2017; Lample, Denoyer, and Ranzato 2017; of supervision, those infrequent errors will be accumulated Yang et al. 2018; Lample et al. 2018) are mainly the mod- and reinforced by NMT models into frequent patterns dur- ifications of encoder-decoder schema. In their work, source ing the training iterations, leading to bad translation perfor- ∗ The first two authors contributed equally to this work. This mance. For instance in Figure 1, the French word “malade” work is supported in part by NSFC U1636210 and 61421003, and is mistakenly translated into the English word “ill-fated” in Shenzhen Institute of Computing Sciences. the first training sample. With strong abilities to identify and Copyright c 2019, Association for the Advancement of Artificial memorize patterns, NMT models mistakenly translate this Intelligence (www.aaai.org). All rights reserved. word into “ill-fated” when “old” (similar to “grandmother”

2.in the first training sample) occurs in the test. Even so, there 2.2 Phrase-based Statistic Machine Translation are also many good translation patterns (such as “malade” The current approach of Statistic Machine Translation → “ill” or “sick” in the second and third training samples), (SMT) is typically based on the log-linear model proposed which could have been extracted in time to guide the NMT by Och and Ney (2002). According to it, the translation models into the correct training direction. The extraction probability from sentence x to sentence y is formulated as: and guidance can be well carried out by Statistical Machine M Translation (SMT). As is pointed out by Khayrallah and exp [ m=1 λm hm (x, y)] p(y|x; λM 1 )= (2) Koehn (2018), SMT performs better than NMT in tackling M ˜ )] ˜ exp [ y m=1 λm hm (x, y noisy data by constructing a strong phrase table with good and frequent translation patterns and filtering out infrequent where hm (x, y) = log φm (x, y) denotes the mth feature. errors and noises. This gives the motivation that if we in- In phrase based SMT (PBSMT) (Koehn, Och, and Marcu corporate SMT in the training process, unsupervised NMT 2003), the sentence pair is segmented into a sequence of could benefit from the robustness of SMT to noisy data. phrases x¯ I1 and y ¯ 1J , where I and J are the counts of phrases. In this paper, we propose to leverage SMT to denoise and During training, given a bilingual corpus, PBSMT first in- guide the training of unsupervised NMT models in the itera- fers word alignment, based on which phrase pairs are de- tive back-translation process. Different from previous work rived and stored in the phrase table, as well as translation (He et al. 2016; Tang et al. 2016; Wang et al. 2017) intro- probabilities. Other features such as a distortion model can ducing SMT into NMT by changing model structures in su- also be learned with the extracted phrase pairs. The feature pervised scenarios, we adopt the framework of posterior reg- weights λM 1 can be optimized by MERT (Och 2003) with a ularization (Ganchev et al. 2010) to leave model structures validation set. During decoding, PBSMT generates transla- unchanged. Our method starts from initial SMT models built tion candidates y ˜ bottom up via the CKY algorithm, ranked with pre-trained language models and word-level translation with scores given by the log-linear model in Eq.(2). tables inferred from cross-lingual embeddings. Then SMT models and NMT models are trained jointly in a unified Ex- 2.3 Posterior Regularization pectation Maximization (EM) training framework. In each Posterior regularization (Ganchev et al. 2010) is a frame- iteration, as desired distributions, SMT models are expected work for structured, weakly supervised learning, which in- to correct NMT models timely with denoised pseudo data corporates indirect supervision from a desired distribution generated in a constrained search space of reliable trans- q(y) via constraints on posterior distribution p(y|xn ; θ) im- lation patterns. Based on that, enhanced NMT models can posed by a Kullback-Leible (KL) divergence as follows: generate better pseudo data for SMT to extract phrases of N higher quality, so that they can benefit from each other incre- F (q; θ) = L(θ) − min KL(q(y)||p(y|xn ; θ)) (3) mentally. In this way, infrequent errors in NMT models can q∈Q n=1 be eliminated with the constraints exerted by SMT features, while NMT can compensate for the deficiency in smooth- where L(θ) is the original likelihood of model p(y|x; θ), and ness inherent in SMT models. Experiments conducted on Q is a constraint posterior set satisfying: en-fr and en-de translation tasks show that our method sig- Q = {q(y) : Eq [φ(x, y)] ≤ b} (4) nificantly outperforms the strong baseline (Lample et al. 2018) and achieves the new state-of-the-art translation per- in which constraints features φ(x, y) are bounded by b. formance in unsupervised machine translation. To maximize F (q; θ), Ganchev et al. (2010) propose an EM framework (McLachlan and Krishnan 2007) as: 2 Background E : q t+1 = arg min KL(q(y)||p(y|xn ; θt )) q∈Q t+1 (5) 2.1 Neural Machine Translation M :θ = arg max L(θ) + Eqt+1 [log p(y|xn ; θ)] θ Given a source sentence x = (x1 , x2 , ..., xl ) and a target one However, there may be a problem as pointed out by Zhang y = (y1 , y2 , ..., ym ), Neural Machine Translation (NMT) et al. (2017) that it is hard to set a reasonable bound b if directly models the word-level translation probability with we directly apply posterior regularization to NMT. To solve parameters θ as: this problem, we follow their practice of representing the desired distribution q(y) as the log-linear model described in p(yi |x, y<i ; θ) = softmax(g(hyi , hy<i , ci ; θ)) (1) Eq.(2). In this way, SMT models directly act as the posterior regularization to constrain NMT models p(y|xn ; θt ). in which g(·) denotes a non-linear function extracting features to predict the target word yi from the decoder states (hyi and hy<i ) and the context vector ci calcu- 3 Method lated with the encoder and attention mechanism. Then the 3.1 Overview sentence-level translation probability p(y|x; θ) is calculated Due to the lack of supervision, noises and infrequent errors m by p(y|x; θ) = i=1 p(yi |x, y<i ; θ). As for training, given in the pseudo data generated by unsupervised NMT mod- a parallel corpus {(xn , yn )}N n=1 , the objective function is to els will be accumulated and reinforced in the iterative back- maximize log p(yn |xn ; θ) over the whole training set. translation process (shown in the shadow area in Figure 2).

3. Figure 2: Method overview. The whole procedure mainly consists of two parts as the left and the right. To address this issue, we introduce SMT as posterior regu- Then the word translation probability from word xi to yj is: larization (the red frame above that) to denoise and guide the exp [λ cos(exi , eyj )] training of NMT, thus the noises being eliminated timely. p(yj |xi ) = (6) k exp [λ cos(exi , eyk )] The whole procedure of our method mainly consists of where λ is a hyper-parameter to control the peakiness of the two parts shown in the left and right of Figure 2. Given a distribution. The calculation of p(xi |yj ) is similar to Eq.(6). language pair X-Y, for model initialization, we build two Based on the above, we choose top-k translation candidates initial SMT models with language models pre-trained us- for each word in our initial phrase table. We only use two ing monolingual data, and word translation tables inferred features in our initial phrase tables, i.e. translation probabil- from cross-lingual embeddings according to the approach ities and inverse translation probabilities. in 3.2. Then the initial SMT models will generate pseudo data to warm up two NMT models. Note that the NMT 3.3 Unsupervised NMT with SMT as PR models are trained using not only the pseudo data gener- ated by SMT models, but those generated by reverse NMT As is mentioned in 3.1, SMT plays a role in denoising and models with the iterative back-translation method. After is leveraged as posterior regularization for NMT. Therefore, that, the NMT-generated pseudo data are fed to SMT mod- we replace the posterior regularization term q(y) in Eq.(3) els. As posterior regularization (PR), SMT models timely with the SMT models (x → y) and (y → x) in Figure 2, filter out noises and infrequent errors by constructing strong which will be denoted by → − ps (y|x) and ←p−s (x|y). By the way, phrase tables with good and frequent translation patterns, the NMT models (x → y) and (y → x) will be denoted by − p→ ←− and then generate denoised pseudo data to guide the subse- n (y|x; θx→y ) and pn (x|y; θx←y ), where θx→y and θx←y quent NMT training. Benefiting from that, NMT then pro- are parameters. Then, given monolingual corpora {xi }M i=1 duces better pseudo data for SMT to extract phrases of and {yj }N j=1 , we formulate the training objective as: higher quality, meanwhile compensating for the deficiency in smoothness inherent in SMT via back-translation. Those J (θx→y , θx←y , → − ps , ← p−s ) = L(θ ¯ x→y , θx←y ) two steps are unified in the EM framework described in 3.3, M where NMT and SMT models are trained jointly and boost − min → − KL(→ − ps (y|xi )||− p→ n (y|xi ; θx→y )) ps (7) each other incrementally until final convergence. i=1 N 3.2 Initialization − min ← − KL(← p−s (x|yj )||← p− n (x|yj ; θx←y )) ps j=1 Our initial SMT models are built with word-based phrase ¯ x→y , θx←y ) corresponds to the training objective where L(θ tables and two pre-trained language models via Moses1 . For of iterative back-translation for NMT models, which is the word translation table, we first train word embeddings ¯ x→y , θx←y ) L(θ using monolingual corpora for two languages respectively. M Based on that, we adopt the method proposed by Artetxe et al. (2018) to obtain respective cross-lingual embeddings = Ey∼− p→ n (y|xi ;θx→y ) [log ← p− n (xi |y; θx←y )] i=1 (8) {exi }Si=1 and {eyj }Tj=1 , where S and T are vocabulary sizes. N + Ex∼p← −(x|y ;θ n j x←y ) [log − p→ n (yj |x; θx→y )] 1 https://github.com/moses-smt/mosesdecoder j=1

4.and two Kullback-Leibler divergence (KL) terms denote the Algorithm 1: Unsupervised NMT with SMT as PR posterior regularizations for two NMT models respectively. Input: Monolingual data X = {xi }M i=1 and Y = {yj }j=1 N Based on that, the training processes of iterative back- Output: Parameters of two NMT models: θx→y , θx←y translation for NMT and SMT models as posterior regular- 1 Train language models lx and ly using X and Y ization are unified into a single objective J . Then, we mod- 2 Infer word translation tables txy and tyx as in 3.2 ulate the EM algorithm in Eq.(5) to optimize it as follows: 3 t := 0 while not convergence do Sample data {xt } ∈ X and {yt } ∈ Y E:← p−s t+1 = arg max J (θx→y , θx←y , → − ps , ← p−s ) 4 ← − 5 // E-step: p s if t = 0 then = arg min KL(← p−s (x|yj )||← p− t n (x|yj ; θx←y )) 6 ps 0 and ← Initialize − → p−s 0 using lx , ly , txy and tyx ← − p s else → − ps t+1 = arg max J (θx→y , θx←y , → − ps , ← p−s ) 7 Generate pseudo data {(xt , yt+ )} and {(x+ t , yt )} → − ps using models − p→ t ←−t n and pn respectively = arg min KL(→ − ps (y|xi )||− p→ Train − t → ps t and ← p−s t using (xt , yt+ ) and (x+ n (y|xi ; θx→y )) 8 t , yt ) → − ps 9 // M-step: t+1 M : θx←y = arg max J (θx→y , θx←y , → − ps , ← p−s ) 10 Generate denoised pseudo data {(xt , yt∗ )} and θx←y {(x∗t , yt )} using − → ps t and ← p−s t = arg max{E← ←− t+1 [log pn (x|yj ; θx←y )] 11 Train −p→ t ←−t ∗ ∗ n and pn using {(xt , yt )} and {(xt , yt )} p− s + + θx←y 12 Generate pseudo data {(xt , yt )} and {(xt , yt )} using − p→ ←−t [log ← p− t n and pn respectively + E− p→ n (xi |y; θx←y )]} Train pn and ← −→ p− t n (y|xi ;θx→y ) t t + ∗ 13 n using {(xt , yt )} ∪ {(xt , yt )} and t+1 θx→y = arg max J (θ x→y ,θx←y ,→− p ,← s p−)s {(xt , yt+ )} ∪ {(x∗t , y)} θx→y 14 t := t + 1 = arg max E→ − → ps t+1 [log pn (y|xi ; θx→y )] − 15 return θx→y , θx←y θx→y + E← p− t n (x|yj ;θx←y ) [log − p→ n (yj |x; θx→y )] (9) This step corresponds to lines 14 to 17 in Algorithm 1. A Briefly speaking, in the E-step, we optimize the desired difficulty here is the exponential search space of the trans- distributions represented by SMT to minimize the KL dis- lation candidates. To address it, we leverage the sampling tance between SMT models and NMT models. In the M- method (Shen et al. 2015) and simply generate the top target step, we optimize NMT models using the pseudo data gen- sentence for approximation in our experiments. Note that in erated by SMT models and the corresponding reverse NMT the 11th line, NMT models are trained using the denoised models to fit the desired distributions and meanwhile per- pseudo data generated by SMT models only, while in the form back-translation iterations. We will give the specific 13th line, the mixed data of those and the pseudo data gen- equation for updating parameters in 3.4. erated by the reverse NMT models are used. The intention 3.4 Training Algorithm here is to first use the denoised pseudo data to correct the NMT models established before, and then apply iterative We combine the model initialization and the whole training back-translation to boost NMT models under the guide of procedure into Algorithm 1 as follows. the denoised data. NMT also makes up for the deficiency in According to Eq.(9), in the E-step, we need to mini- smoothness of SMT in this step. In this way, SMT and NMT mize the gap between SMT models and NMT models. How- models can benefit from each other in the EM iterations. ever, this step cannot be done by traditional gradient descent methods. Approximately, we train SMT models using the pseudo data generated by the corresponding NMT models to 4 Experiments fit the mode of NMT posterior distributions. Thus the KL di- vergence between them is diminished. This step corresponds 4.1 Setup to the the 7th and 8th lines in Algorithm 1, meaning SMT Dataset In our experiments, we consider two language extracts good and frequent translation patterns from the data pairs, English-French and English-German. For each lan- generated by current NMT models to finish denoising. guage, we use 50 million monolingual sentences in In the M-step, we optimize two NMT models with gradi- NewsCrawl, a monolingual dataset from WMT, which is the ent descent methods. We formulate the updating for θx←y in same as the previous work (Artetxe et al. 2017; Lample et al. Eq.(10), to which that for θx→y is similar. 2018). For the convenience of comparison, we use newstest 2014 as the test set for the English-French pair, and newstest ∇θx←y J (θx→y , θx←y , → − ps , ← p−s ) 2014 as well as newstest 2016 for the English-German pair. = Ex∼← p− s (x|yj ) ∇θx←y log ←p−n (x|yj ; θx←y ) (10) Preprocess We use Moses scripts for word tokenization +E − → ∇ y∼pn (y|xi ;θx→y ) θx←y log ←p−(x |y; θ n i x←y ) and truecasing. In model initialization, we use the public im-

5. de-en en-de de-en en-de Method fr-en en-fr (2014) (2014) (2016) (2016) (Artetxe et al. 2017) 15.56 15.13 10.21 6.89 - - (Lample, Denoyer, and Ranzato 2017) 14.31 15.05 - - 13.33 9.64 (Yang et al. 2018) 15.58 16.97 - - 14.62 10.86 (Lample et al. 2018), NMT 24.18 25.41 - - 21.00 17.16 (Lample et al. 2018), PBSMT 27.16 28.11 - - 22.68 17.77 (Lample et al. 2018), NMT+PBSMT 26.29 27.12 - - 22.06 17.52 (Lample et al. 2018), PBSMT+NMT 27.68 27.60 - - 25.19 20.23 Our Method 28.79 29.21 20.04 16.43 25.92 21.07 (+ R2L regularization) 28.92 29.53 20.43 16.97 26.32 21.65 Table 1: Comparison with previous methods. plementation of word2vec2 to train monolingual word em- Results and Discussion The comparison results are re- beddings of each language, and vecmap3 to obtain cross- ported in Table 1. The BLEU scores are calculated by multi- lingual embeddings of both language pairs. For NMT, we bleu.pl. From the table, we find that our method significantly use the modified version of the public implementation4 of outperforms all the baselines even the strong one (Lample et Transformer (Vaswani et al. 2017). We share the vocabu- al. 2018). We elaborate the reasons as follows. lary space of 50,000 BPE codes (Sennrich, Haddow, and (1) Our proposed method significantly improves the per- Birch 2015) for source and target languages. For each lan- formance over the “NMT” and “PBSMT” of (Lample et al. guage pair, we train two independent NMT models for dif- 2018). This is because unsupervised NMT methods suffer ferent translation directions (i.e., source to target and target from the noise problem while PBSMT is inherently defi- to source) with shared embedding layers of source and target cient in fluency just as the case study in 4.5 shows. Our sides. For SMT, we use the Moses implementation of PB- method can compensate for the deficiencies of them by SMT systems with Salm (Johnson et al. 2007), which can combining the training processes of them. (2) Notice that denoise and reduce the size of phrase tables. And we use the “NMT+PBSMT” performs even worse than pure “PBSMT”, default features defined in Moses for our PBSMT models. which may be caused by accumulated errors in the iterations Our code is released in https://github.com/ of NMT models. Due to the lack of timely denoising meth- Imagist-Shuo/UNMT-SPR. ods, infrequent errors and noises are repeated and reinforced as frequent ones by unsupervised NMT, so that even PBSMT 4.2 Comparison could not distinguish them from good patterns in the last iteration. (3) The performance gained by “PBSMT+NMT” Baselines Our proposed method is compared with four verifies combining data of high quality into NMT training baselines of unsupervised machine translation listed in the could be a better choice. But the simple combination in their upper area of Table 1, among which the fourth baseline con- method is not able to make the best of both models. In their tains several methods. Given a language pair, the first two method, NMT and SMT models are trained independently baselines (Artetxe et al. 2017; Lample, Denoyer, and Ran- so that the bad patterns within the models themselves cannot zato 2017) use a shared encoder and different decoders for be well removed due to weak supervision. In contrast, our the two languages. The third baseline (Yang et al. 2018) uses proposed method integrates the training of NMT and SMT different encoders and decoders, and introduces a weight models in a unified EM framework where they can boost sharing mechanism. The fourth baseline (Lample et al. each other incrementally. The noises and errors generated 2018) uses a shared encoder and decoder in their NMT sys- by NMT models can be reduced in time by SMT as poste- tems. As for the training method, the second and third base- rior regularization, while NMT can compensate for the de- lines use adversarial training. All of the four baselines use ficiency of smoothness inherent in SMT models. Therefore, denoising auto-encoder and iterative back-translation. our proposed method still outperforms ”PBSMT+NMT”. Note that the fourth baseline contains four methods. Apart from SMT as posterior regularization, our frame- “NMT” means unsupervised NMT models, while “PBSMT” work can be easily extended to incorporate other poste- denotes unsupervised SMT models with the back-translation rior regularization methods without changing model struc- method performed by SMT. “NMT+PBSMT” and “PB- tures, such as the target-bidirectional agreement regular- SMT+NMT” simply combine the best pseudo data that the ization (Zhang et al. 2018b). This regularization can help former generates into the final iteration of the latter. Dif- deal with the problem of exposure bias in supervised NMT, ferent from our proposed method, the training processes of where another ”reversed” NMT model is trained using data NMT and SMT models in their methods are independent. of reversed sentences from left to right. Then the ”reversed” NMT model is leveraged to generate pseudo data for training 2 https://github.com/tmikolov/word2vec the original NMT model. Specifically, we introduce the R2L 3 https://github.com/artetxem/vecmap regularization after the final training iteration of NMT mod- 4 https://github.com/tensorflow/tensor2tensor els (i.e., NMT2 in Table 2). With this extension, we achieve

6. Steps fr-en en-fr de-en en-de ave E-step (SMT0) 15.34 11.74 11.03 8.14 11.56 M-step (NMT0) 24.06 24.82 16.29 12.88 +7.95 E-step (SMT1) 26.49 27.64 17.34 14.81 +2.06 M-step (NMT1) 28.29 29.02 19.61 16.02 +1.67 E-step (SMT2) 28.64 29.21 19.87 16.29 +0.23 M-step (NMT2) 28.79 29.17 20.04 16.43 +0.11 Table 2: Test BLEU on newstest 2014 in different steps. Figure 4: Test of initial models with various hyper-params. higher performance (+R2L regularization in Table 1). brevity, we let S = T = V in our experiments. The re- 4.3 Model Evolution sults are illustrated in Figure 4. From this figure, we find We conduct several EM iterations in our experiments, and that k and V have much bigger impacts on the initial model record the test BLEU scores on newstest 2014 after each E- SMT0 than λ. With the value of λ increasing, the perfor- step (SMT) and M-step (NMT) in Table 2. We have tried mance of SMT0 gradually improves but starts to decline a more steps but the models do converge after three EM iter- bit after around 20. This is because the larger λ will make the ations. For the convenience of comparison, in the last col- distribution in Eq.(6) sharper, severely restricting the search umn of the table, we also list the average improvement of spaces of SMT models. Similarly, the performance of SMT0 four translation models after each step. From the table, first, improves in accord with the value of k or V going up. But we find NMT and SMT models improve incrementally after the improvement stops after certain thresholds (about 80 of k each iteration, which accords with our proposed motivation. and 50000 of V ). The reason may be the useful information Note that the improvements between adjacent NMT steps provided by word-translation tables is saturated after those. are exactly contributions made by SMT as posterior regular- We also tried other initialization methods in our experi- ization. Second, the models improve the most in the first EM ments, such as directly using the pseudo parallel data con- iteration and nearly converge at the third EM iteration. structed from word-by-word translation to warm up NMT Additionally, we also compare the translation perfor- models. We compare NMT0 models warmed up with this mance on sentences of different lengths as iteration steps method (without SMT0) to NMT0 in our proposed method progress. We group the sentences in the fr-en test set by (with SMT0) in the following table, which stresses the ne- length as shown by the three curves in Figure 3. Then, we cessity of SMT0 and the importance of good initialization. record the BLEU scores of different groups after each step. From the figure, we find the models converge much slower Initialization Method fr-en en-fr de-en en-de on longer sentences, which indicates that it is easier for the NMT0 without SMT0 12.29 12.46 7.32 4.81 models to learn shorter sentences. NMT0 with SMT0 24.06 24.82 16.29 12.88 Table 3: The necessity of SMT0 in model initialization. The numbers in this table are BLEU scores on newstest 2014. 4.5 Case Study To further demonstrate the effectiveness of our method, we select some cases from translation results (fr-en) and com- pare the translations generated by models of different train- ing steps. The results are listed in Table 4. In the first case, Figure 3: Test BLEU on sentences grouped by length. which is exactly the example in the Introduction, the word “malade” in French is wrongly translated into “ill-fated” in English by NMT0. As we can see, this error has been cor- rected in NMT1 after the guidance of SMT1. In the second 4.4 Discussion on Initialization case, apart from the wrongly aligned word “bˆatisse-l`a” to In this subsection, we delve into the initialization stage “canopy-back business” by NMT1, there is also a redundant which is crucial to our method. In that stage, there are three phrase “plenty of” generated by it. Those errors are both cor- hyper parameters described in 3.2 that should be taken into rected after the regularization of SMT1. In the third case, account, i.e., the peakiness controller λ, the vocabulary size we also reach the same conclusion that NMT1 can bene- S or T , and the number of translation candidates k for each fit from SMT1 and rectify the mistake on “rendu visite a` ”. word. Since the performance of initialization can be eval- There is also an interesting phenomenon from case three of uated by SMT0, we adjust the hyper-parameters and mea- NMT adhering to “from” which makes the sentence more sure the fr-en test BLEU of SMT0 models accordingly. For fluent, even though this word is missed by SMT models. In a

7. Source J’ai eu des relations difficiles avec lui jusqu’`a ce qu’il devienne vieux, malade. SMT0 I’ve gotten of difficult relations with him until he will become old, sick. NMT0 I’ve had difficult relations with him until he’s become old, ill-fated. SMT1 I’ve had difficult relationships with him until he became old, sick. NMT1 I had difficult relations with him until he became old and sick. Reference I had a difficult relationship with him until he became old and ill. Source Le fonds d’investissement qui e´ tait propri´etaire de cette bˆatisse-l`a avait des choix a` faire. SMT0 The owner of this underlinebuilding, so had to make a choice of which was an investment fund. NMT0 The investment fund that was an owner of that canopy-back business had plenty of choice to do. SMT1 The investment fund that was the owner of this building just had to make choices. NMT1 The investment fund that was the owner of this building had choices to make. Reference The investment fund that owned the building had to make a choice. M. Dutton a rendu visite a` Mme Plibersek pour garantir qu’aucun dollar du plan de sauvetage ne sera d´epens´e Source en bureaucratie suppl´ementaire. SMT0 Mr Dutton paid a visit to Ms Plibersek to guarantee that the greenback no rescue plan of not be spent in extra bureaucracy. NMT0 Mr Dutton said Ms Plibersek’visit to guarantee any dollar from the rescue plan will be spent in extra bureaucracy. SMT1 Mr Dutton was visiting Ms Plibersek to guarantee that no dollar rescue plan will be spent on additional bureaucracy. NMT1 Mr Dutton paid a visit to Ms Plibersek to guarantee that no dollar from the rescue plan will be spent on extra bureaucracy. Mr Dutton called on Ms Plibersek to guarantee that not one dollar out of the rescue package would be spent on Reference additional bureaucracy. Table 4: Cases of translation results from French to English in newstest 2014. The models of SMT0, NMT0, SMT1 and NMT1 are corresponding to the steps in Table 2. word, the above analysis verifies that noises and errors in un- work and enable them to improve jointly and boost each supervised NMT models can be eliminated timely by SMT other incrementally, where NMT models are responsible for models as posterior regularization with our method . smoothing and fluency, while SMT models are responsible From these cases, we find that SMT can also benefit from for denoising and guiding NMT models. NMT models. Even though the meanings of the key words Moreover, there has been some work exploiting SMT fea- could be captured by SMT, the outputs of SMT0 are not flu- tures to improve supervised NMT. In He et al. (2016), the ent especially in the second case. This problem is relieved in probability calculated by NMT is integrated as a feature into SMT1, after SMT is fed with more fluent pseudo data gen- a log-linear model. After that, Tang et al. (2016) and Wang erated by NMT0, which validates that SMT and NMT can et al. (2017) leverage gate mechanisms to introduce a phrase incrementally boost each other with our method. table or candidates provided by SMT into NMT models. Dif- ferent from them, we leave the model structures unchanged 5 Related Work via the framework of posterior regularization. Zhang et al. (2017) also integrate more prior knowledge into the training Previous unsupervised neural machine translation systems of NMT with the help of posterior regularization. But there (Artetxe et al. 2017; Lample, Denoyer, and Ranzato 2017; is a major difference that we introduce the successful prac- Yang et al. 2018) are mainly the modifications of the cur- tice of iterative back-translation into this framework with a rent encoder-decoder structure. To constrain outputs of en- unified EM training algorithm, where SMT and NMT mod- coders for two languages into a same latent space, Artetxe els can benefit from each other. Additionally, in unsuper- et al. (2017), and Lample et al. (2017) use a shared encoder, vised scenarios, our SMT features are learned from scratch while Yang et al. (2018) use a weight sharing mechanism. and improved incrementally, rather than pre-trained from Denoising auto-encoder (Vincent et al. 2010) and adversar- real bilingual data and fixed during the whole procedure. ial training methods are also leveraged to improve the ability of encoders. Besides, iterative back-translation is applied to generated pseudo parallel data for cross-lingual training. 6 Conclusion After that, Lample et al. (2018) summarize three princi- In this paper, we introduce SMT models as posterior regu- ples for unsupervised machine translation, which are initial- larization to denoise and guide unsupervised NMT models ization, language modeling and iterative back-translation, with the ability of constructing more reliable phrase tables and propose some effective methods with simplified training and eliminating the infrequent and bad patterns generated procedures. Four methods are leveraged in their work, in- in the back-translation iterations of NMT. We unify SMT cluding unsupervised NMT, unsupervised PBSMT and two and NMT models within the EM training algorithm where combinations of them. Our method is different from them. they can be trained jointly and benefit from each other incre- In their methods, SMT and NMT are treated as independent mentally. In the experiments conducted on en-fr and en-de models so that they suffer from respective deficiencies and language pairs, our method significantly outperforms previ- cannot benefit from each other in their training processes. In ous methods, and achieves the new state-of-the-art perfor- contrast, we combine them into a unified EM training frame- mance of unsupervised machine translation, which demon-

8. strates the effectiveness of our method. In the future, we may [2007] McLachlan, G., and Krishnan, T. 2007. The EM al- delve into the initialization stage, which is crucial to the final gorithm and extensions, volume 382. John Wiley & Sons. performance of the proposed method. [2002] Och, F. J., and Ney, H. 2002. Discriminative training and maximum entropy models for statistical machine trans- References lation. In Proceedings of the 40th annual meeting on asso- [2017] Artetxe, M.; Labaka, G.; Agirre, E.; and Cho, K. ciation for computational linguistics, 295–302. Association 2017. Unsupervised neural machine translation. arXiv for Computational Linguistics. preprint arXiv:1710.11041. [2003] Och, F. J. 2003. Minimum error rate training in sta- [2018] Artetxe, M.; Labaka, G.; and Agirre, E. 2018. Gener- tistical machine translation. In Proceedings of the 41st An- alizing and improving bilingual word embedding mappings nual Meeting on Association for Computational Linguistics- with a multi-step framework of linear transformations. In Volume 1, 160–167. Association for Computational Linguis- Proceedings of the Thirty-Second AAAI Conference on Arti- tics. ficial Intelligence, 5012–5019. [2015] Sennrich, R.; Haddow, B.; and Birch, A. 2015. Neural [2014] Bahdanau, D.; Cho, K.; and Bengio, Y. 2014. Neural machine translation of rare words with subword units. arXiv machine translation by jointly learning to align and translate. preprint arXiv:1508.07909. arXiv preprint arXiv:1409.0473. [2016] Sennrich, R.; Haddow, B.; and Birch, A. 2016. Im- [2010] Ganchev, K.; Gillenwater, J.; Taskar, B.; et al. 2010. proving neural machine translation models with monolin- Posterior regularization for structured latent variable mod- gual data. In Proceedings of the 54th Annual Meeting of the els. Journal of Machine Learning Research 11(Jul):2001– Association for Computational Linguistics (Volume 1: Long 2049. Papers), volume 1, 86–96. [2018] Hassan, H.; Aue, A.; Chen, C.; Chowdhary, V.; [2015] Shen, S.; Cheng, Y.; He, Z.; He, W.; Wu, H.; Sun, M.; Clark, J.; Federmann, C.; Huang, X.; Junczys-Dowmunt, and Liu, Y. 2015. Minimum risk training for neural machine M.; Lewis, W.; Li, M.; et al. 2018. Achieving human par- translation. arXiv preprint arXiv:1512.02433. ity on automatic chinese to english news translation. arXiv [2014] Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Se- preprint arXiv:1803.05567. quence to sequence learning with neural networks. In Ad- [2016] He, W.; He, Z.; Wu, H.; and Wang, H. 2016. Im- vances in neural information processing systems, 3104– proved neural machine translation with smt features. In 3112. AAAI, 151–157. [2016] Tang, Y.; Meng, F.; Lu, Z.; Li, H.; and Yu, P. L. 2016. [2007] Johnson, H.; Martin, J.; Foster, G.; and Kuhn, R. Neural machine translation with external phrase memory. 2007. Improving translation quality by discarding most of arXiv preprint arXiv:1606.01792. the phrasetable. In Proceedings of the 2007 Joint Confer- [2017] Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; ence on Empirical Methods in Natural Language Processing Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. and Computational Natural Language Learning (EMNLP- Attention is all you need. In Advances in Neural Information CoNLL). Processing Systems, 6000–6010. [2018] Khayrallah, H., and Koehn, P. 2018. On the impact of [2010] Vincent, P.; Larochelle, H.; Lajoie, I.; Bengio, Y.; and various types of noise on neural machine translation. arXiv Manzagol, P.-A. 2010. Stacked denoising autoencoders: preprint arXiv:1805.12282. Learning useful representations in a deep network with a lo- [2017] Koehn, P., and Knowles, R. 2017. Six chal- cal denoising criterion. Journal of Machine Learning Re- lenges for neural machine translation. arXiv preprint search 11(Dec):3371–3408. arXiv:1706.03872. [2017] Wang, X.; Lu, Z.; Tu, Z.; Li, H.; Xiong, D.; and [2003] Koehn, P.; Och, F. J.; and Marcu, D. 2003. Statis- Zhang, M. 2017. Neural machine translation advised by tical phrase-based translation. In Proceedings of the 2003 statistical machine translation. In AAAI, 3330–3336. Conference of the North American Chapter of the Associ- [2016] Wu, Y.; Schuster, M.; Chen, Z.; Le, Q. V.; Norouzi, ation for Computational Linguistics on Human Language M.; Macherey, W.; Krikun, M.; Cao, Y.; Gao, Q.; Macherey, Technology-Volume 1, 48–54. Association for Computa- K.; et al. 2016. Google’s neural machine translation system: tional Linguistics. Bridging the gap between human and machine translation. [2018] Lample, G.; Ott, M.; Conneau, A.; Denoyer, L.; and arXiv preprint arXiv:1609.08144. Ranzato, M. 2018. Phrase-based & neural unsupervised [2018] Yang, Z.; Chen, W.; Wang, F.; and Xu, B. 2018. Un- machine translation. arXiv preprint arXiv:1804.07755. supervised neural machine translation with weight sharing. [2017] Lample, G.; Denoyer, L.; and Ranzato, M. 2017. Un- arXiv preprint arXiv:1804.09057. supervised machine translation using monolingual corpora [2017] Zhang, J.; Liu, Y.; Luan, H.; Xu, J.; and Sun, M. 2017. only. arXiv preprint arXiv:1711.00043. Prior knowledge integration for neural machine translation [2015] Luong, M.-T.; Pham, H.; and Manning, C. D. 2015. using posterior regularization. In Proceedings of the 55th Effective approaches to attention-based neural machine Annual Meeting of the Association for Computational Lin- translation. arXiv preprint arXiv:1508.04025. guistics (Volume 1: Long Papers), volume 1, 1514–1523.

9.[2018a] Zhang, Z.; Liu, S.; Li, M.; Zhou, M.; and Chen, E. 2018a. Joint training for neural machine translation models with monolingual data. In AAAI. [2018b] Zhang, Z.; Wu, S.; Liu, S.; Li, M.; Zhou, M.; and Chen, E. 2018b. Regularizing neural machine translation by target-bidirectional agreement. CoRR abs/1808.04064.