神经网络机器翻译的训练和预训练

讲者简介:黄书剑,博士,南京大学计算机科学与技术系副教授,博士生导师。分别于2006年和2012年于南京大学获得工学学士和博士学位。主要研究方向包括机器翻译、计算机辅助翻译、文本分析与理解、知识发掘等。曾担任ACL,AAAI,IJCAI,NAACL,EMNLP等会议的PC、SPC等,担任CCMT2019程序委员会主席,NLPCC2016机器翻译领域主席,CWMT2017、2018评测委员会主席等。曾任中文信息学会青年工作委员会执行委员,中文信息学会机器翻译专委会副主任。2017年受江苏省自然科学基金优秀青年基金和江苏省青年科技人才托举工程资助,2019年联合指导的博士生获得中国人工智能学会优秀博士生奖,2019年获CIPSC杰出服务奖,2020年获CCF-NLPCC青年新锐奖。

报告摘要:基于深度学习的神经网络机器翻译系统往往具有大量的参数,需要大规模的数据进行密集计算来进行训练。一方面,这样大规模的参数训练过程往往容易发生过拟合等现象,使得训练不够稳定,本次报告介绍一种利用动态检查点进行知识蒸馏的方法,可以提升参数训练的稳定性。另一方面,在大规模无标记数据上的预训练模型,可能能为机器翻译模型提供一定的知识储备,但直接进行预训练模型的参数微调往往还存在灾难性遗忘等问题,本次报告也将介绍在知识库问答和机器翻译上利用预训练模型的工作。相关工作论文发表于NAACL2019、ACL2019和AAAI2020上。

展开查看详情

1.机器翻译模型的训练和预训练 黄书剑 副教授 博士生导师 计算机软件新技术国家重点实验室 南京大学 计算机科学与技术系

2.提纲 • 神经网络机器翻译介绍 • 翻译模型训练(NAACL2019) • 预训练模型和机器翻译(ACL2019,AAAI2020) 2

3.感谢合作的⽼师同学! • Online Distilling from Checkpoints for Neural Machine Translation (NAACL2019) Hao-Ran Wei, Shujian Huang*, Boxing Chen, Ran Wang, XIN-YU DAI and Jiajun CHEN • Learning Representation Mapping for Relation Detection in Knowledge Base Question Answering (ACL2019)Peng Wu, Shujian Huang, Rongxiang Weng, Zaixiang Zheng, Jianbing Zhang, Xiaohui Yan and Jiajun Chen. • Acquiring Knowledge from Pre-trained Model to Neural Machine Translation (AAAI2020)Rongxiang Weng, Heng Yu, Shujian Huang, Shanbo Cheng, Weihua Luo • 南京大学:魏浩然(硕士生)、汪然(硕士生)、武鹏(硕士生) 郑在翔(博士生)、翁荣祥(硕士生) 戴新宇(教授)、陈家骏(教授)、张建兵(副教授) • 阿里巴巴:陈博兴、于恒、程善伯、骆卫华 • 华为:宴小辉 3

4.神经⺴络机器翻译 • 从单词序列到单词序列的翻译方式(sequence2sequence) –简单直接的把句子看做单词序列 – 例如:习近平主持召开中央全面深化改革领导小组会议 习近平 主持 召开 中央 ... 会议 ... (Cho et al., 2014) 4

5.神经⺴络机器翻译 • 从单词序列到单词序列的翻译方式 –简单直接的把句子看做单词序列 – 例如:习近平主持召开中央全面深化改革领导小组会议 习近平 主持 召开 中央 ... 会议 ... ... Xi Jinping hosted the meeting ... (Cho et al., 2014) 5

6.神经⺴络机器翻译 • 从单词序列到单词序列的翻译方式 –简单直接的把句子看做单词序列 –Bi-directional RNN + Attention 通过attention得到可变上下文 所有目标端的生成基于相同上 Attention 下文C(Information Bottleneck) Bi-Direction (Cho et al., 2014) (Bahdanau et al., 2015) 6

7. riving the context vector ct . In this model type a variable-length alignment vector at , whose siz equals the number of time steps on the source side 神经⺴络机器翻译 is derived by comparing the current target hidde state ht with each source hidden state h̄s : at (s) = align(ht , h̄s ) (7 exp score(ht , h̄s ) • 从单词序列到单词序列的翻译方式 = Pexp(score(ht , hs )) <latexit sha1_base64="uecbq8GqiEcg8yW9C8GgsWN2cbk=">AAACM3icbVDLSsNAFJ34rPUVdekmWIQWpCSi6EYouBFXFewDmhIm0xs7OnkwcyOWkH9y44+4EMSFIm79B6dtFr4ODBzOOZc79/iJ4Apt+9mYmZ2bX1gsLZWXV1bX1s2NzbaKU8mgxWIRy65PFQgeQQs5CugmEmjoC+j4N6djv3MLUvE4usRRAv2QXkU84IyiljzznHpYVR6vnbiBpCxz4S6pKhZLqA493Bt6mTbzWi3PXJWGXnad/5e4Hic8s2LX7Qmsv8QpSIUUaHrmozuIWRpChExQpXqOnWA/oxI5E5CX3VRBQtkNvYKephENQfWzyc25tauVgRXEUr8IrYn6fSKjoVKj0NfJkOJQ/fbG4n9eL8XguJ/xKEkRIjZdFKTCwtgaF2gNuASGYqQJZZLrv1psSHV3qGsu6xKc3yf/Je39unNYty8OKg27qKNEtskOqRKHHJEGOSNN0iKM3JMn8krejAfjxXg3PqbRGaOY2SI/YHx+AYgTq2k=</latexit> at (si ) = P s0 exp score(hti, h̄s0 ) j exp(score(ht , hsj )) –简单直接的把句子看做单词序列 c (s) = X Here, score is referred as a content-based functio a (sj )hsj for whicht we consider tthree different alternatives <latexit sha1_base64="MQEV8AetjSgrv2R36HRaoitXiDU=">AAACBnicbVDLSgMxFM3UV62vUZciBIvQbsqMKLoRCm5cVrAPaIchk2batElmSDJCGbpy46+4caGIW7/BnX9jpp2Fth4IOTnnXm7uCWJGlXacb6uwsrq2vlHcLG1t7+zu2fsHLRUlEpMmjlgkOwFShFFBmppqRjqxJIgHjLSD8U3mtx+IVDQS93oSE4+jgaAhxUgbybePsa8rqnrdUwn3RxBlL39UHfqpuaa+XXZqzgxwmbg5KYMcDd/+6vUjnHAiNGZIqa7rxNpLkdQUMzIt9RJFYoTHaEC6hgrEifLS2RpTeGqUPgwjaY7QcKb+7kgRV2rCA1PJkR6qRS8T//O6iQ6vvJSKONFE4PmgMGFQRzDLBPapJFiziSEIS2r+CvEQSYS1Sa5kQnAXV14mrbOae1Fz7s7LdSePowiOwAmoABdcgjq4BQ3QBBg8gmfwCt6sJ+vFerc+5qUFK+85BH9gff4APq6YTA==</latexit> j 8 – 利用注意力机制动态获取信息 > > <ht h̄s dot score(ht , h̄s ) = h> t Wa h̄s general (8 – 例如:习近平主持召开中央全面深化改革领导小组会议 > : Wa [ht ; h̄s ] concat 习近平 主持 召开 中央 ... 会议 (Thang Luong et al. 2015) ... h si at (s) ct ht ht h si h si Xi (Bahdanau et al., 2015) 7

8.神经⺴络机器翻译 • 从单词序列到单词序列的翻译方式 –简单直接的把句子看做单词序列 – 利用注意力机制动态获取信息 – 例如:习近平主持召开中央全面深化改革领导小组会议 习近平 主持 召开 中央 ... 会议 ... Xi Jinping (Bahdanau et al., 2015) 8

9.神经⺴络机器翻译 • 从单词序列到单词序列的翻译方式 –简单直接的把句子看做单词序列 – 利用注意力机制动态获取信息 – 例如:习近平主持召开中央全面深化改革领导小组会议 习近平 主持 召开 中央 ... 会议 ... Xi Jinping hosted (Bahdanau et al., 2015) 9

10.神经⺴络机器翻译 • 从单词序列到单词序列的翻译方式 –简单直接的把句子看做单词序列 – 利用注意力机制动态获取信息 – 例如:习近平主持召开中央全面深化改革领导小组会议 习近平 主持 召开 中央 ... 会议 ... Xi Jinping hosted the (Bahdanau et al., 2015) 10

11.神经⺴络机器翻译 • 从单词序列到单词序列的翻译方式 –简单直接的把句子看做单词序列 – 利用注意力机制动态获取信息 – 例如:习近平主持召开中央全面深化改革领导小组会议 习近平 主持 召开 中央 ... 会议 ... ... Xi Jinping hosted the meeting ... (Bahdanau et al., 2015) 11

12.Encoder-Decoder with Attention https://medium.com/analytics-vidhya/transformer-vs-rnn-and-cnn-18eeefa3602b 12

13.Self-Attention Networks (Transformer) (Vaswani et al. 2017) https://medium.com/analytics-vidhya/transformer-vs-rnn-and-cnn-18eeefa3602b 13

14.⼤规模参数的训练 • 利用大规模的计算资源,从大规模数据中学习具有大规模参数 的模型 模型名称 参数规模 Transformer-Base (6+6层) 65,000,000 Transformer-Large (6+6层) 213,000,000 4,500,000句对 BERT-Base (12层) 110,000,000 BERT-Large (24层) 340,000,000 GPT (12层) 117,000,000 GPT-2 (48层) 1,500,000,000 GPT-3 (96层) 175,000,000,000 500,000,000,000字符 需要700G以上空间 21 用于存储参数

15.⼤规模参数学习可能遇到的⼀些问题 • 大规模数据 v.s. 相对有限的实时处理能力 –难以进行全局数据更新 –随机梯度下降:分批次利用少部分数据进行梯度更新 –受不同批次影响,训练稳定性不足 –提升模型训练稳定性(NAACL2019) • 大规模参数 v.s. 相对有限的数据 –难以保证全部参数得到有效学习 –预训练:采用其他数据或任务预先训练模型参数 –先后学习的数据分布不一致,并可能发生灾难性遗忘 –解决预训练模型适配问题(ACL2019、AAAI2020) 22

16.利用在线知识蒸馏提升训练稳定性 23

17.训练过程容易出现的问题: • 训练和验证集表现不一致 25

18.常⻅⽅法: • Regularization/Dropout – label smoothing 与具体训练过程无关 – weight decay –… Regularization techniques for fine-tuning in neural machine translation, Barone et al., EMNLP 2018 26

19.常⻅⽅法: • Checkpoints – check points smoothing(参数平均) 无法改变训练过程 – check points ensemble(模型集成) –… Regularization techniques for fine-tuning in neural 27 machine translation, Barone et al., EMNLP 2018

20.相关⼯作 • 知识蒸馏( Knowledge Distillation ) –从标记中学习 p(𝑦|𝑥; 𝜃) 𝑦( –从模型决策中学习 Distillation 𝑦! p(𝑦|𝑥; 𝜃) * 𝑞(𝑦|𝑥; 𝜃) Teacher Model: –同时使用上述两种知识来源 –一般用于模型压缩,也可以看做是一种模型正则化 28

21.我们的⽅案 • 利用checkpiont进行知识蒸馏学习 –可以改变学习过程 –从动态变化的teacher model中学习 29

22.ODC⽰意图 30

23.算法过程 31

24.⺫标函数 • 结合原先学习目标和知识蒸馏学习目标 32

25. Knowledge distillation usually works better when 4 Experiments 选择更好的Teacher模型 teacher models have better performance. As Tar- vainen and Valpola (2017) proposed in their work, 4.1 Setups • checkpointaveraging ensemble model parameters over training steps To evaluate the eff tends to produce a more accurate model that using – best k checkpoints we conduct experime final parameters directly. They called this method lation tasks: NIST • checkpointassmoothing Mean Teacher. Chinese-English, IW – best k checkpoints, last k checkpoints Following Tarvainen and Valpola (2017), besides and WMT17 English • exponentialupdating moving parameters, average we(EMA) maintain the exponential iments based on an o moving average (EMA) of the model parameters of Transformer (Vasw – (Tarvainen as:and Valpola 2017) NJUNMT-pytorch1 . F 0 0 ments, we use SacreB ✓t = ↵✓t 1 + (1 ↵)✓t , (9) BLEU scores. where t is the update step, ✓ is the parameters of the We also present an – alpha 设为接近于1的值 0 training model and ✓ the parameters of EMA. ↵ is ing comprehension, sh the decay weight which is close to 1.0, and typically be applied to other tas in multiple-nines range, i.e., 0.999, 0.9999. By doing so, at each timestep t, parameters of NMT Datasets For 33 NIST

26.实验 • 机器翻译实验: • NIST Chinese-English • WMT17 Chinese-English • Low-resource – IWSLT15 English-Vietnamese – WMT17 English-Turkish • 阅读理解: – BiDAF++ 76.83->77.40 34

27.中英NIST数据的实验 NIST Chinese-English SYSTEMS NIST03 NIST04 NIST05 NIST06 Average RNNSearch (Zhang et al., 2018b) 36.59 39.57 35.56 35.29 - - Transformer-base(Yang et al., 2018) 42.23 42.17 41.02 - - - baseline 43.78 44.26 40.97 38.93 41.39 - baseline + LKS 44.12 44.87 41.59 39.22 41.89 +0.50 baseline + BKS 44.23 44.98 41.62 39.74 42.11 +0.73 baseline + BKE 44.30 45.01 41.86 40.05 42.31 +0.92 ODC 45.33 45.18 42.60 39.67 42.48 +1.10 ODC + LKS 45.05 45.49 42.99 40.48 42.99 +1.60 ODC + BKS 45.35 45.49 43.21 39.96 42.89 +1.50 ODC + BKE 45.34 45.92 43.35 40.30 43.19 +1.80 ODC-EMA 45.52 45.72 43.01 40.65 43.13 +1.74 Table 1: Case-insensitive BLEU scores of Chinese-English translation on NIST datasets. “Average”means average – last-k smoothing, best-k smoothing scores on NIST04, 05 and 06. – best-k ensemble 2005, 2006 as test sets. We filter out sentence pairs Implementation Details Without specific state- whose source or target side contain more than 50 ment, we follow the transformer base v1 hyper- words. We use BPE (Sennrich et al., 2016b) with parameters settings 7 , with 6 layers in both encoder 35

28. firms that using best checkpoint as teacher indeed to better results. We adopt a simple heuristic, which helps improving the performance of the translation first searches an optimal dropout rate, and then WMT数据集 model. further searches weight decay coefficients based Besides, ODC is comparable to the best re- on this dropout. We experiment with1dropout as sults among smoothing and ensemble on baseline’s 0.2, 0.3, 0.4, and weight decay as 10 , 10 and 2 • 中英、英越、英土 checkpoints (achieved by best-k-ensemble). Con- 10 3. sidering that best-k-ensemble needs to decode with k models, while ODC decodes only one, our model SYSTEMS EN2VI EN2TR enjoys a better efficiency. Furthermore, we can Zhang et al. (2018a) - 12.11 s case, checkpointsachieveSYSTEMfurther improvement newsdev2017 by combining newstest2017 these tensor2tensor 28.43 - nce but higher vari-methods on checkpoints Zhang et al. (2018c) generated - by ODC.23.01 mful to parameters Results baseline 21.96 baseline 28.56 12.20 also show that ODC-EMA (§3.3)23.37 could baseline w/ grid-search 29.01 12.51 ODC 22.24 24.22 achieve additional improvement from ODC itself ODC 29.47 12.92 emble inference (av-(43.13Tablev.s.2:42.48 BLEU), demonstrating Case-sensitive BLEU scores that us- on WMT17 ODC w/ grid-search 29.59 13.18 ies) with the best king EMA of the Chinese-English best checkpoint Translation instead can bring 2017). better knowledge distillation performance, as it gen- Table 3: Case-sensitive BLEU scores on two low re- erates a better teacher model. source translation tasks. ne is comparable to (EN2VI) and WMT17 English-Turkish (EN2TR). results (Zhang et al. – Duegrid-search for hyper-parameter of weight-decay and Results on WMT17 Dataset We present the re- to the limited amount of training sults on WMT17 Chinese-English translation task data, models As in Table 3, our baseline is comparable to In consistent with dropout are in Table more 2. We likely to suffer report from of the results over-fitting. the baseline, kpoints for smooth- fore, we use a higher dropout rate of 0.2 and weight There-two recent published results, respectively: EN2TR ODC and a recent result published by Zhang et al. he baseline system. decay, another common technique against over-from Zhang et al. (2018c) and EN2VI from offi- (2018c). To make a fair comparison, we follow the cial release tensor2tensor problem9 . Grid hyper- mprove the baseline fitting, with decay weight set as 10 3 as the default experiment setting in Zhang et al. (2018c). The parameter search does improve the baseline system. sist with (Tarvainen setting. results We implement weight decay asthose AdamWODC leads to better results compared to the base- experiment show similar trends with (Loshchilov and Hutter, 2017) does. 36 on the NIST datasets. Applying ODC leads to the line, as well as the baseline with grid parameter our approach ODC

29.Dev loss和BLEU变化曲线(英越) 37

江苏鸿程大数据技术与应用研究院依托南京大学PASA大数据实验室的研究力量与基础,以产学研协同方式运营,专业从事大数据与人工智能核心技术、系统平台、行业应用的研发与产业化。