ULMFiT可以更有效地解决各种NLP问题，包括：

Jane发布于2018/12/13

4.tuning method, discriminative fine-tuning3 . the LR, cut is the iteration when we switch from Instead of using the same learning rate for all increasing to decreasing the LR, p is the fraction of layers of the model, discriminative fine-tuning al- the number of iterations we have increased or will lows us to tune each layer with different learning decrease the LR respectively, ratio specifies how rates. For context, the regular stochastic gradient much smaller the lowest LR is from the maximum descent (SGD) update of a model’s parameters θ at LR ηmax , and ηt is the learning rate at iteration t. time step t looks like the following (Ruder, 2016): We generally use cut f rac = 0.1, ratio = 32 and ηmax = 0.01. θt = θt−1 − η · ∇θ J(θ) (1) STLR modifies triangular learning rates (Smith, 2017) with a short increase and a long decay pe- where η is the learning rate and ∇θ J(θ) is the gra- riod, which we found key for good performance.5 dient with regard to the model’s objective func- In Section 5, we compare against aggressive co- tion. For discriminative fine-tuning, we split the sine annealing, a similar schedule that has recently parameters θ into {θ1 , . . . , θL } where θl contains been used to achieve state-of-the-art performance the parameters of the model at the l-th layer and in CV (Loshchilov and Hutter, 2017).6 L is the number of layers of the model. Similarly, we obtain {η 1 , . . . , η L } where η l is the learning rate of the l-th layer. The SGD update with discriminative fine- tuning is then the following: θtl = θt−1 l − η l · ∇θl J(θ) (2) We empirically found it to work well to first choose the learning rate η L of the last layer by fine-tuning only the last layer and using η l−1 = η l /2.6 as the learning rate for lower layers. Slanted triangular learning rates For adapting its parameters to task-specific features, we would like the model to quickly converge to a suitable Figure 2: The slanted triangular learning rate region of the parameter space in the beginning schedule used for ULMFiT as a function of the of training and then refine its parameters. Using number of training iterations. the same learning rate (LR) or an annealed learn- ing rate throughout training is not the best way 3.3 Target task classifier fine-tuning to achieve this behaviour. Instead, we propose slanted triangular learning rates (STLR), which Finally, for fine-tuning the classifier, we augment first linearly increases the learning rate and then the pretrained language model with two additional linearly decays it according to the following up- linear blocks. Following standard practice for date schedule, which can be seen in Figure 2: CV classifiers, each block uses batch normaliza- tion (Ioffe and Szegedy, 2015) and dropout, with cut = T · cut f rac ReLU activations for the intermediate layer and a softmax activation that outputs a probability dis- t/cut, if t < cut p= tribution over target classes at the last layer. Note 1 − cut·(1/cut f rac−1) , otherwise (3) t−cut that the parameters in these task-specific classi- 1 + p · (ratio − 1) fier layers are the only ones that are learned from ηt = ηmax · scratch. The first linear layer takes as the input the ratio pooled last hidden layer states. where T is the number of training iterations4 , cut f rac is the fraction of iterations we increase Concat pooling The signal in text classification 3 tasks is often contained in a few words, which may An unrelated method of the same name exists for deep 5 Boltzmann machines (Salakhutdinov and Hinton, 2009). We also credit personal communication with the author. 4 6 In other words, the number of epochs times the number While Loshchilov and Hutter (2017) use multiple anneal- of updates per epoch. ing cycles, we generally found one cycle to work best.

6. Model Test Model Test CoVe (McCann et al., 2017) 8.2 CoVe (McCann et al., 2017) 4.2 TREC-6 IMDb oh-LSTM (Johnson and Zhang, 2016) 5.9 TBCNN (Mou et al., 2015) 4.0 Virtual (Miyato et al., 2016) 5.9 LSTM-CNN (Zhou et al., 2016) 3.9 ULMFiT (ours) 4.6 ULMFiT (ours) 3.6 Table 2: Test error rates (%) on two text classification datasets used by McCann et al. (2017). AG DBpedia Yelp-bi Yelp-full Char-level CNN (Zhang et al., 2015) 9.51 1.55 4.88 37.95 CNN (Johnson and Zhang, 2016) 6.57 0.84 2.90 32.39 DPCNN (Johnson and Zhang, 2017) 6.87 0.88 2.64 30.58 ULMFiT (ours) 5.01 0.80 2.16 29.98 Table 3: Test error rates (%) on text classification datasets used by Johnson and Zhang (2017). Topic classification For topic classification, we used in (Merity et al., 2017a). evaluate on the large-scale AG news and DBpedia ontology datasets created by Zhang et al. (2015). Baselines and comparison models For each task, we compare against the current state-of-the- Pre-processing We use the same pre-processing art. For the IMDb and TREC-6 datasets, we com- as in earlier work (Johnson and Zhang, 2017; Mc- pare against CoVe (McCann et al., 2017), a state- Cann et al., 2017). In addition, to allow the lan- of-the-art transfer learning method for NLP. For guage model to capture aspects that might be rel- the AG, Yelp, and DBpedia datasets, we com- evant for classification, we add special tokens for pare against the state-of-the-art text categorization upper-case words, elongation, and repetition. method by Johnson and Zhang (2017). Hyperparameters We are interested in a model 4.2 Results that performs robustly across a diverse set of tasks. To this end, if not mentioned otherwise, we use the For consistency, we report all results as error rates same set of hyperparameters across tasks, which (lower is better). We show the test error rates we tune on the IMDb validation set. We use on the IMDb and TREC-6 datasets used by Mc- the AWD-LSTM language model (Merity et al., Cann et al. (2017) in Table 2. Our method outper- 2017a) with an embedding size of 400, 3 layers, forms both CoVe, a state-of-the-art transfer learn- 1150 hidden activations per layer, and a BPTT ing method based on hypercolumns, as well as the batch size of 70. We apply dropout of 0.4 to state-of-the-art on both datasets. On IMDb, we layers, 0.3 to RNN layers, 0.4 to input embed- reduce the error dramatically by 43.9% and 22% ding layers, 0.05 to embedding layers, and weight with regard to CoVe and the state-of-the-art re- dropout of 0.5 to the RNN hidden-to-hidden ma- spectively. This is promising as the existing state- trix. The classifier has a hidden layer of size 50. of-the-art requires complex architectures (Peters We use Adam with β1 = 0.7 instead of the de- et al., 2018), multiple forms of attention (McCann fault β1 = 0.9 and β2 = 0.99, similar to (Dozat et al., 2017) and sophisticated embedding schemes and Manning, 2017). We use a batch size of 64, (Johnson and Zhang, 2016), while our method em- a base learning rate of 0.004 and 0.01 for fine- ploys a regular LSTM with dropout. We note tuning the LM and the classifier respectively, and that the language model fine-tuning approach of tune the number of epochs on the validation set of Dai and Le (2015) only achieves an error of 7.64 each task7 . We otherwise use the same practices vs. 4.6 for our method on IMDb, demonstrating the benefit of transferring knowledge from a large 7 On small datasets such as TREC-6, we fine-tune the LM ImageNet-like corpus using our fine-tuning tech- only for 15 epochs without overfitting, while we can fine-tune longer on larger datasets. We found 50 epochs to be a good niques. IMDb in particular is reflective of real- default for fine-tuning the classifier. world datasets: Its documents are generally a few

7.Figure 3: Validation error rates for supervised and semi-supervised ULMFiT vs. training from scratch with different numbers of training examples on IMDb, TREC-6, and AG (from left to right). paragraphs long—similar to emails (e.g for legal Pretraining IMDb TREC-6 AG discovery) and online comments (e.g for commu- Without pretraining 5.63 10.67 5.52 nity management); and sentiment analysis is simi- With pretraining 5.00 5.69 5.38 lar to many commercial applications, e.g. product response tracking and support email routing. Table 4: Validation error rates for ULMFiT with On TREC-6, our improvement—similar as the and without pretraining. improvements of state-of-the-art approaches—is not statistically significant, due to the small size of the 500-examples test set. Nevertheless, the com- a task with a small number of labels. We evalu- petitive performance on TREC-6 demonstrates ate ULMFiT on different numbers of labeled ex- that our model performs well across different amples in two settings: only labeled examples are dataset sizes and can deal with examples that range used for LM fine-tuning (‘supervised’); and all from single sentences—in the case of TREC-6— task data is available and can be used to fine-tune to several paragraphs for IMDb. Note that despite the LM (‘semi-supervised’). We compare ULM- pretraining on more than two orders of magnitude FiT to training from scratch—which is necessary less data than the 7 million sentence pairs used by for hypercolumn-based approaches. We split off McCann et al. (2017), we consistently outperform balanced fractions of the training data, keep the their approach on both datasets. validation set fixed, and use the same hyperparam- We show the test error rates on the larger AG, eters as before. We show the results in Figure 3. DBpedia, Yelp-bi, and Yelp-full datasets in Table On IMDb and AG, supervised ULMFiT with 3. Our method again outperforms the state-of- only 100 labeled examples matches the perfor- the-art significantly. On AG, we observe a simi- mance of training from scratch with 10× and 20× larly dramatic error reduction by 23.7% compared more data respectively, clearly demonstrating the to the state-of-the-art. On DBpedia, Yelp-bi, and benefit of general-domain LM pretraining. If we Yelp-full, we reduce the error by 4.8%, 18.2%, allow ULMFiT to also utilize unlabeled exam- 2.0% respectively. ples (50k for IMDb, 100k for AG), at 100 labeled examples, we match the performance of training 5 Analysis from scratch with 50× and 100× more data on AG In order to assess the impact of each contribution, and IMDb respectively. On TREC-6, ULMFiT we perform a series of analyses and ablations. We significantly improves upon training from scratch; run experiments on three corpora, IMDb, TREC- as examples are shorter and fewer, supervised and 6, and AG that are representative of different tasks, semi-supervised ULMFiT achieve similar results. genres, and sizes. For all experiments, we split off 10% of the training set and report error rates on Impact of pretraining We compare using no this validation set with unidirectional LMs. We pretraining with pretraining on WikiText-103 fine-tune the classifier for 50 epochs and train all (Merity et al., 2017b) in Table 4. Pretraining is methods but ULMFiT with early stopping. most useful for small and medium-sized datasets, which are most common in commercial applica- Low-shot learning One of the main benefits of tions. However, even for large datasets, pretrain- transfer learning is being able to train a model for ing improves performance.

8. LM IMDb TREC-6 AG Classifier fine-tuning IMDb TREC-6 AG Vanilla LM 5.98 7.41 5.76 From scratch 9.93 13.36 6.81 AWD-LSTM LM 5.00 5.69 5.38 Full 6.87 6.86 5.81 Full + discr 5.57 6.21 5.62 Table 5: Validation error rates for ULMFiT with a Last 6.49 16.09 8.38 vanilla LM and the AWD-LSTM LM. Chain-thaw 5.39 6.71 5.90 Freez 6.37 6.86 5.81 LM fine-tuning IMDb TREC-6 AG Freez + discr 5.39 5.86 6.04 Freez + stlr 5.04 6.02 5.35 No LM fine-tuning 6.99 6.38 6.09 Freez + cos 5.70 6.38 5.29 Full 5.86 6.54 5.61 Freez + discr + stlr 5.00 5.69 5.38 Full + discr 5.55 6.36 5.47 Full + discr + stlr 5.00 5.69 5.38 Table 7: Validation error rates for ULMFiT with different methods to fine-tune the classifier. Table 6: Validation error rates for ULMFiT with different variations of LM fine-tuning. of 0.001 and 0.0001 for the last and all other layers respectively for ‘Chain-thaw’ as in (Felbo et al., Impact of LM quality In order to gauge the im- 2017), and a learning rate of 0.001 otherwise. We portance of choosing an appropriate LM, we com- show the results in Table 7. pare a vanilla LM with the same hyperparame- Fine-tuning the classifier significantly improves ters without any dropout8 with the AWD-LSTM over training from scratch, particularly on the LM with tuned dropout parameters in Table 5. small TREC-6. ‘Last’, the standard fine-tuning Using our fine-tuning techniques, even a regular method in CV, severely underfits and is never LM reaches surprisingly good performance on the able to lower the training error to 0. ‘Chain- larger datasets. On the smaller TREC-6, a vanilla thaw’ achieves competitive performance on the LM without dropout runs the risk of overfitting, smaller datasets, but is outperformed significantly which decreases performance. on the large AG. ‘Freez’ provides similar per- Impact of LM fine-tuning We compare no fine- formance as ‘Full’. ‘Discr’ consistently boosts tuning against fine-tuning the full model (Erhan the performance of ‘Full’ and ‘Freez’, except et al., 2010) (‘Full’), the most commonly used for the large AG. Cosine annealing is competi- fine-tuning method, with and without discrimi- tive with slanted triangular learning rates on large native fine-tuning (‘Discr’) and slanted triangular data, but under-performs on smaller datasets. Fi- learning rates (‘Stlr’) in Table 6. Fine-tuning the nally, full ULMFiT classifier fine-tuning (bottom LM is most beneficial for larger datasets. ‘Discr’ row) achieves the best performance on IMDB and and ‘Stlr’ improve performance across all three TREC-6 and competitive performance on AG. Im- datasets and are necessary on the smaller TREC-6, portantly, ULMFiT is the only method that shows where regular fine-tuning is not beneficial. excellent performance across the board—and is therefore the only universal method. Impact of classifier fine-tuning We compare training from scratch, fine-tuning the full model Classifier fine-tuning behavior While our re- (‘Full’), only fine-tuning the last layer (‘Last’) sults demonstrate that how we fine-tune the clas- (Donahue et al., 2014), ‘Chain-thaw’ (Felbo et al., sifier makes a significant difference, fine-tuning 2017), and gradual unfreezing (‘Freez’). We fur- for inductive transfer is currently under-explored thermore assess the importance of discriminative in NLP as it mostly has been thought to be un- fine-tuning (‘Discr’) and slanted triangular learn- helpful (Mou et al., 2016). To better understand ing rates (‘Stlr’). We compare the latter to an the fine-tuning behavior of our model, we compare alternative, aggressive cosine annealing schedule the validation error of the classifier fine-tuned with (‘Cos’) (Loshchilov and Hutter, 2017). We use a ULMFiT and ‘Full’ during training in Figure 4. learning rate η L = 0.01 for ‘Discr’, learning rates On all datasets, fine-tuning the full model leads 8 To avoid overfitting, we only train the vanilla LM classi- to the lowest error comparatively early in train- fier for 5 epochs and keep dropout of 0.4 in the classifier. ing, e.g. already after the first epoch on IMDb.