Dictionary-Guided Editing Networks for Paraphrase Generation

An intuitive way for a human to write paraphrase sentences is to replace words or phrases in the original sentence with their corresponding synonyms and make necessary changes to ensure the new sentences are fluent and grammatically correct.We propose a novel approach to modeling the process with dictionary-guided editing networks which effectively conduct rewriting on the source sentence to generate paraphrase sentences. It jointly learns the selection of the appropriate word level and phrase level paraphrase pairs in the context of the original sentence from an off-the-shelf dictionary as well as the generation of fluent natural language sentences. Specifically, the system retrieves a set of word level and phrase level paraphrased pairs derived from the Paraphrase Database (PPDB) for the original sentence, which is used to guide the decision of which the words might be deleted or inserted with the soft attention mechanism under the sequence-to-sequence framework. We conduct experiments on two benchmark datasets for paraphrase generation, namely the MSCOCO and Quora dataset. The evaluation results demonstrate that our dictionary-guided editing networks outperforms the baseline methods.
展开查看详情

1. Dictionary-Guided Editing Networks for Paraphrase Generation Shaohan Huang† , Yu Wu‡ , Furu Wei† , Ming Zhou† † Microsoft Research, Beijing, China ‡ State Key Lab of Software Development Environment, Beihang University, Beijing, China {shaohanh, fuwei, mingzhou}@microsoft.com wuyu@buaa.edu.cn arXiv:1806.08077v1 [cs.CL] 21 Jun 2018 (a) Retrieve Abstract 0.4 overcome get rid of An intuitive way for a human to write paraphrase sentences is Original Sentence 0.3 be overcome be resolved to replace words or phrases in the original sentence with their What are the best ways to 0.1 best nicest corresponding synonyms and make necessary changes to en- overcome boredom ? 0.1 the best ways the most suitable ways sure the new sentences are fluent and grammatically correct. 0.1 best ways the most efficient ways We propose a novel approach to modeling the process with (b) Generate dictionary-guided editing networks which effectively conduct rewriting on the source sentence to generate paraphrase sen- Paraphrase Sentence tences. It jointly learns the selection of the appropriate word What is the most efficient way of getting rid of boredom ? level and phrase level paraphrase pairs in the context of the original sentence from an off-the-shelf dictionary as well as Figure 1: The dictionary-guided editing networks model the generation of fluent natural language sentences. Specif- first retrieves a group of paraphrased pairs and then gener- ically, the system retrieves a set of word level and phrase ates a paraphrase using the original sentence as a prototype. level paraphrased pairs derived from the Paraphrase Database (PPDB) for the original sentence, which is used to guide the decision of which the words might be deleted or inserted with the soft attention mechanism under the sequence-to-sequence level paraphrase phrases “get rid of”, and then make small framework. We conduct experiments on two benchmark changes over the new sentence to ensure it is grammatically datasets for paraphrase generation, namely the MSCOCO and correct and fluent. Certainly, it should be emphasized that Quora dataset. The evaluation results demonstrate that our the selection of context-relevant paraphrase pairs from an dictionary-guided editing networks outperforms the baseline off-the-shelf dictionary with respect to the original sentence methods. is also important for a good revision. This process demon- strates that humans usually write paraphrase sentences by Introduction editing the input sentence, which motivates us to develop models for paraphrase generation through editing. Paraphrase generation aims to generate restatements of the We are inspired by Gupta et al.’s pioneer work on a new meaning of a text or passage using other words. It is a fun- paradigm to generate sentences. Specifically, they propose a damental task in natural language processing with many ap- new generative model of sentences that first samples a pro- plications in information retrieval, question answering, di- totype sentence from the training corpus and then edits it alogue, and conversation systems. Existing work on para- into a new sentence. Unlike randomly sampling the edit phrase generation focuses on generating paraphrase sen- vector to generate a new sentence, we can leverage the off- tences from scratch. For example, Lin et al. propose generat- the-shelf word level and phrase level paraphrase pairs (e.g. ing paraphrases with statistical machine translation models. synonyms) to construct the editing vector where the deletion Recently, neural networks based generative models under of words from the original sentence and the insertion words the sequence-to-sequence framework have also been used into the target sentence can be explicitly modeled. for paraphrase generation (Lin et al. 2014). However, an intuitive way for a human to write para- In this paper, we propose a dictionary-guided editing net- phrase sentences is to replace words or phrases in the orig- works for paraphrase generation which effectively conducts inal sentence with their corresponding synonyms and make rewriting on the source sentence to generate paraphrase sen- necessary changes to ensure the new sentences are fluent and tences. It jointly learns the selection of the appropriate word grammatically correct. Figure 1 shows an example. Given level and phrase level paraphrase pairs in the context of the the input sentence “What are the best ways to overcome original sentence from an off-the-shelf dictionary as well as boredom?”, we can first replace “overcome” with the word the generation of fluent natural language sentences. The sys- tem retrieves a set of word level and phrase level paraphrased Copyright c 2018, Association for the Advancement of Artificial pairs derived from the Paraphrase Database (PPDB) for the Intelligence (www.aaai.org). All rights reserved. original sentence, which are used to guide the decision on

2.which the words might be deleted or inserted with the soft at- sentence from training data and then edits it into a new sen- tention mechanism under the sequence-to-sequence frame- tence. We utilize the original sentence as a prototype and work. learn the edit vector from paraphrase dataset (PPDB) (Gan- We conduct experiments on the benchmark MSCOCO itkevitch, Van Durme, and Callison-Burch 2013). Li et al. and Quora datasets for paraphrase generation. The evalu- (2018) introduce a simple approach for style transfer. It ation results demonstrate that the dictionary-guided editing can be considered for applying content words by deleting networks outperforms existing sequence-to-sequence gener- phrases associated with original attribute values as a proto- ation baselines and achieves state-of-the-art results. type, and combining a new phrase with the target attribute The rest of this paper is organized as follows: Section 2 to generate a final output. Cao et al. (2018) employ existing gives a brief overview of the recent history of paraphrase summaries as soft templates, and rerank these soft templates generation and presents a description of text editing meth- by considering the current document. Finally, a summary ods. In Section 3 we show the detailed design of our is generated with a seq2seq framework augmented with the dictionary-guided editing network model. In Section 4 we templates. Our work can be seen as an extension of editing conduct paraphrase generation experiments on two datasets methods for paraphrase generation. The stark difference is and demonstrate the evaluation results. Section 5 concludes that our model is capable of leveraging an external dictio- this paper and outlines future work. nary in editing, which ensures that the expression changes do not affect its original semantic. Related Work Paraphrase generation aims to generate a semantically Methodology equivalent sentence with different expressions. Prior ap- Problem Definition proaches can be categorized into knowledge-based ap- We assume there is access to a corpus of lexical or phrasal proaches and statistical machine translation (SMT) based paraphrased dictionary D = {(oi , pi )}N i=1 , where oi is an approaches. Knowledge-based approaches primarily rely on original word or phrase and pi is the word-level or phrase- hand-crafted rules and dictionaries that enjoy high precision level paraphrase of oi . Besides, we have a parallel data set but that are hard to scale up. The pioneer of this approach is P = {(xi , yi )}Li=1 , where (xi , yi ) is a paraphrase pair. Our Kozlowski et al. (2003) who first pair simple semantic struc- goal is to learn a paraphrase generator with the use of D and tures with their syntactic realization and then generate para- P, so as to precisely paraphrase a new sentence x with y. phrases using such predicate/argument structures. A famous The overview of our model is shown in Figure 2. We first paraphrase generation system is designed by Hassan et al. retrieve a set of word level or phrase level paraphrased pairs (2007), where paraphrases are generated by word substitu- E = {(oi , pi )}M i=1 where (oi , pi ) ∈ D for original sentence tions and the substitution table is obtained by leveraging sev- x. Secondly, we implement a neural encoder to convert each eral external resources, such as WordNet and Microsoft En- word or phrase into a vector in E, which is used in the soft at- carta encyclopedia. Subsequently, Madnani and Dorr (2010) tention mechanism. Finally, we learn the dictionary-guided propose a knowledge-driven method by using hand crafted editing networks model to generate the paraphrase sentence rules or automatically learned complex paraphrase patterns y. (Zhao et al. 2009). SMT based paraphrase generation is pro- posed by (Quirk, Brockett, and Dolan 2004), where an SMT Retrieve model is trained on large volumes of sentence pairs extracted Our model relies on the observation that humans usually from clustered news articles. Zhao et al. (2008) combine write paraphrase sentences by replacing words or phrases multiple resources to learn phrase-based paraphrase tables in the original sentence with their corresponding synonyms. and corresponding feature functions to devise a log-linear Therefore, the first step of our method is to retrieve a set of SMT model. To leverage the power of multiple machine lexical or phrasal paraphrased pairs dictionary for the origi- translate engine, a multi-pivot approach is proposed in (Zhao nal sentence. For example, for original sentence x “What et al. 2010) to obtain plenty of paraphrase candidates. Then are the best ways to overcome boredom”, we would find these candidates are used by selection-based and decoding- some paraphrased pairs such as (“overcome”, “get rid of”), based methods to produce high-quality paraphrases. (“the best ways”, “the most suitable ways”), and (“the best Recently, deep learning-based approaches have been in- ways”, “the most efficient ways”). troduced for paraphrase generation and achieved great suc- Our system retrieves word level and phrase level para- cess. Prakash et al. (2016) employ the residual recurrent phrased pairs derived from the Paraphrase Database (PPDB) neural networks for paraphrase generation, that is one of the (Pavlick et al. 2015). PPDB is an automatically extracted first major words that uses a deep learning model for this database containing millions of paraphrases in different lan- task. Gupta et al. (2017) propose a combination of varia- guages. It contains three types of paraphrases: lexical tional autoencoder(VAE) and sequence-to-sequence model (single word to single word), phrasal (multiword to sin- to generate paraphrase. We also investigate deep learning gle/multiword), and syntactic (paraphrase rules containing for paraphrase generation, and we are the first one to utilize non-terminal symbols). We use PPDB with the lexical and an editing mechanism for this task. phrasal types as raw paraphrased dictionary D. Finally, our work is in the spirit of prototype editing meth- We leverage Lucene1 to index the paraphrased dictionary ods for natural language generation (Guu et al. 2017), which 1 proposes a generative model that first samples a prototype https://lucene.apache.org/

3. y1 y2 y3 y4 y5 h1 h2 h3 h4 h5 _ + overcome get rid of h1 h2 h3 h4 h5 h6 be overcome be resolved x1 x2 x3 x4 x5 x6 best nicest the best ways the most suitable ways What are the best ways to overcome boredom ? best ways the most efficient ways Retrieve Figure 2: Architecture of dictionary-guided editing networks. At each step of the decoder, we implement the soft attention mechanism to guide the decision for word deletion or insertion. D and use the default ranking function in Lucene during the Dictionary-Guided Editing search phase. Specifically, we index all oi in D, and retrieve top 10 × M paraphrased pairs for the original sentence x. We propose a dictionary-guided editing networks model Then, we rank 10 × M candidates by combining the TF-IDF where paraphrased dictionary is used to guide the decision weighted word overlap and the PPDB score. The ranking for words that might be deleted or inserted with the soft at- score is formulated as: tention mechanism under the sequence-to-sequence frame- work. We learn our model that takes as input original sen- scorer = tfw · idfw + score (oi , pi ) (1) tence x and representation vectors E = {(oir , pir )}M i=1 . w∈oi ∩x For original sentence x, we first regard the output of the where score (oi , pi ) is the PPDB score for (oi , pi ) pairs, BiRNN as the representation of the original sentence x and which is computed by a regression model (Pavlick et al. use the standard attention model (Luong, Pham, and Man- 2015) in PPDB. We obtain a set of word level or phrase level ning 2015) to capture original-side information. paraphrased pairs E as the local dictionary for the original For representation vectors E , we adopt the soft attention sentence x. mechanism, which is introduced to better utilize paraphrased dictionary information. The soft attention mechanism would Dictionary Encoder be used to guide the decision for word deletion or insertion After finding a group of paraphrased pairs E = {(oi , pi )}M i=1 in each step of the decoder. for original sentence x, we use a neutral dictionary encoder For the t-th time step of the decoder, ht denotes its hid- to convert E into a representation vector. In the case of single den state. The goal is to derive a context vector ct that cap- word paraphrased pairs, a good representation vector would tures paraphrased dictionary side information to guide the be the word vector of oi or pi . For multiple words, oi or decoder. We employ a concatenation layer to combine ht , pi are represented as the sum of the individual word vectors ct and ct as follows: (Gupta et al. 2017). oir = Φ(w) (2) ˜ t = tanh(Wc · (ht ⊕ ct ⊕ ct )) h (4) w∈oi where ⊕ denotes concatenation and Wc is a parameter. The pir = Φ(w) (3) vector ct is the standard attention for the source side. ct w∈pi is computed as the weighted average of the original hidden where Φ(w) is the word vector for word w and oir is the states. representation vector of oi and pir is the representation vector We then compute context vector ct . In paraphrased pairs of pi . E = {(oi , pi )}M i=1 , oi might be the word that will be deleted For each paraphrased pair in E, we employ the same and pi might be inserted. In order to better guide our decoder encoding method and obtain 2 × M vectors E = on which word might be deleted or be inserted, we employ {(oir , pir )}M i=1 . In the next section, we will introduce lever- two soft attentions to compute the oi -side and pi -side con- aging our paraphrased dictionary to generate a paraphrase. text vectors respectively. Context vector ct is computed as

4.the weighted average of oir and pir as follows: Gupta et al. 2017), 20K instances are randomly selected from the data for testing, 10K instances for validation and M M remaining data over 320K instances for training. ct = at,i · oir ⊕ at,i · pir (5) Quora dataset is related to the problem of identifying du- i=1 i=1 plicate questions3 . It consists of over 400K potential ques- The at and at are alignment vectors, whose size equals tion duplicate pairs. The non-duplicate pairs are related M . at,i is formulated as: questions or have similar topics, which are not truly seman- tically equivalent, so we use true examples of duplicate pairs exp(score(ht , oir )) as paraphrase generation dataset. There are a total of 155K at,i = M (6) such questions. In order to compare our results with pre- i j=1 exp(score(ht , or )) vious work (Gupta et al. 2017), we evaluate our model on score(ht , oir ) = v tanh(Wα [ht ⊕ oir ]) (7) 145K training dataset sizes, 5K validation dataset and 4K instances for testing. where Wα and v are parameters. The pir -side alignment vec- tor at,i is formulated as: Evaluation Metric exp(score(ht , pir )) To automatically evaluate the performance of paraphrase at,i = M (8) generation models, we use the well-known evaluation met- i j=1 exp(score(ht , pr )) rics4 for comparing parallel corpora: BLEU (Papineni et al. score(ht , pir ) = v tanh(Wα [ht ⊕ pir ]) (9) 2002) and METEOR (Lavie and Agarwal 2007). Previous work has shown that these metrics can perform well for para- where Wα and v are the attention parameters. phrase detection (Madnani, Tetreault, and Chodorow 2012) A softmax layer is introduced to compute probability dis- and correlate well with human judgments in paraphrase gen- tribution of the t-th time word: eration (Wubben, Van Den Bosch, and Krahmer 2010). ˜ t ⊕ ct ⊕ ct ] + by ) (10) yt = sof tmax(Wy [yt−1 ⊕ h BLEU considers exact matching between reference para- phrases and system generated paraphrases by considering n- where Wy and by are two parameters. gram overlaps. METEOR uses stemming and synonymy in For the generative model, the learning goal is to maximize WordNet to improve and smoothen this measure. We report the probability of the actual paraphrase y∗ . We learn our our p-values at 95% Confidence Intervals (CI). model by minimizing the negative log-likelihood (NLL): Implementation Details J = −log(p(y∗ |x, E )) (11) We leverage the PPDB to build our paraphrased dictionary The mini-batched Adam (Kingma and Ba 2014) algo- index and we have introduced our retrieval strategy before. rithm is used to optimize the objective function. In order The Paraphrased Database (PPBD)5 is used to divide the to avoid overfitting, we adopt dropout layers between differ- database into six sizes, from S up to XXXL. We build our ent LSTM layers same as (Zaremba, Sutskever, and Vinyals paraphrased dictionary index using L size PPBD. PPDB con- 2014). tains five types of entailment relations and we exact para- phrased pairs with equivalent entailment relations to ensure Experiments the quality of our paraphrased dictionary. We use NLTK (Bird and Loper 2004) to tokenize the sen- Datasets tences and keep words that appear more than 10 times in We present the performance of our model on two benchmark our vocabulary. Following the data preprocessing method datasets, namely the MSCOCO and Quora datasets. in previous work (Prakash et al. 2016; Gupta et al. 2017), MSCOCO (Lin et al. 2014) is a large-scale captioning we reduce those captions to the size of 15 words (by remov- dataset which contains human annotated captions of over ing the words beyond the first 15) for the MSCOCO dataset, 120K images 2 . This dataset was used previously to eval- and sentences whose lengths are greater than 30 words are uate paraphrase generation methods (Prakash et al. 2016; filtered in the Quora dataset. The max length of phrases in Gupta et al. 2017). In the MSCOCO dataset, each image PPDB is set to 7 and the size M of the paraphrased dictio- has five captions from five different annotators. Annota- nary is 10. tors describe the most obvious object or action in an im- We use a one-hot vector approach to represent the words age, which makes this dataset very suitable for the para- in all models. The training hyper-parameters are selected phrase generation task. This dataset comes with separate based on the results of the validation set. The dimensions of subsets for training and validation: Train 2014 contains word embeddings is set to 300 and hidden vectors are set to over 82K images and Val 2014 contains over 40K images. 512 in the sequence encoder and decoder. The dimensions From the five captions accompanying each image, we ran- 3 domly omit one caption and use the other four as train- https://data.quora.com/First-Quora-Dataset-Release- ing instances to create paraphrase pairs. In order to com- Question-Pairs 4 pare our results with previous work (Prakash et al. 2016; We used the evaluation software available at https://github.com/jhclark/multeval 2 5 http://cocodataset.org/ http://paraphrase.org

5.of the attention vector are also set to 512 and the dropout rate is set to 0.5 for regularization. The mini-batched Adam Table 1: Results on MSCOCO dataset. Higher BLEU and (Kingma and Ba 2014) algorithm is used to optimize the ob- METEOR score is better. Scores of the methods marked jective function. The batch size and base learning rates are with * are taken from (Gupta et al. 2017). set to 64 and 0.001, respectively. Model Beam size BLEU METEOR Seq2Seq 1 29.9 24.7 Baselines VAE-SVG* 1 39.2 29.2 VAE-SVG-eq* 1 37.3 28.5 We compare our method with the following baseline meth- Our method 1 40.3 30.1 ods for paraphrase generation: Seq2Seq* 10 33.4 25.2 Seq2Seq: We implement the standard sequence to se- Residual LSTM* 10 37.0 27.0 quence with attention model (Bahdanau, Cho, and Bengio VAE-SVG* 10 41.3 30.9 2015), which is implemented in OpenNMT (Klein et al. VAE-SVG-eq* 10 39.6 30.2 2017). All the settings are the same as our system. Our method 10 42.6 31.3 Residual LSTM: Residual LSTM is a stacked residual LSTM network under the sequence to sequence framework proposed by (Prakash et al. 2016). It adds residual connec- Table 2: Results on Quora dataset. Higher BLEU and ME- tions between LSTM layers to help retain essential words in TEOR score is better. Scores of the methods marked with * the generated paraphrases. are taken from (Gupta et al. 2017). VAE-SVG: VAE-SVG is the current state-of-the-art para- Model Beam size BLEU METEOR phrase method on the MSCOCO dataset (Gupta et al. Seq2Seq 1 25.9 25.8 2017). It combines the variational autoencoder(VAE) and Residual LSTM 1 26.3 26.2 sequence-to-sequence model to generate paraphrases. VAE-SVG* 1 25.0 25.1 VAE-SVG-eq: It is the current state-of-the-art paraphrase VAE-SVG-eq* 1 26.2 25.7 method on the Quora dataset (Gupta et al. 2017). Different Our method 1 27.6 29.9 from the VAE-SVG model, the encoder of the original sen- Seq2Seq 10 27.9 29.3 tence is the same on both sides i.e. encoder side and the Residual LSTM 10 27.4 28.9 decoder side in this variation. VAE-SVG-eq* 10 37.1 32.0 Our method 10 28.4 30.6 Evaluation Results As shown in Table 1, we compare our dictionary-guided editing networks model with several state-of-the-art meth- ence in their score is more than 0.2 in BLEU and 0.1 in ods on the MSCOCO dataset. The results demonstrate that METEOR for Quora dataset. our model consistently improves performance over other models for both greedy search and beam search. We get Discussion further improvement in both metrics though beam search, In Figure 3, we show the visualization of dictionary-guided though, these improvements are not as significant as for attention in the decoder. Each column in the diagram corre- Seq2Seq. This could be because the paraphrased dictio- sponds to the weights of the decoder and items in the para- nary provides some information for generating paraphrases, phrased dictionary. which could avoid our model to output the paraphrases Figure 3 shows two examples separately from MSCOCO which have only a few common terms. For MSCOCO, the and Quora datasets. Each example has five paraphrased pairs comparison between two models is significant at 95% CI, if in the dictionary. The delete attention and insert attention the difference in their score is more than 0.2 in BLEU and scores are represented by gray scales and are column-wisely 0.1 in METEOR. normalized as described in Equation 6 and 8. As described, In Table 2, we report BLEU and METEOR results for the editing attention mechanism learns soft alignment scores the Quora dataset. For this dataset, we compare the re- between paraphrased dictionary and generated words. These sults of our approach with existing approaches at greedy scores are used to guide the decision for word deletion or search and beam search. The results demonstrate that our insertion in the decoder. proposed model outperforms other models at the non-beam In the first example, the generated paraphrase is ”two cats search. For the greedy search, the dictionary-guided editing are playing in a living room with a television .”. We find networks model is able to give a 1.3 performance improve- that the pair (”a tv”, ”a television”) has larger attention ment for BELU and 3.7 improvement for the METEOR met- scores where the decoder generates the television word. This ric over the state-of-the-art one. For beam size of 10, our demonstrates our paraphrased dictionary has more effect on model outperforms other models in the Quora dataset ex- generating some words which might be deleted or inserted. cept the VAE-SVG-eq model, in which beam search gives an As we can see in the second example, the model learns align- 11% absolute point performance improvement in the BLEU ments when the decoder generates earn money. score. For the Quora dataset, beam search does not give In Table 3, we show some generated paraphrase exam- such a significant improvement in our model. Comparison ples on MSCOCO and Quora datasets. In these tables, red between two models is significant at 95% CI, if the differ- denotes paraphrased dictionary pairs which might be used to

6. a tv a television Table 3: Example paraphrases generated using the a large and dictionary-guided editing networks on MSCOCO and Quora a great and playing a datasets. to play a Source these two cats are playing in a room that a large has a large tv and a laptop computer . a major cat is Reference a cat being lazy and a cat being nozy in a cat 's living room with tv and a laptop romm playing living television two in are a with . a cats displaying the same things . Generated two cats are playing in a living room with a television . make money (a tv, a television) earn money indicate in what ways Dictionary (a large and, a great and) indicate how (playing a, to play a) making money earn money Source a large passenger airplane flying through what are the air . what 's Reference an airplane that is , either , landing or just are what s what 's taking off . Generated a large jetliner flying through a blue sky . money easily how can earn online ? i (the airplane, the aeroplane) Dictionary (airplane, jetliner) Figure 3: Visualization of dictionary-guided attention in the (a large, a great) decoder. Each column in the diagram corresponds to the weights of the decoder and items in the paraphrased dictio- Source what are ways i can make money online ? nary. Reference can i earn money online ? Generated how can i earn money online easily ? (make money, earn money) guide paraphrase generation and blue denotes phrases which Dictionary (indicate in what ways, indicate how) are found in the paraphrased dictionary. As we can see, our (making money, earn money) model is able to replace some words or phrases in the orig- Source can you offer me any advice on how to inal sentence based on the dictionary and makes necessary lose weight ? changes to ensure the new sentence is grammatically correct Reference how can i efficiently lose weight ? and fluent. Generated can you give me some advice on losing weight ? Conclusion (offer advice, provide advice) In this paper, we present a dictionary-guided editing net- Dictionary (offer advice, give advice) works model for generating paraphrase sentences through (you lost weight, you ’ve lost weight) editing the original sentence. It can effectively leverage word level and phrase level paraphrase pairs from an off- the-shelf dictionary. The system jointly learns the selection Cao, Z.; Li, W.; Wei, F.; and Li, S. 2018. Retrieve, rerank and rewrite: Soft template based neural summarization. Association of the appropriate word level and phrase level paraphrase for Computational Linguistics. pairs in the context of the original sentence from the Para- phrase Database (PPDB) as well as the generation of fluent Ganitkevitch, J.; Van Durme, B.; and Callison-Burch, C. 2013. Ppdb: The paraphrase database. In Proceedings of the 2013 Con- natural language sentences. Experiments on the Quora and ference of the North American Chapter of the Association for Com- MSCOCO datasets demonstrate that the dictionary-guided putational Linguistics: Human Language Technologies, 758–764. editing networks significantly improves the existing gener- Gupta, A.; Agarwal, A.; Singh, P.; and Rai, P. 2017. A deep ative models for paraphrase generation from scratch. The generative framework for paraphrase generation. arXiv preprint dictionary-guided editing networks can also be applied to arXiv:1709.05074. other text generation tasks, such as the text style transfer where we can use word and phrase level style mapping dic- Guu, K.; Hashimoto, T. B.; Oren, Y.; and Liang, P. 2017. Generat- ing sentences by editing prototypes. TACL. tionaries to facilitate sentence level style transfer results. Hassan, S.; Csomai, A.; Banea, C.; Sinha, R.; and Mihalcea, R. 2007. Unt: Subfinder: Combining knowledge sources for au- References tomatic lexical substitution. In Proceedings of the 4th Interna- Bahdanau, D.; Cho, K.; and Bengio, Y. 2015. Neural machine tional Workshop on Semantic Evaluations, 410–413. Association translation by jointly learning to align and translate. ICLR. for Computational Linguistics. Bird, S., and Loper, E. 2004. Nltk: the natural language toolkit. Kingma, D. P., and Ba, J. 2014. Adam: A method for stochastic In Proceedings of the ACL 2004 on Interactive poster and demon- optimization. arXiv preprint arXiv:1412.6980. stration sessions, 31. Association for Computational Linguistics. Klein, G.; Kim, Y.; Deng, Y.; Senellart, J.; and Rush, A. M. 2017.

7.Opennmt: Open-source toolkit for neural machine translation. In AFNLP: Volume 2-Volume 2, 834–842. Association for Computa- Proc. ACL. tional Linguistics. Kozlowski, R.; McCoy, K. F.; and Vijay-Shanker, K. 2003. Genera- Zhao, S.; Wang, H.; Lan, X.; and Liu, T. 2010. Leveraging tion of single-sentence paraphrases from predicate/argument struc- multiple mt engines for paraphrase generation. In Proceedings of ture using lexico-grammatical resources. In Proceedings of the the 23rd International Conference on Computational Linguistics, second international workshop on Paraphrasing-Volume 16, 1–8. 1326–1334. Association for Computational Linguistics. Association for Computational Linguistics. Lavie, A., and Agarwal, A. 2007. Meteor: An automatic met- ric for mt evaluation with high levels of correlation with human judgments. In Proceedings of the Second Workshop on Statisti- cal Machine Translation, 228–231. Association for Computational Linguistics. Li, J.; Jia, R.; He, H.; and Liang, P. 2018. Delete, retrieve, generate: A simple approach to sentiment and style transfer. NAACL. Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Doll´ar, P.; and Zitnick, C. L. 2014. Microsoft coco: Common objects in context. In European conference on computer vision, 740–755. Springer. Luong, M.-T.; Pham, H.; and Manning, C. D. 2015. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025. Madnani, N., and Dorr, B. J. 2010. Generating phrasal and senten- tial paraphrases: A survey of data-driven methods. Computational Linguistics 36(3):341–387. Madnani, N.; Tetreault, J.; and Chodorow, M. 2012. Re-examining machine translation metrics for paraphrase identification. In Pro- ceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 182–190. Association for Computational Linguis- tics. Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002. Bleu: a method for automatic evaluation of machine translation. In Pro- ceedings of the 40th annual meeting on association for computa- tional linguistics, 311–318. Association for Computational Lin- guistics. Pavlick, E.; Rastogi, P.; Ganitkevitch, J.; Van Durme, B.; and Callison-Burch, C. 2015. Ppdb 2.0: Better paraphrase rank- ing, fine-grained entailment relations, word embeddings, and style classification. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th Interna- tional Joint Conference on Natural Language Processing (Volume 2: Short Papers), volume 2, 425–430. Prakash, A.; Hasan, S. A.; Lee, K.; Datla, V.; Qadir, A.; Liu, J.; and Farri, O. 2016. Neural paraphrase generation with stacked residual lstm networks. arXiv preprint arXiv:1610.03098. Quirk, C.; Brockett, C.; and Dolan, B. 2004. Monolingual machine translation for paraphrase generation. Wubben, S.; Van Den Bosch, A.; and Krahmer, E. 2010. Para- phrase generation as monolingual translation: Data and evaluation. In Proceedings of the 6th International Natural Language Gener- ation Conference, 203–207. Association for Computational Lin- guistics. Zaremba, W.; Sutskever, I.; and Vinyals, O. 2014. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329. Zhao, S.; Niu, C.; Zhou, M.; Liu, T.; and Li, S. 2008. Combin- ing multiple resources to improve smt-based paraphrasing model. Proceedings of ACL-08: HLT 1021–1029. Zhao, S.; Lan, X.; Liu, T.; and Li, S. 2009. Application-driven statistical paraphrase generation. In Proceedings of the Joint Con- ference of the 47th Annual Meeting of the ACL and the 4th Inter- national Joint Conference on Natural Language Processing of the