Response Generation by Context-aware Prototype Editing

Open domain response generation has achieved remarkable progress in recent years, but sometimes yields short and uninformative responses. We propose a new paradigm for response generation, that is response generation by editing, which significantly increases the diversity and informativeness of the generation results. Our assumption is that a plausible response can be generated by slightly revising an existing response prototype. The prototype is retrieved from a pre-defined index and provides a good start-point for generation because it is grammatical and informative. We design a response editing model, where an edit vector is formed by considering differences between a prototype context and a current context, and then the edit vector is fed to a decoder to revise the prototype response for the current context. Experiment results on a large scale dataset demonstrate that the response editing model outperforms generative and retrieval-based models on various aspects

1. Response Generation by Context-aware Prototype Editing Yu Wu† , Furu Wei‡ , Shaohan Huang‡ , Zhoujun Li† , Ming Zhou‡ † State Key Lab of Software Development Environment, Beihang University, Beijing, China ‡ Microsoft Research, Beijing, China {wuyu,lizj} {fuwei, shaohanh, mingzhou} Abstract My friends and I went to some ve- Context gan place for dessert yesterday. Open domain response generation has My friends and I had Tofu and achieved remarkable progress in recent Prototype vegetables at a vegan place nearby context arXiv:1806.07042v2 [cs.CL] 5 Jul 2018 years, but sometimes yields short and un- yesterday. informative responses. We propose a new Prototype Raw green vegetables are very paradigm for response generation, that is response beneficial for your health. response generation by editing, which sig- Revised Desserts are very bad for your nificantly increases the diversity and infor- response health. mativeness of the generation results. Our assumption is that a plausible response Table 1: An example of context-aware prototypes can be generated by slightly revising an editing. Underlined words mean they do not ap- existing response prototype. The proto- pear in the original context, while words with type is retrieved from a pre-defined index strikethrough mean they are not in the prototype and provides a good start-point for gen- context. Words in bold represent they are modi- eration because it is grammatical and in- fied in the revised response. formative. We design a response editing model, where an edit vector is formed by considering differences between a proto- Since retrieval-based approaches are severely con- type context and a current context, and strained by a pre-defined index, generative ap- then the edit vector is fed to a decoder to proaches become popular in recent years. Tradi- revise the prototype response for the cur- tional generation-based approaches, however, do rent context. Experiment results on a large not easily generate long, diverse and informative scale dataset demonstrate that the response responses, which is referred to as “safe response” editing model outperforms generative and problem (Li et al., 2016a). retrieval-based models on various aspects. To address this issue, we propose a new paradigm, prototype-then-edit, for response gen- 1 Introduction eration. Our motivations are three-folds: 1) We In recent years, non-task oriented chatbots, fo- observe that most of responses can be represented cused on responding to humans intelligently on a as transformations of other responses, so it is un- variety of topics, have drawn much attention from necessary to generate a response from scratch; 2) both academia and industry. Existing approaches Human-written responses, called “prototypes” in can be categorized into generation-based methods this paper, are informative, diverse and grammati- (Shang et al., 2015; Vinyals and Le, 2015; Serban cal which do not suffer from short and generic is- et al., 2016; Sordoni et al., 2015; Xing et al., 2017; sues. Hence, generating responses by editing such Serban et al., 2017; Xing et al., 2018) which gen- prototypes can avoid the “safe response” problem. erate a response from scratch, and retrieval-based 3) In order to adapt current context, slight revision methods (Hu et al., 2014; Lowe et al., 2015; Yan should be done on the prototype, in which irrel- et al., 2016; Zhou et al., 2016; Wu et al., 2017) evant words are removed and appropriated words which select a response from an existing corpus. are inserted.

2. Inspired by this idea, we formulate the response model, that regards the concatenation of last step generation process as follows. Given a conversa- word embedding and the edit vector as inputs, and tional context c, we first retrieve a similar con- predicts the next word with an attention mecha- text c and its associated response r from a pre- nism. defined index, and then invoke an edit vector by Our experiments are conducted on a large scale analyzing the differences between c and c . Af- Chinese conversation corpus comprised of 10 mil- ter that, we revise the prototype conditioning on lion context-response pairs. Given a response, we the edit vector. We further illustrate how our idea find its neighbors based on lexicon similarities to works with an example in Table 1. It is obvious build the prototype-target parallel data for train- that the major difference between c and c is what ing. In terms of testing, we retrieve a similar con- the speaker eats, so the phrase “raw green vegeta- text and adapt its response to current context by bles” in r should be replaced by “desserts” in or- editing. The experiments show that our method der to reply current context c. We hope that the outperforms traditional generative models on rel- decoder language model could remember the col- evance, diversity and originality. We further find location of “desserts” and “bad for health”, so as that the revised response achieves better relevance to replace “beneficial” with “bad” in the revised compared to its prototype, demonstrating that the response. This paradigm can be seen as an ensem- editing process is not trivial. ble of a retrieval based approach and a generation Our contributions are listed as follows: 1) this based approach, that not only inherits the fluency is the first work to propose a prototype-then-edit and informativeness advantages from retrieval re- paradigm for response generation, that is signifi- sults, but also enjoys the flexibility of generation cantly different from those that generate responses results. from scratch; 2) we elaborate a simple but effec- In practice, there is a problem that must be tive context-aware editing model for response gen- solved, which is how to revise the prototype in a eration; 3) we empirically verify the effectiveness conditional setting1 . Prior work (Guu et al., 2017) of our method. has figured out how to edit prototype in an un- conditional setting, which cannot be adopted to 2 Related Work the response generation task directly. Our idea is that differences between responses strongly cor- Research on chatbots goes back to the 1960s when relates with differences in their contexts, mean- ELIZA was designed (Weizenbaum, 1966) that ing if a word in prototype context is changed, its uses a huge amount of hand-crafted templates and related words in the response are probably mod- rules. Recently, researchers have paid more and ified in the editing. We realize this idea by de- more attention on data-driven approaches (Ritter signing a context-aware editing model that is built et al., 2011; Ji et al., 2014) due to their superior upon a encoder-decoder model augmented with an scalability. Most of these methods are classified as editing vector. The edit vector is computed by retrieval-based methods (Ji et al., 2014; Yan et al., the weighted sum of insertion word embeddings 2016) and generation methods (Li et al., 2016b, (words in prototype context but not in current con- 2017; Mou et al., 2016; Zhou et al., 2017). The text) and deletion word embeddings (words in cur- former one aims to select a relevant response using rent context but not in prototype context). Larger a matching model, while the latter one generates a weights mean that the editing model should pay response with natural language generative models. more attention on corresponding words in revi- Prior works on retrieval-based methods mainly sion. For instance, we wish “dessert”, “Tofu” and focus on the matching model architecture for sin- “vegetables” get larger weights than “and” in Ta- gle turn conversation (Hu et al., 2014) and multi- ble 1. The encoder learns the prototype repre- turn conversation (Lowe et al., 2015; Zhou et al., sentation with a gated recurrent unit (GRU), and 2016; Wu et al., 2017). For the studies of gen- feeds the representation to a decoder together with erative methods, a huge amount of work aims to the edit vector. The decoder is a GRU language mitigate the “safe response” issue from different perspectives. Most of work build models under a 1 conditional setting means the editor should consider the sequence to sequence framework (Sutskever et al., context (dialogue history) except in a response itself in the revision. Editor only considers a sentence itself in the uncon- 2014), and introduce other elements, such as la- ditional setting tent variables (Serban et al., 2017), topic infor-

3.mation (Xing et al., 2017), dynamic vocabulary ity of the edit model is p(x|x , z) where z is an (Wu et al., 2018) and other contents (Yao et al., edit vector sampled from a pre-defined distribu- 2017; Mou et al., 2016) to increase response di- tion like variational auto-encoder. In the training versity. Furthermore, the reranking technique (Li phase, the parameter of the distribution is condi- et al., 2016a), reinforcement learning technique tional on the context differences. We first define (Li et al., 2016b) and adversarial learning tech- I = {w|w ∈ x ∧ w ∈ / x } as an insertion word nique (Li et al., 2017; Xu et al., 2017) have also set, where w is a word added to the prototype, been applied to response generation. Apart from and D = {w |w ∈ x ∧ w ∈ / x} is a dele- work on “safe response”, there is a growing body tion word set, where w is a word deleted from of literature on automatic response evaluation (Tao the prototype. Subsequently, we compute an in- et al., 2018; Lowe et al., 2017), style transfer (Fu sertion vector i = w∈I Ψ(w) and a deletion vec- et al., 2018; Wang et al., 2017) and emotional re- tor d = w ∈D Ψ(w ) by a summation over word sponse generation (Zhou et al., 2017). Retrieval embeddings in two corresponding sets, where Ψ(·) results have been used for dialogue generation in transfers a word to its embedding. Then, the edit (Song et al., 2016), where the decoder takes rep- vector z is sampled from a distribution whose pa- resentations of retrieval results and the original rameters are governed by the concatenation of i query as inputs. Compared to our work, it nei- and d. Finally, the edit vector and output of the ther edits a prototype nor utilizes the difference encoder are fed to the decoder to generate x. between two contexts. In general, previous work For response generation, which is a conditional generates a response from scratch either left-to- setting of text editing, an interesting question right or conditioned on a latent vector, whereas our raised, that is how to generate the edit by consid- approach is the first one to generate a response by ering contexts. We will introduce our motivation editing a prototype, facilitating a new direction for and model in details in the next section. response generation. Recently, some researches have explored nat- 4 Approach ural language generation by editing (Guu et al., 4.1 Model Overview 2017; Su et al., 2018; Liao et al., 2018). A typi- cal approach follows a writing-then-edit paradigm, Suppose that we have a data set D = that utilizes one decoder to generate a draft from {(Ci , Ri )}N i=1 . ∀i, (Ci , Ri ) comprises a con- scratch and uses another decoder to revise the draft text Ci = (ci,1 , . . . , ci,l ) and its response Ri = (Xia et al., 2017). The other approach follows a (ri,1 , . . . , ri,l ), where ci,j is the j-th word of the retrieval-then-edit paradigm, that uses a Seq2Seq context Ci and ri,j is the j-th word of the response model to edit a prototype retrieved from a corpus Ri . It should be noted that Ci can be either a single (Guu et al., 2017; Li et al., 2018; Cao et al., 2018). turn input or a multiple turn input. As the first step, To the best of our knowledge, we are the first to we assume Ci is a single turn input in this work, explicitly leverage context differences to edit pro- and leave the verification of the same technology totypes, and this approach can be applied to other for multi-turn response generation to future work. Seq2Seq tasks, such as summarization, paraphrase Our full model is shown in Figure 1, consisting of generation and machine translation. a prototype selector S and a context-aware neural editor E. Given a new conversational context C, 3 Background we first use S to retrieve a context-response pair (Ci , Ri ) ∈ D. Then, the editor E calculates an edit Before introducing our approach, we first briefly vector zi = f (Ci , C) to encode the information describe state-of-the-art natural language editing about the differences between Ci and C. Finally, method (Guu et al., 2017). Given a sentence pair we generate a response according to the probabil- (x, x ), our goal is to obtain sentence x by editing ity of p(R|zi , Ri ). In the following, we will elab- the prototype x . The general framework is built orate how to design the selector S and the editor upon a Seq2Seq model with an attention mech- E. anism, which takes x and x as source sequence and target sequence respectively. The main differ- 4.2 Prototype Selector ence is that the generative probability of a vanilla A good prototype selector S plays an important Seq2Seq model is p(x|x ) whereas the probabil- role in the prototype-then-edit paradigm. We use

4. (a) Prototype Selector (b) Neural Editor hi'1 hi' Current context: My friends and I went to some vegan place for dessert yesterday. Insert words Attnetion Edit Vector Prototype context: My friends and I had Tofu and vegetables at a vegan place nearby yesterday.  Attention Delete words Attnetion Prototype response: Raw green vegetables are very beneficial for your health. h1 h2 h3 hL Input Index Source: r1' r2' r3' rL' Figure 1: Architecture of our model. different strategies to select prototypes for train- and groundtruth are nearly identical (i.e. Jaccard ing and testing. In testing, as we described above, similarity > 0.7). We do not use context similar- we retrieve a context-response pair (C , R ) from ity to construct parallel data for training, because a pre-defined index for context C according to the similar contexts may correspond to totally differ- similarity of C and C . Here, we employ Lucene2 ent responses, so-called one-to-many phenomenon to construct the index and use its inline algorithm in dialogue generation, that impedes editor train- to compute the context similarity. ing due to the large lexicon gap. According to our Now we turn to the training phase. ∀i, (Ci , Ri ), preliminary experiments, the editor always gener- our goal is to maximize the generative probabil- ates non-sense responses if training data is con- ity of Ri by selecting a prototype (Ci , Ri ) ∈ D. structed by context similarity. As we already know the ground-truth response Ri , we first retrieve thirty prototypes {(Ci,j , Ri,j )}30 4.3 Context-Aware Neural Editor j=1 based on the response similarity instead of context A context-aware neural editor aims to revise a pro- similarity, and then reserve prototypes whose Jac- totype to adapt current context. Formally, given a card similarity to Ri are in the range of [0.3, 0.7]. quadruple (C, R, C , R ) (we omit subscripts for Here, we use Lucene to index all responses, and simplification), a context-aware neural editor first retrieve the top 30 similar responses along with forms an edit vector z using C and C , and then their corresponding contexts for Ri . The Jaccard updates parameters of the generative model by similarity measures text similarity from a bag-of- maximizing the probability of p(R|z, R ). For word view, that is formulated as testing, we directly generate a response after get- ting the editor vector. In the following, we will |A ∩ B| J(A, B) = , (1) introduce how to obtain the edit vector and learn |A ∪ B| the generative model in details. where A and B are two bags of words and | · | 4.3.1 Edit Vector Generation denotes the number of elements in a collection. For an unconditional sentence editing setting (Guu Each context-response pair is processed with the et al., 2017), an edit vector is randomly sampled above procedure, so we obtain enormous quadru- from a distribution because how to edit the sen- ples {(Ci , Ri , Ci,j , Ri,j )M N j=0 }i=1 after this step. i tence is not constrained. In contrast, we should The motivation behind filtering out instances with take both of C and C into consideration when Jaccard similarity < 0.3 is that a neural editor we revise a prototype response R . Formally, R model performs well only if a prototype is lexi- is firstly transformed to hidden vectors {hk |hk = cally similar (Guu et al., 2017) to its ground-truth. → − ←− nj Besides, we hope the editor does not copy the pro- h k ⊕ h k }k=1 through a biGRU parameterized as totype so we discard instances where the prototype Equation (2). 2 − → − → ← − ←− h j = fGRU ( h j−1 , rj ); h j = fGRU ( h j+1 , rj ) (2)

5.where rj is the j-th word of R . attention. The hidden state of the decoder is ac- Then we compute a context diff-vector dif fc by quired by an attention mechanism defined as follows hj = fGRU (hj−1 , rj−1 ⊕ zi ), (9) dif fc = βw Ψ(w) ⊕ γw Ψ(w ), (3) where the input of j-th time step is the last step w∈I w ∈D hidden state and the concatenation of the (j − 1)- where ⊕ is a concatenation operation, I = th word embedding and the edit vector obtained in {w|w ∈ C ∧ w ∈ / C } is a insertion word set, and Equation 8. Then we compute a context vector ci , D = {w |w ∈ C ∧ w ∈ / C} is a deletion word which is a linear combination of {h1 , . . . , ht }: set. dif fc explicitly encodes insertion words and t deletion words from C to C . βw is the weight of ci = αi,j hj , (10) a insertion word w, that is computed by j=1 exp(ew ) where αi,j is given by βw = , (4) w∈I exp(ew ) exp(ei,j ) ew = vβ tanh(Wβ [Ψ(w) ⊕ hl ]), (5) αi,j = t , (11) k=1 exp(ei,k ) where vβ and Wβ are parameters, and hl is the ei,j = v tanh(Wα [hj ⊕ hi ]), (12) last hidden state of the encoder. γw is obtained with a similar process: where v and Wα are parameters. The generative probability distribution is given by exp(ew ) γw = , (6) s(ri ) = sof tmax(Wp [ri−1 ⊕ hi−1 ⊕ ci ] + bp ), w ∈D exp(ew ) (13) ew = vγ tanh(Wγ [Ψ(w ) ⊕ hl ]), (7) where Wp and bp are two parameters. Equation We assume that different words influence the 11 and 13 are the attention mechanism (Bahdanau editing process unequally, so we weighted aver- et al., 2015), that mitigates the long-term depen- age insertion words and deletion words to form an dency issue of the original Seq2Seq model. We edit in Equation 3. Table 1 explains our motivation append the edit vector to every input embedding of as well, that is “desserts” is much more important the decoder in Equation 9, so the edit information than “the” in the editing process. Then we com- can be utilized in the entire generation process. pute the edit vector z by following transformation Since our model is differentiated, we learn our response generation model by minimizing the neg- z = tanh(W · dif fc + b), (8) ative log likelihood of D N l where W and b are two parameters. Equation 8 can be regarded as a mapping from context differ- L=− logp(ri,j |zi , Ri , ri,k<j ) (14) i=1 j=1 ences to response differences. It should be noted that there are several alterna- We implement our model by PyTorch 3 .We em- tive approaches to compute dif fc and z for this ploy the Adam algorithm (Kingma and Ba, 2015) task, such as applying memory networks, latent to optimize the objective function with a batch size variables, and other complex network architec- of 128. We set the initial learning rate as 0.001 and tures. Here, we just use a simple method, but it reduce it by half if perplexity on validation begins yields interesting results on this task. We will fur- to increase. We will stop training if the perplexity ther illustrate our experiment findings in the next on validation keeps increasing in two successive section. epochs. The word embedding size and editor vec- 4.3.2 Prototype Editing tor size are 512, and both of the encoder and de- We build our prototype editing model upon a coder are a 1-layer GRU whose hidden vector size Seq2Seq with an attention mechanism model, is 1024. Message and response vocabulary size which integrates the edit vector into the decoder. are 30000, and words not covered by the vocabu- nj The decoder takes{hk }k=1 as an input and gen- lary are represented by a placeholder $UNK$. 3 erates a response by a GRU language model with

6.5 Experiment sampled with a ratio of pos : neg = 1 : 9. We call this baseline as Retrieval-Rerank. 5.1 Experiment setting Correspondingly, we design three variants of Our editing model requires a large corpus to our model, Edit-default, Edit-1-Rerank and find lexically similar prototypes. We crawl over Edit-N-Rerank. Edit-default and Edit-1-Rerank, 20 million human-human context-response pairs edit top-1 response yielded by Retrieval-default (context only contains 1 turn) from Douban Group and Retrieval-Rerank respectively. Edit-N-Rerank 4 which is a popular forum in China. After re- edits all 20 responses returned by Lucene and then moving duplicated pairs and utterance longer than reranks the revised results with the dual-LSTM 30 words, we split 19,623,374 pairs for training, model. 10,000 pairs for validation and 10,000 pairs for Word embedding size, hidden vector size and testing. The average length of contexts and re- attention vector size of baselines and our models sponses are 11.64 and 12.33 respectively. We are the same. All generative models use beam index pairs in the training dataset for prototype search to yield responses, where beam size is cho- retrieval, and our retrieval strategy is described sen as 20 except S2SA-MMI. For all models, we in Section 4.2. We finally obtain 42,690,275 remove $UNK$ from the target vocabulary, be- (Ci , Ri , Ci , Ri ) quadruples. For a fair compari- cause it always causes fluency issue in evaluation. son, we randomly sample 19,623,374 quadruples for training. 5.3 Evaluation Metrics How to evaluate response generation systemati- 5.2 Baselines cally is still an open problem. We evaluate our S2SA: We adopt Seq2Seq with attention (Bah- model on four criteria, including: fluency, rele- danau et al., 2015) as a baseline model. We use a vance, diversity and originality. We employ Em- Pytorch implementation, OpenNMT (Klein et al., bedding Average (Average), Embedding Extrema 2017) in the experiment. (Extrema), and Embedding Greedy (Greedy) (Liu S2SA-MMI: We employed the bidirectional- et al., 2016) to evaluate response relevance, which MMI decoder as in (Li et al., 2016a). The hy- are better correlated with human judgment than perparameter λ is set as 0.5 according to the pa- BLEU. Following (Li et al., 2016a), we evalu- per’s suggestion. 200 candidates are sampled from ate the response diversity based on the ratios of beam search for reranking. distinct unigrams and bigrams in generated re- CVAE: The conditional variational auto- sponses, denoted as Distinct-1 and Distinct-2. In encoder is a popular method of increasing this paper, we define a new metric, originality, that the diversity of response generation (Zhao is defined as the ratio of generated responses that et al., 2017). We use the published code do not appear in the training set. Here, “appear” at means we can find exactly the same response in NeuralDialog-CVAE, and conduct small our training data set. We randomly select 1,000 adaptations for our single turn scenario. contexts from the test set, and ask three native Retrieval: We compare our model with speakers to annotate response fluency. We conduct retrieval-based methods to show the effect of edit- 3-scale rating: +2, +1 and 0. +2: The response is ing. Here, we use two retrieval strategies. The first fluent and grammatically correct. +1: There are a one, denoted as Retrieval-default, is that we di- few grammatical errors in the response but read- rectly use the top-1 result given by Lucene, which ers could understand it. 0: The response is totally ranks candidates with the context similarity. The grammatically broken, making it difficult to un- second one is that we first retrieve 20 responses, derstand. and then employ a dual-LSTM model (Lowe et al., 5.4 Evaluation Results 2015) to compute matching degree between cur- Table 2 shows the evaluation results on the rent context and response candidates. The match- Chinese dataset. Our methods are better than ing model is implemented with the same setting retrieval-based methods on embedding based met- in (Lowe et al., 2015), and is trained on the train- rics, that means revised responses are more rele- ing data set where negative instances are randomly vant to ground-truth in the semantic space. Our 4 model just slightly revises prototype response, so

7.Table 2: Automatic evaluation results. Numbers in bold mean that improvement from the model on that metric is statistically significant over the baseline methods (t-test, p-value < 0.01). Relevance Diversity Originality Fluency Average Extrema Greedy Distinct-1 Distinct-2 Not appear Avg. Score S2SA 0.346 0.180 0.350 0.032 0.087 0.208 1.90 S2SA-MMI 0.379 0.189 0.385 0.039 0.127 0.297 1.86 CVAE 0.360 0.183 0.363 0.062 0.178 0.745 1.71 Retrieval-default 0.288 0.130 0.309 0.098 0.549 0.000 1.95 Edit-default 0.297 0.150 0.327 0.071 0.300 0.796 1.78 Retrieval-Rerank 0.380 0.191 0.381 0.067 0.460 0.000 1.96 Edit-1-Rerank 0.367 0.185 0.371 0.077 0.296 0.794 1.79 Edit-N-Rerank 0.386 0.203 0.389 0.068 0.280 0.860 1.78 improvements on automatic metrics are not that S2SA and S2SA-MMI perform well on this met- large but significant on statistical tests (t-test, p- ric. The edit model obtains an average score of value < 0.01). Two factors are known to cause 1.79. That is an acceptable fluency score for a dia- Edit-1-Rerank worse than Retrieval-Rerank. 1) logue engine and most of generated responses are Rerank algorithm is biased to long responses, that grammatically correct. poses a challenge for the editing model. 2) Despite of better prototype responses, a context of top-1 5.5 Discussions response is always greatly different from current 5.5.1 Editing Type Analysis context, leading to a large insertion word set and It is interesting to explore the semantic gap be- a large deletion set, that also obstructs the revi- tween prototype and revised response. We ask sion process. In terms of diversity, our methods annotators to conduct 4-scale rating on 500 ran- drop on distinct-1 and distinct-2 in a comparison domly sampled prototype-response pairs given by with retrieval-based methods, because the editing Edit-default and Edit-N-Rerank respectively. The model often deletes special words pursuing for 4-scale is defined as: identical, paraphrase, on the better relevance. Retrieval-Rerank is better than same topic and unrelated. retrieval-default, indicating that it is necessary to rerank responses by measuring context-response similarity with a matching model. Our methods significantly outperform genera- tive baselines in terms of diversity since response prototypes are diverse and informative. It demon- strates that the prototype-then-editing paradigm is essentially capable of addressing the safe response problem. Edit-Rerank is better than generative baselines on relevance but Edit-default is not, in- dicating a good prototype selector is quite impor- Figure 2: Editing Ratio. The different between tant to our editing model. In terms of originality, paraphrase and on the same topic is that a re- about 86% revised response do not appear in the sponse in “on the same topic” category introduces training set, that surpasses S2SA, S2SA-MMI and or deletes contents to its prototype response. CVAE. This is mainly because baseline methods are more likely to generate safe responses that are Figure 2 provides the ratio of four editing types frequently appeared in the training data, while our defined above. For both of two methods, Only model tends to modify an existing response that 2% of edits are exactly the same with the proto- avoids duplication issue. In terms of fluency, re- type, that means our model does not downgrade trieval based approaches achieve the best results, to a copy model. Surprisingly, there are 30% re- and S2SA comes to the second place. Safe re- vised responses are unrelated to prototypes. The sponse enjoys high score on fluency, that is why key factor for this phenomenon is that the neural

8.editor will rewrite the prototype when it is hard Table 3: Case Study. We show examples yielded to insert insertion words to the prototype. The ra- by Edit-default and Edit-Rerank. Chinese utter- tio of “on the same topic” response given by Edit- ances are translated to English here. N-rerank is larger than Edit-default, revealing that Insertion case “on the same topic” responses might be more rel- Context 身 在 国外 寂寞 无聊 就 化妆 evant from the view of a LSTM based reranker. When I feel bored and lonely abroad, I like makeup. 5.5.2 Case Study Prototype 无聊 就 玩 context We give three examples to show how our model If I feel bored, I will play something. works in Table 3. The first case illustrates the ef- Prototype fect of word insertion. Our editing model enriches 嗯 response a short response by inserting words from context, OK. that makes the conversation informative and co- Revised re- 嗯 我 也 喜欢 化妆 sponse herent. The second case gives an example of word OK. I love makeup too. deletion, where a phrase “braised pork rice” is re- Deletion case moved as it does not fit current context. Phrase Context 我 比较 常 吃 辛拉面 “braised pork rice” only appears in the prototype I often eat spice noodles. Prototype 我 在 台湾 时候 常 吃 辛拉面 和 卤肉 context but not in current context, so it is in the context 饭。 deletion word set D, that makes the decoder not When I lived at Taiwan, I often eat spicy generate it. The third one is that our model forms Noodles and braised pork rice. Prototype a relevant query by deleting some words in the 我 也 喜欢 卤肉饭。 response prototype while inserting other words to it. Cur- I love braised pork rice as well. rent context is talking about “clean tatoo”, but the Revised re- 我 也 喜欢。 sponse prototype discusses “clean hair”, leading to an ir- I love it as well. (In Chinese, model just relevant response. After the word substitution, the deletes the phrase “braised pork rice” with- revised response becomes appropriated for current out adding any word. ) Replacement case context. Context 纹身 有没有 办法 全部 弄 干净 According to our observation, word deletion Is there any way to get all tattoos clean and replacement operations have a positive effect Prototype 把 药 抹 头发 上 能 把 头发 弄 干净 么 context on relevance. Specifically, word deletion opera- Is it possible to clean your hair by wiping tion is able to remove words that do not fit to the medicine on your hair? current context, which makes a response coherent Prototype 抹 完 真的 头发 干净 很多 response to current context. Similarly, word replacement After wiping it, hair gets clean much. operation adds context related words to prototype Revised re- 抹完 纹身 就 会 掉 很多 while remove unappropriated ones. sponse After wiping it, most of tattoos will be cleaned. 6 Conclusion and Future Work We present a new paradigm, prototype-then-edit, for open domain response generation, that enables References a generation-based chatbot to leverage retrieved Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- results. We propose a simple but effective model gio. 2015. Neural machine translation by jointly to edit context-aware responses by taking context learning to align and translate. ICLR . differences into consideration. Experiment results on a large-scale dataset show that our model im- SL Ziqiang Cao, Wenjie Li, Furu Wei, and Sujian Li. 2018. Retrieve, rerank and rewrite: Soft template proves the prototype relevance, and outperforms based neural summarization. In ACL. volume 2, other generative baselines in terms of both of rel- page 3. evance and diversity. In the future, we will extend our model to multi-turn response generation, that Zhenxin Fu, Xiaoye Tan, Nanyun Peng, Dongyan considers complex relationship among context ut- Zhao, and Rui Yan. 2018. Style transfer in text: Exploration and evaluation. In Proceedings of the terances. We will also investigate how to learn Thirty-Second AAAI Conference on Artificial Intelli- prototype selector and neural editor in an end-to- gence, New Orleans, Louisiana, USA, February 2-7, end manner. 2018.

9.Kelvin Guu, Tatsunori B. Hashimoto, Yonatan Oren, turing test: Learning to evaluate dialogue re- and Percy Liang. 2017. Generating sentences sponses. In Proceedings of the 55th Annual Meet- by editing prototypes. CoRR abs/1709.08878. ing of the Association for Computational Linguis- tics, ACL 2017, Vancouver, Canada, July 30 - Au- gust 4, Volume 1: Long Papers. pages 1116–1126. Baotian Hu, Zhengdong Lu, Hang Li, and Qingcai Chen. 2014. Convolutional neural network archi- tectures for matching natural language sentences. Ryan Lowe, Nissan Pow, Iulian Serban, and Joelle In Advances in Neural Information Processing Sys- Pineau. 2015. The ubuntu dialogue corpus: A large tems. pages 2042–2050. dataset for research in unstructured multi-turn dia- logue systems. SIGDIAL . Zongcheng Ji, Zhengdong Lu, and Hang Li. 2014. An Lili Mou, Yiping Song, Rui Yan, Ge Li, Lu Zhang, information retrieval approach to short text conver- and Zhi Jin. 2016. Sequence to backward and for- sation. arXiv preprint arXiv:1408.6988 . ward sequences: A content-introducing approach to generative short-text conversation. arXiv preprint Diederik Kingma and Jimmy Ba. 2015. Adam: A arXiv:1607.00970 . method for stochastic optimization. ICLR . Alan Ritter, Colin Cherry, and William B Dolan. 2011. Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Data-driven response generation in social media. Senellart, and Alexander M. Rush. 2017. Opennmt: In EMNLP. Association for Computational Linguis- Open-source toolkit for neural machine translation. tics, pages 583–593. In Proc. ACL. Iulian Vlad Serban, Alessandro Sordoni, Yoshua Ben- 4012. gio, Aaron C. Courville, and Joelle Pineau. 2016. End-to-end dialogue systems using generative hier- Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, archical neural network models. In Proceedings of and Bill Dolan. 2016a. A diversity-promoting ob- the Thirtieth AAAI Conference on Artificial Intel- jective function for neural conversation models. In ligence, February 12-17, 2016, Phoenix, Arizona, NAACL HLT 2016, The 2016 Conference of the USA.. pages 3776–3784. North American Chapter of the Association for Computational Linguistics: Human Language Tech- Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe, nologies, San Diego California, USA, June 12-17, Laurent Charlin, Joelle Pineau, Aaron C Courville, 2016. pages 110–119. and Yoshua Bengio. 2017. A hierarchical latent variable encoder-decoder model for generating di- Jiwei Li, Will Monroe, Alan Ritter, Dan Jurafsky, alogues. In AAAI. pages 3295–3301. Michel Galley, and Jianfeng Gao. 2016b. Deep rein- forcement learning for dialogue generation. In Pro- Lifeng Shang, Zhengdong Lu, and Hang Li. 2015. ceedings of the 2016 Conference on Empirical Meth- Neural responding machine for short-text conversa- ods in Natural Language Processing, EMNLP 2016, tion. In ACL 2015, July 26-31, 2015, Beijing, China, Austin, Texas, USA, November 1-4, 2016. pages Volume 1: Long Papers. pages 1577–1586. 1192–1202. Yiping Song, Rui Yan, Xiang Li, Dongyan Zhao, and Ming Zhang. 2016. Two are better than one: An en- Jiwei Li, Will Monroe, Tianlin Shi, Alan Ritter, and semble of retrieval-and generation-based dialog sys- Dan Jurafsky. 2017. Adversarial learning for neural tems. arXiv preprint arXiv:1610.07149 . dialogue generation. EMNLP . Alessandro Sordoni, Michel Galley, Michael Auli, Juncen Li, Robin Jia, He He, and Percy Liang. 2018. Chris Brockett, Yangfeng Ji, Margaret Mitchell, Delete, retrieve, generate: A simple approach to sen- Jian-Yun Nie, Jianfeng Gao, and Bill Dolan. 2015. timent and style transfer. NAACL . A neural network approach to context-sensitive gen- eration of conversational responses. In NAACL HLT Yi Liao, Lidong Bing, Piji Li, Shuming Shi, Wai 2015, The 2015 Conference of the North American Lam, and Tong Zhang. 2018. Incorporating pseudo- Chapter of the Association for Computational Lin- parallel data for quantifiable sequence editing. arXiv guistics: Human Language Technologies, Denver, preprint arXiv:1804.07007 . Colorado, USA, May 31 - June 5, 2015. pages 196– 205. Chia-Wei Liu, Ryan Lowe, Iulian V Serban, Michael Jinyue Su, Jiacheng Xu, Xipeng Qiu, and Xuanjing Noseworthy, Laurent Charlin, and Joelle Pineau. Huang. 2018. Incorporating discriminator in sen- 2016. How not to evaluate your dialogue system: tence generation: a gibbs sampling method. arXiv An empirical study of unsupervised evaluation met- preprint arXiv:1802.08970 . rics for dialogue response generation. EMNLP . Ilya Sutskever, Oriol Vinyals, and Quoc VV Le. 2014. Ryan Lowe, Michael Noseworthy, Iulian Vlad Ser- Sequence to sequence learning with neural net- ban, Nicolas Angelard-Gontier, Yoshua Bengio, works. In Advances in neural information process- and Joelle Pineau. 2017. Towards an automatic ing systems. pages 3104–3112.

10.Chongyang Tao, Lili Mou, Dongyan Zhao, and Rui tention network for response generation. AAAI-18 Yan. 2018. RUBER: an unsupervised method for . automatic evaluation of open-domain dialog sys- tems. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, Zhen Xu, Bingquan Liu, Baoxun Wang, Chengjie Sun, Louisiana, USA, February 2-7, 2018. Xiaolong Wang, Zhuoran Wang, and Chao Qi. 2017. Neural response generation via GAN with an ap- Oriol Vinyals and Quoc Le. 2015. A neural conversa- proximate embedding layer. In Proceedings of the tional model. arXiv preprint arXiv:1506.05869 . 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Di Wang, Nebojsa Jojic, Chris Brockett, and Eric Denmark, September 9-11, 2017. pages 617–626. Nyberg. 2017. Steering output style and topic in neural response generation. In Proceedings of the Rui Yan, Yiping Song, and Hua Wu. 2016. Learning 2017 Conference on Empirical Methods in Natural to respond with deep neural networks for retrieval- Language Processing, EMNLP 2017, Copenhagen, based human-computer conversation system. In Denmark, September 9-11, 2017. pages 2140–2150. Proceedings of the 39th International ACM SIGIR Joseph Weizenbaum. 1966. Eliza: a computer program conference on Research and Development in Infor- for the study of natural language communication be- mation Retrieval, SIGIR 2016, Pisa, Italy, July 17- tween man and machine. Communications of the 21, 2016. pages 55–64. ACM 9(1):36–45. Lili Yao, Yaoyuan Zhang, Yansong Feng, Dongyan Yu Wu, Wei Wu, Chen Xing, Ming Zhou, and Zhou- Zhao, and Rui Yan. 2017. Towards implicit content- jun Li. 2017. Sequential matching network: A introducing for generative short-text conversation new architecture for multi-turn response selection systems. In Proceedings of the 2017 Conference on in retrieval-based chatbots. In Proceedings of the Empirical Methods in Natural Language Process- 55th Annual Meeting of the Association for Compu- ing, EMNLP 2017, Copenhagen, Denmark, Septem- tational Linguistics, ACL 2017, Vancouver, Canada, ber 9-11, 2017. pages 2190–2199. July 30 - August 4, Volume 1: Long Papers. pages 496–505. Tiancheng Zhao, Ran Zhao, and Maxine Esk´enazi. Yu Wu, Wei Wu, Dejian Yang, Can Xu, and Zhoujun 2017. Learning discourse-level diversity for neural Li. 2018. Neural response generation with dynamic dialog models using conditional variational autoen- vocabularies. In Proceedings of the Thirty-Second coders. In ACL 2017, Vancouver, Canada, July 30 - AAAI Conference on Artificial Intelligence, New Or- August 4, Volume 1: Long Papers. pages 654–664. leans, Louisiana, USA, February 2-7, 2018. Yingce Xia, Fei Tian, Lijun Wu, Jianxin Lin, Tao Qin, Nenghai Yu, and Tie-Yan Liu. 2017. Deliberation Hao Zhou, Minlie Huang, Tianyang Zhang, Xiaoyan networks: Sequence generation beyond one-pass de- Zhu, and Bing Liu. 2017. Emotional chatting coding. In Advances in Neural Information Process- machine: Emotional conversation generation with ing Systems 30: Annual Conference on Neural In- internal and external memory. arXiv preprint formation Processing Systems 2017, 4-9 December arXiv:1704.01074 . 2017, Long Beach, CA, USA. pages 1782–1792. Chen Xing, Wei Wu, Yu Wu, Jie Liu, Yalou Huang, Xiangyang Zhou, Daxiang Dong, Hua Wu, Shiqi Zhao, Ming Zhou, and Wei-Ying Ma. 2017. Topic aware Dianhai Yu, Hao Tian, Xuan Liu, and Rui Yan. neural response generation. In AAAI 2017. pages 2016. Multi-view response selection for human- 3351–3357. computer conversation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Lan- Chen Xing, Wei Wu, Yu Wu, Ming Zhou, Yalou Huang, guage Processing, EMNLP 2016, Austin, Texas, and Wei-Ying Ma. 2018. Hierarchical recurrent at- USA, November 1-4, 2016. pages 372–381.