Close to Human Quality TTS with Transformer

Although end-to-end neural text-to-speech (TTS) methods (such as Tacotron2) are proposed and achieve state-of-the-art performance, they still suffer from two problems: 1) low efficiency during training and inference; 2) hard to model long dependency using current recurrent neural networks (RNNs). Inspired by the success of Transformer network in neural machine translation (NMT), in this paper, we introduce and adapt the multi-head attention mechanism to replace the RNN structures and also the original attention mechanism in Tacotron2. With the help of multi-head self-attention,the hidden states in the encoder and decoder are constructed in parallel, which improves training efficiency. Meanwhile,any two inputs at different times are connected directly by a self-attention mechanism, which solves the long range dependency problem effectively. Using phoneme sequences as input, our Transformer TTS network generates mel spectrograms, followed by a WaveNet vocoder to output the final audio results. Experiments are conducted to test the efficiency and performance of our new network. For the efficiency, our Transformer TTS network can speed up the training about 4.25 times faster compared with Tacotron2. For the performance, rigorous human tests show that our proposed model achieves state-of-the-art performance (outperforms Tacotron2 with a gap of 0.048) and is very close to human quality (4.39 vs 4.44 in MOS).

1. Close to Human Quality TTS with Transformer Naihan Li∗1,4 , Shujie Liu2 , Yanqing Liu3 , Sheng Zhao3 , Ming Liu1,4 , Ming Zhou2 1 University of Electronic Science and Technology of China 2 Microsoft Research Asia 3 Microsoft STC Asia 4 CETC Big Data Research Institute Co.,Ltd, Guizhou, China {shujliu,yanqliu,szhao,mingzhou} arXiv:1809.08895v2 [cs.CL] 13 Nov 2018 Abstract guage specific, which requires a lot of resource and man- power. Besides, synthesized audios often have glitches or Although end-to-end neural text-to-speech (TTS) methods (such as Tacotron2) are proposed and achieve state-of-the- instability in prosody and pronunciation compared to human art performance, they still suffer from two problems: 1) low speech, and thus sound unnatural. efficiency during training and inference; 2) hard to model Recently, with the rapid development of neural net- long dependency using current recurrent neural networks works, end-to-end generative text-to-speech models, such (RNNs). Inspired by the success of Transformer network in as Tacotron (Wang et al. 2017) and Tacotron2 (Shen et al. neural machine translation (NMT), in this paper, we intro- 2017), are proposed to simplify traditional speech synthe- duce and adapt the multi-head attention mechanism to replace sis pipeline by replacing the production of these linguistic the RNN structures and also the original attention mecha- and acoustic features with a single neural network. Tacotron nism in Tacotron2. With the help of multi-head self-attention, and Tacotron2 first generate mel spectrograms directly from the hidden states in the encoder and decoder are constructed in parallel, which improves training efficiency. Meanwhile, texts, then synthesize the audio results by a vocoder such as any two inputs at different times are connected directly by Griffin Lim algorithm (Griffin and Lim 1984) or WaveNet a self-attention mechanism, which solves the long range de- (Van Den Oord et al. 2016). With the end-to-end neural net- pendency problem effectively. Using phoneme sequences as work, quality of synthesized audios is greatly improved and input, our Transformer TTS network generates mel spec- even comparable with human recordings on some datasets. trograms, followed by a WaveNet vocoder to output the fi- The end-to-end neural TTS models contain two components, nal audio results. Experiments are conducted to test the ef- an encoder and a decoder. Given the input sequence (of ficiency and performance of our new network. For the effi- words or phonemes), the encoder tries to map them into a ciency, our Transformer TTS network can speed up the train- semantic space and generates a sequence of encoder hidden ing about 4.25 times faster compared with Tacotron2. For states, and the decoder, taking these hidden states as context the performance, rigorous human tests show that our pro- posed model achieves state-of-the-art performance (outper- information with an attention mechanism, constructs the de- forms Tacotron2 with a gap of 0.048) and is very close to coder hidden states then outputs the mel frames. For both human quality (4.39 vs 4.44 in MOS). encoder and decoder, recurrent neural networks (RNNs) are usually leveraged, such as LSTM (Hochreiter and Schmid- huber 1997) and GRU (Cho et al. 2014). 1 Introduction However, RNNs can only consume the input and generate Text to speech (TTS) is a very important task for user inter- the output sequentially, since the previous hidden state and action, aiming to synthesize intelligible and natural audios the current input are both required to build the current hid- which are indistinguishable from human recordings. Tra- den state. The characteristic of sequential process limits the ditional TTS systems have two components: front-end and parallelization capability in both the training and inference back-end. Front-end is responsible for text analysis and lin- process. For the same reason, for a certain frame, informa- guistic feature extraction, such as word segmentation, part tion from many steps ahead may has been biased after mul- of speech tagging, multi-word disambiguation and prosodic tiple recurrent processing. To deal with these two problems, structure prediction; back-end is built for speech synthesis Transformer (Vaswani et al. 2017) is proposed to replace the based on linguistic features from front-end, such as speech RNNs in NMT models. acoustic parameter modeling, prosody modeling and speech Inspired by this idea, in this paper, we combine the ad- generation. In the past decades, concatenative and paramet- vantages of Tacotron2 and Transformer to propose a novel ric speech synthesis systems were mainstream techniques. end-to-end TTS model, in which the multi-head attention However, both of them have complex pipelines, and defin- mechanism is introduced to replace the RNN structures in ing good linguistic features is often time-consuming and lan- the encoder and decoder, as well as the vanilla attention ∗ Work done during internship at Microsoft STC Asia. network. The self-attention mechanism unties the sequen- Copyright c 2019, Association for the Advancement of Artificial tial dependency on the last previous hidden state to im- Intelligence ( All rights reserved. prove the parallelization capability and relieve the long dis-

2.tance dependency problem. Compared with the vanilla at- tention between the encoder and decoder, the multi-head attention can build the context vector from different as- pects using different attention heads. With the phoneme se- quences as input, our novel Transformer TTS network gen- erates mel spectrograms, and employs WaveNet as vocoder to synthesize audios. We conduct experiments with 25-hour professional speech dataset, and the audio quality is eval- uated by human testers. Evaluation results show that our proposed model outperforms the original Tacotron2 with a gap of 0.048 in CMOS, and achieves a similar performance (4.39 in MOS) with human recording (4.44 in MOS). Be- sides, our Transformer TTS model can speed up the train- ing process about 4.25 times compared with Tacotron2. Au- dio samples can be accessed on https://neuraltts. 2 Background In this section, we first introduce the sequence-to-sequence model, followed by a brief description about Tacotron2 and Transformer, which are two preliminaries in our work. 2.1 Sequence to Sequence Model A sequence-to-sequence model (Sutskever, Vinyals, and Le 2014; Bahdanau, Cho, and Bengio 2014) converts an input sequence (x1 , x2 , ..., xT ) into an output sequence (y1 , y2 , ..., yT ), and each predicted yt is conditioned on Figure 1: System architecture of Tacotron2. all previously predicted outputs y1 , ..., yt−1 . In most cases, these two sequences are of different lengths (T = T ). In NMT, this conversion translates the input sentence in one sequence of input is firstly processed with a 3-layer CNN to language into the output sentence in another language, based extract a longer-term context, and then fed into the encoder, on a conditional probability p(y1 , ..., yT |x1 , ..., xT ): which is a bi-directional LSTM. The previous mel spectro- ht = encoder(ht−1 , xt ) (1) gram frame (the predicted one in inference, or the golden one in training time), is first processed with a 2-layer fully st = decoder(st−1 , yt−1 , ct ) (2) connected network (decoder pre-net), whose output is con- where ct is the context vector calculated by an attention catenated with the previous context vector, followed by a 2- mechanism: layer LSTM. The output is used to calculate the new context vector at this time step, which is concatenated with the out- ct = attention(st−1 , h) (3) put of the 2-layer LSTM to predict the mel spectrogram and stop token with two different linear projections respectively. thus p(y1 , ..., yT |x1 , ..., xT ) can be computed by Finally the predicted mel spectrogram is fed into a 5-layer T CNN with residual connections to refine the mel spectro- p(y1 , ..., yT |x1 , ..., xT ) = p(yt |y<t , x) (4) gram. t=1 and 2.3 Transformer for NMT p(yt |y<t , x) = sof tmax(f (st )) (5) Transformer (Vaswani et al. 2017), shown in Fig. 2, is a where f (·) is a fully connected layer. For translation tasks, sequence to sequence network, based solely on attention this softmax function is among all dimensions of f (st ) and mechanisms and dispensing with recurrences and convo- calculates the probability of each word in the vocabulary. lutions entirely. In recent works, Transformer has shown However, in the TTS task, the softmax function is not re- extraordinary results, which outperforms many RNN-based quired and the hidden states s calculated by decoder are con- models in NMT. It consists of two components: an encoder sumed directly by a linear projection to obtain the desired and a decoder, both are built by stacks of several identity spectrogram frames. blocks. Each encoder block contains two subnetworks: a multi-head attention and a feed forward network, while each 2.2 Tacotron2 decoder block contains an extra masked multi-head attention Tacotron2 is a neural network architecture for speech syn- comparing to the encoder block. Both encoder and decoder thesis directly from text, as shown in Fig. 1 . The embedding blocks have residual connections and layer normalizations.

3. training data is not sufficient enough, and some exceptions have too few occurrences for neural networks to learn. So we make a rule system and implement it as a text-to-phoneme converter, which can cover the vast majority of cases. 3.2 Scaled Positional Encoding Transformer contains no recurrence and no convolution so that if we shuffle the input sequence of encoder or decoder, we will get the same output. To take the order of the se- quence into consideration, information about the relative or absolute position of frames is injected by triangle positional embeddings, shown in Eq. 7: pos P E(pos, 2i) = sin( 2i ) (6) 10000 dmodel pos P E(pos, 2i + 1) = cos( 2i ) (7) 10000 dmodel where pos is the time step index, 2i and 2i + 1 is the channel index and dmodel is the vector dimension of each frame. In NMT, the embeddings for both source and target language are from language spaces, so the scales of these embeddings are similar. This condition doesn’t hold in the TTS scenar- ioe, since the source domain is of texts while the target do- main is of mel spectrograms, hence using fixed positional embeddings may impose heavy constraints on both the en- coder and decoder pre-nets (which will be described in Sec. 3.3 and 3.4). We employ these triangle positional embed- dings with a trainable weight, so that these embedding can adaptively fit the scales of both encoder and decoder pre- Figure 2: System architecture of Transformer. nets’ output, as shown in Eq. 8: xi = prenet(phonemei ) + αP E(i) (8) 3 Neural TTS with Transformer where α is the trainable weight. Compared to RNN-based models, using Transformer in neu- ral TTS has two advantages. First it enables parallel training 3.3 Encoder Pre-net by removing recurrent connections, as frames of an input In Tacotron2, a 3-layer CNN is applied to the input text em- sequence for decoder can be provided in parallel. The sec- beddings, which can model the longer-term context in the in- ond one is that self attention provides an opportunity for in- put character sequence. In our Transformer TTS model, we jecting global context of the whole sequence into each in- input the phoneme sequence into the same network, which put frame, building long range dependencies directly. Trans- is called ”encoder pre-net”. Each phoneme has a trainable former shortens the length of paths forward and backward embedding of 512 dims, and the output of each convolution signals have to traverse between any combination of posi- layer has 512 channels, followed by a batch normalization tions in the input and output sequences down to 1. This helps and ReLU activation, and a dropout layer as well. In addi- a lot in a neural TTS model, such as the prosody of synthe- tion, we add a linear projection after the final ReLU acti- sized waves, which not only depends on several words in the vation, since the output range of ReLU is [0, +∞), while neighborhood, but also sentence level semantics. each dimension of these triangle positional embeddings is in In this section we will introduce the architecture of our [−1, 1]. Adding 0-centered positional information onto non- Transformer TTS model, and analyze the function of each negative embeddings will result in a fluctuation not centered part. The overall structure diagram is shown in Fig. 3. on the origin and harm model performance, which will be demonstrated in our experiment. Hence we add a linear pro- 3.1 Text-to-Phoneme Converter jection for center consistency. English pronunciation has certain regularities, for example, there are two kinds of syllables in English: open and closed. 3.4 Decoder Pre-net The letter ”a” is often pronounced as /eı/ when it’s in an The mel spectrogram is first consumed by a neural network open syllable, while it is pronounced as /æ/ or /a:/ in closed composed of two fully connected layers(each has 256 hid- syllables. We can rely on the neural network to learn such a den units) with ReLU activation, named ”decoder pre-net”, regularity in the training process. However, it is difficult to and it plays an important role in the TTS system. Phonemes learn all the regularities when, which is often the case, the has trainable embeddings thus their subspace is adaptive,

4. 3.5 Encoder In Tacotron2, the encoder is a bi-directional RNN. We re- place it with Transformer encoder which is described in Sec. 2.3 . Comparing to original bi-directional RNN, multi-head attention splits one attention into several subspaces so that it can model the frame relationship in multiple different as- pects, and it directly builds the long-time dependency be- tween any two frames thus each of them considers global context of the whole sequence. This is crucial for synthe- sized audio prosody especially when the sentence is long, as generated samples sound more smooth and natural in our experiments. In addition, employing multi-head atten- tion instead of original bi-directional RNN can enable par- allel computing to improve training speed. 3.6 Decoder In Tacotron2, the decoder is a 2-layer RNN with location- sensitive attention (Chorowski et al. 2015). We replace it with Transformer decoder which is described in Sec. 2.3. Employing Transformer decoder makes two main differ- ences, adding self-attention, which can bring similar advan- tages described in Sec. 3.5, and using multi-head attention instead of the location-sensitive attention. The multi-head attention can integrate the encoder hidden states in multi- ple perspectives and generate better context vectors. Taking attention matrix of previous decoder time steps into consid- eration, location-sensitive attention used in Tacotron2 can encourage the model to generate consistent attention results. We try to modify the dot product based multi-head attention to be location sensitive, but that doubles the training time and easily run out of memory. 3.7 Mel Linear, Stop Linear and Post-net Same as Tacotron2, we use two different linear projections to predict the mel spectrogram and the stop token respec- tively, and use a 5-layer CNN to produce a residual to refine Figure 3: System architecture of our model. the reconstruction of mel spectrogram. It’s worth mention- ing that, for the stop linear, there is only one positive sample in the end of each sequence which means ”stop”, while hun- dreds of negative samples for other frames. This imbalance while that of mel spectrograms is fixed. We infer that de- may result in unstoppable inference. We impose a positive coder pre-net is responsible for projecting mel spectrograms weight (5.0 ∼ 8.0) on the tail positive stop token when cal- into the same subspace as phoneme embeddings, so that the culating binary cross entropy loss, and this problem was ef- similarity of a phoneme, mel f rame pair can be mea- ficiently solved. sured, thus the attention mechanism can work. Besides, 2 fully connected layers without non-linear activation are also tried but no reasonable attention matrix aligning the hidden 4 Experiment states of encoder and decoder can be generated. In our other In this section, we conduct experiments to test our proposed experiment, hidden size is enlarged from 256 to 512, how- Transformer TTS model with 25-hour professional speech ever that doesn’t generate significant improvement but needs pairs, and the audio quality is evaluated by human testers in more steps to converge. Accordingly, we conjecture that mel MOS and CMOS. spectrograms have a compact and low dimensional subspace that 256 hidden units are good enough to fit. This conjecture 4.1 Training Setup can also be evidenced in our experiment, which is shown in We use 4 Nvidia Tesla P100 to train our model with an in- Sec. 4.6, that the final positional embedding scale of decoder ternal US English female dataset, which contains 25-hour is smaller than that of encoder. An additional linear projec- professional speech (17584 text, wave pairs, with a few tion is also added like encoder pre-net not only for center too long waves removed). 50ms silence at head and 100ms consistency but also obtain the same dimension as the trian- silence at tail are kept for each wave. Since the lengths of gle positional embeddings. training samples vary greatly, fixed batch size will either run

5.out of memory when long samples are added into a batch System MOS CMOS with a large size or waste the parallel computing power if the batch is small and into which short samples are divided. Tacotron2 4.39 ± 0.05 0 Therefore, we use the dynamic batch size where the maxi- Our Model 4.39 ± 0.05 0.048 mum total number of mel spectrogram frames is fixed and Ground Truth 4.44 ± 0.05 - one batch should contain as many samples as possible. Thus there are on average 16 samples in single batch per GPU. We Table 1: MOS comparison among our model, our Tacotron2 try training on a single GPU, but the procedures are quiet and recordings. instable or even failed, by which synthesized audios were like raving and incomprehensible. Even if training doesn’t fail, synthesized waves are of bad quality and weird prosody, or even have some severe problems like missing phonemes. Thus we enable multi-GPU training to enlarge the batch size, which effectively solves those problems. 4.2 Text-to-Phoneme Conversion and Pre-process Tacotron2 uses character sequences as input, while our model is trained on pre-normalized phoneme sequences. Word and syllable boundaries, punctuations are also in- cluded as special markers. The process pipeline to get train- ing phoneme sequences contains sentence separation, text normalization, word segmentation and finally obtaining pro- nunciation. By text-to-phoneme conversion, mispronuncia- tion problems are greatly reduced especially for those pro- nunciations that are rarely occurred in our training set. 4.3 WaveNet Settings We train a WaveNet conditioned on mel spectrogram with the same internal US English female dataset, and use it as the vocoder for all models in this paper. The sample rate of ground truth audios is 16000 and frame rate (frames per sec- ond) of ground truth mel spectrogram is 80. Our autoregres- sive WaveNet contains 2 QRNN layers and 20 dilated layers, and the sizes of all residual channels and dilation channels Figure 4: Mel spectrogram comparison. Our model (6-layer) are all 256. Each frame of QRNN’s final output is copied does better in reconstructing details as marked in red rectan- 200 times to have the same spatial resolution as audio sam- gles, while Tacotron2 and our 3-layer model blur the texture ples and be conditions of 20 dilated layers. especially in high frequency region. Best viewed in color. 4.4 Training Time Comparison Our model can be trained in parallel since there is no re- model and one group of recordings. Those MOS tests are current connection between frames. In our experiment, time rigorous and reliable, as each audio is listened to by at least consume in a single training step for our model is ∼0.4s, 20 testers, who are all native English speakers (compar- which is 4.25 times faster than that of Tacotron2 (∼1.7s) ing to Tacotron2’s 8 testers in Shen et al. (2017)), and each with equal batch size (16 samples per batch). However, since tester listens less than 30 audios. the parameter quantity of our model is almost twice than We train a Tacotron2 model with our internal US English Tacotron2, it still takes ∼3 days to converge comparing to female dataset as the baseline (also use phonemes as input), ∼4.5 days of that for Tacotron2. and gain equal MOS with our model. Therefore we test the comparison mean option score (CMOS) between samples 4.5 Evaluation generated by Tacotron2 and our model for a finer contrast. We randomly select 38 fixed examples with various lengths In the comparison mean option score (CMOS) test, testers (no overlap with training set) from our internal dataset as listen to two audios (generated by Tacotron2 and our model the evaluation set. We evaluate mean option score (MOS) on with the same text) each time and evaluates how the latter these 38 sentences generated by different models (include feels comparing to the former using a score in [−3, 3] with recordings), in which case we can keep the text content con- intervals of 1. The order of the two audios changes randomly sistent and exclude other interference factors hence only ex- so testers don’t know their sources. Our model wins by a gap amine audio quality. For higher result accuracy, we split the of 0.048, and detailed results are shown in Table 1. whole MOS test into several small tests, each containing one We also select mel spectrograms generated by our model group from our best model, one group from a comparative and Tacotron2 respectively with the same text, and com-

6. PE Type MOS Original 4.37 ± 0.05 Scaled 4.40 ±0.05 Ground Truth 4.41 ± 0.04 Table 3: MOS comparison of scaled and original PE. Layer Number MOS 3-layer 4.33 ± 0.06 6-layer 4.41 ±0.05 Ground Truth 4.44 ± 0.05 Figure 5: PE scale of encoder and decoder. Table 4: Ablation studies in different layer numbers. Re-center MOS No 4.32 ± 0.05 Head Number MOS Yes 4.36 ±0.05 4-head 4.39 ± 0.05 Ground Truth 4.43 ± 0.05 8-head 4.44 ±0.05 Ground Truth 4.47 ± 0.05 Table 2: MOS comparison of whether re-centering pre-net’s output. Table 5: Ablation studies in different head numbers. pare them together with ground truth, as shown in column position information won’t be accurate for rear frames in a 1,2 and 3 of Fig. 4. As we can see, our model does better long sample. in reconstructing details as marked in red rectangles, while Tacotron2 left out the detailed texture in high frequency re- Model with Different Hyper-Parameter Both the en- gion. coder and decoder of the original Transformer is composed of 6 layers, and each multi-head attention has 8 heads. We 4.6 Ablation Studies compare performance and training speed with different layer In this section, we study the detail modification of network and head numbers, as shown in Table 4, 5 and 6. We find that architecture, and conduct several experiments to show our reducing layers and heads both improve the training speed, improvements. but on the other hand, harm model performance in different degrees. Re-centering Pre-net’s Output As described in Sec. 3.3 We notice that in both the 3-layer and 6-layer model, only and 3.4, we re-project both the encoder and decoder pre- alignments from certain heads of the beginning 2 layers’ nets’ outputs for consistent center with positional embed- are interpretable diagonal lines, which shows the approx- dings. In contrast, we add no linear projection in encoder imate correspondence between input and output sequence, pre-net and add a fully connected layer with ReLU activation while those of the following layers are disorganized. Even in decoder pre-net. The results imply that center-consistent so, more layers can still lower the loss, refine the synthesized positional embedding performs slightly better, as shown in mel spectrogram and improve audio quality. The reason is Table 2. that with residual connection between different layers, our model fits target transformation in a Taylor-expansion way: Different Positional Encoding Methods We inject posi- the starting terms account most as low ordering ones, while tional information into both encoder’s and decoder’s input the subsequential ones can refine the function. Hence adding sequences as Eq. 8. Fig. 5 shows that the final positional more layer makes the synthesized wave more natural, since embedding scales of encoder and decoder are different, and it does better in processing spectrogram details (shown in Table 3 shows model with trainable scale performs slightly column 4, Fig. 4). Fewer heads can slightly reduce training better. We think that the trainable scale relaxes the constraint time cost since there are less production per layer, but also on encoder and decoder pre-nets, making positional infor- harm the performance. mation more adaptive for different embedding spaces. We also try adding absolute position embeddings (each position has a trainable embedding) to the sequence, which 5 Related Work also works but has some severe problems such as missing Traditional speech synthesis methods can be categorized phonemes when the sequences became long. That’s because into two classes: concatenative systems and parametric sys- long sample is relatively rare in the training set, so the em- tems. Concatenative TTS systems (Hunt and Black 1996; beddings for large indexes can hardly be trained and thus the Black and Taylor 1997) split original waves into small units,

7. 3-layer 6-layer lel training. Both RNN and CNN based models are difficult 4-head - 0.44 to learn dependencies between distant positions since RNNs 8-head 0.29 0.50 have to traverse a long path and CNN has to stack many con- volutional layers to get a large receptive field, while Trans- Table 6: Comparison of time consuming (in second) per former solves this using self attention in both its encoder training step of different layer and head numbers. and decoder. The ability of self-attention is also proved in SAGAN (Zhang et al. 2018), where original GANs without self-attention fail to capture geometric or structural patterns that occur consistently in some classes (for example, dogs and stitch them by some algorithms such as Viterbi (Viterbi are often drawn without clearly defined separate feet). By 1967) followed by signal process methods (Charpentier and adding self-attention, these failure cases are greatly reduced. Stella 1986; Verhelst and Roelands 1993) to generate new Besides, multi-head attention is proposed to obtain differ- waves. Parametric TTS systems (Tokuda et al. 2000; Zen, ent relations in multi-subspaces. Recently, Transformer has Tokuda, and Black 2009; Ze, Senior, and Schuster 2013; been applied in automatic speech recognition (ASR) (Zhou Tokuda et al. 2013) convert speech waves into spectrograms, et al. 2018a; Zhou et al. 2018b), proving its ability in acous- and acoustic parameters, such as fundamental frequency and tic modeling other than natural language process. duration, are used to synthesize new audio results. Traditional speech synthesis methods require extensive 6 Conclusion and Future Work domain expertise and may contain brittle design choices. Char2Wav (Sotelo et al. 2017) integrates the front-end and We propose a neural TTS model based on Tacotron2 and the back-end as one seq2seq (Sutskever, Vinyals, and Le Transformer, and make some modification to adapt Trans- 2014; Bahdanau, Cho, and Bengio 2014) model and learns former to neural TTS task. Our model generates audio sam- the whole process in an end-to-end way, predicting acoustic ples of which quality is very closed to human recording, and parameters followed by a SampleRNN (Mehri et al. 2016) enables parallel training and learning long-distance depen- as the vocoder. However, acoustic parameters are still inter- dency so that the training is sped up and the audio prosody mediate for audios, thus Char2Wav is not a really end-to- is much more smooth. We find that batch size is crucial for end TTS model, and their seq2seq and SampleRNN models training stability, and more layers can refine the detail of need to be separately pre-trained, while Tacotron, proposed generated mel spectrograms especially for high frequency by Wang et al. (2017), is an end-to-end generative text-to- regions thus improve model performance. speech model, which can be trained by text, spectrogram Even thought Transformer has enabled parallel training, pairs directly from scratch, and synthesizes speech audios autoregressive model still suffers from two problems, which with generated spectrograms by Griffin Lim algorithm (Grif- are slow inference and exploration bias. Slow inference is fin and Lim 1984). Based on Tacotron, Tacotron2 (Shen et due to the dependency of previous frames when infer cur- al. 2017), a unified and entirely neural model, generates rent frame, so that the inference is sequential, while explo- mel spectrograms by a Tacotron-style neural network and ration bias comes from the autoregressive error accumula- then synthesizes speech audios by a modified WaveNet (Van tion. We may solve them both at once by building a non- Den Oord et al. 2016). WaveNet is an autoregressive gen- autoregressive model, which is also our current research in erative model for waveform synthesis, composed of stacks progress. of dilated convolutional layers and processes raw audios of very high temporal resolution (e.g., 24,000 sample rate), References while suffering from very large time cost in inference. This [Bahdanau, Cho, and Bengio 2014] Bahdanau, D.; Cho, K.; problem is solved by Parallel WaveNet (Oord et al. 2017), and Bengio, Y. 2014. Neural machine translation by based on the inverse autoregressive flow (IAF) (Kingma et jointly learning to align and translate. arXiv preprint al. 2016) and reaches 1000× real time. Recently, ClariNet arXiv:1409.0473. (Ping, Peng, and Chen 2018), a fully convolutional text-to- wave neural architecture, is proposed to enable the fast end- [Black and Taylor 1997] Black, A. W., and Taylor, P. 1997. to-end training from scratch. Moreover, VoiceLoop (Taig- Automatically clustering similar units for unit selection in man et al. 2018) is an alternative neural TTS method mim- speech synthesis. In Fifth European Conference on Speech icking a person’s voice based on samples captured in-the- Communication and Technology. wild, such as audios of public speeches, and even with an [Charpentier and Stella 1986] Charpentier, F., and Stella, M. inaccurate automatic transcripts. 1986. Diphone synthesis using an overlap-add technique On the other hand, Transformer (Vaswani et al. 2017) for speech waveforms concatenation. In Acoustics, Speech, is proposed for neural machine translation (NMT) and and Signal Processing, IEEE International Conference on achieves state-of-the-art result. Previous NMT models are ICASSP’86., volume 11, 2015–2018. IEEE. dominated by RNN-based (Bahdanau, Cho, and Bengio [Cho et al. 2014] Cho, K.; Van Merrienboer, B.; Gulcehre, 2014) or CNN-based (e.g. ConvS2S (Gehring et al. 2017), C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; and Bengio, ByteNet (Kalchbrenner et al. 2016)) neural networks. For Y. 2014. Learning phrase representations using rnn encoder- RNN-based models, both training and inference are sequen- decoder for statistical machine translation. Computer Sci- tial for each sample, while CNN-based models enable paral- ence.

8.[Chorowski et al. 2015] Chorowski, J. K.; Bahdanau, D.; [Taigman et al. 2018] Taigman, Y.; Wolf, L.; Polyak, A.; and Serdyuk, D.; Cho, K.; and Bengio, Y. 2015. Attention-based Nachmani, E. 2018. Voiceloop: Voice fitting and synthe- models for speech recognition. In Advances in neural infor- sis via a phonological loop. In International Conference on mation processing systems, 577–585. Learning Representations. [Gehring et al. 2017] Gehring, J.; Auli, M.; Grangier, D.; [Tokuda et al. 2000] Tokuda, K.; Yoshimura, T.; Masuko, T.; Yarats, D.; and Dauphin, Y. N. 2017. Convolu- Kobayashi, T.; and Kitamura, T. 2000. Speech param- tional sequence to sequence learning. arXiv preprint eter generation algorithms for hmm-based speech synthe- arXiv:1705.03122. sis. In Acoustics, Speech, and Signal Processing, 2000. [Griffin and Lim 1984] Griffin, D., and Lim, J. 1984. Sig- ICASSP’00. Proceedings. 2000 IEEE International Confer- nal estimation from modified short-time fourier transform. ence on, volume 3, 1315–1318. IEEE. IEEE Transactions on Acoustics, Speech, and Signal Pro- [Tokuda et al. 2013] Tokuda, K.; Nankaku, Y.; Toda, T.; Zen, cessing 32(2):236–243. H.; Yamagishi, J.; and Oura, K. 2013. Speech synthesis based on hidden markov models. Proceedings of the IEEE [Hochreiter and Schmidhuber 1997] Hochreiter, S., and 101(5):1234–1252. Schmidhuber, J. 1997. Long short-term memory. Neural computation 9(8):1735–1780. [Van Den Oord et al. 2016] Van Den Oord, A.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Kalchbren- [Hunt and Black 1996] Hunt, A. J., and Black, A. W. 1996. ner, N.; Senior, A. W.; and Kavukcuoglu, K. 2016. Wavenet: Unit selection in a concatenative speech synthesis system A generative model for raw audio. In SSW, 125. using a large speech database. In Acoustics, Speech, and Signal Processing, 1996. ICASSP-96. Conference Proceed- [Vaswani et al. 2017] Vaswani, A.; Shazeer, N.; Parmar, N.; ings., 1996 IEEE International Conference on, volume 1, Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polo- 373–376. IEEE. sukhin, I. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, 5998–6008. [Kalchbrenner et al. 2016] Kalchbrenner, N.; Espeholt, L.; [Verhelst and Roelands 1993] Verhelst, W., and Roelands, Simonyan, K.; Oord, A. v. d.; Graves, A.; and Kavukcuoglu, M. 1993. An overlap-add technique based on waveform K. 2016. Neural machine translation in linear time. arXiv similarity (wsola) for high quality time-scale modification of preprint arXiv:1610.10099. speech. In Acoustics, Speech, and Signal Processing, 1993. [Kingma et al. 2016] Kingma, D. P.; Salimans, T.; Jozefow- ICASSP-93., 1993 IEEE International Conference on, vol- icz, R.; Chen, X.; Sutskever, I.; and Welling, M. 2016. ume 2, 554–557. IEEE. Improved variational inference with inverse autoregressive [Viterbi 1967] Viterbi, A. 1967. Error bounds for convolu- flow. In Advances in Neural Information Processing Sys- tional codes and an asymptotically optimum decoding algo- tems, 4743–4751. rithm. IEEE transactions on Information Theory 13(2):260– [Mehri et al. 2016] Mehri, S.; Kumar, K.; Gulrajani, I.; Ku- 269. mar, R.; Jain, S.; Sotelo, J.; Courville, A.; and Bengio, Y. [Wang et al. 2017] Wang, Y.; Skerry-Ryan, R.; Stanton, D.; 2016. Samplernn: An unconditional end-to-end neural au- Wu, Y.; Weiss, R. J.; Jaitly, N.; Yang, Z.; Xiao, Y.; Chen, dio generation model. arXiv preprint arXiv:1612.07837. Z.; Bengio, S.; et al. 2017. Tacotron: A fully end-to-end [Oord et al. 2017] Oord, A. v. d.; Li, Y.; Babuschkin, I.; Si- text-to-speech synthesis model. arXiv preprint. monyan, K.; Vinyals, O.; Kavukcuoglu, K.; Driessche, G. [Ze, Senior, and Schuster 2013] Ze, H.; Senior, A.; and v. d.; Lockhart, E.; Cobo, L. C.; Stimberg, F.; et al. 2017. Schuster, M. 2013. Statistical parametric speech synthesis Parallel wavenet: Fast high-fidelity speech synthesis. arXiv using deep neural networks. In Acoustics, Speech and Signal preprint arXiv:1711.10433. Processing (ICASSP), 2013 IEEE International Conference [Ping, Peng, and Chen 2018] Ping, W.; Peng, K.; and Chen, on, 7962–7966. IEEE. J. 2018. Clarinet: Parallel wave generation in end-to-end [Zen, Tokuda, and Black 2009] Zen, H.; Tokuda, K.; and text-to-speech. arXiv preprint arXiv:1807.07281. Black, A. W. 2009. Statistical parametric speech synthe- [Shen et al. 2017] Shen, J.; Pang, R.; Weiss, R. J.; Schuster, sis. Speech Communication 51(11):1039–1064. M.; Jaitly, N.; Yang, Z.; Chen, Z.; Zhang, Y.; Wang, Y.; [Zhang et al. 2018] Zhang, H.; Goodfellow, I.; Metaxas, D.; Skerry-Ryan, R.; et al. 2017. Natural tts synthesis by con- and Odena, A. 2018. Self-attention generative adversarial ditioning wavenet on mel spectrogram predictions. arXiv networks. arXiv preprint arXiv:1805.08318. preprint arXiv:1712.05884. [Zhou et al. 2018a] Zhou, S.; Dong, L.; Xu, S.; and Xu, B. [Sotelo et al. 2017] Sotelo, J.; Mehri, S.; Kumar, K.; San- 2018a. A comparison of modeling units in sequence-to- tos, J. F.; Kastner, K.; Courville, A.; and Bengio, Y. 2017. sequence speech recognition with the transformer on man- Char2wav: End-to-end speech synthesis. ICLR 2017 work- darin chinese. arXiv preprint arXiv:1805.06239. shop. [Zhou et al. 2018b] Zhou, S.; Dong, L.; Xu, S.; and Xu, B. [Sutskever, Vinyals, and Le 2014] Sutskever, I.; Vinyals, O.; 2018b. Syllable-based sequence-to-sequence speech recog- and Le, Q. V. 2014. Sequence to sequence learning with neu- nition with the transformer in mandarin chinese. arXiv ral networks. In Advances in neural information processing preprint arXiv:1804.10752. systems, 3104–3112.