key: cord-0666937-resog1g1 authors: Song, Jaeyun; Shim, Hajin; Yang, Eunho title: Mutually-Constrained Monotonic Multihead Attention for Online ASR date: 2021-03-26 journal: nan DOI: nan sha: 7e464656d5fbc1fd347b2bf3bf86457aeb4fb4d4 doc_id: 666937 cord_uid: resog1g1 Despite the feature of real-time decoding, Monotonic Multihead Attention (MMA) shows comparable performance to the state-of-the-art offline methods in machine translation and automatic speech recognition (ASR) tasks. However, the latency of MMA is still a major issue in ASR and should be combined with a technique that can reduce the test latency at inference time, such as head-synchronous beam search decoding, which forces all non-activated heads to activate after a small fixed delay from the first head activation. In this paper, we remove the discrepancy between training and test phases by considering, in the training of MMA, the interactions across multiple heads that will occur in the test time. Specifically, we derive the expected alignments from monotonic attention by considering the boundaries of other heads and reflect them in the learning process. We validate our proposed method on the two standard benchmark datasets for ASR and show that our approach, MMA with the mutually-constrained heads from the training stage, provides better performance than baselines. Online automatic speech recognition (ASR), which immediately recognizes incomplete speeches as humans do, is emerging as a core element of diverse ASR-based services such as teleconferences, AI secretaries, or AI booking services. In particular, in these days, where the untact service market is rapidly growing due to the recent global outbreak of COVID-19, the importance of providing more realistic services by reducing latency is also growing. However, of course, online ASR models [1, 2] targeting real-time inference have concerns about performance degradation compared to traditional Copyright 2021 IEEE. Published in ICASSP 2021 -2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), scheduled for 6-11 June 2021 in Toronto, Ontario, Canada. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works, must be obtained from the IEEE. Contact: Manager, Copyrights and Permissions / IEEE Service Center / 445 Hoes Lane / P.O. Box 1331 / Piscataway, NJ 08855-1331, USA. Telephone: + Intl. 908-562-3966. DNN-HMM hybrid models with pre-segmented alignments or offline models based on Transformer [3, 4] , which is the state-of-the-art in many sequence-sequence tasks nowadays. In order to overcome this performance-delay trade-off, several attempts have been made to learn or find monotonic alignments between source and target via attention mechanism [5, 6, 7, 8, 9] . Especially, Monotonic Attention (MA), and Monotonic Chunkwise Attention (MoChA) [8, 9] learn alignments in an end-to-end manner by calculating differentiable expected alignments in training phase and shows comparable performance to models using an offline attention. Very recently, motivated by the success of Transformer architecture even in ASR [4] , direct attempts to make it online by applying these learning alignment strategies to Transformer, not on the traditional RNN based models, are emerging [10, 11, 12, 13] . Among others, Monotonic Multihead Attention (MMA) [14] converts each of multi-heads in Transformer to MA and exploits the diversity of alignments from multiple heads. In order to resolve the issue of MMA that has to wait for all multi-heads to decode, HeadDrop [15] drops heads stochastically in the training stage. [15] also proposed to use head-synchronous beam search decoding (HSD) which limits the difference in selection time between the heads in the same layer only in the inference phase, but resulting in the discrepancy between training and inference. In this paper, we propose an algorithm, called "Mutually-Constrained Monotonic Multihead Attention" (MCMMA), that enables the model to learns alignments along with other heads by modifying expected alignments to consistently bring constrained alignments of the test time to the training time. By bridging the gap between the training and the test stages, MCMMA effectively improves performance. We first review the main components which our model is based on, including monotonic attention, monotonic multihead attention, and HeadDrop with head-synchronous beam search decoding in Subsection 2.1, 2.2, and 2.3, respectively. MA [8] is the attention-based encoder-decoder RNN model which is able to learn monotonic alignments in an end-toend manner. The encoder processes input sequence x = (x 1 , . . . , x T ) to encoder states h = (h 1 , . . . , h T ). At i-th output step, the decoder sequentially inspects encoder states from the last selected one in the previous step and decide whether to take it or not to produce current output. The probability p i,j to select h j for i-th output is computed as where s i−1 is a decoder state of (i − 1)-th output step. If h j is selected, the RNN decoder takes it as context c i = h j with the previous decoder state s i−1 and output y i−1 to compute current state. To make alignment learnable in the training phase, a hard selected context above is replaced by a expected context c i = L j=1 α i,j h j , the weighted sum of h with the expected alignment α computed as (1) MoChA [9] extends MA by performing soft attention over fixed-length chunks of encoder states preceding the position chosen by a MA mechanism. MMA [14] applies MA mechanism to Transformer [3] by making each of the multiple heads of decoder-encoder attention learn monotonic alignments as MA. MMA borrows scaled dot-product operation of Transformer. Although MMA leads to considerable improvement in online machine translation, the latency is still high since the model should wait until all heads to select their contexts for every decoding step. Thus, the authors of [14] proposed to use additional regularization to minimize the variance of expected alignments of all heads to reduce the latency. Nevertheless, this approach does not model the dependency between heads explicitly. HeadDrop [15] is the method that drops each head stochastically for each individual to learn alignments correctly. This approach improves boundary coverage and streamability of MMA [15] . Head-synchronous decoding (HSD) [15] is the inference algorithm, where the leftmost head forces slow heads, which fail to choose any frames within waiting time threshold , to choose the rightmost of selected frames. However, HSD considers alignments of other heads and only at the test phase. In this section, we propose the algorithm to learn alignments of MA heads under a constraint of the difference of heads' alignments within each decoder layer. To reduce latency, our approach forces MMA heads to select input frames within a limited range, not only in the testing phase but also in the training phase by estimating differentiable expected alignments of MA heads under constraints. With newly estimated expected alignments, we follow the overall training and test process of [15] . Before we address the details, we define two functions for the convenience of deriving the expectations of constrained alignments. Let be the waiting threshold for the training stage, and α m i,T +1 be the probability for the m-th head not to select any frames where T is the length of the input sequence. We define function A m i,j which is equal to α i,j when it takes expected alignment α as an argument (See Eq. (1)): The probability B m i,j that the m-th head does not choose any frames until reaching h j for i-th output can be represented as a function taking α: Before we consider interdependent alignments of all heads under the constraints, we first consider the simple situation that each head chooses a frame within a waiting time threshold from the last selected frame at the previous step. We define another function C for expected alignments as where + 1 ≤ j ≤ T and A m 0,j = 0, B m 0,j = 1 for all m, j. The first term of RHS is the probability that the m-th head selects h j when the head selects a later frame than j − in the previous step. The second term represents the probability that the m-th head chooses h j when the head does not select any frames between the last selected frame h j− and the frame right before the right bound h j−1 . Thus, equation (4) means the probability for the m-th head to choose h j when predicting y i so that γ i,j = C m i,j (γ). However, instead of getting γ autoregressively, we can replace it withγ = C(α) where α is computed by equation (1) in parallel as below. However, this modification has the limitation since word pieces or characters have various length, so constraints on length might be harmful. Alternatively, we attend the constraints on the difference among heads as [15] . We suggest our main method called Mutually-Constrained MMA (MCMMA) that estimates the expected alignment with considering the inter-dependency of other MA heads. Similarly with the equation (4), we define the function D as The first term is the probability that the m-th head selects h j when the other heads do not choose any frames until reaching h j− . The second term means the probability that the m-th head chooses h j when at least one of the other heads selects h j− and the time limit is over. Thus, the equation (5) means the probability for the m-th head to choose h j when predicting y i . Note that the probability to select h j is zero if at least one of the other heads have chosen h o where o < j − . To avoid training MMA autoregressively, we replace δ = D(δ) withδ computed aŝ The overall procedures of MMA and MCMMA at inference are in Fig. 1 . In the training phase, we formulate a context as a weighted sum of the encoder states with the expected alignments. Our architecture follows [15] for fairness. The model is Transformer-based [3] . The encoder is composed of 3 CNN blocks to process audio signals and 12 layers of multi-head self-attention (SAN) with the dimension d model of queries, keys and values, and H heads. Each CNN block comprises 2D-CNN followed by max-pooling with 2-stride and ReLU activation is used after every CNN. The decoder has 1D-CNN for positional encoding [16] and 6 attention layers. Each of the lower D lm attention layers only has a multi-head SAN followed by a FFN and the upper (6 − D lm ) layers are stacks of SAN, MMA, and FFN. We also adopt chunkwise multihead attention (CA) [15] which provides additional heads for each MA head to consider multiple views of input sequences. We refer the reader to [15] for further details. We experiment on Librispeech 100-hour [17] and AISHELL-1 [18] . We implement models on [15] . 1 We utilize the same setup as [15] including extraction for input features and overall experiments for fairness. We build 10k size of vocabulary by Byte Pair Encoding (BPE). Adam optimizer [19] with Noam learning rate scheduling [3] . We also adopt the chunkhopping mechanism [20] as (the past size, the current size, the future size) = (64, 128, 64) to make encoder streamable. We use a pre-trained language model (LM), which is 4-layer LSTM with 1024 units, for inference where the weight of LM and length penalty is 0.5 and 2, respectively with a beam size of 10. By following [15] , the objective is the negative log-likelihood and the CTC loss with an interpolation weight λ ctc = 0.3 and the averaged model over the top-10 of models saved at the end of every epoch for final evaluation. We utilize SpecAugment [21] for Librispeech, and speed perturbation [22] for AISHELL-1. Instead of choosing the right bound, we select the most probable frame between the leftmost frame and the right bound in AISHELL-1 from the training stage. We utilize 2 CNN blocks for encoder and apply max-pooling with 2-stride after the second CNN block, the fourth, and the eighth layer in AISHELL-1. Boundary coverage and streamability [15] is the metric to evaluate whether the model is streamable. However, it does not well suit with the MMA mechanism since predicting each output is done when the last head completes the selection. Instead of the above, we utilize the relative latency (ReL) by We note that ReL is the natural extension of the existing latency metric. [23] provides the utterance-level latency which is the same with ReL when replacing the boundaries produced by the reference model with the gold boundaries in the definition of relative latency. However, acquiring the gold boundaries is complicated, so we utilize the boundaries of MMA without HSD as the reference boundaries. We present the results of our approach with baselines in table 1. We train our model with = 10 and = 12 on Librispeech and AISHELL-1, respectively and evaluate it with = 8 to make the setting same with [15] . Our model shows better performance than the baselines including HeadDrop [15] . Especially, we reduce 2.2% of WER than HeadDrop [15] on testother in Librispeech. These results show that training alignments together with other heads' selection time improves the performance. One very interesting and unexpected point we observed in table 1 is that the WER of Transformer is higher than online models (except for MMA) in test-clean experiments. We con- jecture that online attention mechanisms are beneficial to exploit locality since they strongly force models to attend small chunks of an input sequence from the training phase. We provide trade-off graphs between quality and relative latency in fig 2 through adjusting ∈ {6, 8, 10, 12}, and ∈ {4, 8, 12} in inference time for Librispeech, and AISHELL-1, respectively. To calculate relative latency with time units, we multiply frame-level relative latency by 80ms since the reducing factor of frames is 8 and the shifting size is 10ms. Our model outperforms baselines and is still faster than MMA without HSD even though there are small increases in relative latency compared to HeadDrop except for the case with extremely small text . The performance degradation with small occurs since accessible input information is very limited and training models with small restricts head diversity severely. Thus, this result suggests that the practitioners should avoid choosing small . We suggest the method to learn alignments with considering other heads' alignments by modifying expected alignments for all the heads of each layer to select an input frame within a fixed size window. Our approach improves performance with only a small increase in latency by regularizing the intra-layer difference of boundaries effectively from the training phase. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks Sequence transduction with recurrent neural networks Attention is all you need Speechtransformer: A no-recurrence sequence-to-sequence model for speech recognition Attention-based models for speech recognition An online sequence-to-sequence model using partial conditioning Triggered attention for end-to-end speech recognition Online and linear-time attention by enforcing monotonic alignments Monotonic chunkwise attention Streaming automatic speech recognition with the transformer model CIF: continuous integrateand-fire for end-to-end speech recognition Streaming transformer asr with blockwise synchronous inference Transformerbased online ctc/attention end-to-end speech recognition architecture Monotonic multihead attention Enhancing monotonic multihead attention for streaming ASR Transformers with convolutional context for asr Librispeech: An ASR corpus based on public domain audio books AISHELL-1: an open-source mandarin speech corpus and a speech recognition baseline Adam: A method for stochastic optimization Self-attention aligner: A latency-control end-to-end model for ASR using self-attention network and chunk-hopping Specaugment: A simple data augmentation method for automatic speech recognition Audio augmentation for speech recognition Minimum latency training strategies for streaming sequence-to-sequence ASR RWTH ASR systems for librispeech: Hybrid vs attention