key: cord-0438119-h4ykqvwt authors: Dai, Wenliang; Cahyawijaya, Samuel; Yu, Tiezheng; Barezi, Elham J.; Xu, Peng; Yiu, Cheuk Tung Shadow; Frieske, Rita; Lovenia, Holy; Winata, Genta Indra; Chen, Qifeng; Ma, Xiaojuan; Shi, Bertram E.; Fung, Pascale title: CI-AVSR: A Cantonese Audio-Visual Speech Dataset for In-car Command Recognition date: 2022-01-11 journal: nan DOI: nan sha: 933e905e70823493197df43aa0fc7264b01a432f doc_id: 438119 cord_uid: h4ykqvwt With the rise of deep learning and intelligent vehicle, the smart assistant has become an essential in-car component to facilitate driving and provide extra functionalities. In-car smart assistants should be able to process general as well as car-related commands and perform corresponding actions, which eases driving and improves safety. However, there is a data scarcity issue for low resource languages, hindering the development of research and applications. In this paper, we introduce a new dataset, Cantonese In-car Audio-Visual Speech Recognition (CI-AVSR), for in-car command recognition in the Cantonese language with both video and audio data. It consists of 4,984 samples (8.3 hours) of 200 in-car commands recorded by 30 native Cantonese speakers. Furthermore, we augment our dataset using common in-car background noises to simulate real environments, producing a dataset 10 times larger than the collected one. We provide detailed statistics of both the clean and the augmented versions of our dataset. Moreover, we implement two multimodal baselines to demonstrate the validity of CI-AVSR. Experiment results show that leveraging the visual signal improves the overall performance of the model. Although our best model can achieve a considerable quality on the clean test set, the speech recognition quality on the noisy data is still inferior and remains as an extremely challenging task for real in-car speech recognition systems. The dataset and code will be released at https://github.com/HLTCHKUST/CI-AVSR. The research of intelligent transportation systems covers a broad spectrum, including monitoring the driver's inattention (Haghani et al., 2021) , traffic monitoring for both surveillance and traffic management (Buch et al., 2011) , and video-based lane tracking for smart driving assistance (Tang et al., 2021) . More recently, the area of self-driving (or autonomous driving) has been developed rapidly with a lot of attention from both academia (Van Brummelen et al., 2018; Zhou et al., 2019) and the industry. Autonomous driving aims to notably improve driving convenience and safety by automating some of the driving tasks, helping drivers to focus more on driving itself. The fast prosperous growth of artificial intelligence (AI) and deep neural networks has led to the growth of autonomous driving. However, there is still a long journey to meet the highest level of automation required for autonomous driving systems to deal with various road environments and noise conditions for various driver languages and abilities (Zhou et al., 2019) . Voice controllable systems (VCS), which lean on Automatic Speech Recognition (ASR) methods, have gained remarkable progress with the advances in deep learning and big data. It allows drivers to use simple voice commands to handle complex operations, which is a paramount future demand of advanced driving vehi-★ These authors contributed equally. † The work was done when the author was studying in The Hong Kong University of Science and Technology. cles (Zhou et al., 2019) . However, one of the largest issues of building such an ASR system, especially for low-resource languages, is the lack of data (Winata et al., 2020a; Winata et al., 2020b; Lovenia et al., 2021) , which is crucial for achieving commendable speech recognition quality for a VCS system. On the other hand, the audio-visual speech recognition (AVSR) task is proposed to leverage visual data of the speakers to support ASR (Afouras et al., 2018a; Xia et al., 2020) . It is shown that visual information is beneficial to the ASR, especially when the audio itself is noisy or even unavailable. To further push the boundary of this research area and mitigate the aforementioned problems, in this paper, we collect a multimodal dataset in the Cantonese language, called Cantonese In-car Audio-Visual Speech Recognition (CI-AVSR), which includes both visual and acoustic signals of the drivers reading various in-car commands. Specifically, it contains 200 unique commands recorded by 30 native Cantonese speakers, forming a dataset of 4,984 samples. Furthermore, we augment it with 10 common in-car background noises to simulate real environments, which increase both the scale and applicability of the dataset. This dataset can benefit the Cantonese and also multilingual AVSR task. The visual information provided in the CI-AVSR dataset can further enhance the quality of the ASR task. Various research works using audio-visual information (Zadeh et al., 2019; Dai et al., 2020; Dai et al., 2021) have shown the advantage of utilizing multiple modalities over a single modality. More specifically, recent works arXiv:2201.03804v2 [cs.CL] 14 Mar 2022 in audio-visual speech recognition (Zhou et al., 2019; Wang et al., 2020; Alam et al., 2020; Zhu et al., 2021) have shown that audio-visual information can produce more precise and reliable speech recognition quality compared to only using audio data and can further avoid hidden voice commands that can stealthily control such intelligent systems (Carlini et al., 2016) . Moreover, since the in-car voice can be quite noisy, using the visual information helps to enhance the audio and further improve the overall ASR performance. Overall, our contribution can be summarized as threefold: 1) we collect and will release the first AVSR dataset for in-car command recognition in the Cantonese language, which could benefit future research and applications; 2) we augment the data with 10 common background noises to simulate real in-car environments and improve the scale; and 3) we provide details statistics of the dataset and train two baseline models to validate its effectiveness. In the past few years, many multimodal datasets have been released to facilitate the study of in-car speech recognition, driver behavior, emotion recognition, etc. Jenness et al. (2008) conduct an empirical research study about the use of the voice control system (VCS) by drivers as well as potential measures that could be used for evaluating possible distraction from using these systems while driving a car. To this end, they collect 30 minutes of data on car-driving on two US roads to explore the performance of vehicle voice control interface. The data cover various situations, including the highway, heavy arterials, residential, and commercial streets. On the other hand, Ivaneckỳ and Mehlhase (2012) introduce a speech dataset in German for addressing the ASR task for physically disabled drivers. This dataset is recorded by 10 speakers and each of them is asked to record 2 × 30 commands. Furthermore, to include environmental noises, the speakers conduct the data recording at a distance of 20 to 30 centimeters from the microphone. While in the AVSR task, Afouras et al. (2018a) have proposed a large-scale dataset named LRS2-BBC, which is collected from the British Broadcasting Corporation (BBC). The dataset contains thousands of hours of spoken sentences from various BBC programs that contain talking faces together with transcripts. Moreover, Afouras et al. (2018b) further introduce a multimodal dataset for visual and audio-visual speech recognition. It includes face tracks from over 400 hours of TED and TEDx videos, along with the corresponding subtitles and word alignment boundaries. More recently, (Stappen et al., 2021) propose the MuSe-Sent challenge that aims to predict five sentiment classes for each emotion dimension (arousal or valence) on a segment level, based on audio-visual recordings from the MuSe-CaR dataset about car review videos. The dataset is crawled from YouTube which contains 291 videos. However, to the best of our knowledge, none of the previous work focuses on in-car command AVSR for low-resource languages, such as Cantonese, which hinders the research and applications of AVSR in those languages. Various architectures have been proposed to tackle the AVSR task. Petridis et al. (2018) use bidirectional GRUs to encode the visual and acoustic data separately and adopt a late fusion for two modalities. It shows that even with a simple model, visual information can improve the performance. Afouras et al. (2018a) propose to leverage the Transformer self-attention and compare the effectiveness of two objectives: the Connectionist Temporal Classification (CTC) loss and the sequenceto-sequence language modeling loss. They show that visual information can further reduce the word error rate up to 1.2% in a clean environment, and 23.2% in a noisy environment on a very large-scale dataset. It indicates that the visual information of lip movement can benefit a lot when the audio is noisy. In recent years, the use of attention-based models has been increasing for solving the AVSR task. propose a double visual awareness multimodality speech recognition (AE-MSR) network to use the visual information with two steps, first to enhance the noisy audio data, and then as a second modality using an Element-wise-Attention Gated Recurrent Unit (EleAtt-GRU), which is potentially more effective than Transformer for long sequences. More recently, Ma et al. (2021a) introduce a novel architecture by combining convolutional layers with Transformer, achieving stateof-the-art performance on common benchmarks. In this paper, we build two multimodal baseline models to demonstrate the validity of our collected dataset. In this section, we first describe the data collection pipeline for CI-AVSR, which includes the command template creation, the data recording interface, the recording equipment and format, and the data recording strategy. Then, we provide detailed statistics of the CI-AVSR dataset. Finally, we introduce a data augmentation method to increase the data scale and simulate in-car environments. Before the data recording phase, we first prepare a template of commands for the speakers to read, in which the commands should be diverse and cover common scenarios. We classify template of commands into four general categories: 1) navigation; 2) music playing; 3) weather inquiry; and 4) others. This categorization strategy allows us to cover all the common scenarios of the in-car voice commands while maintaining the diversity of the commands and further simplifying the commands creation process. In addition to the first three categories, there are also other frequently used in-car commands. However, these commands cannot be created by following the aforementioned procedure. We categorize these commands as others, including commands to ask the system to turn on or off some functionality in the car (e.g., air-conditioner, radio, light, etc.), ask the system to broadcast or send messages, and inquiry the status of the car (e.g., how much electricity or gas remains). To ensure the created commands conform to the spoken Cantonese language, we hire two human experts who are native Cantonese speakers. First of all, each expert is asked to design command candidates for all categories (command patterns and named entities for the first three, and complete commands for the last). Then, we swap their designed results and ask them to conduct a grammar and correctness check for each other. Finally, we assemble all the commands and ask these two experts to filter out command patterns with high similarities to increase the diversity. Statistics of these preliminary commands are shown in Table 2 . One drawback of creating commands in this way is that many commands share the same pattern with only a small difference in the named entities, which reduces the overall diversity of the commands. To mitigate this problem, we uniformly sample ∼30% commands from the first three categories while keeping all the commands from the others category. Finally, after sampling, we collect 200 commands in total, in which 160 are from the first three categories and 40 are from the others category 1 . We show the distribution of command length in Figure 1 . Data Recording Interface. To reduce the difficulty of data recording, we make a dedicated website so that speakers can conduct the recording on their own computer (equipped with a camera and a microphone) at their own place 2 . The interface of the website is shown in Figure 3 . In the top left part, we show a basic introduction to the speakers as well as a tutorial to teach them how to use this website and how to record. The control panel is on the bottom left, where there is the current recording command shown, the buttons to perform actions (start/stop the recording and save/download the recorded data), a live preview window to show the realtime camera view, and a playback window to play the last recorded video. Once the speakers are satisfied with the recording of the current command, they can proceed to the next one until all the commands are recorded. On the right-hand side, we show an overview of all the commands that the speakers need to fulfill, which also shows the current progress. Although the previously mentioned setting is convenient to distribute data collection jobs and reduce the cost, there is a problem that the speakers may have different types of cameras and microphones. To mitigate this problem, we ask the speakers to use a camera with at least 720p (1280 × 720 px) resolution, which gives us sufficient room for further processing. After the recording, we crop all videos to 640 × 480 with the person at the center area. For the recorded audios, we process them to the same format with the 16kHz frame rate, mono channel, and encoded as 16-bit pulse-code modulation (PCM), producing a total bit rate of 256 kbps. Data Recording Strategy. We divide the whole data recording process into two stages. In the first stage, we perform a preliminary data collection session to test the robustness of our system and the correctness of the created commands. To this end, we hire 10 native Cantonese speakers (5 males and 5 females) and ask each of them to record 100 commands. Based on their feedback, we improve the user experience on the website, provide more detailed instructions in the tutorial, and add a more potent data auto-saving strategy of the website to avoid data lost. In the second stage, we expand the data collection scale by hiring 20 native Cantonese speakers (10 males and 10 females) and asking each of them to record the full list of 200 commands. In the total of two stages, there are 5,000 samples collected. We perform a manual check on each of the data samples and 1 The full command list will be released along with our dataset in https://github.com/HLTCHKUST/CI-AVSR. 2 This is especially important during the COVID-19 pandemic as people usually work from home and there is a government regulation on the number of people for gatherings. The last column on the right represents samples that are longer than 11 seconds. filter out 16 samples with low quality. Therefore, we have 4,984 data samples for the final dataset with a total duration of 30,049 seconds. The distribution of sample duration is shown in Figure 2 . Finally, as illustrated in Table 3 , we split the data into training, validation, and test sets by speakers while maintaining a balanced gender distribution on each split. To simulate in-car environments and increase the data scale, we augment each sample from the collected dataset by combining it with 10 different in-car background noises 3 that are commonly heard in the daily usage of cars, including alarm, horn, background music, ignition, hail, rain, windscreen wiper, road ambience, door opens and closes, and people talking. For each type of noise, we use five variants to increase the diversity and uniformly sample one when applying to the data. The volume of the noise is adjusted by human experts so that the original 3 The sounds are downloaded from https://freesound. org/. Split #Male (Dur.) #Female (Dur.) Total Dur. Train 10 (10,803s) 10 (11,813s) 22,616s Valid 2 (1,849s) 2 (1,829s) 3,678s Test 3 (1,902s) 3 (1,843s) 3,745s Table 3 : Statistics of the train/valid/test splits of our recorded dataset. We split data by the speaker id, i.e. speakers in a set will not appear in another. Here, Dur. denotes duration and s represents seconds. The web interface for conducting data recording. It helps us to distribute works and reduce the time cost. Details are explained in Section 3.2. commands are still recognizable and all the sounds are on the same level of loudness. Therefore, the resulting augmented dataset is 10 times as large as the original clean one and more in line with actual in-car scenarios, which could potentially benefit the generalization and applicability of the trained model. In this section, we aim to demonstrate the validity of our collected multimodal Cantonese data and the effectiveness of using the visual signals to help speech recognition. Conformer-based multimodal ASR model. Following Ma et al. (2021b) , for the audio input, we use a modified ResNet-18 (He et al., 2016) with all 2D-convolution layers changed to 1D, and then a Convolution-augmented Transformer (Conformer (Gulati et al., 2020) ) encoder to further process the representations. For the visual input, it is also first processed by a ResNet-18 with the first layer replaced by a 3D convolutional layer, and then another Conformer encoder to get final visual representations. The audio and visual representations are fused by a Multi-Layer-Perceptron (MLP), which consists of two linear layers with batch normalization and ReLU activation in between. Finally, we apply a GPT-based (Radford et al., 2019) four-layer decoder to decode the audio-video information to text. Wav2Vec 2.0-based multimodal ASR model. The pre-trained Wav2Vec 2.0 model (Baevski et al., 2020) gives strong speech representations and shows excellent performance on downstream tasks. We further extend it with a video encoder to allow audio-visual speech recognition. We adopt a ResNet-18 model pre-trained the ImageNet (Russakovsky et al., 2015) to get the visual representations from each video frame. Moreover, we incorporate another 1D-convolutional layer with a kernel size of 3 to add interaction between frame-level features. To fuse the representations from two modalities, we align the audio framerate with the video framerate and fuse them by summing representations from both modalities. Lastly, we apply a linear transformation to convert the fused representations into predicted tokens. Data Preprocessing In our experiment, we preprocess the multimodal data to get a standardized input format of the audio and the image data. For the audio data, we keep using 16kHz mono-channel audio data and normalize the value with zero mean and standard deviation of one. For the image data, we extract 25 image frames per second from the video using FFMPEG (Tomar, 2006) . To reduce the computational cost, we extract only the lip part of each image with the face landmark detection module (Kazemi and Sullivan, 2014) from the DLIB library (King, 2009 ) by using a face landmark detection model 4 that is pre-trained on the iBUG 300-W face dataset (Sagonas et al., 2016) 2.0 model, we fine-tuned the whole model's parameters using a learning rate of 5 −5 and a batch size of 16 by minimizing the CTC loss (Graves et al., 2006) of the output with regard to the label. Evaluation Details. We use the character error rate (CER) (Wang et al., 2013) rather than the word error rate (WER) as an evaluation metric because the Cantonese language is character-based. In detail, the CER is calculated by adding the number of substituted, inserted, and deleted characters together and then dividing them by the total number of characters. For the multimodal Conformer model, the output transcripts are generated in an auto-regressive manner with a beam search size of 4 and a length penalty of 1. For the multimodal wav2vec 2.0 model, the output transcription is generated using the CTC decoding method. We train the Conformer-based and Wav2Vec-based models only on the clean training set of our data with two settings: 1) audio-only; and 2) multimodal (audio and video). As shown in Table 4 , we evaluate their performance on both clean and augmented (noisy) test sets. Compared to the audio-only setting, models trained on multimodal data achieve lower CER by a large margin on both clean and noisy data. Moreover, similar to prior work (Afouras et al., 2018a; Xu et al., 2020; Ma et al., 2021a) , we find that the visual signal contributes more when the audio is noisy. The rich visual data provide complementary and supplementary information to the audio data for the models to generate transcripts. However, the performance of the Conformer-based models is far from satisfactory. We conjecture that the model is overfitted to the training set since our clean training data is limited, which could be mitigated by using the augmented data to train. On the other hand, the pre-trained Wav2Vec-based model has great generalization ability and shows promising results after finetuning. Additionally, to ablate the effects of each background noise, in Table 5 , we show the test result of each of them using the Wav2Vec-based model. The model is more sensitive to alarm and horn compared to the others, and the visual information helps more under these situations. In this paper, we propose a new dataset, Cantonese In-car Audio-Visual Speech Recognition (CI-AVSR), for Avg (0 to 9) 14.34% 7.99% Table 5 : The character error rate of the Wav2Vec 2.0 model (trained on the clean training set only) on the augmented noisy data. We report its performance on each type of noise and the average of them. audio-visual speech recognition of in-car commands. It consists of 200 unique commands with 8.3 hours of recorded data. Furthermore, we augment the dataset with 10 commonly seen background sounds to simulate real scenarios, resulting in more than 80 hours of data. We evaluate the collected data with two baseline models, showing the effectiveness of AVSR. When testing on the augmented data with background noises, we observe a clear performance drop, which we believe would be a challenging and interesting future research direction to tackle. Deep audio-visual speech recognition Lrs3-ted: a large-scale dataset for visual speech recognition Survey on deep neural networks in speech and vision systems wav2vec 2.0: A framework for selfsupervised learning of speech representations. ArXiv, abs A review of computer vision techniques for the analysis of urban traffic Hidden voice commands Modalitytransferable emotion embeddings for low-resource multimodal emotion recognition Multimodal end-to-end sparse model for emotion recognition Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks Conformer: Convolutionaugmented transformer for speech recognition Structural anatomy and temporal trends of road accident research: Full-scope analyses of the field Deep residual learning for image recognition An in-car speech recognition system for disabled drivers Use of advanced in-vehicle technology by young and older early adopters One millisecond face alignment with an ensemble of regression trees Dlib-ml: A machine learning toolkit Adam: A method for stochastic optimization Ascend: A spontaneous chinese-english dataset for code End-to-end audio-visual speech recognition with conformers End-toend audio-visual speech recognition with conformers End-to-end audiovisual speech recognition Language models are unsupervised multitask learners Imagenet large scale visual recognition challenge 300 faces inthe-wild challenge: database and results The multimodal sentiment analysis in car reviews (muse-car) dataset: Collection, insights and improvements A review of lane detection methods based on deep learning Converting video formats with ffmpeg Autonomous vehicle perception: The technology of today and tomorrow A new word language model evaluation metric for character based languages SIEVE: Secure In-Vehicle automatic speech recognition systems Lightweight and efficient end-toend speech recognition using low-rank transformer Meta-transfer learning for code-switched speech recognition Audiovisual speech recognition: A review and forecast Discriminative multi-modality speech recognition Factorized multimodal transformer for multimodal sequential learning Hidden voice commands: Attacks and defenses on the vcs of autonomous driving cars Deep audio-visual learning: A survey