key: cord-0632191-6snxr8w1
authors: Nagrani, Arsha; Chung, Joon Son; Huh, Jaesung; Brown, Andrew; Coto, Ernesto; Xie, Weidi; McLaren, Mitchell; Reynolds, Douglas A; Zisserman, Andrew
title: VoxSRC 2020: The Second VoxCeleb Speaker Recognition Challenge
date: 2020-12-12
journal: nan
DOI: nan
sha: 6040301229ba6f7def10a5ac6ce530239546f9b4
doc_id: 632191
cord_uid: 6snxr8w1

We held the second installment of the VoxCeleb Speaker Recognition Challenge in conjunction with Interspeech 2020. The goal of this challenge was to assess how well current speaker recognition technology is able to diarise and recognize speakers in unconstrained or `in the wild' data. It consisted of: (i) a publicly available speaker recognition and diarisation dataset from YouTube videos together with ground truth annotation and standardised evaluation software; and (ii) a virtual public challenge and workshop held at Interspeech 2020. This paper outlines the challenge, and describes the baselines, methods used, and results. We conclude with a discussion of the progress over the first installment of the challenge.

In 2019 we introduced the VoxCeleb Speaker Recognition Challenge [1] (VoxSRC), a new series of speaker recognition challenges that are intended to be hosted annually. The primary goals of VoxSRC are to: (i) explore and promote new research in speaker recognition 'in the wild'; (ii) measure and calibrate the performance of the current state of technology through public evaluation tools; and (iii) provide open-source data freely accessible to all in the research community.

While speech technologies have developed rapidly during the last few decades (with a large focus on automatic speech recognition and speaker verification), speaker recognition and diarisation under noisy and unconstrained conditions are still extremely challenging. Applications of speaker recognition are many and varied, ranging from authentication in highsecurity systems and forensic tests, to high fidelity search of persons in large corpora of speech data. For such systems to be deployed in the real world, it is crucial that they work under unconstrained conditions, with noisy, varied and sometimes very short and fleeting speech segments.

After a successful challenge and workshop in 2019, there was a constructive discussion on the limitations of VoxSRC2019 and ideas for extensions that would improve the quality of the challenge (Sec.6 of [1] ). These were (i) the use of more metrics than simply EER, (ii) the addition of new tasks, and (iii) given the large number of submissions leading to satuation on the VoxSRC19 test set, a more challenging test set.

We are pleased to note that VoxSRC2020 successfully incorporated all these three suggestions. The single biggest change for the challenge this year was the addition of a brand new task -speaker diarisation, for which we also released a new dataset called VoxConverse [2] . For speaker verification, an additional self-supervised track was added, and new metrics were introduced for both tasks (namely DCF for speaker verification, and JER and DER for speaker diarisation). Finally, we created an extremely challenging test set for speaker verification by incorporating out of domain data from movie material [3, 4] .

In this paper, we describe the details of the evaluation task, the datasets provided, the challenge evaluation results and subsequent discussion. Further details can be found at the challenge website 1 .

There were two tasks in this challenge, speaker verification and speaker diarisation. Speaker verification is the task of determining whether a given pair of speech utterances are from the same speaker or not, while speaker diarisation aims to break up multi-speaker audio into homogeneous single speaker segments, effectively solving 'who spoke when'. Within the task of speaker verification we had three different tracks, each constraining the data allowed for training models, though with a common test and evaluation metrics.

The challenge consisted of the following four tracks:

1. Speaker Verification -Closed 2. Speaker Verification -Open 3. Speaker Verification -Self-supervised (Closed)

The first two tracks were identical to those in VoxSRC2019 [1] (see also Sec. 2.2).

Track 3 and 4 were new in VoxSRC2020. Inspired by recent successes in self-supervised learning [5] [6] [7] [8] , we introduced Track 3, where participants could not use any speaker labels during training, however they were allowed to use the visual modality (faces) as well from the videos. We also introduced a speaker diarisation track (Track 4), as a new task this year. The open and closed training conditions refer to the training data allowed, and are described in Sec. 2.2.

The VoxCeleb datasets were still the primary datasets for the speaker recognition (tracks 1-3). For VoxSRC2020 we also used a new set of speaker recognition segments from movie material, called VoxMovies [3] , to create more challenging validation and test sets. For speaker diarisation (track 4), we introduced a new diarisation dataset from YouTube called VoxConverse [2].

The VoxCeleb datasets [9] [10] [11] consist of speech segments from unconstrained YouTube videos for several thousand individuals, and were created using an automatic pipeline. For a full description of the pipeline and an overview of the datasets, see [9] .

Train set (Closed and Open Conditions): The closed training condition required that participants train only on the Vox-Celeb2 dev dataset [10] , which contains 1,092,009 utterances from 5,994 speakers. For the open training condition, participants could use the VoxCeleb datasets and any other data, except for the challenge's test data.

Val and Test sets: We provided a challenging validation set to participants to examine the performance of their models before uploading results to the evaluation server, in addition to the actual test set which was released a month before the challenge results were due. Unlike the validation set, the test set was blind, i.e. the speech segments were released but with no annotations. The test data was released strictly for reporting of results alone, participants were not allowed to use this data in any way to train or tune systems.

The validation dataset consisted of trial pairs of speech from the identities in the VoxCeleb1 dataset, while the test set consisted of disjoint identities not present in either VoxCeleb1 or VoxCeleb2. Each trial pair consisted of two single-speaker audio segments, of variable length.

In order to make a challenging validation and test set, we obtained some out-of-domain data for the same identities for which we had YouTube interview data for. These more challenging audio segments were sourced from the VoxMovies dataset [3] , which contains speech segments from movie clips [4] . The VoxCeleb sourced segments, while valuable, are collected entirely from interviews on YouTube, and are limited in terms of linguistic content, emotion and background noise. On the other hand, the VoxMovies segments are sourced from an entirely different domain i.e. movies, and contain speech covering many different emotions, accents, and varied background noise for the same identities. This out of domain data offers a significant challenge to state of the art speaker recognition systems, as shown in [3] . The statistics of the val and test sets can be found in Table 1 . The val and test data were checked using a combination of automatic and manual techniques for any errors using the same procedure described in [9] , and following an identical procedure to VoxSRC19 [1] .

In accordance with the feedback from last year, the challenge did not have same-session trials (e.g. segments from the same interview) in the test and validation sets.

The VoxConverse [2] dataset is an audio-visual speaker diarisation dataset which includes 526 videos from YouTube. These videos are mostly from debates, talk shows and news segments. It has mutli-speaker, variable-length audio segments, with some overlap, and with challenging background conditions. Inspired by other audio-visual dataset creation pipelines such as VoxCeleb [11] or VGGSound [12] , it is generated from an automatic audio-visual speaker diarisation method using active speaker detection [13] , audio-visual source separation [14] and speaker verification. Only audio files are used for this workshop. Please refer to [2] for a more detailed description.

Train set: Since the diarisation track is in the open training condition, participants are allowed to use any public or internal datasets except for the test data. We provide a dev set for training and validation consisting of 216 wav files covering 1,216 minutes. The average number of speakers is 4.5 and the average overlap percentage of speech per video is 3.8%.

Val and Test sets: We encouraged participants to use the VoxConverse dev set to validate their models. The VoxConverse test set contains 310 wav files, which total 56 hours. It is more challenging than the VoxConverse dev set since both the average number of speakers and video duration is higher and the proportion of the audio track that is speech is lower. Details for both sets are described in Table 2 of [2] .

We released a validation toolkit 2 for both speaker verification and speaker diarisation. Participants were encouraged to evaluate their models using this public code on the validation set of each track.

Speaker verification. For the speaker verification tracks (track 1-3), we displayed two metrics, Equal Error Rate (EER) and minimum Detection Cost Function (minDCF). EER is a popular metric for evaluating the performance of speaker verification. It is used to determine the threshold value for a system when its false acceptance rate (FAR) and false rejection rate (FRR) are equal. minDCF (C DET ) can be computed as:

This is same as the primary metric of the NIST SRE 2018 evaluation [22] . We set C miss = C f a = 1 and P tar = 0.05 in our cost function.

For track 1 and 2, the primary metric was minDCF and final ranking was determined by this score alone. For track 3, the primary metric was EER. For both metrics, a lower score is better.

Speaker diarisation. For track 4, we adopted two diarisation metrics, Diarisation Error Rate (DER) and Jaccard Error Rate (JER). DER is used as a primary evaluation metric in this track.

DER is a standard evaluation metric for speaker diarisation. It is the sum of speaker error, false alarm speech and missed speech. We applied a forgiveness collar of 0.25 sec, and overlapping speech was not ignored.

We also reported the Jaccard error rate (JER), a metric introduced for the DIHARD II challenge that is based on the Jaccard index. The Jaccard index is a similarity measure typically used to evaluate the output of image segmentation systems and is defined as the ratio between the intersection and union of two segmentations. To compute Jaccard error rate, 2 https://github.com/a-nagrani/VoxSRC2020 an optimal mapping between reference and system speakers is determined and for each pair the Jaccard index of their segmentations is computed. The Jaccard error rate is then 1 minus the average of these scores. For more details please consult Section 3 of the Dihard Challenge Report [23] .

We provided baselines (with open sourced code) for all tracks to help new participants get started. For the fully-supervised speaker verification tracks (tracks 1 and 2), we provided a baseline consisting of a Fast ResNet-34 backbone trained on 40-dimensional mel spectrograms. The architecture and training procedures are described in detail in [24] . The baseline achieved a minDCF of 0.477 and an EER of 7.68% on the test set.

For the self-supervised track, we trained a baseline model with a Fast ResNet-34 backbone, using contrastive learning and data augmentation (additive noise and room impulse response). Several similar techniques have already been introduced [8, 25] . This model achieved a minDCF of 0.877 and an EER of 19.07% on the test set.

For track 4, we used the baseline system from the second DIHARD challenge [23] . This baseline is adopted from the JHU submission to the first DIHARD challenge [26] , which exploited standard clustering-based speaker diarisation. Speaker embeddings are extracted with a sliding window approach, followed by probabilistic linear discriminant analysis (PLDA) and aggolomerative hierachical clustering (AHC). We did not apply speech enhancement for preprocessing, resulting in a 21.75% DER and 51.89% JER on the challenge test set.

Similar to the previous year, the challenge was hosted via Co-daLab 3 with two phases: "Challenge workshop" and "Permanent". Participants could only submit one submission per day in order to prevent overfitting on the challenge test set. Submission for the "Challenge workshop" phase was available until 16th of October, 2020. For the last 48 hours before the final deadline, the leaderboard was made anonymous.

All teams that participated in the challenge were required to submit the challenge report describing their system by 23rd of October, 2020. The workshop was held on the 30th of October, 2020 in conjunction with Interspeech 2020.

272 submissions were made across all four tracks of the challenge. The top three performances for each track are shown in Tables 2 and 3 . Please refer to the challenge website for more details of results. Speaker verification. Interestingly, the winners of all three speaker verification tracks were the same [17] .

For the fully-supervised tracks, the winner explored 6 variants of ECAPA-TDNN systems [27] and 4 variants of the ResNet34 [28] architecture. For the open track, the winner used the VoxCeleb1 dev set as well as additional speech data from the train-other-500 train set of the LibriSpeech dataset [29] and a subset of the DeepMine corpus [30] (babble, noise). Several data augmentation techniques were adopted for both the open and closed track -including additional noise samples from the MUSAN corpus [31] , reverberation with RIR filters [32] and the SoX and FFmpeg libraries for adjusting tempo and compression of speech. Large-margin fine-tuning was also applied with an AAMsoftmax layer [33] and quality-aware score calibration, both of which led to an improvement of 3% and 8% in terms of EER on the VoxSRC-20 test set, respectively. The final submission scored 0.177 minDCF in track 1 and 0.174 in track 2. The second place for both track 1 and 2 [16] used a fusion of ResNext [34] , Res2Net [35] and a dual path network [36] for the speaker network. They additionally explored the wide range of margin values of AAM-softmax, the dimension of output vector and the network size. Score normalization was used to boost the performance. The second place achieved a minDCF of 0.196 in track 1 and 0.194 in track 2. We note that similarly to last year, the gap between the winning methods of Track 1 and 2 is not large (0.177 vs 0.174 min DCF), despite the fact that Track 2 (open condition) allows additional training data.

In the self-supervised track, both the first and second place used a similar training framework. In both cases they (1) trained the network using contrastive learning, (2) generated pseudo-labels based on the model from the first stage, and (3) trained the network in a supervised way using these pseudolabels. The first place team [17] exploited Momentum Contrast (MoCo) [37] for the first stage, followed by iterative clustering using both efficient mini-batch k-means and Agglomerative Hierarchical Clustering (AHC) to make pseudospeaker labels. A large ECAPA-TDNN was then trained with these labels using a sub-center AAM-softmax [38] layer. This was a similar technique to that used by the winners [17] in their fully-supervised track submissions. The second place exploited a contrastive learning framework similar to [39, 40] , employed k-means with an extra purification step for pseudolabels, and then trained the network for classification using a cross-entropy loss at the final stage. The first and the second place achieved an EER of 7.21% and 12.42% on the challenge test set, respectively.

Speaker diarisation. Although this was the first appearance of a speaker diarisation track in a VoxSRC challenge, 43 submissions from 17 different teams were made. The perfor-mances of the three highest scoring teams are shown in Table 3 .

The winner [21] of this track exploited various novel techniques in their system. The audio input was first processed by a conformer-based [41] continuous speech separation (CSS) technique, resulting in two separated channels. The individual channels were then fed into the Res2Net-based [35] speaker embedding extractor, which was trained with AM-Softmax [42] loss. This was followed by Agglomerative Hierachical Clustering (AHC) with leakage filtering. Moreover, the outputs from multiple systems are fused by a modified version of the voting-based algorithm DOVER [43] . The winner achieved 6.23% DER on our test set.

The second place [20] team adopted an LSTM-based speech enhancement technique, a ResNet152-based speaker embedding extractor and Variational Bayes Hidden Markov Model (VB-HMM) clustering. They also applied global speaker embedding re-clustering and a LSTM-based overlapping speech detector [44] trained with the AMI corpus [45] for post-processing. The resulting system achieved second place in terms of DER (8.12%) and first place in terms of JER (18.35%) on our test set.

Due to COVID-19 and in line with Interspeech 2020, the VoxSRC 2020 workshop was held entirely virtually as a Zoom webinar. Once again the workshop was free of cost for anybody to attend. The number of attendees peaked at over 150 during the event, with a constant attendance of over 100 for the duration of the workshop. The workshop consisted of two keynote presentations: the first, from Dr Daniel Garcia-Romero gave a detailed summary on the recent history of speaker verification methods, titled "X-vectors: Neural Speech Embeddings for Speaker Recognition"; while the second, from Professor Shinji Watanabe discussed methods for speaker diarisation, titled "Tackling Multispeaker Conversation Processing based on Speaker diarisation and Multispeaker Speech Recognition". This was in line with the introduction of a diarisation challenge track (track 4). Additionally there were announcements of the winners of each challenge track, and short presentations from the winners where they gave an overview of their methods. After each presentation, the speakers answered questions live from attendees. Over 50 live questions were asked to keynote speakers and challenge winners. All slides and recorded videos from the workshop are available at http://www.robots.ox.ac.uk/ vgg/data/voxceleb/interspeech2020.html. The workshop was kindly sponsored by Naver Corporation.

Track 1,2 and 3 are focused on speaker recognition, which has been explored by the NIST-SRE (Speaker Recognition Evalu-

VoxSRC2019 winner [53] 1.42 -JTBD (VoxSRC2020) [17] 0.80 3.73 xx205 (VoxSRC2020) [16] 0.75 3.81 ation) series [22, 46, 47] , held since 1996 to measure state-ofthe-art speaker recognition systems. Researchers from both academia and industry are encouraged to participate in NIST, however unlike NIST, all training data for VoxSRC is released publicly to the research community, even for those not participating in the challenge. Other challenges on speaker verification focus on noisy conditions [48] or the far-field condition [49] . Track 4 is complementary to several existing audio speaker diarisation challenges. The DIHARD challenges [23, 50] are potentially the most popular. They evaluate stateof-the-art systems on extreme, "hard" conditions. Both the dev and test sets cover various background conditions, such as audiobooks, broadcast interviews, and restaurants. The third installment of the challenge [51] will be concluded in early 2021. Unlike VoxSRC, the challenge does not provide explicit training data, and hence any public or private data can be used for training models. Additionally, the DIHARD challenge applies no forgiveness collar during evaluation and also has two separate diarisation tracks, one with oracle VAD and another with system VAD. Another popular challenge is the CHIME-6 challenge, where participants perform both speaker diarisation and speech recognition for multi-speaker conversations held in kitchen, dining and living room areas. The challenge data was made using binaural microphones and 4-channel microphone arrays, and the number of participants is fixed for each session. More details are provided at [52] .

The workshop had wider attendance this year, potentially due to the virtual format. Additionally, all talks were pre-recorded and made accessible on the website, providing future access. While we hope future workshops will be in-person, to encourage open access, we will endeavor to record and livestream presentations during future workshops. Participation in the challenge was lowest for the self-supervised track, potentially because this is still a new area for speaker verification. We also note that the best performance for the self-supervised track (0.345 minDCF), is still far behind the full supervised tracks (0.177 minDCF) on the same test set. We also note that all methods used audio only, with the visual modality not being utilised at all. This year the self-supervised track was closed (participants could only train on the VoxCeleb2 dev set), but in future years we may introduce additional tracks to determine if self-supervised methods can outperform fully supervised ones given sufficient training data (following the trend in computer vision [37, 39] and other areas).

While the VoxSRC2020 test set was larger that the VoxSRC2019 test set, with more challenging audio samples included from movie material, the VoxSRC2019 test set was still included in its entirety as a subset of the test set this year, allowing us to easily measure the performance of all submissions made this year on the 2019 test set alone. Table 4 shows the performance of the top-2 submissions from VoxSRC2020 on both the 2019 and the 2020 test sets (bottom two rows), demonstrating that the 2020 test set is far more challenging. We also compare performance on the 2019 test set with that of the winner of VoxSRC2019. The top-2 winners of the challenge this year significantly outperformed 2019's winner, demonstrating the vast improvement in speaker verification performance over one year. Somewhat surprisingly, unlike the results shown in Table 2 , team xx205's performance is better than team JTBD's performance on the 2019 test set.

Voxsrc 2019: The first voxceleb speaker recognition challenge

Playing a part: Speaker verification at the movies

Condensed movies: Story based retrieval with contextual embeddings

Self-supervised speaker embeddings

Momentum contrast speaker representation learning

Disentangled speech embeddings using crossmodal self-supervision

Augmentation adversarial training for unsupervised speaker recognition

Voxceleb: Large-scale speaker verification in the wild

Voxceleb2: Deep speaker recognition

VoxCeleb: a large-scale speaker identification dataset

Vggsound: A large-scale audio-visual dataset

Out of time: automated lip sync in the wild

The conversation: Deep audio-visual speech enhancement

ID R&D system description to voxceleb speaker recognition challenge 2020

The xx205 system for the voxceleb speaker recognition challenge 2020

The IDLAB voxceleb speaker recognition challenge 2020 system description

The DKU-DukeECE systems for voxceleb speaker recognition challenge 2020

The upc speaker verification system submitted to voxceleb speaker recognition challenge 2020 (voxsrc-20)

Analysis of the but diarization system for voxconverse challenge

Microsoft speaker diarization system for the voxceleb speaker recognition challenge 2020

NIST 2018 Speaker Recognition Evaluation Plan

The second dihard diarization challenge: Dataset, task, and baselines

In defence of metric learning for speaker recognition

Semi-supervised contrastive learning with generalized contrastive loss and its application to speaker recognition

Diarization is hard: Some experiences and lessons learned for the jhu team in the inaugural dihard challenge

Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification

Deep residual learning for image recognition

Librispeech: an asr corpus based on public domain audio books

Deepmine speech processing database: Text-dependent and independent speaker verification and speech recognition in persian and english

Musan: A music, speech, and noise corpus

A study on data augmentation of reverberant speech for robust speech recognition

Sub-center arcface: Boosting face recognition by largescale noisy web faces

Aggregated residual transformations for deep neural networks

Res2net: A new multi-scale backbone architecture

Dual path networks

Momentum contrast for unsupervised visual representation learning

Sub-center arcface: Boosting face recognition by largescale noisy web faces

A simple framework for contrastive learning of visual representations

A framework for contrastive self-supervised learning and designing a new approach

Conformer: Convolution-augmented transformer for speech recognition

Additive margin softmax for face verification

Dover: A method for combining diarization outputs

Pyannote. audio: neural building blocks for speaker diarization

Unleashing the killer corpus: experiences in creating the multi-everything ami meeting corpus

The 2016 nist speaker recognition evaluation

The 2019 nist speaker recognition evaluation cts challenge

The voices from a distance challenge 2019 evaluation plan

The interspeech 2020 far-field speaker verification challenge

First dihard challenge evaluation plan

Third dihard challenge evaluation plan

Chime-6 challenge: Tackling multispeaker speech recognition for unsegmented recordings

BUT system description to voxceleb speaker recognition challenge 2019

We also thank Rajan from Elancer and his team, http: //elancerits.com/, for their huge assistance with diarisation annotation for VoxConverse.