key: cord-0657738-d9hhyi69 authors: Truong, Tuan; Lenga, Matthias; Serrurier, Antoine; Mohammadi, Sadegh title: FAIR4Cov: Fused Audio Instance and Representation for COVID-19 Detection date: 2022-04-22 journal: nan DOI: nan sha: bcb5cf7f2790bde3ae086455696761fb0c777c35 doc_id: 657738 cord_uid: d9hhyi69 Audio-based classification techniques on body sounds have long been studied to support diagnostic decisions, particularly in pulmonary diseases. In response to the urgency of the COVID-19 pandemic, a growing number of models are developed to identify COVID-19 patients based on acoustic input. Most models focus on cough because the dry cough is the best-known symptom of COVID-19. However, other body sounds, such as breath and speech, have also been revealed to correlate with COVID-19 as well. In this work, rather than relying on a specific body sound, we propose Fused Audio Instance and Representation for COVID-19 Detection (FAIR4Cov). It relies on constructing a joint feature vector obtained from a plurality of body sounds in waveform and spectrogram representation. The core component of FAIR4Cov is a self-attention fusion unit that is trained to establish the relation of multiple body sounds and audio representations and integrate it into a compact feature vector. We set up our experiments on different combinations of body sounds using only waveform, spectrogram, and a joint representation of waveform and spectrogram. Our findings show that the use of self-attention to combine extracted features from cough, breath, and speech sounds leads to the best performance with an Area Under the Receiver Operating Characteristic Curve (AUC) score of 0.8658, a sensitivity of 0.8057, and a specificity of 0.7958. This AUC is 0.0227 higher than the one of the models trained on spectrograms only and 0.0847 higher than the one of the models trained on waveforms only. The results demonstrate that the combination of spectrogram with waveform representation helps to enrich the extracted features and outperforms the models with single representation. Our body produces innumerable sounds every day, but most of the time we do not pay enough attention to them. Body sounds are known to reveal an individual state of health. A slight change in the physical state can modify the responsible organ and consequently produce irregular sound patterns. For example, snoring is a common sound produced when certain parts of the body, such as the tongue or pharynx, block the airflow as we breathe during sleep. Snoring alone is in general not considered as symptomatic, but if it is coupled with breathing pauses, it can be a symptom of obstructive sleep apnea. More generally, body sounds can be used extensively to support diagnostic decisions. In particular, auscultation is a medical procedure in which a clinician listens to the internal sounds of the body using a stethoscope. Organs and systems such as the heart, lungs, and gastrointestinal system are usually listened to detect abnormal patterns. In respiratory diseases such as pneumonia, auscultation can be performed to look for crackles or percussion dullness, an indication of fluid in the lungs. Hence, body sound analysis is part of automated diagnostic applications such as in respiratory diseases [1, 2, 3, 4] , Parkinson's disease [5] , sleep apnea [6] . Although detecting irregular internal sounds might be insufficient for a conclusive diagnostic decision, it serves as a hallmark that can be combined with other clinical tests obtained from different diagnostic tools to conclude. In this article we study an audio-based approach to detect Coronavirus Disease 2019 (COVID-19), a disease caused by Severe Acute Respiratory Syndrome CoronaVirus 2 (SARS-CoV-2). The SARS-CoV-2 infects most heavily the respiratory tract [7] . Therefore, infected individuals express flu-like symptoms, which can often be mistaken for a cold or flu. Complications are also typi-cally related to pulmonary disorders, such as pneumonia or acute respiratory distress syndrome. The best diagnostic approach is viral testing, often done using nucleic acid tests such as polymerase chain reaction (PCR) to detect viral RNA fragments. Although a gold standard, PCR tests return the result after approximately 4-6 hours, excluding the delivery time, and can take up to 24 or 48 hours. As the ultimate goal of management strategies is to break the infection chain by quickly identifying suspected cases for immediate isolation or quarantine, a test with such a long waiting time as PCR is not optimal. In addition, PCR testing requires qualified staff and well-equipped facility to operate, which is hardly accessible in remote and low-income countries. An alternative test, known as the antigen test, can retrieve results in less than 30 minutes by identifying viral proteins with specific antibodies. Although antigen tests are highly suitable for mass testing, they are less sensitive to detection. Meanwhile, since SARS-CoV-2 infects mainly the respiratory systems, it induces changes in body sounds by either modifying them, e.g., dysphonia, or creating them, e.g., cough or breath sounds. Several studies show that these changes are specific to COVID-19. For example, a study by [8] finds abnormal breathing sounds in all COVID-19 patients. The irregular sounds include cackles, asymmetrical vocal resonance, and indistinguishable murmurs. In a different study of vocal changes in COVID-19 individuals [9] , the authors validate the hypothesis that vocal fold oscillations are correlated with COVID-19, inducing not only changes in voice but also the inability to speak normally. Body sounds have therefore the potential to serve as standalone or in parallel with antigen tests to detect There are several advantages of using body sounds for screening COVID- 19 . First, because PCR testing capacities are limited, screening with body sound or in conjunction with antigen tests can help prioritize who is eligible for PCR tests. If anyone with flu-like symptoms can order a PCR test, it will soon exceed the testing capacity. Only suspects indicated by body sound screening could proceed with PCR tests. Body sound screening can rapidly identify suspect cases without asking them to quarantine while waiting for PCR results. Second, similar to antigen tests, body sound screening is fast, affordable, and conveniently conducted without medical professionals. The cost of running body sound screening can even be lower than that of antigen tests because it can be installed as software or a mobile application on any device and uses the device microphone. Users do not need to buy additional support kits and can use their device to record, analyze, and monitor their status an unlimited number of times. This is particularly useful in regions or countries where testing capacities are scarce, inaccessible, or expensive. Lastly, compared to antigen tests, it does not lead to (medical) waste because no physical products are manufactured, which alleviates the burden on the environment. Given these advantages, the potential of body sounds for screening COVID-19 is enormous. However, a fully developed screening system using body sounds is not yet available. Current research on COVID-19 detection considering multiple body sounds often focuses on individual sounds and does not consider their interaction [10, 11] . We hypothesize on the contrary that the effects of COVID-19 may occur in different body sounds, or in a different combination of them, for different individuals. One or more body sounds may be affected while the others remain intact. It is thus sensible not to rely on a single but rather a combination of several body sounds. We propose combining the most meaningful body sounds that are indicative of COVID-19 expressed in terms of fusion rules within the detection algorithm. Our hypothesis is stated as follows: The cough, breath and speech sounds contain biomarkers that are indicative of COVID-19 and can be combined using an appropriate fusion rule to maximize the chances of correct detection. To this end, we propose self-attention as a fusion rule to combine features extracted from cough, breath, and speech sounds. Mainly, we use waveforms and spectrograms as input to our model. A waveform represents an audio signal in the time domain, whereas a spectrogram is a representation in the time-frequency domain. Our main contributions in this work are summarized as follows: • We demonstrate that the cough, breath and speech sounds can be lever-aged to detect COVID-19 in a multi-instance audio classification approach based on self-attention fusion. Our experimental results indicate that combining multiple audio instances exceeds the performance of single instance baselines. • We experimentally show that an audio-based classification approach can benefit from combining waveform and spectrogram representations of the input signals. In other words, inputting the time-and frequency-domain dual representations to the network allows for a richer latent feature space which finally improves the overall classification performance. • We integrate the above contributions into the FAIR4Cov, a classification approach that combines multiple instances of body sound in waveform and spectrogram representations to classify negative and positive COVID-19 individuals. This approach can be extended to other respiratory diseases beyond COVID-19. We briefly present the related work in body sound analysis for pulmonary diseases with a primary emphasis on the COVID-19 use case. Before COVID-19, there is a well-established line of research on body sound analysis for pulmonary disorders such as tuberculosis, pneumonia, or Chronic Obstructive Pulmonary Disease (COPD). Due to the urgency of the pandemic, this field of research has expanded and seen growing interest in newly developed techniques and collected datasets. The majority of studies are centered on traditional machine learning techniques by building a classifier using extracted audio features of cough or respiratory sounds. Botha et al. [4] investigate on multichannel lung sounds of 50 subjects from a multi-media respiratory database [13] to classify COPD. The authors develop a Deep Belief Network using features extracted using the Hilbert-Huang transform [14] . The model achieves 93.67% accuracy, 91% sensitivity, and 96.33% specificity. Xu et al. [15] propose a multi-instance learning framework to process raw cough recordings and detect multiple pulmonary disorders including asthma and COPD. The presented framework achieves an F1-score of more than 0.8 in classifying healthy vs. unhealthy, obstructive vs. non-obstructive, and COPD vs. asthma. A recent and detailed review of disease classification from cough sounds can be found in [16] . Unlike datasets of other pulmonary diseases, there are large corpora of COVID-19 related audios collected from crowdsourcing. Voluntary participants submit recordings of their body sounds to a mobile app or website and provide metadata such as their COVID-19 status and comorbidity. Such large datasets enable researchers to develop COVID-19 detection algorithms as well as to benchmark their research work. To our knowledge, the largest crowdsourcing datasets are COUGHVID [17] , Coswara [18] , and Covid-19 Sounds [19] . COUGHVID comprises more than 20000 cough recordings, while Coswara and Covid-19 Sounds consist of cough, breath, and vocal sounds from more than 2000 and 30000 participants, respectively. In terms of technical development, a few studies follow the traditional machine learning approaches with extracted audio features [20, 21, 22, 10] . The most common audio features are still MFCC, log Mel spectrogram, ZCR, and kurtosis. Fakhry et al. [20] propose an ensemble network of ResNet50 and MLP on MFCC and Mel spectrograms of cough recordings to classify COVID-19 individuals. The proposed solution claims an AUC of 0.99 on COUGHVID dataset. In a similar approach, the study by [21] benchmarks 15 audio features in the time and frequency domains for the COVID-19 detection task using cough and breath sounds. Their findings indicate that spectral features slightly outperform cepstral features in the classification task, and the best model is achieved using a SVM and Random Forest classifier, with AUCs of 0.8768 and 0.8778, respectively. Several studies adopt Deep Learning approaches by training CNN on spectrogram or waveform instead of extracted audio features [23, 24, 25, 11] . Rao et al. [23] present a VGG13 network [26] on spectrogram with combined cross-entropy and focal loss. The approach achieves an AUC of 0.78 on the COUGHVID dataset. Xia et al. [24] provide an analysis of combined cough, breath and speech sounds using a simple VGG-ish model. The study introduces the combination of the features of various body sounds to improve classification performance. The best performance has an AUC of 0.75 and sensitivity and specificity of 0.70. Other studies also attempt pretraining on an external dataset or the same dataset without labels. The pretrained model is later finetuned on the target dataset with labels [27, 28, 25] . Harvill et al. [27] pretrain all samples in COUGHVID dataset using autoregressive predictive coding with Long Short-Term Memory. The Mel spectrogram is split into several frames, and the model attempts to predict the next frame given the previous frames. The pretrained model is later finetuned on the DiCOVA dataset [29] and achieves an AUC of 0.95. Similarly, Pinkas et al. [25] pretrains a transformer-based architecture to predict the next frame of the spectrogram and transfers the pretrained features to a set of RNN expert classifiers. The final prediction is the average of the scores produced by all expert classifiers. The proposed training scheme reaches a sensitivity of 0.78 on a private dataset collected by the authors. Xue and Salim [28] propose use contrastive learning in a self-supervised pretraining phase. The contrastive pairs are created by randomly masking the inputs. The model is pretrained on the Coswara dataset without labels and finetuned with Covid-19 Sounds in the downstream task. The proposed technique achieves 0.9 AUC in the COVID-19 negative vs. positive classification task. In previous research, cough sounds have often been studied more than other body sounds. This is reasonable because dry cough is a known symptom of COVID-19. However, different body sounds, either being used together with cough or individually, are reported to have a performance comparable to or better than cough sounds. For example, Suppakitjanusant et al. [11] compare two separately trained models in cough and speech and show that speech outperforms cough in classifying COVID-19 patients. Xia et al. [24] also achieve the highest performance by concatenating features of cough, breath, and speech. Unlike research works that usually study each body sound independently [11] or combined them by significant voting of prediction scores [10] , we explore fusion rules that combined them at the feature level. In other words, we train a network that learns a joint feature vector of all body sounds. Hence, the joint feature vector is optimized to implicitly reflect the relative importance of each body sound towards the final prediction. Although our work falls along the line of [24] , we investigate a more complex fusion rule than just concatenating features. We use self-attention [30] , which captures the dependencies among body sounds into a joint feature vector. Self-attention is used not only as a layer in the transformer architecture but can also be used to aggregate features [31] . In addition, instead of using handcrafted audio features, we train our model using waveform and spectrogram representations, therefore creating more robust features compared to previous methods. We experiment our approach on the Coswara dataset and achieve the state-of-the-art results. We report an average performance of the models obtained from cross-validation on a split test set (Section 4.2). However, we emphasize that there is no unique test set generated for the Coswara dataset and the data size was growing at the time we conducted our experiment. We begin this section by first summarizing self-attention [30] , which is an attention mechanism used as the fusion rule and as a layer in the backbone network in our approach. We then describe the proposed FAIR4Cov approach with detailed components and how they interact with each other to extract the features of body sound. Self-attention [30] is originally developed for language models. A sequence in language models consists of many tokens (e.g., words) that the model needs to memorize to synthesize the global information on top of that sequence. However, memorizing a long sequence is not always possible and the model is likely to forget the tokens that emerge early in the sequence. Self-attention therefore seeks to find a set of highly important tokens in the sequence and divert the focus of the model into these ones. The reason why these tokens are chosen is that they are highly similar in their content. Instead of memorizing the whole sequence, the model just needs to memorize these tokens because they carry the same (important) message repeatedly along the sequence. Let I be an input sequence of n tokens in d dimensions. The fundamental components of a selfattention layer are the query (Q), key (K), and value (V), which are the projection of the input sequence I with weights W Q , W K , W V . A n × n selfattention matrix W a is then computed by taking the dot product of each query token with n keys. Each row i of W a denotes the similarity scores of query token i with n keys including itself. The dot product is scaled by √ d to stabilize the gradient. Hence it is known as the scaled dot product. Next, a softmax function is applied across each row of W a to normalize the scores between 0 and 1. The final output is a product between W a and V. The output has the same n tokens, but each new token is the sum of tokens in V weighted by each row in W a . In other words, every new token i in V is constructed based on the similarity of the query token i with other tokens. Each self-attention layer can comprise many heads in parallel and is called multiheaded self-attention [30] . The intuition behind multihead is that each head pays attention to a different property of the input token. Assume h is the number of heads, the self-attention layer outputs h different sequences where each token in the sequence has the length of d/h. The output tokens are then concatenated across the sequences, forming the final output of shape n × d. We present in this section our proposed architecture. Let D = x Our objective is to derive a representative feature vector for c body sounds per subject across waveform and spectrogram inputs. We denote by 2c ] the aggregated input instance related to the i-th subject. The FAIR4Cov approach takes the input x (i) of the i-th subject and returns a joint feature vector z (i) that aggregates the information across multiple body sounds and representations as shown in the following equation. Here g w and g s denote neural networks that extract features from waveform and spectrogram inputs, and φ is the attention-based fusion unit. Figure 1 shows an overview of the FAIR4Cov approach and the main components along the pipeline. The feature extractors and the fusion unit are instrumental components in our proposed approach, and are further detailed in the next sections. Feature extractors are neural networks responsible for learning representative features for each body sound. As the input consists of waveform and spectrogram, two neural networks g w and g s are trained in parallel to handle both representations. In each network, the weights are shared across the input channels 1, .., c and c + 1, ..., 2c. We choose g w to be a pretrained wav2vec [33] and g s to be DeiT-S/16, a Vision Transformer (ViT) model [34] . DeiT-S/16 and wav2vec are transformer-based models and achieve state-of-the-art results in language and vision models. wav2vec. The wav2vec network [33] was developed to process audio for the speech-to-text translation task. It comprises both convolutional and self-attention layers and is pretrained on a large audio corpus in an unsupervised fashion. Therefore, we take advantage of the pre-trained wav2vec features and designed a finetuning unit to effectively leverage them in our target dataset. As shown in We denote f The fusion unit φ combines f (i) k with k = 1, .., c into a single vector z (i) by using a multiheaded self-attention layer (MSA) and a MLP h: The output of MSA for each subject i is a new set of feature vectors f There is no restriction on the duration of the recordings, so users can decide when they want to start and stop recording. The first step in the preprocessing pipeline is to remove the leading and trailing silence. We observe that long recordings (>20 seconds) mainly contain silence, and the duration at which people cough, breathe or speak lasts only 3-10 seconds. Hence, we trim automatically the silence and take only the clip with detectable amplitude. The next important step is to remove corrupted files. We define corrupted files as those that contain no sound, noise, or a different sound type than the one reported in the label. First, we remove recordings whose duration is less than 1 second because they do not contain any detected sound. Second, similar to the approach of [24] , we use a pretrained model called YAMNet 2 to systematically remove recordings where the detected sound is not the same as the provided label. YAMNet is a pretrained model on YouTube audio to classify 521 events, including cough, speech, and breath. If most of the predicted events in a recording are not cough, speech, or breath, we will remove all the recordings associated with this participant. In addition, we decide not to use shallow cough and breath in our experiments because the quality of such recordings is low and can be misdetected as noise. Altogether, 710 participants were discarded from the initially curated dataset through this process and the 1359 remaining participants were considered for our analyses. Out of them, 1136 people (83.6%) are COVID-19 negative and 223 people are COVID-19 positive. Each participant has exactly 7 recordings, which amounts to 9513 recordings used in our experiments. We provide in Table 1 For audio processing and transformation, we use Torchaudio 3 , a library for audio and signal processing with Hyperparameters. Table 4 shows the complete hyperparameter settings in our experiments. Most hyperparameters are identical across architectures, representations, or fusion rules. For example, we train all models for 30 epochs without early stopping, and the best checkpoint is saved based on the best AUC obtained in the validation fold. The loss function that we use is binary crossentropy (BCE), and we optimize this loss with AdamW (Adam with weight decay) [36] . However, concerning the learning rate, we fix a base learning rate of 0.0001 for all experiments and adjust the learning rate scheduler and weight decay conditional on the architecture or fusion rules. The weight decay factor is set between 0.1 and 0.001. Evaluation. Our primary metric for model selection is AUC. During training, we save the checkpoint with the highest performance based on AUC. During validation, we use AUC to compute the optimal threshold and take this thresh- old to compute other metrics such as sensitivity and specificity in the test set. We report the AUC scores in the main paper and provide the sensitivity and specificity in the Appendix. Table 5 shows the performance of the models trained on a single body sound instance. The input to the model is either a waveform (BA1) or a spectrogram (BA2). The results reveal that the models trained on spectrograms perform substantially better than those trained on waveforms. The average AUC scores for Average .6127 ± .0751 .7549 ± .0215 Table 5 : AUC scores of the baseline experiments for spectrogram (DeiT-S/16) and waveform (wav2vec) models. The bold scores denote the highest performance between spectrogram and waveform. Table 6 shows the results for the As can be seen in Table 6 , the AUC scores are not comparable among all combinations of body sound. Thus, it is valid to doubt whether there is a preferable combination of body sounds that leads to the best predictive outcome. However, it is not conclusive based on our experimental results or the literature to decide the best combination choice or selection rules. Instead, we argue that no body sound is significantly better than the others and that the performance is correlated rather with the number of body sounds in that combination. To illustrate this point, we look at the performance of our model when (1) In addition to the number of body sounds in each combination, the varying duration of each instance can have an influence on the results. In this study, we truncate each recording to 4 seconds. However, a body sound such as cough could last less than 4 seconds and the rest of the audio be just breath. A finer analysis taking into consideration this aspect should be considered in a follow-up study. We analyze the effect of the dual representation of spectrogram and waveform in the absence of body sound fusion by conducting an ablation study similar to the FAIR4Cov framework but with the input of a single body sound. As there are no rules for body sound fusion, the features extracted from two representations are concatenated, flattened, and then projected onto a 128-dimensional vector by a MLP layer. Similar to the baseline experiment, we present the AUC scores of seven models trained on 7 body sound instances in Table 7 . Overall, the average AUC scores are on par with those of the DeiT-S/16 model (BA2) in Average .7519 ± .0137 Table 7 : Baseline performance (in AUC) of FAIR4Cov on a single body sound. The fusion rule for body sound relies on self-attention. One of the interesting properties of self-attention is scaling, which is discussed in the work of Dosovitskiy et al. [34] . The authors note that the performance of the transformer-based model could be scaled up in response to an increase in resolution of patches or number of blocks. This contrasts with convolutional networks, in which the accuracy can reach saturation at a certain level of complexity. This scaling property explains why adding more body sounds leads to a steady increase in AUC scores. Adding more body sounds means adding more tokens and establishing stronger dependencies among them. When only two or three instances of body sound are adopted, the effect of body sound fusion is less significant. The combinations with less than or equal to three instances, i.e., cough-breath, fast and normal counting, /a-e-o/ vowel utterance, achieve AUC scores in the range of 0.75-0.79, which is on par or slightly better than the performance of models on a single instance (Table 7) . This happens because the number of instances is insufficient to establish long-range dependencies. As more body sounds are added, these dependencies are captured, and the performance of models with fusion units started to improve substantially. Likewise, the joint feature vector embeds more information when dual representation is adopted. When the number of instances in the combination is small, i.e., less than three, the gain due to the dual representation is not noticeable. However, starting from five instances, the gap between FAIR4Cov and DeiT-S/16 becomes wider in favor of FAIR4Cov. We attribute this gain to the resonance of extra information given by the dual representation and the number of body sounds, efficiently captured the self-attention fusion rule. In this article, we study Deep Learning approaches to detect COVID-19 using body sounds. To this end, we propose FAIR4Cov, a multi-instance audio classification approach with attention-based fusion on waveform and spectrogram representation. We prove the effectiveness of our approach by conducting extensive experiments on the Coswara dataset. The results demonstrate that the fusion of body sounds using self-attention helps extract richer features useful for the classification of COVID-19 negative and positive patients. In addition, we perform an in-depth analysis on the influence of fusion rule on the performance. We found that the scaling property of self-attention shows great efficiency when more instances of body sounds as well as representations are adopted. The best setting with a combination of cough, breath, and speech sounds in waveform and spectrogram representation results in an AUC score of 0.8658, a sensitivity of 0.8057, and a specificity of 0.7958 on our test set. The sensitivity of our model exceeds 0.75, the required threshold of COVID-19 screening test [37] . Diagnosis of pneumonia from sounds collected using low cost cell phones COVID-19 Artificial Intelligence Diagnosis Using Only Cough Recordings Detection of tuberculosis by automatic cough sound analysis Deep Learning on Computerized Analysis of Chronic Obstructive Pulmonary Disease PDVocal: Towards Privacy-preserving Parkinson's Disease Detection using Non-speech Body Sounds Apnea and heart rate detection from tracheal body sounds for the diagnosis of sleep-related breathing disorders Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2): An overview of viral structure and host re The respiratory sound features of COVID-19 patients fill gaps between clinical data and screening methods, preprint, Infectious Diseases (except HIV/AIDS) Detection of Covid-19 Through the Analysis of Vocal Fold Oscillations Artificial intelligence enabled preliminary diagnosis for COVID-19 from voice cues and questionnaires Identifying individuals with recent COVID-19 through voice classification using deep learning Automatic cough classification for tuberculosis screening in a real-world environment Multimedia Respiratory Database (RespiratoryDatabase@TR): Auscultation Sounds and Chest X-rays The empirical mode decomposition and the Hilbert spectrum for nonlinear and non-stationary time series analysis Listen2Cough: Leveraging End-to-End Deep Learning Cough Detection Model to Enhance Lung Health Assessment Using Passively Sensed Audio Past and Trends in Cough Sound Acquisition, Automatic Detection and Automatic Classification: A Comparative Review The COUGHVID crowdsourcing dataset, a corpus for the study of large-scale cough analysis algorithms Coswara -A Database of Breathing, Cough, and Voice Sounds for COVID-19 Diagnosis Exploring Automatic Diagnosis of COVID-19 from Crowdsourced Respiratory Sound Data Virufy: A Multi-Branch Deep Learning Network for Automated Detection of Audio feature ranking for sound-based COVID-19 patient detection COVID-19 cough classification using machine learning and global smartphone recordings Deep Learning with hyper-parameter tuning for COVID-19 Cough Detection Proceedings of the 35th Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track SARS-CoV-2 Detection From Voice Very Deep Convolutional Networks for Large-Scale Image Recognition Classification of COVID-19 from Cough Using Autoregressive Predictive Coding Pretraining and Spectral Data Augmentation Exploring Self-Supervised Representation Ensembles for COVID-19 Cough Classification DiCOVA Challenge: Dataset, task, and baseline system for COVID-19 diagnosis using acoustics Attention is All you Need, Advances in neural information processing systems How Transferable are Self-supervised Features in Medical Image Classification Tasks? Signal estimation from modified short-time Fourier transform wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale Training data-efficient image transformers & distillation through attention Decoupled Weight Decay Regularization Comparative sensitivity evaluation for 122 CE-marked rapid diagnostic tests for SARS-CoV-2 antigen Appendix A. Self-attention fusion with only waveform inputs