key: cord-0439706-ji2049p8
authors: Ismail, Mahmoud Al; Deshmukh, Soham; Singh, Rita
title: Detection of COVID-19 through the analysis of vocal fold oscillations
date: 2020-10-21
journal: nan
DOI: nan
sha: 528388a33d7ba32ffcb853be062f2eda12b866b9
doc_id: 439706
cord_uid: ji2049p8

Phonation, or the vibration of the vocal folds, is the primary source of vocalization in the production of voiced sounds by humans. It is a complex bio-mechanical process that is highly sensitive to changes in the speaker's respiratory parameters. Since most symptomatic cases of COVID-19 present with moderate to severe impairment of respiratory functions, we hypothesize that signatures of COVID-19 may be observable by examining the vibrations of the vocal folds. Our goal is to validate this hypothesis, and to quantitatively characterize the changes observed to enable the detection of COVID-19 from voice. For this, we use a dynamical system model for the oscillation of the vocal folds, and solve it using our recently developed ADLES algorithm to yield vocal fold oscillation patterns directly from recorded speech. Experimental results on a clinically curated dataset of COVID-19 positive and negative subjects reveal characteristic patterns of vocal fold oscillations that are correlated with COVID-19. We show that these are prominent and discriminative enough that even simple classifiers such as logistic regression yields high detection accuracies using just the recordings of isolated extended vowels.

The vibration of the vocal folds is the primary source of voicing (or phonation) in humans [1] . The membranes that comprise the vocal folds are partially tethered by the muscles, cartilage and ligaments surrounding them, allowing them to open and close the glottal area, and to vibrate in response to the passage of air through the glottis. As a result of their structure and physical placement in the larynx, they have characteristic eigen-modes of vibration, or eigen-frequencies at which they can independently vibrate. These are a function of the biophysical properties of the vocal folds, such as their length, thickness, elasticity etc. During phonation, the vibrations of the two vocal fold membranes synchronize or lock at one of their many eigen-frequencies. Both, the oscillations of the vocal folds during phonation, and this entrainment (or synchrony during vibration), result from an intricate balance of aerodynamic forces across the glottis. These forces are directly dependent on the respiratory functions of the speaker, among other factors [2] , and are highly sensitive to changes in them. The oscillation patterns of the vocal folds, the symmetry of their motion as the glottis opens and closes, the frequencies at which they synchronize (or the extent of their synchrony), can all be very easily compromised by fine fluctuations in the airflow dynamics of the upper respiratory tract, or even by slight impairments of any of the laryngeal muscles. Disturbances in any of these factors can cause the vocal folds to vibrate in an asymmetrical and asynchronized fashion, and to fail to lock due to unstable eigen-modes.

Clinical observations of symptomatic patients of COVID-19 have so far revealed that this virus moderately or often seriously impairs the functions of the lower and mid respiratory tract, including that of the lungs, airways and musculature of the respiratory tract. Patients who are symptomatic and have tested positive for COVID-19 as the underlying cause have not only reported changes in their voice, but also a general inability to produce voice normally. This leads us to hypothesize that the vocal folds of these persons are likely to exhibit anomalies in their oscillation patterns during phonation, and that these can be used to detect COVID-19 from voice. The goal of this paper is to validate this hypothesis.

As of now, literature on detecting COVID-19 from voice, coughs and other respiratory sounds is recent and sparse [3] . One study [4] has attempted to detect COVID-19 by analyzing the speech envelope, pitch, cepstral peak prominence and the formant center-frequencies. This study observes highrank eigen-values tending toward relatively lower energy in post-COVID-19 cases, but does not provide strict interpretations. Researchers have also used crowd-sourced data [5, 6] with data-driven end-to-end deep learning methods for this purpose. However, the data remain scarce, and deep learning models are prone to over-fitting -there is no guarantee that the network will specifically learn only COVID-19 related characteristics, and not speaker-specific characteristics.

A controlled medical study that is of special relevance to our work is reported by Huang et. al. [7] , which uses stethoscope data from lung auscultation to analyze the breathing patterns of COVID-19 patients. In this study, recorded audio signals were analyzed by six independent physicians. All COVID-19 patients were observed to have abnormal breath sounds like crackles, asymmetrical vocal resonances and indistinguishable murmurs. These results were reported to be consistent with CT scans of the 9 th intercostal cross-section of the corresponding patient. The study found concrete evidence of the association of abnormal breath sounds, and asymmetries in vocal resonances with COVID-19 infection. This study suggests that COVID-19 affects the source signal that excites the vocal tract, which implicates abnormalities in vocal fold oscillations. While it supports our hypothesis that observing vocal fold oscillations may yield information relevant to detection of COVID-19, it is infeasible to make such direct observations patient symptoms (using a stethoscope, or using high-speech videography of vocal fold motion) at scale for widespread diagnostic purposes.

In our work, we use the much more scalable and accessible approach of computationally deducing the oscillations of the vocal folds directly from recorded speech signals. The algorithmic details of this approach are given in Sec. 2. Experiments on clinically curated data reveal the presence of clear bio-markers of COVID-19 in the vocal fold oscillation patterns, in the estimated glottal flow, and in the residuals obtained. In Sec. 3 we discuss these, and analyze their usefulness in detecting COVID-19 using multiple classifiers.

Of the several mathematical models of phonation proposed in the past decades [8, 9, 10, 11, 12, 13] , the 1-mass asymmetric body-cover model [8] is of particular interest to us due to its ability to capture asymmetry in the oscillation of left and right vocal folds. We briefly describe this model below. Fig. 1 shows a schematic diagram of the vocal folds. As they vibrate, the horizontal displacements of the left and right vocal folds (x l and x r ) are measured with reference to the center of the glottis (central dashed line). x 0 represents displacements at rest. The model measures the displacements at the location (yellow dots) where the folds are half their maximum thickness (τ ). The length of the vocal folds d is normal to the plane of the figure and not shown.

The asymmetric 1-mass body-cover model is described by the set of coupled non-linear differential equations:

where α is the coupling coefficient between the supra-and sub-glottal pressure, β incorporates mass, spring and damping coefficients of the vocal folds, and ∆ is an asymmetry coefficient. For a male adult with normal voice, their values (calculated from actual videographic measurements), average to around α ≈ 0.25, β ≈ 0.32 and ∆ ≈ 0.

The solution of the dynamical system above yields the displacement, velocity and acceleration of the vocal folds as a set of time-series. The time-series corresponding to x r and x l represent the oscillations of the vocal folds. To obtain these, the forward problem of estimating the time series must be jointly solved with the inverse problem of estimating the parameters of the dynamical system themselves. In [14] , we introduced the ADLES algorithm that achieves this by minimizing the error between the glottal flow waveform obtained by inverse filtering, and the vocal fold oscillations predicted by the model as its parameter space is sampled. This joint estimation algorithm is briefly explained in the section below.

During phonation, the vocal tract (of length L) acts as a filter that modulates the pressure wave produced by the airflow through the glottis: F : p 0 (t) → p L (t). p 0 (t), the pressure at the glottis, can be deduced from p L (t), the pressure sensed by a microphone close to the lips, through inverse filtering: p 0 (t) = F −1 (p L (t)). If A(0) represents the cross-sectional area of the vocal channel at the glottis, then the volume velocity of airflow at the glottis, u 0 (t), can be deduced from p 0 (t) at the glottis as u m 0 (t) = A(0) ρc p 0 (t), where c is the speed of sound and ρ is the ambient air density. The superscipt m denotes that u m 0 (t) is estimated from the pressure wave measured by a microphone near the mouth.

The volume velocity u 0 (t) can also be estimated from the solution to the model in Eqns. 2 and 1: u 0 (t) =cd(2x 0 + x l (t) + x r (t)), where d is the length of vocal folds, andc is the air particle velocity at the midpoint of the vocal fold. We derive our model parameters such that the glottal flow u 0 (t) predicted by the model matches the measured flow u m 0 (t) as closely as possible. We define the residual R(t) = u 0 (t) − u m 0 (t) as the difference between the predicted and actual glottal flows, and the residual energy as

We estimate our model parameters to minimize the residual energy E subject to Eqns. 1 and 2, and boundary constraints:

x r (0) = C r , x l (0) = C l ,ẋ r (0) = 0,ẋ l (0) = 0 (4) where C r and C l are constants. To solve the above functional least squares, we define the Lagrangian:

where E r encodes the constraint of Eq. 1:

and E l is similarly obtained from Eq. 2. λ l , λ r , µ r , µ l , ν r and ν l are Lagrangian multipliers. Differentiating L w.r.t. the model parameters and simplifying, we get, for λ r :

and a similar pair of equations for λ l as well. At the end of the recording we also have:

Substituting into the Lagrangian and simplifying we get the derivatives of L w.r.t. the model parameters:

Using gradient descent to optimize objective (3), we get the following update rules:

where δ is the step-size and k refers to k th iteration.

The algorithm described above is used to solve for the model parameters α, β and ∆. These parameters are then substituted in the model to iteratively obtain x r and x l . The time series corresponding to x r and x l comprise the vocal fold oscillations. The behavior of their trajectories is studied in the model's phase space. The behavior can also be located on a bifurcation diagram that maps the behavior types in the model's parameter space. However, we do not extend our study to bifurcation diagrams in this paper. Data used: For our study we used a data set collected under clinical supervision and curated by Merlin Inc., a private firm in Chile. The dataset included recordings from 512 individuals who were tested for COVID-19, and turned out either COVID-19 postive or negative. Of these, we chose the recordings from only those individuals who had been recorded within 7 days of being medically tested. Only 19 individuals satisfied this criterion. Of these, 10 were females and 9 were males. 5 females and 4 males had been diagnosed with COVID-19, and the rest had tested negative. The speech signals were sampled at 8 kHz, and recorded over microphones on commodity devices. Each individual was asked to utter multiple sounds, including the vowels /a/, /i/ and /u/.

Experiments performed: We performed two studies. In one, we estimated the vocal fold oscillations of the subjects in our dataset, observed the differences in the patterns of phase space trajectories of the model. Only the recordings of extended vowels /a/, /i/ and /u/ were used for this purpose. Each recording was sectioned into segments of 50ms duration, with an overlap of 25ms, generating 3835 sets of oscillation timeseries in all. We used the value of the residual R(t) in Eq. 3 to gauge our model's sufficiency in modeling extreme asymmetry in vocal fold motion. The value of R(t) inversely relates to the accuracy with which the model is likely to estimate the vocal fold oscillations.

In the second study, we used the residuals and the coefficients α, β and ∆ as features, and investigated the use of several classifiers to discriminate between COVID-19 positive and negative individuals. The classifiers tested in this binary classification task were Logistic regression (LR), Support vector machine with a nonlinear radial basis function kernel (NL-SVM), Decision tree (DT), Random forest (RF) tree and AdaBoost (AB). 3-fold cross validation experiments were done using recordings of the vowels /a/, /i/ and /u/.

Results of Study 1: The results of the first study are shown in Figs. 2 and 3. Fig. 2 shows the phase space trajectories of the model on a displacement vs. velocity plane for each vocal fold, for COVID-19 positive and negative patients of both genders. We see a significant difference in the phase space behaviors of COVID-19 positive and negative individuals (with a very small number of outliers the need to be investigated in further studies). The phase space trajectories for COVID-19 negative individuals are limit cycles or slim toroids, indicating a greater degree of synchronization in the eigenmodes of vibration, and greater symmetry of motion. For COVID-19 positive patients, the trajectories are more complex, indicating a higher degree of both asynchrony and asymmetry and the range of motion is reduced. The vocal folds are unable to maintain the natural self-sustained vi- brations required for vocalization, and thier range of motion is restricted by an order of magnitude relative to normal. Although measures of divergence may be used to quantify these, e.g. Lyapunov exponents [15] , we have not used these yet. Fig. 3 shows a comparison of the estimated oscillations of the vocal folds to the glottal flow waveform obtained by inverse filtering. Note that in reality, the two are not the same. The former are the actual displacements of the vocal folds during phonation, the latter is the airflow volume velocity values across the glottis. Their strong correlation is however reflected in the example shown in Fig. 3 . Tables 1 and 2 , In all experiments, performance was evaluated using the coresponding Reciever Operating Characteristics (ROC) curve. Tables 1 and 2 report the area under this curve (ROC-AUC) and its standard deviation (STD) for each experiment. Table 1 presents the ROC-AUC and STD obtained for the vowels -/a/, /i/ and /u/. The segments used in the 3fold cross-validation experiment were stratified -the speakers in the training set were not included in the test set. We observe from Table 1 that all the classifiers achieve a comparable performance of ≈ 0.8 ROC-AUC. The statistical significance was tested for all classifiers and all were found to be significant, with p-values better than 1e −5 . This strongly indicates that the features (residual values and vocal fold oscillation coefficients) can indeed capture the anomalous vibrations of COVID-19 patients without using sophisticated modeling techniques such as neural networks.

In order to gain further insight into the importance of these features, we examined the splits within the decision tree classifier specifically. We found that the residual R is consistently the most important feature, indicating that the vocal fold displacements themselves are highly discriminative for Table 2 : Performance of logistic regression on extended vowels and their combinations. Table 2 shows the performance of logistic regression on different vowels and their combinations. We observe that the vowel /i/ (a high front vowel) consistently yields the best performance, followed by /u/ (a high back vowel) then /a/ (a low back vowel). This indicates that the ability to reach the higher frequency energy peaks during phonation is compromised due to COVID-19 infection.

While vocal fold oscillation patterns can be indicative of COVID-19, two caveats must be noted: a) they are likely to be useful only in symptomatic patients, and b) the exclusiveness of the anomalies observed to other respiratory conditions has not been tested. We can only say that COVID-19 disrupts the entrainment of the vocal folds during phonation, and causes asymmetries in their motion, and that these characteristics can yield discriminative features that can be used to detect COVID-19 with even simple classifiers. Furthermore, it seems possible to achieve a high ROC-AUC using just a single phonated sound (e.g. the vowel /i/). We hope that the techniques presented in this paper can help facilitate future work towards a simple and cheap alternative for the rapid detection of COVID-19, using more sophisticated models to better capture pathological vocal fold oscillations.

Nonlinear source-filter coupling in phonation: Theory

Production and perception of voice

An overview on audio, signal, speech, & language processing for covid-19

A framework for biomarkers of covid-19 based on coordination of speech-production subsystems

Exploring automatic diagnosis of covid-19 from crowdsourced respiratory sound data

Ai4covid-19: Ai enabled preliminary diagnosis for covid-19 from cough samples via an app

The respiratory sound features of covid-19 patients fill gaps between clinical data and screening methods

Self-entrainment of the right and left vocal fold oscillators

Modeling vocal fold asymmetries with coupled van der pol oscillators

Synthesis of voiced sounds from a two-mass model of the vocal cords

Computation of physiological human vocal fold parameters by mathematical optimization of a biomechanical model

A finite-element model of vocal-fold vibration

The physics of small-amplitude oscillation of the vocal folds

Speech-based parameter estimation of an asymmetric vocal fold oscillation model and its application in discriminating vocal fold pathologies

Determining lyapunov exponents from a time series