key: cord-0281763-jxr5yrt4
authors: Kawahara, Hideki; Matsui, Toshie; Yatabe, Kohei; Sakakibara, Ken-Ichi; Tsuzaki, Minoru; Morise, Masanori; Irino, Toshio
title: Mixture of orthogonal sequences made from extended time-stretched pulses enables measurement of involuntary voice fundamental frequency response to pitch perturbation
date: 2021-04-03
journal: nan
DOI: 10.21437/interspeech.2021-2073
sha: f6c5a5aa9a1163983bcf3fc4808f89eb6c5bdf74
doc_id: 281763
cord_uid: jxr5yrt4

Auditory feedback plays an essential role in the regulation of the fundamental frequency of voiced sounds. The fundamental frequency also responds to auditory stimulation other than the speaker's voice. We propose to use this response of the fundamental frequency of sustained vowels to frequency-modulated test signals for investigating involuntary control of voice pitch. This involuntary response is difficult to identify and isolate by the conventional paradigm, which uses step-shaped pitch perturbation. We recently developed a versatile measurement method using a mixture of orthogonal sequences made from a set of extended time-stretched pulses (TSP). In this article, we extended our approach and designed a set of test signals using the mixture to modulate the fundamental frequency of artificial signals. For testing the response, the experimenter presents the modulated signal aurally while the subject is voicing sustained vowels. We developed a tool for conducting this test quickly and interactively. We make the tool available as an open-source and also provide executable GUI-based applications. Preliminary tests revealed that the proposed method consistently provides compensatory responses with about 100 ms latency, representing involuntary control. Finally, we discuss future applications of the proposed method for objective and non-invasive auditory response measurements.

The fundamental frequency (fo) 1 of sustained vowels respond to frequency modulation of aurally presented sounds [2] [3] [4] . This response, combined with our recently developed system analysis method [5, 6] provides a versatile new tool for investigating the auditory-to-speech chain. Preliminary tests illustrated that the proposed method provides accurate measurement of the involuntary response to fo perturbation of the aurally presented test signals. The measured latency of the response was around 100 ms and compensatory to the perturbation. The goal of this article is to introduce our method and to demonstrate the feasibility of the proposed method. We developed an easy-touse tool and made it available as an open-source for readers to be able to replicate and verify our results. 1 We use the symbol fo to represent the fundamental frequency adopting the discussion in the forum article [1] .

Without proper regulation, we are not able to keep the fundamental frequency of the voice (for example, sustained vowels) constant [7] . Auditory feedback plays an essential role in this regulation [8] [9] [10] . Vibrato, which makes singing voice attractive, also involves auditory feedback in production [11, 12] . Despite decades of research on voice fundamental frequency control mechanisms, it still is a hot topic [13] [14] [15] [16] . Note that the target of the regulation is not the fo value. The target is the perceived pitch and is a psychological attribute, [17] . For periodic signals, fo value is the perceived pitch's physical correlate. In other words, we can observe the perceptual attribute, pitch, directly using the fo value of the produced voice. The regulation of voice pitch consists of voluntary and involuntary control [18, 19] . The shifted pitch paradigm [20] used in these studies has difficulty investigating this involuntary response.

The first author proposed to use a pseudo-random signal [21] to perturb the fo of the fed-back voice. It enabled to make the test signal unpredictable and to derive the impulse response of the auditory-to-voice fo chain [22, 23] . This unpredictability enabled measurement of involuntary response to pitch perturbation. However, it was difficult for others to replicate the test because it required a complex combination of hardware and software tools. The procedure also consisted of several drawbacks due to available technology in the 1990s. For example, we measured the response to pitch perturbation using the maximum length sequence (MLS) [21] . Selection of MLS among other TSP signals [24] [25] [26] [27] [28] was inevitable to make the test signal unpredictable. However, MLS has difficulty in measuring systems with non-linearity [26, 27] . Conventional pitch extractors are the other source of problem. They introduced non-linear and unpredictable distortions in the extracted fo trajectories.

We succeeded in making test signals which are unpredictable and do not have MLS's difficulty. Our new system analysis method uses a new extended TSP called CAPRICEP (Cascaded All-Pass filters with RandomIzed CEnter frequencies and Phase Polarities) [6] . We used CAPRICEP and developed an auditoryto-speech chain analysis system by adopting the simultaneous measurement method of linear, non-linear, and random responses [5] . We developed an instantaneous frequency-based fo analysis method instead of using conventional pitch extractors and removed the above-mentioned distortions. The combination of these analysis methods and substantially advanced com- putational power removed all the difficulties in measuring the auditory-to-speech chain response and resulted in an easy-touse tool for conducting experiments. The tool is open-sourced and available from the first author's GitHub repository [29] .

The following section introduced the proposed method with illustrative plots of component procedures. Then, we introduce a GUI-based application for conducting experiments based on the proposed method quickly and interactively. The section shows preliminary test examples to illustrate how to use the tool and how to analyze the results. Finally, we discuss the further application of this method for objective and non-invasive auditory response measurements.

The associated media provides a movie showing how an interactive test tool for conducting the proposed method works to readers. The media also consists of an example recording of a test session and link to the tool for readers to investigate details and to verify our results. Figure 1 shows a schematic diagram of the experimental setting of the proposed method. The task of the subject is to keep voicing a sustained vowel at a constant pitch while exposed to the test sound using a headphone. The analysis of the response to pitch perturbation uses fo values of both the test and the voiced sounds. A special design enabled analysis of the involuntary response to pitch perturbation.

Test signal design consists of the following four steps. The first step generates extended TSP signals (unit-TSPs) based CAPRI-CEP [6] . The "CAPRICEP generator" in Fig. 1 does this process. The second step periodically (tr represents the period) allocates unit-TSPs using a set of orthogonal series to yield a set of (after post-processing) orthogonal sequences [5] . The "Orthogonal sequence generator" in Fig. 1 does this process. The third step mixes and smooths the orthogonal sequences to make a modulation signal for frequency modulation. The "Modulation signal generator" in Fig. 1 does this process. The fourth step frequency modulates the carrier signals such as a single sine wave and signals consisting of harmonically related multiple sine waves. The "Test signal generator" in Fig. 1 does this process.

The third step generates a modulation signal having the fundamental period 4×tr with and without smoothing. Filtering using time-reversed unit-TSPs followed by post-processing using the set of orthogonal series recovers periodic pulse sequences with the period tr and a pulse sequence with the period 4×tr.

The fundamental frequency analyzer uses analytic signals which are tuned to the target fundamental frequency. The instantaneous frequency of the filtered signal is the temporally varying fundamental frequency. We used a six-term cosine series for the envelope of the analytic signals [30] for calculating clean instantaneous frequency. The "Fundamental frequency analyzer" in Fig. 1 does this process.

This process applies the pulse recovery and correlationcancellation procedures described in the previous section. It yields the perturbation pulse shape from the electrically fedback signal. It also yields the response to the perturbation from the recorded sound. The "System response analysis" in Fig. 1 does this process.

We designed the temporal distribution of the power of unit-TSP to have the raised cosine shape [6] . The nominal duration of the unit-TSP was 400 ms. We set the allocation interval of unit-TSPs as 16384 samples (371.5 ms for 44100Hz sampling). We selected three items from the unit-TSP pool to generate one set of orthogonal sequences. Note that the mixtures of three sequences made from the different sets of unit-TSPs are independent. We set the total duration of the test signal 20 s. This setting provides about 60 repetitions of measurement for calculating one impulse response. This repetition reduces the observation error by about 1/8 in terms of standard deviation.

This section uses the generated signals in each procedure shown in Fig. 1 to illustrate its function. Figure 2 illustrates the function of "Orthogonal sequence generator." The labels "Seq.1," "Seq.2," and "Seq.3" are orthogonal sequences made from three different unit-TSPs. For "Seq.1," we allocated the first unit-TSP periodically with the same polarity. For "Seq.2," we allocated the second unit-TSP periodically inverting the polarity each time. For "Seq.3," we allocated the third unit-TSP periodically inverting the polarity every other time. We overlap and added each allocation to yield the sequence. We mixed all sequences to make the signal "MIX" for the following process. Figure 3 shows an example of the generated signal by "Modulation signal generator." We smoothed the mixed sig- nal ("MIX" in Fig. 2 ) generated by the preceding procedure using the six-term cosine series [30] . This smoother provides more than 114 dB suppression of interfering signals outside of the main lobe of the frequency response. The plot shows a portion of the signal with the length of eight allocation periods. We set the standard deviation of the modulation signal to 25 Cent. This setting is to design the speed of fo transition of this modulation signal not to exceed that observed in natural speech sounds [31] . Note that this smoothed signal modulates the fo represented in log-frequency because a set of linear differential equations approximate the fo dynamics well when using the log-frequency representation [32] . Figure 4 shows an example of the test signal consisting of twenty harmonic components generated using the "sine" phase. This is the output of "Test signal generator" in Figure 1 . The plot shows the portion with one allocation interval's width. Note that the waveform deformation caused by the frequency modulation is not visible in this plot because the magnitude of the modulation is less than 7% of the fundamental period. The headphone converts this test signal to the test sound and presents the sound to the subject. Figure 5 shows the impulse response hc[n] and the frequency gain response of the filter for fo analysis. The procedure "Fundamental frequency analyzer" uses this filter. The following equation provides the instantaneous frequency fi[n] of the filtered output y[n], where n represents the index of the discrete-time signal.

where ∠[a] represents the argument of a complex number a and fs represents the sampling frequency of the discrete-time signal. Because the fundamental frequency of the test signal and the sustained vowel are known in advance, we avoid using conventional pitch extractors. Those extractors consist of pre and post-processing procedures and introduce non-linear and unpredictable distortions. Equation 1 does not introduce such distortions and yields the fo values at the audio sampling rate. The impulse response hc[n] design is crucial for Eq. 1 to yield accurate instantaneous frequency values [33] . We used the six-term cosine series [30] to design the envelope of hc[n]. Figure 6 shows the analysis results of "System response analysis" in Fig. 1 . The same fo analysis procedure extracted the fo values of the test signal (the electrically looped back signal) and the acquired acoustic signal (an omnidirectional condenser microphone acquired the sound produced by a noise- canceling headphone). Filtering using time-reversed unit-TSPs and cross-correlation canceling procedure recovers the perturbation pulse and responses from the extracted fo trajectories. Please refer to the reference [5] for details of these recovery procedures. Note that the recovered pulse shapes are effectively the same (the electric signal yielded 50.38 Cent and the acoustic signal yielded 50.32 Cent at each peak.). 2 Figure 6 illustrates that the proposed procedure provides accurate measurement of the fo response to the perturbation. It is important to note that the perturbation does not modify the subject's auditory feedback through the natural sidetone. It enables us to measure the involuntary response to the fo perturbation without disrupting the natural auditory-to-speech chain 3 .

We conducted preliminary experiments using the GUI-based test tool shown in Fig. 7 . Because of the COVID-19 pandemic, the first author played the experimenter and the subject roles. This may introduce biases. Please use these example results as illustration materials for proof of concept of the proposed method. We made the tool and some examples accessible to everyone for them to be able to replicate the tests described in this article. Please refer to the associated multimedia files. shows a response to the missing fundamental (missing the first harmonic component) test signal. The left part of the GUI is for operation, and the right part is for display the analysis result. The center bar graph is an input level monitor. The right three panels are the power level (top), recovered responses (middle), and the final response analysis result. The title of the top panel shows the recorded file name and the test conditions. Note that the analysis does not precede data saving. Also, a separate logfile records all saving and analysis operations. These are the built-in mechanism to prevent misconducts of experiments.

We used a miniature omnidirectional condenser microphone (Shure MX153T/O-TQG), a noise-canceling circumaural headphone (SONY MDR-1RNC), an audio interface (ROLAND Ru-bix24), and a powered loudspeaker (IK Multimedia iLoud Micro Monitor). The microphone and the R-channel audio interface output are connected R and L channels of the inputs of the audio interface. We used a notebook computer (Apple Mac-BookPro 13 inches with 2.7GHz Intel Core i7 and 16 GB memory) for running the tool. The placement of the microphone adopted the recommendation [35] . The sampling rate and the resolution was 44100 Hz and 24 bit. The target fo was 130 Hz, a comfortable pitch for the subject.

Before start experiments, the experimenter calibrates the acoustic input using pink noise and the calibration panel of the GUI. A test session starts by clicking the "START" button. In the beginning several seconds, the subject listens to the test sound to determine the target pitch of voicing. Then, the subject starts the sustained vowel keeping the pitch constant. The test signal lasts in twenty seconds. By clicking the "SAVE" button, it saves the test signal and the recorded voice, then response analysis starts and displays the results. The attached media files consist of a movie showing an example test procedure. The dark yellow line shows the averaged random responses. Note that the average response is compensatory to the perturbation with a latency of around 100 ms. Test signal having only the fundamental component prevented the subject from monitoring the pitch of the produced voice. This condition made the fo randomly drift away from the target value. This behavior suggests that when using the test signal with harmonic components, the subject's auditoryto-speech chain of pitch regulation operates intact. Therefore it is safe to state that the averaged response represents the involun- Figure 9 : Response example to missing fundamental sound consisting of resolved harmonics from 2nd to 20th components tary response to the perturbation of this speech-to-speech chain. These results illustrate the feasibility of the proposed method.

Each session lasts about one minute. We tested under various conditions using different types of test signals. The next section discusses the future possibilities we found from those test sessions.

We conducted tests using a sum of harmonic components other than the fundamental component. It is a missing fundamental signal. Figure 9 shows the summary of the results. Note that the average of the response is close to that of test signals having the fundamental component (Fig. 8 ). This suggests that we can conduct this acoustic-to-speech chain experiment using a loudspeaker instead of using a headphone. This is very useful.

We also tested using a sum of harmonic components from the eighth to the twentieth. The average response to this test signal shows a very small compensatory behavior. This difference in response may reflect the difference in the pitch salience of the test signals. We speculate that the proposed procedure provides direct access to our internal pitch representation. The immediate is to design test signals using several phase relations between harmonic components, such as cosine, alternating [36] , random, and Schroeder phase [37] . These may lead to critical tests for testing various pitch perception models [38] . The proposed method provides a non-invasive and quantitative assessment of auditory functions.

We designed a set of test signals using the mixture to modulate the fundamental frequency of artificial signals for testing the auditory-to-speech chain of pitch regulation. For testing the response, the experimenter presents the modulated signal aurally while the subject is voicing sustained vowels. We developed a tool for conducting this test quickly and interactively. We make it available as an open-source and also provide compiled GUI-based applications executable without requiring the MATLAB license. Preliminary tests using the tool revealed that the proposed method consistently provides compensatory responses with about 100 ms latency, representing involuntary control. Finally, we discuss future applications of the proposed method for objective and non-invasive auditory response measurements.

This work was supported by JSPS (Japan Society for the Promotion of Science) Grants-in-Aid for Scientific Research Grant Numbers JP18K00147, JP18K10708, and JP19K21618.

Toward a consensus on symbolic notation of harmonics, resonances, and formants in vocalization

Voice responses to changes in pitch of voice or tone auditory feedback

ERP correlates of pitch error detection in complex tone and voice auditory feedback with missing fundamental

Vocal and neural responses to unexpected changes in voice pitch auditory feedback during register transitions

Simultaneous measurement of time-invariant linear and nonlinear, and random and extra responses using frequency domain variant of velvet noise

Cascaded all-pass filters with randomized center frequencies and phase polarity for acoustic and speech measurement and data augmentation

Principles of Voice Production

Auditory-motor mapping for pitch control in singers and nonsingers

Neural mechanisms underlying auditory feedback control of speech

Human cortical sensorimotor network underlying feedback control of vocal pitch

The role of auditory feedback in sustaining vocal vibrato

A reflex resonance model of vocal vibrato

Sensory processing: Advances in understanding structure and function of pitch-shifted auditory feedback in voice control

Modulation of vocal pitch control through high-definition transcranial direct current stimulation of the left ventral motor cortex

Relationships between vocal pitch perception and production: A developmental perspective

A causal role of the cerebellum in auditory feedback control of vocal production

An introduction to the psychology of hearing

Instructing subjects to make a voluntary response reveals the presence of two components to the audio-vocal reflex

Neural networks involved in voluntary and involuntary vocal pitch regulation in experienced singers

Voice F0 responses to pitch-shifted auditory feedback: a preliminary study

Integrated-impulse method measuring sound decay without using impulses

Interactions between speech production and perception under auditory feedback perturbations on fundamental frequencies

Effects of auditory feedback on F0 trajectory generation

Computer-generated pulse signal applied for sound measurement

Distortion immunity of MLSderived impulse response measurements

Simultaneous measurement of impulse response and distortion with a swept-sine technique

Comparison of different impulse response measurement techniques

Impulse responses measured with MLS or Swept-Sine signals applied to architectural acoustics: an in-depth analysis of the two methods and some case studies of measurements inside theaters

GitHub repository for speech and hearing research/education tools

A new cosine series antialiasing function and its application to aliasing-free glottal source models for speech and singing synthesis

Maximum speed of pitch change and how it may relate to speech

A note on the physiological and physical basis for the phrase and accent components in the voice fundamental frequency contour

Pitfalls in digital signal processing

The evolution of the Lombard effect: 100 years of psychoacoustic research

Recommended protocols for instrumental assessment of voice: American speech-language-hearing association expert panel to develop a protocol for instrumental assessment of vocal function

A pulse ribbon model of monaural phase perception

Synthesis of low-peak-factor signals and binary sequences with low autocorrelation

Pitch perception models