key: cord-0307987-sgawofsf authors: Kulasingham, Joshua P.; Joshi, Neha H.; Rezaeizadeh, Mohsen; Simon, Jonathan Z. title: Cortical Processing of Arithmetic and Simple Sentences in an Auditory Attention Task date: 2021-05-18 journal: bioRxiv DOI: 10.1101/2021.01.31.429030 sha: 9709e93a540dbaefa3f6d9b60be8841d6e2e5369 doc_id: 307987 cord_uid: sgawofsf Cortical processing of arithmetic and of language rely on both shared and task-specific neural mechanisms, which should also be dissociable from the particular sensory modality used to probe them. Here, spoken arithmetical and non-mathematical statements were employed to investigate neural processing of arithmetic, compared to general language processing, in an attention-modulated cocktail party paradigm. Magnetoencephalography (MEG) data was recorded from 22 subjects listening to audio mixtures of spoken sentences and arithmetic equations while selectively attending to one of the two speech streams. Short sentences and simple equations were presented diotically at fixed and distinct word/symbol and sentence/equation rates. Critically, this allowed neural responses to acoustics, words, and symbols to be dissociated from responses to sentences and equations. Indeed, the simultaneous neural processing of the acoustics of words and symbols was observed in auditory cortex for both streams. Neural responses to sentences and equations, however, was predominantly to the attended stream, and originated primarily from left temporal, and parietal areas, respectively. Additionally, these neural responses were correlated with behavioral performance in a deviant detection task. Source-localized Temporal Response Functions revealed distinct cortical dynamics of responses to sentences in left temporal areas and equations in bilateral temporal, parietal and motor areas. Finally, the target of attention could be decoded from MEG responses, especially in left superior parietal areas. In short, the neural responses to arithmetic and language are especially well segregated during the cocktail party paradigm, and the correlation with behavior suggests that they may be linked to successful comprehension or calculation. Significance Statement Neural processing of arithmetic relies on dedicated, modality independent cortical networks that are distinct from those underlying language processing. Using a simultaneous cocktail party listening paradigm, we found that these separate networks segregate naturally when listeners selectively attend to one type over the other. Time-locked activity in the left temporal lobe was observed for responses to both spoken sentences and equations, but the latter additionally showed bilateral parietal activity consistent with arithmetic processing. Critically, these responses were modulated by selective attention and correlated with task behavior, consistent with reflecting high-level processing for speech comprehension or correct calculations. The response dynamics show task-related differences that were used to reliably decode the attentional target of sentences or equations. Neural processing of arithmetic relies on dedicated, modality independent cortical networks that are distinct from those underlying language processing. Using a simultaneous cocktail party listening paradigm, we found that these separate networks segregate naturally when listeners selectively attend to one type over the other. Time-locked activity in the left temporal lobe was observed for responses to both spoken sentences and equations, but the latter additionally showed bilateral parietal activity consistent with arithmetic processing. Critically, these responses were modulated by selective attention and correlated with task behavior, consistent with reflecting highlevel processing for speech comprehension or correct calculations. The response dynamics show task-related differences that were used to reliably decode the attentional target of sentences or equations. Comprehension and manipulation of numbers and words are key aspects of human cognition and share many common features. Numerical operations may rely on language for precise calculations (Spelke and Tsivkin, 2001; Pica et al., 2004) or share logical and syntactic rules with language (Houdé and Tzourio-Mazoyer, 2003) . During numerical tasks, frontal, parietal, occipital and temporal areas are activated (Menon et al., 2000; Dehaene et al., 2003 Dehaene et al., , 2004 Arsalidou and Taylor, 2011; Maruyama et al., 2012; Dastjerdi et al., 2013; Harvey et al., 2013; Harvey and Dumoulin, 2017) . Bilateral intraparietal sulcus (IPS) is activated by presenting numbers using Arabic or alphabetical notation (Pinel et al., 2001) or speech (Eger et al., 2003) . Posterior parietal and prefrontal areas are activated for both arithmetic and language (Price, 2000; Göbel et al., 2001; Venkatraman et al., 2006; Zarnhofer et al., 2012; Bemis and Pylkkänen, 2013) . However, some cortical networks activated by numerical stimuli (e.g., IPS), differ from those underlying language processing, even when the stimuli are presented using words Dehaene, 2016, 2019; Monti et al., 2012; Park et al., 2011) . Lesion studies (Dehaene and Cohen, 1997; Varley et al., 2005; Baldo and Dronkers, 2007) , further provide evidence that the neural basis of numerical processing is distinct from that of language processing (Gelman and Butterworth, 2005; Amalric and Dehaene, 2018) . Beyond cortical location, these neural processes are also dynamic, with dynamic evoked responses to arithmetic arising from parietal, occipital, temporal and frontal regions (Iguchi and Hashimoto, 2000; Kou and Iwaki, 2007; Ku et al., 2010; Jasinski and Coch, 2012; Maruyama et al., 2012; Iijima and Nishitani, 2017) . Arithmetic operations can even be decoded from such responses (Pinheiro-Chagas et al., 2019) . Speech studies differentiate early auditory evoked components from later components reflecting linguistic and semantic processing in temporal, parietal and frontal regions (Baggio and Hagoort, 2011; Koelsch et al., 2004; Lau et al., 2008; Obleser et al., 2003 Obleser et al., , 2004 . Linear models called Temporal Response Functions (TRFs), which model timelocked responses to continuous speech (Lalor and Foxe, 2010) , also reveal dynamical linguistic processing (Brodbeck et al., 2018a (Brodbeck et al., , 2018b Broderick et al., 2018; Di Liberto et al., 2015) . To investigate cortical processing of spoken language and arithmetic, we utilize a technique pioneered by Ding et al. (2016) of presenting isochronous (fixed rate) words and sentences. There, the single syllable word rate, also the dominant acoustic rate, is tracked strongly by auditory neural responses, as expected. However, cortical responses also strongly track the sentence rate, completely absent in the acoustics, possibly reflecting hierarchical language processing (Sheng et MEG data was collected from 22 adults (average age 22.6 yrs, 10 female, 21 right handed) who were native English speakers. The participants gave informed consent and received monetary compensation. All experimental procedures were approved by the Internal Review Board of the University of Maryland, College Park. To ensure that the subjects could satisfactorily perform the arithmetic task, only subjects who self-reported that they had taken at least one college level math course were recruited. Monosyllabic words were synthesized with both male and female speakers using the ReadSpeaker synthesizer (https://www.readspeaker.com, 'James' and 'Kate' voices). The language stimuli consisted of 4-word sentences and the arithmetic stimuli consisted of 5-word equations (hereafter, arithmetic words are referred to as 'symbols', arithmetic sentences as 'equations', non-arithmetic words as 'words' and non-arithmetic sentences as 'sentences'). The words and symbols were modified to be of constant durations to allow for separate word, symbol, sentence, and equation rates, so that the neural response to each of these could be separated in the frequency domain. The words and symbols were constructed with fixed durations of 375 ms and 360 ms respectively, giving a word rate of 2.67 Hz, a symbol rate of 2.78 Hz, a sentence rate of 0.67 Hz, and an equation rate of 0.55 Hz. All the words and symbols were mono-syllabic, and hence the syllabic rate is identical to the word/symbol rate. These rates are quite fast for spoken English, and, though intelligible, can be difficult to follow in the cocktail party conditions. Because neural signals below 0.5 Hz are very noisy, however, it was not deemed appropriate to reduce the rates further; preliminary testing showed that these rates were a suitable compromise between ease of understanding and reasonable neural signal to noise ratio. In addition, the rates were selected such that 10 equations and 12 sentences (each of which make up a single "trial") would have the same duration (18 s), allowing for precise frequency resolution at both these rates. The individual words and symbols were shortened by removing silent portions before their beginning and after their end, and then manipulated to give durations of 375 ms and 360 ms respectively, using the overlap-add resynthesis method in Praat (Boersma and Weenick, 2018) . The words and symbols were respectively formed into sentences and equations (described below) and were lowpass filtered below 4 kHz using a 3 rd order elliptic filter (the air-tube system used to deliver the stimulus has a lowpass transfer function with cutoff approximately 4 kHz). Finally, each stimulus was normalized to have approximately equal perceptual loudness using the MATLAB 'integratedLoudness' function. The equations were constructed using a limited set of symbols consisting of the word 'is' (denoting '='), three operators ('plus' (+), 'less' (-) and 'times' (´)), and the eleven English monosyllabic numbers ('nil', 'one' through 'six', 'eight' through 'ten', and 'twelve') . The equations themselves consisted of a pair of monosyllabic operands (numbers) joined by an operator, an 'is' statement of equivalence, and a monosyllabic result; the result could be either the first or last symbol in the equation (e.g., 'three plus two is five' or 'five is three plus two'). The equations were randomly generated with repetitions allowed, in order to roughly balance the occurrences of each number (although smaller numbers are still more frequent since there are more mathematically correct equations using only the smallest numbers). The fact that there were a limited set of symbols and that the same symbol 'is' occurs in every sentence, in either the 2 nd or 4 th position, are additional regularities, which contribute to additional peaks in the acoustic stimulus spectrum at the first and second harmonic of the equation rate (1.11 Hz and 1.66 Hz) as seen in Fig. 1 (and borne out by simulations). Although less than ideal, it is difficult to avoid in a paradigm when restricting to mathematically well-formed equations. Hence, we do not analyze the neural responses at those harmonic frequencies, since their relative contributions from auditory vs. arithmetic processing are not simple to estimate. The sentences were also constructed with two related syntactic structures to be similar to the two equation formats: verb in second position (e.g., 'cats drink warm milk') and verb in third position (e.g., 'head chef bakes pie'), but unlike the 'is' of the arithmetic case, the verb changed with every sentence and there were no analogous harmonic peaks in the sentence case. Deviants were also constructed: deviant equations were properly structured but mathematically incorrect (e.g., 'one plus one is ten'); analogously, deviant sentences were syntactically correct but semantically nonsensical (e.g., 'big boats eat cake'). Cocktail party stimuli were constructed by adding the acoustic waveforms of the sentences and equations in a single audio channel (see Fig. 1 ) and presented diotically (identical for both ears). The speakers were different (male and female), in order to simplify the task of segregating diotically presented speech. The mixed speech was then normalized to have the same loudness as all the single speaker stimuli using the abovementioned algorithm. The word and symbol rhythms are clearly visible in the waveforms. The mix was presented diotically and is the linear sum of both streams. B. The frequency spectrum of the Hilbert envelope of the entire concatenated stimulus for the attend-sentences condition (432 s duration). The sentence (0.67 Hz), equation (0.55 Hz), word (2.67 Hz) and symbol (2.78 Hz) rates are indicated by colored arrows under the x-axis. Clear word and symbol rate peaks are seen in the foreground and background respectively, while the mix spectrum has both peaks. Note that there are no sentence rate or equation rate peaks in the stimulus spectrum. The appearance of harmonics of the equation rate are consistent with the limited set of math symbols used. The experiment was conducted in blocks: 4 single speaker blocks (2 ⨉ 2: male and female, sentences and equations) were followed by 8 cocktail party blocks (see Table 1 ). The order of the gender of the speaker was counterbalanced across subjects. Each block consisted of multiple trials (10 for single speaker, 6 for cocktail party as shown in Table 1 ). 50% of the blocks had one deviant trial. Each trial consisted of 10 equations or 12 sentences (or both, for cocktail party conditions) and was 18 s in duration for all cases (0.360 s/symbol ´ 5 symbols/equations ´ 10 equations = 18 s; 0.375 s/word ´ 4 words/sentence ´ 12 sentences = 18 s). In total, the single speaker conditions had 240 sentences and 200 equations, and the cocktail party conditions had 288 sentences and 240 equations in the foreground. Deviant trials had 4 equations or 5 sentences being deviants. At the start of each block, the subject was instructed which stimulus to attend to, and was asked to press a button at the end of each trial to indicate whether a deviant was detected (right button: yes; left button: no). The subjects kept their eyes open, and a screen indicated which voice they should attend to ('Attend Male' or 'Attend Female') while the stimulus was presented diotically. After each trial, the stimulus was paused, and the screen displayed the text 'Outlier?' until the subjects pressed one of the two buttons. There was a 2-second break after the button press, after which the next trial stimulus was presented. Since the deviant detection task was challenging, especially in the cocktail party case, subjects were asked to practice detecting deviants just before they were placed inside the MEG scanner (2 trials of language, 2 trials of arithmetic, with stimuli not used during the experiment). The majority of subjects reported that it was easier to follow and detect deviants in the equations compared to the sentences. This might arise for several reasons, e.g., because the equations had a restricted set of simple numbers, or because the repetitive 'is' symbol helped keep track of equation structure. This experiment was not preregistered. The data has been made available at and the code is available at . A 157 axial gradiometer whole head MEG system (Kanazawa Institute of Technology, Nonoichi, Ishikawa, Japan) was used to record MEG data while subjects rested in the supine position in a magnetically shielded room (VAC, Hanau, Germany). The data was recorded at a sampling rate of 2 kHz with an online 500 Hz low pass filter, and a 60 Hz notch filter. Saturating channels were excluded (approximately two channels on average) and the data was denoised using time-shift principal component analysis (de Cheveigné and Simon, 2007) to remove external noise, and sensor noise suppression (de Cheveigné and Simon, 2008) to suppress channel artifacts. All subsequent analyses were performed in mne-python 0.19.2 (Gramfort, 2013; Gramfort et al., 2014) and eelbrain 0.33 . The MEG data was filtered from 0.3-40 Hz using an FIR filter (mne-python 0.19.2 default settings), downsampled to 200 Hz, and independent component analysis was used to remove artifacts such as eye blinks, heartbeats, and muscle movements. The complex-valued spectrum of the MEG response for each sensor was computed using the Discrete Fourier Transform (DFT). The preprocessed MEG responses were separated into 4 conditions: attending math or language, in single speaker or cocktail party conditions. The male and female speaker blocks were combined for all analysis. Within each condition, the MEG responses for each trial were concatenated to form signals of duration 6 minutes for each of the single speaker conditions and 7.2 minutes for each of the cocktail party conditions. The DFT was computed for each sensor in this concatenated response, leading to a frequency resolution of 2.7´10 -3 Hz for the single speaker conditions and 2.3´10 -3 Hz for the cocktail party conditions. The amplitudes of the frequency spectra were averaged over all sensors and tested for significant frequency peaks (described in section 2.9). Frequencies of interest were selected corresponding to the equation rate (0.555 Hz), the sentence rate (0.667 Hz), the symbol rate (2.778 Hz), and the word rate (2.667 Hz). Note that the duration of the signals is an exact multiple of both the symbol and the word durations, ensuring that the frequency spectrum contained an exact DFT value at each of these four rates. In addition, the neighboring 5 frequency values (width of ~0.01 Hz) on either side of these key frequencies were also selected to be used in a noise model for statistical tests. The head shape of each subject was digitized using a Polhemus 3SPACE FASTRAK system, and head position was measured before and after the experiment using five marker coils. The marker coil locations and the digitized head shape were used to co-register the template FreeSurfer 'fsaverage' brain (Fischl, 2012) using rotation, translation and uniform scaling. A volume source space was formed by dividing the brain volume into a grid of 12 mm sized voxels. This source space was used to compute an inverse operator using minimum norm estimation (MNE) (Hämäläinen and Ilmoniemi, 1994) , with a noise covariance estimated from empty room data. The complex-valued sensor distributions (with both amplitude and phase) of each of the 44 selected frequencies (4 frequencies of interest with 10 sidebands each) are given by the Fourier transform of each sensor's responses (concatenated over trials) at that frequency, in each condition. Each complex-valued sensor distribution was source-localized independently using MNE onto the volume source space, giving complex-valued source activations (Simon and Wang, 2005) . The amplitudes of these complex-valued source activations were used for subsequent analysis. The sideband source distributions were averaged together to form the noise model. The preprocessed single-trial MEG responses in each of the four conditions (excluding deviant trials) were source-localized in the time domain using MNE, similar to the method described above in the frequency domain. The MEG signals were further lowpassed below 10 Hz using an FIR filter (default settings in mne python) and downsampled to 100 Hz for the TRF analysis. These responses were then used along with representations of the stimulus to estimate TRFs. The linear TRF model for predictors (stimulus representations) is given by where ( ) is the response at a neural source at time , (") ( − ) is the time shifted p th predictor (e.g., speech envelope, word onsets, etc., as explained below) with time lag of , (") ( ) is the value of the TRF corresponding to the p th predictor at lag , and ( ) is the residual noise. The TRF estimates the impulse response of the neural system for that predictor, and can be interpreted as the average time-locked response to continuous stimuli (Lalor and Foxe, 2010) . For this analysis, several predictors were used to estimate TRFs at each neural source using the boosting algorithm (David et al., 2007) , as implemented in eelbrain, thereby separating the neural response to different features. The boosting algorithm may result in overly sparse TRFs, and hence an overlapping basis of 30 ms Hamming windows (with 10 ms spacing) was used in order to allow smoothly varying responses. For the volume source space, the TRF at each voxel for a particular predictor is a vector that varies over the time lags, representing the amplitude and direction of the current dipole activity. The stimulus was transformed into 2 types of representations that were used for TRF analysis: representations were set to zero. In order to separate out responses to stimulus onset and offset, the first and last sentences were assigned separate onset predictors, which were not analyzed further except to note that their TRFs showed strong and sustained onset and offset responses, respectively. The remaining (middle) sentences' onsets were combined into one predictor that was used for further TRF analysis. The same procedure was followed for the equation onset predictors. For each of the two single-speaker conditions, 5 predictors were used in the TRF model: the corresponding three sentence/equation onsets (just described), word/symbol onsets and the acoustic envelope. For each of the two cocktail conditions, 10 predictors were used in the TRF model: the abovementioned 5 predictors, for each of the foreground and the background stimuli. The predictors were fit in the TRF model jointly, without any preference given to one of them over another. The TRF for the speech envelope and the word/symbol onsets were estimated for time lags of 0-350 ms in order to limit the TRF duration to before the onset of the next word (at 375 ms) or symbol (at 360 ms). The sentence and equation TRFs were estimated starting from 350 ms to avoid onset responses, as well as lagged responses to the previous sentence. The sentence TRF was estimated until 1850 ms (350 ms past the end of the sentence) and the equation TRF was estimated until 2150 ms (350 ms past the end of the equation), in order to detect lagged responses. These sentence and equation TRFs were used to further analyze high level arithmetic and language processing. All decoding analyses were performed using scikit-learn (Pedregosa et al., 2011) samples. This method was used separately for both the single speaker and the cocktail party conditions. The decoder at each voxel learns to predict the attended stimulus type (equations or sentences) using the temporal dynamics of the response at that voxel. Finally, the effect of attention was investigated using two sets of classifiers for equations and sentences at each voxel. For the attend-equations classifier, the cocktail party trials were separated into samples at the equation boundaries (12 equations per trial), and the labels were marked as '1' when math was attended to and '0' when not. The time duration T was 0-1800 ms (entire equation). For the attend-sentences classifier, the cocktail party trials were separated into samples at the sentence boundaries (10 sentences per trial) and the labels were '1' when attending to sentences and '0' otherwise. The time duration T was 0-1500 ms (entire sentence). Therefore, the attendequations decoder at each voxel learns to predict whether the equation stimulus was attended to using the temporal dynamics of the response to the equation at that voxel (and similarly for the attend-sentences decoder). In summary, the decoders at each time point reveal the dynamics of decoding attention to equations vs. sentences from MEG sensor topographies, and the decoders at each voxel reveal the ability to decode arithmetic and language processing in specific cortical areas. The trained classifiers were tested on a separate set and the score of the decoder was computed. Logistic regression classifiers were used, with 5-fold cross-validation, within-subject for all the trials. The area under the receiver operating characteristic curve (AUC) was used to quantify the performance of the classifiers. The amplitude spectrum for each condition was averaged across sensors, and permutation tests were used to detect significant peaks across subjects (n=22). Each frequency value in the spectrum from 0.3 to 3 Hz was tested for a significant increase over the neighboring 5 values on either side using 10000 permutations and the max-t method (Nichols and Holmes, 2002) to adjust for multiple comparisons. For these tests, we report the p-values and the t-values, and deliberately omit the degrees of freedom to avoid direct comparison between the two, since the p-values are derived entirely from the permutation distribution and not from the t-distribution. Correlation tests were also performed to investigate associations between different responses (e.g., sentence rate vs. equation rate) within each subject. Pearson correlation tests with Holm-Bonferroni correction were used on the responses at the frequencies of interest, after subtracting the average of the five neighboring bins on either side. The source distributions for each individual were mapped onto the FreeSurfer 'fsaverage' brain, in order to facilitate group statistics. To account for individual variability and mislocalization during this mapping, the distributions were spatially smoothed using a Gaussian window with a standard deviation of 12 mm for all statistical tests. The source localized frequency responses were tested for a significant increase over the corresponding noise model formed by averaging the source localized responses of the five neighboring frequencies on either side. Nonparametric permutation tests (Nichols and Holmes, 2002) and Threshold Free Cluster Enhancement (TFCE) (Smith and Nichols, 2009 ) were performed to compare the response against the noise and to control for multiple comparisons. A detailed explanation of this method can be found in Brodbeck et al. (2018) . Briefly, a test statistic (in this case, paired samples t-statistics between true responses and noise models) is computed for the true data and 10000 random permutations of the data labels. The TFCE algorithm is applied to these statistics, in order to enhance continuous clusters of large values and a distribution consisting of the maximum TFCE value for each permutation is formed. Any value in the original TFCE map that exceeds the 95th percentile is considered significant at the 5% significance level. In all subsequent results, the minimum p-value and the maximum or minimum t-value across voxels is reported as pmin, tmax or tmin respectively. Note that the pmin is derived from the permutation distribution and cannot be derived directly from tmax or tmin using the t-distribution (degrees of freedom are also omitted due to this reason). Lateralization tests were performed by testing each voxel in the left hemisphere with the corresponding voxel in the right, using permutation tests and TFCE with paired samples t-statistics. For the attend math conditions, equations were separated by operator type ('plus' (+), 'less' (-) or 'times' (´)) to test for specific responses to each operator. No significant differences were found between the source localized responses to each operator. To test for significant effects and interactions, repeated measures ANOVAs were performed on the source localized responses at each frequency of interest after subtracting the corresponding noise model. Nonparametric permutation tests with TFCE were used to correct for multiple comparisons, similar to the method described above. In brief, a repeated-measures ANOVA is performed at each voxel, and then, for each effect or interaction, the voxel-wise F-values from this ANOVA are passed into the TFCE algorithm, followed by permutation tests as described earlier. This method detects significant clusters in source space for each significant effect. Note that the maximum F-value in the original map within a cluster (Fmax) and the p-value of the cluster are reported (and degrees of freedom omitted), for the same reasons as those explained in the previous paragraph (i.e., p-values are derived from the permutation distribution and not the F-distribution). Several types of repeated measures ANOVAs were performed using the abovementioned method. In the single speaker case, a 2 ´ 2 ANOVA with factors stimulus ('language' for words/sentences or 'math' for symbols/equations) and frequency (low for sentence/equation and high for word/symbol) was performed. For the cocktail party case, a 2 ´ 2 ´ 2 ANOVA with the added factor of attention (attended or unattended) was performed. In addition, two further ANOVAs were performed to investigate hemispheric effects, using an additional factor of hemisphere for both the single speaker (2 ´ 2 ´ 2 ANOVA) and the cocktail party (2 ´ 2 ´ 2 ´ 2 ANOVA) conditions. To investigate significant ANOVA effects further, post-hoc t-tests across subjects were performed on the responses averaged across voxels within the relevant significant cluster. For this scalar ttest, a Holm-Bonferroni correction was applied to correct for multiple comparisons. For these tests, the t-values with degrees of freedom, corrected p-values and Cohen's d effect sizes are reported. Behavioral responses for the deviant detection task were classified as either correct or incorrect, and the number of correct responses for each subject was correlated with the source localized response power of that subject. The noise model for each frequency of interest was subtracted from the response power before correlating with behavior. Nonparametric permutation tests with TFCE were used in a manner similar to that given above. The only difference was that the statistic used for comparison was the Pearson correlation coefficient between the two variables (behavior and response power), and the maximum correlation coefficient across voxels is reported as rmax. The TRFs were tested for significance using vector tests based on Hotelling's T 2 statistic (Mardia, 1975 Finally, the decoders were tested across subjects for a significant increase in decoding ability above chance (AUC = 0.5) at each time point or at each voxel. Multiple comparisons were controlled for using permutation tests and TFCE, similar to the above cases, with AUC as the test statistic. After each trial, subjects indicated whether that trial contained a deviant by pressing a button. The single speaker conditions had higher deviant detection accuracies (equations: mean = 89.9%, std. = 10.7%; sentences: mean = 74%, std. = 13.7%) than the cocktail party conditions (equations: mean = 79.8%, std. = 13.6%; sentences: mean = 62.3%, std = 18.9%). Subjects reported that the equations were perceptually easier to follow than the sentences, consistent with the fact that the equations were formed using a smaller set of symbols (restricted to being monosyllabic numbers to preserve the symbol rates). The presence of 'is' in each equation may have also contributed to subjects tracking equation boundaries. The response power spectrum was averaged over all sensors and a permutation test with the maxt method was performed to check whether the power at each frequency of interest was significantly larger than the average of the neighboring 5 frequency bins on either side (see Fig. 2 Since the word and symbol rates are present in the acoustics for both conditions, the neural responses at these rates could merely reflect acoustic processing. However, the fact that the sentence and equation rates are significant only in the corresponding attention condition suggests that these responses may dominantly reflect attention-selective high-level processes. This agrees with prior studies showing a similar effect for language . Here we show that this effect occurs even for arithmetic equations. However, arithmetic equations are also sentences, so it is unclear from this result alone if the equation peak reflects merely tracking of sentential structure and not arithmetic processing. To investigate this, we used volume source localization on the responses at the relevant frequencies to determine the cortical distribution of these responses. The responses at the 4 frequencies of interest (word, symbol, sentence and equation rates) were source-localized using the Fourier transform sensor topographies at these frequencies (see Methods 2.6). The amplitudes of the resulting complex-valued volume source distributions were used for all subsequent analysis. For each frequency of interest, the neighboring 5 bins on either side were also source localized (using the same source model) and averaged together to form an estimate of the background noise. The response distributions for each of these frequencies were tested for a significant increase over the noise estimate using nonparametric permutation tests with paired sample t-statistics and TFCE. For both single speaker conditions, the corresponding word or symbol responses were significant (tmax = 12.85, pmin < 0.001, and tmax = 12.77, pmin < 0.001, respectively) in the regions shown in Fig. 3A , B, with the average response being strongest in bilateral auditory cortex. The word and symbol rate responses were not significantly different (tmin = -2.11, tmax = 3.63, p > 0.08), consistent with low level auditory processing. The corresponding sentence or equation responses were also significant (tmax = 9.92, pmin < 0.001, and tmax = 7.68, pmin < 0.001, respectively). The source distribution for sentence responses was predominantly in left auditory cortex and temporal lobe, whereas for equations the response was distributed over areas of bilateral temporal, parietal, and occipital lobes. Despite these visually distinct patterns, the two responses were not significantly different (tmin = -3.00, tmax = 3.36, p > 0.12), perhaps because large portions of the brain show activity synchronized to the rhythm. Both sentence and equation responses were significantly left lateralized in temporal (tmax = 6.69, pmin < 0.001) and parietal (tmax For the cocktail party conditions, similar results were obtained for both word and symbol rate responses (attend sentences: word rate: tmax = 11.9, pmin < 0.001, symbol rate: tmax = 12.8, pmin < 0.001; attend equations: word rate: tmax = 11.1, pmin < 0.001, symbol rate: tmax = 11.01, pmin < 0.001). The response was predominantly in bilateral auditory cortices as shown in Fig. 3C the sentence rate response and that the selective attention task is better able to separate these responses. Perhaps surprisingly, the unattended sentence and equation rates were also significant (tmax = 4.02, pmin = 0.005, and tmax = 5.31, pmin < 0.001, respectively) in small clusters, even though such peaks do not appear in the frequency spectrum averaged across all sensors (Fig. 2) . Note however, that some individuals did show small peaks at these rates even in the average spectrum (see points above zero for unattended rates in the insets of Fig. 2C, D) . A repeated measures ANOVA was performed for the single speaker case on the abovementioned source space distributions (as shown in Fig. 3 Behavioral performance was correlated with source localized neural responses using non- TRF analysis was performed using source localized MEG time signals for each condition after excluding the deviant trials (details in Methods 2.7). TRFs were simultaneously obtained for responses to the acoustic envelopes, word/symbol onsets and sentence/equation onsets. Although stimuli with fixed and rhythmic word, symbol, sentence, and equation onsets might lend itself to an evoked response analysis, the fact that the words (or symbols) are only separated by 375 ms (or 360 ms) may lead to high-level late responses overlapping with early auditory responses to the next word (or symbol). In contrast, computing simultaneous TRFs to envelopes and word/symbol onsets in the same model as TRFs to equation/sentence onsets regresses out auditory responses from higher-level responses, providing cleaner TRFs for sentences and equations. The obtained envelope and word/symbol TRFs were not used for further analysis, since they were dominated by acoustic responses that have been well-studied in other investigations (Brodbeck et al., 2018a (Brodbeck et al., , 2018b . The volume source localized TRFs are time-varying vectors at each voxel. Activity of nearby neural populations may overlap, even if the activity is driven by different processes, due to the limited spatial resolution of MEG. However, these effects may have different current directions due to the anatomy of the cortical surface. Therefore, a test for consistent vector directions, using Hotelling's T 2 statistic and permutation tests with TFCE, was used to detect group differences in the direction of current flow (see Methods 2.9). The sentence and equation TRFs showed significance over several time intervals and many voxels over the duration of the TRF ( ,-. / = 7.66, pmin < 0.001, and ,-. / = 5.12, pmin < 0.001, respectively, see Fig 5) . The TRFs were computed starting from 350 ms after the sentence onset to 350 ms after the end of the sentence, but because of the fixed-period presentation rate without any breaks between sentences in a trial, the TRFs from 0-350 ms are identical to the last 350 ms. The large peak at the end (and beginning) of each TRF may either arise from processing of the completion of the sentence/equation, or to preparation (or auditory) processing of the new sentence sentence/equation, or both. This peak occurs around 60-180 ms after the start of the new sentence/equation, in the typical latency range of early auditory processing. However, spatial distributions of the peak in the equation TRFs seem to indicate patterns that are not consistent with purely auditory processing, especially for the cocktail party condition (described below). Additionally, significant activity is seen throughout the duration of the sentence/equation that is not tied to word/symbol onsets, indicating that lower-level auditory responses have been successfully regressed out. Therefore, the large peaks at the end plausibly reflect processing of the completion of the sentence/equation (with a latency of 420-530 ms after the last word/symbol). The sentence TRF peaks were significant predominantly in the left temporal lobe, while the equation TRF peaks were significant in bilateral temporal, parietal, and motor areas. auditory cortex (even more so than in the single speaker case). Therefore, selective attention in the cocktail conditions seems to highlight differences between arithmetic and language processing, and possible explanations are discussed section 4.6. To further help differentiate between the cortical processing of equations and sentences, two types of linear decoders were trained on neural responses. 1) Classifiers at each time point that learned weights based on the MEG sensor topography at that time point. 2) Classifiers at each voxel that learned weights based on the temporal dynamics of the response at that voxel. The former was used to contrast the dynamics of equation and sentence processing (Fig. 7A) . For the single speaker conditions, all time points showed significant decoding ability across subjects (tmax = 11.3, pmin < 0.001), with higher prediction success (as measured by AUC) at longer latencies. For the cocktail party conditions, decoding ability was significantly above chance only at longer latencies (tmax = 6.45, pmin < 0.001). While subjects listened to the equations, the identity of the arithmetic operator (e.g., 'plus' vs. 'times' or 'less') was reliably decoded from the MEG sensor topography during the time points when the operator and the subsequent operand were presented (Fig. 7B ). Note that decoding accuracy was significantly above chance for time points ~250-300 ms after the offset of the operator. This is considerably late for decoding based on mere auditory responses to acoustic features of the operator. Decoders at each voxel were also trained to differentiate attention to equations vs. sentences, based on the dynamics of the response at that voxel during the first or last words (Fig. 7C) . The prediction success (as measured by AUC) was significant for large areas in the single speaker conditions both for first words (tmax = 5.1, pmin < 0.001) and last words ( sentences. Finally, decoders at each voxel were also trained to decode the attention condition (foreground vs. background) from the response to the entire sentence or equation (Fig 7D) . The AUC was significant in bilateral parietal areas for decoding whether arithmetic was in foreground vs. background (tmax = 5.2, pmin < 0.001), consistent with areas involved in arithmetic processing. For decoding whether language was in foreground vs. background, the AUC was significant (tmax = 5.1, pmin = 0.002) in left middle temporal areas (consistent with higher level language processing) and bilateral superior parietal areas (consistent with attention networks that are involved in discriminating auditory stimuli). Therefore, the decoding analysis is able to detect different cortical areas that may be involved in attention to language and arithmetic. Specific cortical patterns were found to be consistent across different analysis methods (frequency domain, TRFs, decoders) as illustrated in schematic Fig. 8 . Sentence responses consistently localize to left temporal areas. Equation responses consistently show bilateral parietal activity, but also vary across analysis method (e.g., motor activity in TRFs). This may be due to different mechanisms involved in equation processing, although further investigation would be needed to support this claim. As expected, MEG responses to acoustic features source-localize to bilateral auditory cortex, and sentence rate responses source-localize to left temporal cortex, consistent with speech and language areas (Friederici, 2002 (Friederici, , 2011 Vandenberghe et al., 2002; Hickok and Poeppel, 2007; Binder et al., 2009) , similar to prior isochronous speech studies (Sheng et al., 2018) . In contrast, we find equation rate responses localize to left parietal, temporal, and occipital areas. Arithmetic processing can activate IPS and parietal (Dehaene et al., 2003) , angular gyrus (Göbel et al., 2001) , temporal (Tang et al., 2006) and even occipital areas (Maruyama et al., 2012; Harvey and Dumoulin, 2017) , perhaps due to internal visualization (Zago et al., 2001) . Equation responses also localize to right temporal and parietal areas in cocktail party conditions, confirming that arithmetic processing is more bilateral than language processing (Dehaene and Cohen, 1997; Dehaene, 2018, 2019) . Critically, ANOVA analysis indicated that attended equation and sentence responses are significantly different. Unexpectedly, significant neural responses at the unattended sentence and equation rates were found in smaller temporal (consistent with language processing) and parietal (consistent with arithmetic processing) areas respectively. Some subjects may have been unable to sustain attention to the instructed stream for the entirety of this diotic stimulus and so briefly switched their attentional focus. Equation responses were left dominant in both single speaker and cocktail party conditions. This could reflect left lateralized language processing since equations were presented using speech. However, arithmetic processing may also show left dominance (Pinel and Dehaene, 2009) , perhaps due to precise calculations (Dehaene, 1999; Pica et al., 2004) or arithmetic fact retrieval (Dehaene et al., 2003; Grabner et al., 2009 ). These fast-paced stimuli required rapid calculations, and may have resulted in increased reliance on rote memory, which activates left hemispheric areas (Campbell and Austin, 2002) . Specific strategies employed for calculation may also result in left lateralization-multiplication of small numbers is often performed using rote memory (Delazer et al., 1999; Ischebeck et al., 2006; Fehr et al., 2007) , while subtraction is less commonly performed using memory and shows more bilateral activation (Schmithorst and Brown, 2004; Prado et al., 2011) . Addition may recruit both these networks, depending on specific strategies utilized by individuals (Arsalidou and Taylor, 2011) . We found no significant differences in equation responses when separated by operation type, perhaps because of individual variation in procedural calculation or retrieval strategies within the same operation (Tschentscher and Hauk, 2014) . However, operation types were successfully decoded from the overall MEG signals (Fig. 8B Neural responses to sentence, equation, word, and symbol rates were correlated with performance in detecting deviants, consistent with language-only isochronous studies . Sentence responses correlated with behavior in language areas, such as left auditory cortex, superior and middle temporal lobe and angular gyrus (Price, 2000; Binder et al., 2009; Karuza et al., 2013) in both single speaker and cocktail party conditions. In contrast, equation responses correlated with behavior in cocktail party conditions in posterior parietal areas, which are known to predict competence and performance in numerical tasks (Grabner et al., 2007; Lin et al., 2012 Lin et al., , 2019 Lasne et al., 2019) . The lack of significant behavioral correlations for equation responses in single speaker conditions may be due to several subjects performing at ceiling; equations had a restricted set of only 14 unique symbols, and the presence of the 'is' symbol in every equation might be structurally useful in tracking equation boundaries. Unexpectedly, behavioral correlations were also found for background symbol rate responses in parietal and occipital areas (and for background word rate responses in a small parietal region). Some studies show that acoustic features of background speech may also be tracked (Fiedler et al., 2019; Brodbeck et al., 2020) . Since background word and symbol rates are present in the stimulus acoustics, increased effort or attention could enhance both behavioral performance and auditory responses at these rates. Representations of the background could enhance attentional selectivity in challenging cocktail party environments as suggested by Fiedler et al., 2019 . However, note that sentence and equation responses were only correlated with behavior when attended. Overall, behavioral correlations in temporal and parietal regions suggest that these responses may reflect improved comprehension due to neural chunking of speech structures or successful calculations (Blanco-Elorrieta et al., 2019; Chen et al., 2020; Jin et al., 2020; Kaufeld et al., 2020; Teng et al., 2020) . TRFs for equation and sentence onsets were jointly estimated along with speech envelopes and word/equation onsets in order to regress out auditory responses, analogous to prior work with linguistic and auditory TRFs (Brodbeck et al., 2018a (Brodbeck et al., , 2018b Broderick et al., 2018) . Isochronous speech studies have found slow rhythmic activity , which does not appear in our TRFs, perhaps due to implicit high-pass filtering (boosting favors sparse TRFs). Instead, we find large TRF peaks at sentence/equation boundaries. Prior studies have found late evoked responses specific to numbers and equations (Avancini et al., 2015) . Large peaks appear in both sentence and equation TRFs 410-600 ms after the onset of the last word/symbol and may reflect processing of the completion of the sentence/equation. Sentence TRF peaks localize to left temporal areas, while equation TRF peaks show activity in bilateral parietal and temporal areas involved in numerical processing (Abd Hamid et al., 2011; Amalric and Dehaene, 2018) , and motor areas, perhaps reflecting procedural calculation strategies (Tschentscher and Hauk, 2014) . The latencies of these peaks are similar to prior arithmetic ERP studies (Iguchi and Hashimoto, 2000; Iijima and Nishitani, 2017) . These sentence and equation TRF peaks may reflect several mechanisms; both shared (language processing, decision making), and separate (semantic vs. arithmetic processing), and further work is needed to disentangle these mechanisms. Finally, the cortical patterns of TRF peaks show more differences in the cocktail party than the single speaker conditions, suggesting that selective attention focuses the underlying cortical networks. Numbers and arithmetic operations have been previously decoded from cortical responses (Eger et al., 2009; Pinheiro-Chagas et al., 2019) . In this study, the attended stimulus type (sentences or equations) was reliably decoded in single speaker conditions in broad cortical regions, perhaps due to highly correlated responses across cortex for this task. In contrast, decoding accuracy during cocktail party conditions was significant in left IPS and superior parietal areas suggesting that these regions are most important for discriminating between arithmetic and language processing. Both the attend-equations and the attend-sentences states could be decoded from bilateral superior parietal areas. This could be due to general attentional networks in fronto-parietal areas, or attentional segregation of foreground and background speech based on pitch or gender (Hill and Miller, 2010; Kristensen et al., 2013) . Additionally, decoding the attend-equations state was also significant in bilateral parietal areas, consistent with arithmetic processing, while decoding the attend-sentences state was also significant in left middle temporal lobe, consistent with language processing (Hickok and Poeppel, 2007) . Overall, MEG responses contain enough information to decode arithmetic vs. language processing, selective attention, and arithmetic operations. Differences in source-localized sentence and equation responses were more prominent in the cocktail party than in the single speaker conditions for all analyses (frequency domain, TRFs and decoders). Responses to both stimuli presented simultaneously may have helped control for common auditory and pre-attentive responses. Abd Hamid et al., (2011) found fMRI activation in broader areas for spoken arithmetic with a noisy background than in quiet, perhaps due to increased effort. However, in our case, the background stimulus was not white noise, but rather meaningful non-mathematical speech. Our TRF analysis, which regresses out responses to background speech, as well as the selective attention task itself, may highlight specific cortical processes that best separate the arithmetic and language stimuli. In summary, neural processing of spoken equations and sentences involves both overlapping and non-overlapping cortical networks. Behavioral correlations suggest that these neural responses may reflect improved comprehension and/or correct arithmetic calculations. Selective attention for equations focuses activity in temporal, parietal occipital and motor areas, and for sentences in temporal and superior parietal areas. This cocktail party paradigm is well suited to highlight the cortical networks underlying the processing of spoken arithmetic and language. Brain Activation during Addition and Subtraction Tasks In-Noise and In-Quiet Origins of the brain networks for advanced mathematics in expert mathematicians Cortical circuits for mathematical knowledge: evidence for a major subdivision within the brain's semantic networks A distinct cortical network for mathematical knowledge in the human brain Is 2+2=4? Meta-analyses of brain areas needed for numbers and calculations Separating stages of arithmetic verification: An ERP study with a novel paradigm The balance between memory and unification in semantics: A dynamic account of the N400 Neural correlates of arithmetic and language comprehension: A common substrate Basic linguistic composition recruits the left anterior temporal lobe and left angular gyrus during both listening and reading Where Is the Semantic System? A Critical Review and Meta-Analysis of 120 Functional Neuroimaging Studies Understanding Requires Tracking: Noise and Knowledge Interact in Bilingual Comprehension Praat: doing phonetics by computer Rapid Transformation from Auditory to Linguistic Representations of Continuous Speech Neural speech restoration at the cocktail party: Auditory cortex recovers masked speech of both attended and ignored speakers Neural source dynamics of brain responses to continuous stimuli: Speech processing from acoustics to comprehension Electrophysiological Correlates of Semantic Dissimilarity Reflect the Comprehension of Natural, Narrative Speech Effects of response time deadlines on adults' strategy choices for simple addition Attentional amplification of neural codes for number independent of other quantities along the dorsal visual stream The influence of linguistic information on cortical tracking of words Numerical processing in the human parietal cortex during experimental and natural conditions Estimating sparse spectro-temporal receptive fields with natural stimuli Denoising based on Time-Shift PCA Sensor noise suppression Sources of Mathematical Thinking: Behavioral and Brain-Imaging Evidence Cerebral Pathways for Calculation: Double Dissociation between Rote Verbal and Quantitative Knowledge of Arithmetic Arithmetic and the brain Three parietal circuits for number processing Numerical skills and aphasia Low-Frequency Cortical Entrainment to Speech Reflects Phoneme-Level Processing Characterizing Neural Entrainment to Hierarchical Linguistic Units using Electroencephalography (EEG) Cortical tracking of hierarchical linguistic structures in connected speech Attention Is Required for Knowledge-Based Sequential Grouping: Insights from the Integration of Syllables into Words Emergence of neural encoding of auditory objects while listening to competing speakers Deciphering Cortical Number Coding from Human Brain Activity Patterns A supramodal number representation in human intraparietal cortex Common brain regions underlying different arithmetic operations as revealed by conjunct fMRI-BOLD activation Late cortical tracking of ignored speech facilitates neural selectivity in acoustically challenging conditions Towards a neural basis of auditory sentence processing The Brain Basis of Language Processing: From Structure to Function Number and language: how are they related? The Mental Number Line and the Human Angular Gyrus To retrieve or to calculate? Left angular gyrus mediates the retrieval of arithmetic facts during problem solving Individual differences in mathematical competence predict parietal brain activation during mental calculation MEG and EEG data analysis with MNE-Python MNE software for processing MEG and EEG data Interpreting magnetic fields of the brain: minimum norm estimates A network of topographic numerosity maps in human association cortex Topographic Representation of Numerosity in the Human Parietal Cortex The cortical organization of speech processing Auditory Attentional Control and Selection during Cocktail Party Listening Neural foundations of logical and mathematical cognition Sequential information processing during a mental arithmetic is reflected in the time course of event-related brain potentials Cortical dynamics during simple calculation processes: A magnetoencephalography study How specifically do we learn? Imaging the learning of multiplication and subtraction ERPs across arithmetic operations in a delayed answer verification task Low-frequency neural activity reflects rule-based chunking during speech listening The neural correlates of statistical learning in a word segmentation task: An fMRI study Linguistic Structure and Meaning Organize Neural Oscillations into a Content-Specific Hierarchy Music, language and meaning: brain signatures of semantic processing Modulation of neural activities by the complexity of mental arithmetic: An MEG study The Interface Between Language and Attention: Prosodic Focus Marking Recruits a General Attention Network in Spoken Language Comprehension Spectra-temporal patterns underlying mental addition: An ERP and ERD/ERS study Neural responses to uninterrupted natural speech can be extracted with precise temporal resolution Discriminability of numerosityevoked fMRI activity patterns in human intra-parietal cortex reflects behavioral numerical acuity A cortical network for semantics: (de)constructing the N400 Mental Addition in Bilinguals: An fMRI Study of Task-Related and Performance-Related Activation Neuroplasticity, bilingualism, and mental mathematics: A behavior-MEG study Cortical encoding of acoustic and linguistic rhythms in spoken narratives Assessment of Multinormality and the Robustness of Hotelling's T2 Test The cortical representation of simple mathematical expressions Dissociating Prefrontal and Parietal Cortex Activation during Arithmetic Processing Thought Beyond Language: Neural Dissociation of Algebra and Natural Language Nonparametric permutation tests for functional neuroimaging: A primer with examples Auditory-evoked magnetic field codes place of articulation in timing and topography around 100 milliseconds post syllable onset Magnetic Brain Response Mirrors Extraction of Phonological Features from Spoken Vowels Neural Dissociation of Number from Letter Recognition and Its Relationship to Parietal Numerical Processing Scikit-learn: Machine Learning in Python Exact and Approximate Arithmetic in an Amazonian Indigene Group Beyond Hemispheric Dominance: Brain Regions Underlying the Joint Lateralization of Language and Arithmetic to the Left Hemisphere Modulation of Parietal Activation by Semantic Distance in a Number Comparison Task Decoding the processing stages of mental arithmetic with magnetoencephalography Distinct representations of subtraction and multiplication in the neural systems for numerosity and language The anatomy of language: contributions from functional neuroimaging Empirical validation of the triple-code model of numerical processing for complex math operations using functional MRI and group Independent Component Analysis of the mental addition and subtraction of fractions The Cortical Maps of Hierarchical Linguistic Structures during Speech Perception Fully complex magnetoencephalography Threshold-free cluster enhancement: Addressing problems of smoothing, threshold dependence and localisation in cluster inference Language and number: a bilingual training study Arithmetic processing in the brain shaped by cultures Constrained Structure of Ancient Chinese Poetry Facilitates Speech Content Grouping How are things adding up? Neural differences between arithmetic operations are due to general problem solving strategies The Response of Left Temporal Cortex to Sentences Agrammatic but numerate Effect of Language Switching on Arithmetic: A Bilingual fMRI Study Neural Correlates of Simple and Complex Mental Calculation The Influence of verbalization on the pattern of cortical activation during mental arithmetic Time-domain analysis of neural tracking of hierarchical linguistic structures