key: cord-0058286-o5cj7ks1 authors: Lemkhenter, Abdelhak; Favaro, Paolo title: Boosting Generalization in Bio-signal Classification by Learning the Phase-Amplitude Coupling date: 2021-03-17 journal: Pattern Recognition DOI: 10.1007/978-3-030-71278-5_6 sha: 8db57b103738cfe7a8b04361ebe443bb0dc091d6 doc_id: 58286 cord_uid: o5cj7ks1 Various hand-crafted feature representations of bio-signals rely primarily on the amplitude or power of the signal in specific frequency bands. The phase component is often discarded as it is more sample specific, and thus more sensitive to noise, than the amplitude. However, in general, the phase component also carries information relevant to the underlying biological processes. In fact, in this paper we show the benefits of learning the coupling of both phase and amplitude components of a bio-signal. We do so by introducing a novel self-supervised learning task, which we call phase-swap, that detects if bio-signals have been obtained by merging the amplitude and phase from different sources. We show in our evaluation that neural networks trained on this task generalize better across subjects and recording sessions than their fully supervised counterpart. Bio-signals, such as Electroencephalograms and Electrocardiograms, are multivariate time-series generated by biological processes that can be used to assess seizures, sleep disorders, head injuries, memory problems, heart diseases, just to name a few [19] . Although clinicians can successfully learn to correctly interpret such bio-signals, their protocols cannot be directly converted into a set of numerical rules yielding a comparable assessment performance. Currently, the most effective way to transfer this expertise into an automated system is to gather a large number of examples of bio-signals with the corresponding labeling provided by a clinician, and to use them to train a deep neural network. However, collecting such labeling is expensive and time-consuming. In contrast, bio-signals without labeling are more readily available in large numbers. Recently, self-supervised learning (SelfSL) techniques have been proposed to limit the amount of required labeled data. These techniques define a so-called pretext task that can be used to train a neural network in a supervised manner on data without manual labeling. The pretext task is an artificial problem, where a model is trained to output what transformation was applied to the data. For instance, a model could be trained to output the probability that a time-series had been time-reversed [25] . This step is often called pre-training and it can be carried out on large data sets as no manual labeling is required. The training of the pre-trained neural network then continues with a small learning rate on the small target data set, where labels are available. This second step is called finetuning, and it yields a substantial boost in performance [21] . Thus, SelfSL can be used to automatically learn physiologically relevant features from unlabelled bio-signals and improve classification performance. SelfSL is most effective if the pretext task focuses on features that are relevant to the target task. Typical features work with the amplitude or the power of the bio-signals, but as shown in the literature, the phase carries information about the underlining biological processes [2, 15, 20] . Thus, in this paper, we propose a pretext task to learn the coupling between the amplitude and the phase of the bio-signals, which we call phase swap (PS). The objective is to predict whether the phase of the Fourier transform of a multivariate physiological time-series segment was swapped with the phase of another segment. We show that features learned through this task help classification tasks generalize better, regardless of the neural network architecture. Our contributions are summarized as follows -We introduce phase swap, a novel self-supervised learning task to detect the coupling between the phase and the magnitude of physiological time-series; -With phase swap, we demonstrate experimentally the importance of incorporating the phase in bio-signal classification; -We show that the learned representation generalizes better than current state of the art methods to new subjects and to new recording sessions; -We evaluate the method on four different data sets and analyze the effect of various hyper-parameters and of the amount of available labeled data on the learned representations. Self-supervised Learning. Self-supervised learning refers to the practice of pre-training deep learning architectures on user-defined pretext tasks. This can be done on large volumes of unlabeled data since the annotations can be automatically generated for these tasks. This is a common practice in the Natural Language Processing literature. Examples of such works include Word2Vec [17] , where the task is to predict a word from its context, and BERT [3] , where the model is pretrained as a masked language model and on the task of detecting consecutive sentences. The self-supervision framework has also been gaining popularity in Computer Vision. Pretext tasks such as solving a jigsaw puzzle [21] , predicting image rotations [5] and detecting local inpainting [11] have been shown to be able to learn useful data representations for downstream tasks. Recent work explores the potential of self-supervised learning for EEG signals [1] and time series in general [10] . In [1] , the focus is on long-term/global tasks such as determining whether two given windows are nearby temporally or not. Deep Learning for Bio-signals. Bio-signals include a variety of physiological measures across time such as: Electroencephalogram (EEG), Electrocardiogram (ECG), Electromyogram (EMG), Electrooculography (EOG), etc. These signals are used by clinicians in various applications, such as sleep scoring [18] or seizure detection [23] . Similarly to many other fields, bio-signals analysis has also seen the rise in popularity of deep learning methods for both classification [7] and representation learning [1] . The literature review [22] showcases the application of deep learning methods to various EEG classification problems such as brain computer interfaces, emotion recognition and seizure detection. The work by Banville et al. [1] leverages self-supervised tasks based on the relative temporal positioning of pairs/triplets of EEG segments to learn a useful representation for a downstream sleep staging application. Phase Analysis. The phase component of bio-signals has been analyzed before. Busch et al. [2] show a link between the phase of the EEG oscillations, in the alpha (8-12 Hz) and theta (4-8 Hz) frequency bands, and the subjects' ability to perceive the flash of a light. The phase of the EEG signal is also shown to be more discriminative for determining firing patterns of neurons in response to certain types of stimuli [20] . More recent work, such as [15] , highlights the potential link between the phase of the different EEG frequency bands and cognition during proactive control of task switching. In this section, we define the phase swap operator and the corresponding SelfSL task, and present the losses used for pre-training and fine-tuning. be the set of samples associated with the ith subject during the j-th recording session. Each sample x i,j,k ∈ R C×W is a multivariate physiological time-series window where C and W are the number of channels and the window size respectively. y i,j,k is the class of the k-th sample. Let F and F −1 be the Discrete Fourier Transform operator and its inverse, respectively. These operators will be applied to a given vector x extracted from the bio-signals. In the case of multivariate signals, we apply these operators channel-wise. For the sake of clarity, we provide the definitions of the absolute value and the phase element-wise operators. Let z ∈ C, where C denotes the set of complex numbers. Then, the absolute value, or magnitude, of z is denoted |z| and the phase of z is denoted z. With such definitions, we have the trivial identity z = |z| z. Given two samples where is the element-wise multiplication (see Fig. 1 ). Note that the energy per frequency is the same for both x i,k swap and x i,k and that only the phase, i.e., the synchronization between the different frequencies, changes. Examples of phase swapping between different pairs of signals are shown in Fig. 2 . Notice how the shape of the oscillations change drastically when the PS operator is applied and no trivial shared patterns seem to emerge. The PS pretext task is defined as a binary classification problem. A sample belongs to the positive class if it is transformed using the PS operator, otherwise it belongs to the negative class. In all our experiments, both inputs to the PS operator are sampled from the same patient during the same recording session. Because the phase is decoupled from the amplitude of white noise, our model has no incentive to detect noise patterns. On the contrary, it will be encouraged to focus on the structural patterns in the signal in order to detect whether the phase and magnitude of the segment are coupled or not. We use the FCN architecture proposed by Wang et al. [24] as our core neural network model E : R C×W → R H×W/128 . It consists of 3 convolutions blocks using a Batch Normalization layer [8] and a ReLU activation followed by a pooling layer. The output of E is then flattened and fed to two Softmax layers C Self and C Sup , which are trained on the self-supervised and supervised tasks respectively. Instead of a global pooling layer, we use an average pooling layer with a stride of 128. This allows us to keep the number of weights of the supervised network C Sup • E constant when the self-supervised task is defined on a different window size. The overall framework is illustrated in Fig. 3 . Note that the encoder network E is the same for both tasks. The loss function for training on the SelfSL task is the cross-entropy where y Self i,k and (C Self • E(x i )) k are the one-hot representations of the true SelfSL pretext label and the predicted probability vector respectively. We optimize Eq. (2) with respect to the parameters of both E and C Self . Similarly, we define the loss function for the (supervised) fine-tuning as the cross-entropy where y Sup i,k denotes the label for the target task. The y Sup/Self i,k vectors are in R N ×K Sup/Self , where N and K Sup/Self are the number of samples and classes respectively. In the fine-tuning, E is initialized with the parameters obtained from the optimization of Eq. (2) and C Sup with random weights, and then they are both updated to optimize Eq. (3), but with a small learning rate. In our experiments, we use the Expanded SleepEDF [6, 12, 18] , the CHB-MIT [23] and ISRUC-Sleep [13] data sets as they contain recordings from multiple patients. This allows us to study the generalization capabilities of the learned feature representation to new recording sessions and new patients. The Expanded SleepEDF database contains two different sleep scoring data sets -Sleep Cassette Study (SC) [18] : Collected between 1987 and 1991 in order to study the effect of age on sleep. It includes 78 patients with 2 recording sessions each (3 recording sessions were lost due to hardware failure). The third data set we use in our experiments is the CHB-MIT data set [23] recorded at the Children's Hospital Boston from pediatric patients with intractable seizures. It includes multiples recording files across 22 different patients. We retain the 18 EEG channels that are common to all recording files. The sampling rate for all channels 256 Hz. The target task defined on this data set is predicting whether a given segment is a seizure event or not, i.e., a binary classification problem. For all the data sets, the international 10-20 system [16] was adopted for the choice of the positioning of the EEG electrodes. The last data set we use is ISRUC-Sleep [13] , for sleep scoring as a 4-way classification problem. We use the 14 channels extracted in the Matlab version of the data set. This data set consists of three subgroups: subgroups I and II contain respectively recordings from 100 and 8 subjects with sleep disorders, whereas subgroup III contains recordings from 10 healthy subjects. This allows us to test the generalization from diagnosed subjects to healthy ones. For the SC, ST and ISRUC-sleep data sets we resample the signals to 102.4 Hz. This resampling allows us to simplify the neural network architectures we use, because in this case most window sizes can be represented by a power of 2, e.g., a window of 2.5 s corresponds to 256 samples. We normalize each channel per recording file in all data sets to have zero mean and a standard deviation of one. In the supervised baseline (respectively, self-supervised pre-training), we train the randomly initialized model C Sup • E (respectively, C Self • E) on the labeled data set for 10 (respectively, 5) epochs using the Adam optimizer [14] with a learning rate of 10 −3 and β = (0.9, 0.999). We balance the classes present in the data set using resampling (no need to balance classes in the self-supervised learning task). In fine-tuning, we initialize E's weights with those obtained from the SelfSL training and then train C Sup • E on the labeled data set for 10 epochs using the Adam optimizer [14] , but with a learning rate of 10 −4 and β = (0.9, 0.999). As in the fully supervised training, we also balance the classes using re-sampling. In all training cases, we use a default batch size of 128. We evaluate our self-supervised framework using the following models -PhaseSwap: The model is pre-trained on the self-supervised task and finetuned on the labeled data; -Supervised: The model is trained solely in a supervised fashion; -Random: C Sup is trained on top of a frozen randomly initialized E; -PSFrozen: We train C Sup on top of the frozen weights of the model E pretrained on the self-supervised task. We evaluate our models on train/validation/test splits in our experiments. In total we use at most 4 sets, which we refer to as the training set, the Validation We use the same set of recordings and patients for both the training of the self-supervised and supervised tasks. For the ST, SC and ISRUC data sets we use class re-balancing only during the supervised fine-tuning. However, for the CHB-MIT data set, the class imbalance is much more extreme: The data set consists of less than 0.4% positive samples. Because of that, we under-sample the majority class both during the self-supervised and supervised training. This prevents the self-supervised features from completely ignoring the positive class. Unless specified otherwise, we use W Self = 5 s and W Sup = 30 s for the ISRUC, ST and SC data sets, W Self = 2 s and W Sup = 10 s for the CHB-MIT data set, where W Self and W Sup are the window size for the self-supervised and supervised training respectively. For the ISRUC, ST and SC data sets, the choice of W Sup corresponds to the granularity of the provided labels. For the CHB-MIT data set, although labels are provided at a rate 1 Hz, the literature in neuroscience usually defines a minimal duration of around 10 s for an epileptic event in humans [4] , which motivates our choice of W Sup = 10 s. As an evaluation metric, we use the balanced accuracy which is defined as the average of the recall values per class, where K, N , y and y are respectively the number of classes, the number of samples, the one-hot representation of true labels and the predicted labels. We explore the generalization of the self-supervised trained model by varying the number of different patients used in the training set for the SC data set. The r train is the percentage of patient identities used for training, in Validation Set and in Test set A. In Table 1 , we report the balanced accuracy on all test sets for various values of r train . The self-supervised training was done using a window size of W Self = 5 s. We observe that the PhaseSwap model performs the best for all values of r train . We also observe that the performance gap between the Using the ISRUC-Sleep data set [13] , we aim to evaluate the performance of the PhaseSwap model on healthy subjects when it was trained on subjects with sleep disorders. For the self-supervised training, we use W Self = 5 s. The results are reported in Table 2 . Note that we combined the recordings of subgroup II and the ones not used for the training from subgroup I into a single test set since they are from subjects with sleep disorders. We observe that for both experiments, r train = 25% and r train = 50%, the PhaseSwap model outperforms the supervised baseline for both test sets. Notably, the performance gap on subgroup III is larger than 10%. This can be explained by the fact that sleep disorders can drastically change the sleep structure of the affected subjects, which in turn leads the supervised baseline to learn features that are specific to the disorders/subjects present in the training set. The Relative Positioning (RP) task was introduced by Banville et al. [1] as a self-supervised learning method for EEG signals, which we briefly recall here. Given x t and x t , two samples with a window size W and starting points t and t respectively, the RP task defines the following labels is the indicator function, and τ pos and τ neg are predefined quantities. Pairs that yield C Self = 0 are discarded. | · | denotes the element-wise absolute value operator. Next, we compare our self-supervised task to the RP task [1] . For both settings, we use W Self = 5 s and r train = 20%. For the RP task we choose τ pos = τ neg = 12 × W Self . We report the balanced accuracy for all test sets on the SC data set in Table 3 . We observe that our self-supervised task outperforms the RP task. This means that the features learned through the PS task allow the model to perform better on unseen data. In this section, we evaluate our framework on the ST and CHB-MIT data sets. For the ST data set, we use W Self = 1.25 s, W Sup = 30 s and r train = 50%. For the CHB-MIT data set, we use W Self = 2 s, W Sup = 10 s, r train = 25% and 30 epochs for the supervised fine-tuning/training. As shown in Table 4 , we observe that for the ST data set, the features learned through the PS task produce a significant improvement, especially on Test sets A and B. For the CHB-MIT data set, the PS fails to provide the performance gains as observed for the previous data sets. We believe that this is due to the fact that the PS task is too easy on this particular data set: Notice how the validation accuracy is above 99%. With a trivial task, self-supervised pre-training fails to learn any meaningful feature representations. In order to make the task more challenging, we introduce a new variant, which we call PS + Masking, where we randomly zero out all but 6 randomly selected channels for each sample during the self-supervised pre-training. The model obtained through this scheme performs the best on both sets A and B and is comparable to the Supervised baseline on the validation set. As for the reason why the PS training was trivial on this particular data set, we hypothesize that this is due to the high spatial correlation in the CHB-MIT data set samples. This data set contains a high number of homogeneous channels (all of them are EEG channels), which in turn result in a high spatial resolution of the brain activity. At such a spatial resolution, the oscillations due to the brain activity show a correlation both in space and time [9] . However, our PS operator ignores the spatial aspect of the oscillations. When applied, it often corrupts the spatial coherence of the signal, which is then easier to detect than the temporal phaseamplitude incoherence. This hypothesis is supported by the fact that the random channel masking, which in turn reduces the spatial resolution during the selfsupervised training, yields a lower training accuracy, i.e., it is a non-trivial task. In this section, we analyze the effect of the window size W Self used for the selfsupervised training on the final performance. We report the balanced accuracy on all our test sets for the SC data set in Table 5 . For all these experiments, we use 20% of the identities in the training set. The capacity of the Supervised model C Sup • E is independent of W Self (see Sect. 3) , and thus so is its performance. We observe that the best performing models are the ones using W Self = 2.5 s for the Validation Set and W Self = 5 s for sets A and B. We argue that the features learned by the self-supervised model are less specific for larger window sizes. The PS operator drastically changes structured parts of the time series, but barely affects pure noise segments. As discussed in Sect. 3, white noise is invariant with respect to the PS operator. With smaller window sizes, most of the segments are either noise or structured patterns, but as the window size grows, its content becomes a combination of the two. In Table 6 , we analyze the effect of freezing the weights of E during the supervised fine-tuning. We compare the performance of the four variants described in Sect. 4.2 on the SC data set. All variants use W Self = 5 s, W Sup = 30 s and r train = 20%. As expected, we observe that the PhaseSwap variant is the most performant one since it is less restricted in terms of training procedure than PSFrozen and Random. Moreover, the PSFrozen outperforms the Random variant on all test sets and is on par with the Supervised baseline on the Test set B. This confirms that the features learned during pre-training are useful for the downstream classification even when the encoder model E is frozen during the fine-tuning. The last variant, Random, allows us to disentangle the contribution of the self-supervised task from the prior imposed by the architecture choice for E. As we can see in Table 6 , the performance of the PhaseSwap variant is significantly higher than the latter variant, confirming that the self-supervised task chosen here is the main factor behind the performance gap. Most of the experiments in this paper use the FCN architecture [24] . In this section, we illustrate that the performance boost of the PhaseSwap method does not depend on the neural network architecture. To do so, we also analyze the performance of a deeper architecture in the form of the Residual Network (ResNet) proposed by Humayun et al. [7] . We report in Table 7 the balanced accuracy computed using the SC data set for two choices of W Self ∈ {2.5 s, 30 s} and two choices of r train ∈ {20%, 100% * }. The table also contains the performance of the FCN model trained using the PS task as a reference. We do not report the results for the RP experiment using W Self = 30 s as we did not manage to make the self-supervised pre-training converge. All ResNet models were trained for 15 epochs for the supervised fine-tuning. For r train = 20%, we observe that pre-training the ResNet on the PS task outperforms both the supervised and RP pre-training. We also observe that for this setting, the model pre-trained with W Self = 30 s performs better on both the validation set and test set B compared to the one pre-trained using W Self = 5 s. Nonetheless, the model using the simpler architecture still performs the best on those sets and is comparable to the best performing one on set A. We believe that the lower capacity of the FCN architecture prevents the learning of feature representations that are too specific to the pretext task compared the ones learned with the more powerful ResNet. For the setting r train = 100% * , the supervised ResNet is on par with a model pre-trained on the PS task with W Self = 30 s. Recall that r train = 100% * refers to the setting where all recording session and patients are used for the training set. Based on these results, we can conclude that there is a point of diminishing returns in terms of available data beyond which the self-supervised pre-training might even deteriorate the performance of the downstream classification tasks. We have introduced the phase swap pretext task, a novel self-supervised learning approach suitable for bio-signals. This task aims to detect when bio-signals have mismatching phase and amplitude components. Since the phase and amplitude of white noise are uncorrelated, features learned with the phase swap task do not focus on noise patterns. Moreover, these features exploit signal patterns present both in the amplitude and phase domains. We have demonstrated the benefits of learning features from the phase component of bio-signals in several experiments and comparisons with competing methods. Most importantly, we find that pre-training a neural network with limited capacity on the phase swap task builds features with a strong generalization capability across subjects and observed sessions. One possible future extension of this work, as suggested by the results on the CHB-MIT data set [23] , is to incorporate spatial correlations in the PS operator through the use of a spatio-temporal Fourier transformation. Self-supervised representation learning from electroencephalography signals The phase of ongoing EEG oscillations predicts visual perception Bert: pre-training of deep bidirectional transformers for language understanding How can we identify ictal and interictal abnormal activity? Unsupervised representation learning by predicting image rotations Physiobank, physiotoolkit, and physionet: components of a new research resource for complex physiologic signals End-to-end sleep staging with raw single channel EEG using deep residual convnets Batch normalization: accelerating deep network training by reducing internal covariate shift Spatial and temporal structure of phase synchronization of spontaneous alpha EEG activity Self-supervised learning for semisupervised time series classification Steering self-supervised feature learning beyond local pixel statistics Analysis of a sleep-dependent neuronal feedback loop: the slow-wave microcontinuity of the EEG ISRUC-sleep: a comprehensive public dataset for sleep researchers Adam: a method for stochastic optimization Dynamic low frequency EEG phase synchronization patterns during proactive control of task switching Bioelectromagnetism: Principles and Applications of Bioelectric and Biomagnetic Fields Efficient estimation of word representations in vector space Age and gender affect different characteristics of slow waves in the sleep eeg Advanced Biosignal Processing EEG phase patterns reflect the selectivity of neural firing Unsupervised learning of visual representations by solving jigsaw puzzles Deep learning-based electroencephalography analysis: a systematic review Application of machine learning to epileptic seizure onset detection and treatment Time series classification from scratch with deep neural networks: a strong baseline Learning and using the arrow of time