key: cord-0659576-rz2nlk43
authors: Baki, Pinar
title: A Multimodal Approach for Automatic Mania Assessment in Bipolar Disorder
date: 2021-12-17
journal: nan
DOI: nan
sha: 64d2974a764a4a6abede2df9da015d70fa319b5a
doc_id: 659576
cord_uid: rz2nlk43

Bipolar disorder is a mental health disorder that causes mood swings that range from depression to mania. Diagnosis of bipolar disorder is usually done based on patient interviews, and reports obtained from the caregivers of the patients. Subsequently, the diagnosis depends on the experience of the expert, and it is possible to have confusions of the disorder with other mental disorders. Automated processes in the diagnosis of bipolar disorder can help providing quantitative indicators, and allow easier observations of the patients for longer periods. Furthermore, the need for remote treatment and diagnosis became especially important during the COVID-19 pandemic. In this thesis, we create a multimodal decision system based on recordings of the patient in acoustic, linguistic, and visual modalities. The system is trained on the Bipolar Disorder corpus. Comprehensive analysis of unimodal and multimodal systems, as well as various fusion techniques are performed. Besides processing entire patient sessions using unimodal features, a task-level investigation of the clips is studied. Using acoustic, linguistic, and visual features in a multimodal fusion system, we achieved a 64.8% unweighted average recall score, which improves the state-of-the-art performance achieved on this dataset.

where depression and manic symptoms occur together. The diagnosis of bipolar disorder requires lengthy observations on the patient. Otherwise, it can be mistaken with other mental disorders like anxiety or depression. The disease affects 2% of the population, and sub-threshold forms (recurrent hypomania episodes without major depressive episodes) affect an additional 2% [1] . It is ranked as one of the top ten diseases of disability-adjusted life year (DALY) indicator among young adults, according

to World Health Organization [2] . It takes 10 years on average to diagnose bipolar disorder after the first symptoms [3] .

In bipolar disorder, the clinical appearance of the patients changes based on the moods they are in. The changes are seen in both their sound and visual appearance, as well as the energy level changes. In the manic episode, the speech of the patient becomes louder, rushed, or pressured. The patient can be very cheerful, furious, or overly confident. The movements of the patient become more active, exaggerated, and they tend to wear very colorful clothes. Feelings and the state of mind change quickly.

Racing thoughts, reduced need for sleep, lack of attention, increase in targeted activity (work, school, personal life) are some situations patients can experience in the manic episode. These symptoms return to a normal state during the remission state [4] .

Today, the diagnosis of mental health disorders rely on questionnaires done by psychiatrists and reports from patients and their caregivers. Psychiatrists perform some tests to collect information about the patient's cognitive, neurophysiological, and emotional situations [4] . But these reports are subjective, and there is a need for more systematic and objective diagnosis methods. Especially, with the COVID-19

pandemic, remote treatment and diagnosis gain importance, which can be achieved using automated methods.

One of the tools used to rate the severity of the manic episodes of a patient is the Young Mania Rating Scale (YMRS). During the interviews, psychiatrists observe the patient's symptoms and give ratings to them. The 11 items in YMRS assess the elevated mood, increased motor activity-energy, sexual interest, sleep, irritability, speech rate and amount, language-thought disorder, content, disruptive-aggressive behavior, appearance, and insight. Most of these can be observed from speech patterns, body or facial movements, and the content of what was spoken during the interview. 

Assessment of mental health disorders using machine learning methods has been an active research area. Many researchers are working on recognizing mental health disorders varying from depression, Alzheimer's disease, anxiety to bipolar disorder. The interdisciplinary research between psychiatrists and computer scientists helps to create new datasets and bringing insights from the medical domain to artificial intelligence.

The datasets used in the prediction of mental health disorders contain various data types [5] . Datasets are collected by psychiatrists like electronic health records [12] , surveys [13] , interviews [6] , clinical assessments [14] , brain imaging scans [15] or gathered from the personal information of the patients outside the clinic like social media posts [16] , suicide notes [17] or wearable sensor data [18] . These datasets contain visual, auditory, textual, or biological information, which allows researchers to develop algorithms using computer vision, signal processing, speech processing, or natural language processing models. Some of the datasets are suitable for using modalities together, which is similar to the human decision making process. For instance, from the patient interviews recorded with a video and audio, visual, auditory, and textual features can be extracted. State-of-the-art results are achieved by the fusion of the modalities [19] .

Acoustic and visual cues are used in the detection of major depressive disorder in [20] . They use motion history histograms to extract dynamic features from video and audio data and represent the subtle change of emotions in depression. Decision level fusion of audio and visual modalities proves the effectiveness of the proposed model.

In [21] , facial action and vocal prosody (suprasegmental) features are extracted from patient interviews conducted by a clinical interviewer. Vocal prosody features provide information about the sound in language beyond the meaning of the language, like rhythm, stress, intonation etc. Support vector machine (SVM) is used for the classification of facial action unit features and logistic regression for the classification of the acoustic prosody features. Both modalities give promising results separately.

However, the fusion of the audio-visual features wasn't performed in this paper.

Another work on the recognition of depression applies hierarchical classifier systems to vocal prosody features and local appearance descriptors extracted from the faces of the patients. Kalman filter is used for the fusion of modalities [22] . It enables the system to perform better in real-time and can deal with sensor failures. Their late fusion method cannot outperform the results obtained from the auditory and visual modalities separately. They stated that the performance gap between the audio and video modalities is the reason of the performance drop in the fusion results.

Similarly, audio and visual modalities are commonly used in the detection of bipolar disorder. One of the early works on the classification of bipolar disorder [23] presents The University of Michigan Prechter Acoustic Database, which contains cel- Muaremi et al. [25] collected a cellular phone dataset from the 12 bipolar patients of a psychiatric hospital. Using the openSMILE toolkit, they extract root mean square, mel-frequency cepstral coefficients (MFCC), pitch, harmonics-to-noise ratio, zero-crossing-rate, and summarize these low-level descriptors (LLDs) with 12 functionals. Besides these acoustic features, they also experiment with phone call statistics like number of phone calls during the day, average duration of the phone calls etc., and social signal processing features like average speaking length, average number of speaker turns etc. Among the three feature set, acoustic features perform the best. The highest performance is achieved with the early fusion of the three modalities. Using a random forest classifier, 83% F 1 is achieved on 2 classes (manic vs. normal or depressive vs. normal).

Besides speech cues, motor activity related information (body movement, motor response time, level of psychomotor activity) is used for BD classification in [26] . The speech data is collected from cell phones of the BD patients. During the conversations over the phone, motor activity data is collected from the accelerometer on the phone.

Information related to the motor activity is also collected from the self-assessment questionnaires regarding the patient's psychological state, physical state, and activity level. Their result suggests that the fusion of accelerometer features with the speech related features gives 82% accuracy in the classification of a manic episode. With the information from the questionnaires, the final result is improved slightly to 85%.

However, they argue that the usage of questionnaires may harm the fully autonomous nature of the system.

In [27] , the dialogues in the assessment phone calls between the patient and the 

The Audio/Visual Emotion Challenge (AVEC) was held for the eighth time in 2018. The mission of the AVEC series is to create a common benchmark and pushing the boundaries of audio-visual emotion and health recognition problems. Some of the previous challenge topics were prediction of self-reported severity of depression, detecting discrete emotion classes, prediction of continuous-valued dimensional effect, depression analysis from human-agent interactions, and emotion recognition from human behaviors captured in-the-wild.

In the 2018 AVEC Challenge, a Bipolar Disorder (BD) corpus was made available [28] . Several groups have worked on this corpus within the AVEC Challenge,

where the goal was to determine the state of the patient given a short video sequence containing several pre-determined tasks [29] [30] [31] [32] [33] [34] [35] . Our research group, as the creator of the bipolar challenge, did not run in this challenge as a participant, but provided the baseline and the protocol.

As a performance metric, the unweighted average recall (UAR) score is used during the challenge. Throughout this study, we also use UAR for presenting the results to compare our findings with the previous studies. In more detail, UAR is the unweighted average of the class-specific recalls obtained from the system for each of the three classes.

Most of the works in the challenge extract both audio and visual features, and apply either decision or feature-level fusion [29, [31] [32] [33] . All of them obtain their best results using a fusion of these modalities. shows that the proposed model learns the training data too well but can not generalize to the unseen test set data, which is called overfitting.

Fisher vector encoding is a popular aggregation method mostly used in image classification or retrieval problems [40] . Recently, it has also been applied to several signal processing problems and promising results were obtained [41] . [32] uses this approach with the Computational Paralinguistics ChallengE (ComParE) feature set.

They propose some turbulence features that represent the sudden changes in feature contours of both audio and visual modalities. The classification is done using the Greedy Ensemble of Weighted Extreme Learning Machines (ELM) [42] where they train many weighted ELMs, then select the ones which have a UAR score more than a fixed threshold on the validation set. Turbulence features extracted from the visual modality achieve the best test set result of the challenge.

There are a couple of papers that use deep learning methods on the BD set.

There are 218 samples from 46 individuals in the BD corpus. So, deep learning based models often cause an over-fitting problem on this corpus, and lead to significant drop of performance on the test set, compared to the performance on the validation set.

In [30] , this problem is handled using L 1 regularization while using a network consisted of an Inception module combined with an long short-term memory (LSTM) network. 16 Using the visual modality, the baseline test set score is achieved in [37] . The To deal with the small size of the BD corpus, Capsule Neural Network (Cap-sNet) [44] is used in [34] . In CapsNet the pooling layer in the Convolutional Neural Network (CNN) is changed with the capsules, which are a group of neurons that allow the model to learn spatial relationships between different parts of the data (mostly image), so different transformations of the data can be recognized without reducing the performance, which makes the model more efficient when working with small datasets.

Mel-frequency spectrograms are extracted from the small segments of raw audio files to train the CapsNet model. Two more audio representation learning frameworks, give the best test result, which is 49.8%. As stated in the paper, due to the high computational cost needed for the optimization of the CapsNet hyperparameters, authors could not optimize the parameters enough and fully evaluate the best result that can be achieved using CapsNet.

Another technique that can be used while classifying small datasets with deep learning models is multi-instance learning. In [35] , audio clips are segmented into chunks to increase the dataset size. However, each clip has only one label and after segmenting the clip, each chunk becomes weakly labeled. For example, a clip may be labeled as 'mania', but a small chunk from that clip may not represent any 'mania'

features. This problem is solved using multi-instance learning where training is per- Extended Distress Analysis Interview Corpus (E-DAIC) [45] , which is used in AVEC 2019 challenge [46] . Experimental results show that multi-modal frameworks increase the classification performance of mental disorder recognition tasks. On the BD corpus, 70 .9% UAR score is achieved on the development set. Using 10 fold cross-validation, 60 .0% UAR is achieved, which shows that the proposed framework does not overfit to the data.

In [38] 93.2% UAR score is achieved using acoustic, visual, and textual modalities on the development set. However, the test set score is not presented. Since the BD dataset is a small one, and prone to overfitting, the test scores should have been submitted to evaluate the system more accurately. For the acoustic features, 

In our experiments, we use ELM method as a classification algorithm. BD corpus is an imbalanced dataset, so we experiment Weighted ELM (WELM) method. WELM assigns weights to each sample in a way that it strenghtens the minority class (explained in Section 4.6 in detail). However, the weights are assigned based on sample quantities, and may not be optimal.

Wang et.al. [47] proposes the Deep WELM (DWELM) method to solve this problem. DWELM consists of enhanced ELM and AdaBoost algorithms. Enhanced ELM is created by replacing the linear ELM with regularized ELM, and adding shortcut con-nections between the building blocks. An enhanced AdaBoost model embedded into enhanced DWELM algorithm. The AdaBoost algorithm is enhanced in a way that the weights are updated for both the misclassified and correctly classified samples.

Their experimental results show that proposed algorithm is efficient on both binary and multiclass classification problems.

In [48] , the imbalance learning problem on ELM model is solved using genetic algorithms. They propose a weighted and cost sensitive ELM model. They use the cost matrix in weighted least square method, and assign different weights to each sample.

Genetic algorithm is used to obtain the optimal cost. Their experiments show that cost sensitive WLS approach performs better than the WELM model.

In this work, we use the Turkish Audio-Visual Bipolar Disorder (BD) Corpus [6] , which was also used for the 2018 AVEC Bipolar Disorder and Cross-cultural Affect

Recognition Competition [28] , as discussed in the previous chapter. Participants were encouraged to achieve the highest performance, considering the baseline performance given by the organizers.

The BD corpus contains video clips of 46 bipolar disorder patients and 49 healthy controls from the mental health service of a hospital. Mood of the patients evaluated using YMRS and Montgomery-Asberg Depression Rating Scale (MADRS) during 0th, 3rd, 7th and 28th days of the hospitalization and after discharge on the 3rd month.

In those days, psychiatrists performed an interview with the patients asking the same questions each time and took audiovisual recordings of the sessions. Annotation was done based on YMRS score [49] . YMRS is a continuous clinical interview assessment scale used for rating the severity of manic episodes of a patient. Scores range from 0 to 60 where higher scores represent severe mania. In the BD corpus, bipolar patients are grouped into three classes based on their YMRS score in a session. Grouping is done considering following the scheme where Y t represents the YMRS score of session t:

As presented by the AVEC Competition, there are 104, 60, and 54 clips in the training, development, and test sets, respectively. Due to the difficulties and ethical issues of collecting healthcare data, they are typically small in the number of recordings.

So its size should be considered while working on the problem to avoid overfitting and achieve better generalizability. Clips were recorded in a room where only the participant and clinician were present. The participants were recorded with a camera while performing tasks. They read the descriptions of the tasks they were asked to perform from the computer screen.

After completing a task, they pushed the space button and a description of the next task appeared on the screen. When the space button was pushed, a 'knock' sound was heard to mark the beginning of a new task. This sound helps to split tasks if the tasks are wanted to use separately for classification.

In order to provide baseline results on the data, creators of the corpus investigate audio and visual modalities, experiment with both classification and regression models, and propose two approaches as they call direct and indirect approaches. 

In this chapter, we introduce the features used in audio, textual, and visual modalities, preprocessing, feature selection methods applied to the dataset. After that, we explain the ELM algorithm used as a classification method, cross-validation technique used to evaluate the results, and modality fusion methods applied to improve the unimodal results.

Feature extraction is the initial stage in most of the machine learning problems where the aim is to obtain representations from the input that can be useful for a pattern recognition process in the further steps. For audio feature extraction, we use the openSMILE feature extraction toolkit [24] , which provides many built-in configuration In our experiments, we use IS10 [52] , eGEMAPS [53] and MFCC feature sets.

IS10 paralinguistic challenge consists of three sub-challenges, namely age, gender, and affect. IS10 feature set was provided to the participants to be used in the audio classification of the sub-challenge problems. It contains 38 low-level descriptors and their temporal derivatives as can be seen in Table 4 .1). The features are chosen for their ability to represent affective physiological changes in voice production. MFCC features are widely used in speech recognition tasks. They represent the phonemes as the shape of the vocal tract and

give information about the human voice perception mechanism. All three feature set LLDs summarised using BD10 functionals used during the experiments were listed in Table 3 .1.

In the recognition of the bipolar disorder, clinicians assess the presence of risk of suicide, risk of violence to persons or property, risk-taking behavior, sexually inappropriate behavior, substance abuse, patient's ability to care for himself/herself, etc. [54] .

These can be deducted from what patients say during the interviews of the BD dataset.

For the textual feature extraction, the text version of the interviews is obtained from audio files using the Google Automatic Speech Recognition (ASR) tool 1 . Since the audio files were clipped into tasks for the audio experiments, the transcripts for the tasks were extracted as well. The extracted transcripts contained mistakes since there were words not heard well. So, we manually transcribed the third task, which was describing a sad memory, to further examine the results in a situation where there are no mistakes in the texts. categories (e.g., work, home, leisure activities), five informal language markers (assents, fillers, swear words, netspeak), and 12 punctuation categories (periods, commas, etc).

Tf-idf is a statistical measure that shows how much a word is important in a document. They are used commonly in NLP [59, 60] , information retrieval [61] and text mining [62] tasks. As a preprocessing step, stop words are removed using the English/German stop-word dictionaries from the NLTK library [63] , and stemming is applied using the Porter stemmer algorithm [64] . After these steps, Tf-idf features are computed over the set of uni-grams and bi-grams.

As polarity features, we use the outputs of three sentiment analysis tools together, which are Natural Language Toolkit Valence Aware Dictionary for sEntiment

Reasoning (NLTK Vader) [65] , TextBlob [66] and Flair [67] since they all have different strengths. NLTK Vader is one of the most popular sentiment analysis tools. It uses sentiment lexicon together with grammatical rules for expressing polarity. A sentiment lexicon is a dictionary, which holds the sentiment scores for words, phrases, and emoticons. However, this approach causes the algorithm to perform weakly on unseen words. The algorithm also handles other linguistic usages that can represent sentiment like capitalization, punctuation, adverbs, etc. using some heuristics. TextBlob library performs many NLP tasks like tokenization, lemmatization, part-of-speech tagging, finding n-grams as well as sentiment analysis. It returns the sentiment with polarity and subjectivity scores where subjectivity represents the amount of personal and fac-tual information in the sentence, which is a good feature for the valence dimension.

However, TextBlob does not consider negation in the sentence for the polarity score, which can be misleading. Flair uses a character-level LSTM network for sentiment analysis, so it is good at assessing the unseen words as well.

Sentiment and subjectivity features obtained from each library are combined into a feature vector. Then each feature is summarized with five functions, namely mean, standard deviation, maximum, minimum, and summation, respectively.

Clinicians gain significant insight from visual cues in the recognition of the bipolar disorder. Some of the scoring items of YMRS can be obtained from visual cues like increased motor activity-energy, irritability, elevated mood, appearance, disruptiveaggressive behavior. Besides, the speech rate and the amount can be also observed in the facial actions.

For the visual experiments, we use FAUs, geometric features extracted from each face, and appearance descriptors. All three of them were presented by dataset owners as a baseline feature. FAU features were presented in AVEC challenge paper [28] and the other two in Asian Conference on Affective Computing and Intelligent Interaction (ACII) paper [6] . Lastly, in [6] , the authors have extracted appearance descriptors from faces using a pre-trained DCNN network trained on a face emotion corpus. As stated in the paper, this approach is applied to emotion and apparent personality trait recognition tasks in uncontrolled conditions and gives promising results [68] . From the last convolutional layer of the DCNN network, 4,096-dimensional features are extracted and then summarised using mean and standard deviation functionals.

The For the experiments we use L1-based and tree-based feature selection method, and principal component analysis (PCA).

Regularization is the process of adding a penalty to the model coefficients to reduce overfitting. While regularizing the linear models with L1 norm some coefficients may become zero. So these features can be removed from the model. Thus

L1-based linear model regularization can be used as a feature selection method. In our experiment setup, we used a linear SVM model with an L1 penalty available in the scikit-learn library.

Random forests are ensemble learning methods that consist of many decision trees, which only a random subset of the features given to the model. This makes the model prone to overfitting. Each tree tries to split the dataset into two in a way that similar samples remain in the same set. This is done by finding the optimal separation based on the impurities of the features. Impurity is the measure of optimal condition for a feature. While using the random forest for feature selection, the impurity decrease for each feature is found and the features are ranked based on that measure. We used the ExtraTreeClassifier method and treat the subset of features used by the tree ensemble as selected feature set.

Principal component analysis (PCA) is an unsupervised dimensionality reduction technique, which finds the projection of data points into a lower-dimensional space. It creates a hierarchical coordinate system in a way that captures the maximum variance in the data. We used PCA mostly with very high dimensional data like TF-IDF features in linguistic experiments and features extracted from the DCNN network for the visual modality to reduce the feature set size before the feature selection experiments. We used the PCA method from the scikit-learn library.

Like most of the healthcare datasets, the BD dataset is a small one with 164 data points in total. While working with a small number of observations, it is crucial to pay attention to getting accurate predictions by avoiding overfitting.

Classifier selection is one of the most crucial steps while working with small datasets. Deep learning models have been used in many problems and improve the state-of-the-art results. However, complex models with many parameters require many iterations to optimize their parameters, and this results in overfitting in small datasets.

Using simple models is a better choice.

In our experiments, we mostly use kernel ELM [69] . ELM is a simple and robust machine learning model that contains a single hidden layer. Input weights are randomly initialized, so they do not need to be tuned. The weights between the hidden layer and the output layer are calculated by an inverse operation.

In a single hidden layer ELM,, the hidden layer output matrix is H ∈ R N ×h , the weight matrix between the hidden layer and the output layer is β ∈ R h×1 and the output layer matrix is T ∈ R N ×1 , where N is the number of training samples and h is the number of hidden layer nodes. The output weight matrix β is calculated using least squares solution of Hβ = T as β = H † T. H † represents the Moore-Penrose generalized inverse [70] , which minimizes L 2 norms of both Hβ − T and β . For increased generalization and robustness, a regularization coefficient C is used. So, the set of weights is calculated as:

where I is an identity matrix, and K is a kernel. We use radial basis function (RBF) calculating kernel K, as suggested in [71] .

While working with small datasets, class imbalance may mislead the model in favor of the majority class. Using weighted models is one solution to the imbalanced learning problem. In weighted ELM [42] , we define a N × N diagonal weight matrix W, where N is the number of samples. Each diagonal element stores the multiplicative inverse of the number of training samples with the corresponding label. Integrating W into the formula, the set of weights calculated as:

There is a trade-off between weighted and unweighted models, where the former improves UAR, while the latter improves accuracy. To find the best performing model, we implement a decision level fusion model:

where P is an N × t matrix that contains the class probabilities of each sample. α is a coefficient between 0 and 1. The best α is chosen according to the UAR score of P f usion .

Cross-validation is a model validation technique where the model is evaluated on its ability to generalize to independent data. The dataset is sampled into training and development set repeatedly and a model is created and tested for each split. The BD dataset is a small one with 104 training and 60 development samples. So it is important to make sure the created model is not just performing well on the development set samples but also is a general solution to the problem at hand. Besides, it is also possible to train the model with more data by reducing the development set size. Our main goal for using the cross-validation was to decide which models should we try on the test set by using both development set and cross-validation results.

In cross-validation, the dataset is split into k groups. So it is also called k-fold cross-validation. On each turn, one of the k subsets is used as a development set, and the remaining subsets concatenated into a training set. A model is trained, optimized on each training set, and evaluated on each development set. The predictions on each development set is saved. Finally, the performance is evaluated using the predictions and ground truth (real labels) of the whole dataset.

The parameter k should be chosen in a way that after splitting the data, both training and development sets are still able to be a representative of the dataset. In our case, k is chosen as 4 which creates a training set with 123 and a development set with 41 samples.

All these modalities complement each other while processing the information. In affective computing, the datasets mostly contain biological signals, which come from various sensors. All these signals contain some common information that complements each other, as well as some specific information that can not be observed from the other ones.

On the other hand, psychiatrists observe patient's speech patterns like rate, amount, appearance, gestures, motor activity, and change of ideas, topics during the interviews. All of these signs are used to decide the patient's YMRS score and to diagnose BD episodes. The majority voting method takes the probability labels obtained from each model and outputs the mostly seen label for a sample. If all three models output a different label for a clip, the output label of the audio modality is assigned for that clip, since in general, audio modality performed better. The labels are calculated as:

where L is an N × 1 matrix that contains the labels for each video clips and N is the number of samples. We take the mode at each row separately.

We use the weighted sum method for both the fusion of two and three modalities.

For the fusion of two modalities, the probabilities of each class for a clip from each model given as input and the final probabilities for each class are obtained as: 

where B(α) is a normalizing factor given in terms of multivariate beta function, and

x i ∈ (0, 1) and N i=1 x i = 1.

Finally, we also experiment with early fusion (feature level fusion) methods. In this approach, the features from different modalities are combined into a single feature vector before the classification. In our experiments, each feature vector that is obtained after the summarization of LLDs is concatenated before the normalization operation.

While selecting the fusion models to try on the test set, we consider both 4-fold cross-validation result of a model and Multimodal 1 (MM1) metric [72] . MM1 metric measures the improvement in the final fusion model. It is calculated as:

where U AR f usion is the UAR score of the fusion model, U AR 1 , U AR 2 , and U AR 3 are the UAR scores of the models created using single modalities. While calculating the

In this chapter, we present the experiment results in three sections. First, we discuss the unimodal systems for both clip level and task level data types. In the final section, the results of the fusion experiments on the clip level data are presented. features, but as the dataset is a small one with 164 clips, we want to examine the results on a smaller set of features. We also extract the eGEMAPS features, which are summarized using the functionals mentioned in [53] . Throughout the text, we mention the original eGEMAPS features that contain 88 features as eGEMAPS, and the one summarized using 10 functionals is mentioned as eGEMAPS10. eGEMAPS can be directly extracted as a feature vector with 'csvoutput' option instead of 'lldcsvoutput'

with openSMILE command line interface.

MFCC feature set is also extracted with an openSMILE configuration file, which computes 13 MFCC (0-12) and appends their 13 delta and 13 acceleration coefficients. This section shows the experimental results on these feature sets with the ablation studies on the techniques we used in order to increase the performance. Table 5 .1 shows the results with and without L 2 normalization. Z normalization is applied to each feature separately. After that, L 2 normalization is applied to the feature vector of each clip. The ranges and units of the features vary, so the model may give more importance to the features with bigger numbers. Applying normalization to the feature vector eliminates this effect. As can be seen in the results, applying L 2 normalization improves the performance for both feature sets.

The dimensions of the features extracted for the audio modality are high considering the sample size of the BD dataset, which can lead the model to overfit the data.

Besides, some features may be irrelevant to the problem as we use common feature sets. These irrelevant features may mislead the model and reduce performance. So we experiment with some feature selection methods to prevent overfitting and eliminate the irrelevant features, as explained in Section 4.5.

As can be seen in Table 5 .2, both L 1 based and tree-based feature selection methods improve the performance equally for MFCC feature set. However, for the eGEMAPS feature set, feature selection methods drop the performance. 

To further examine the classification of BD episodes using visual modality and which were presented by dataset owners as baseline feature sets. We look for the results of summarizing the LLDs of these features using some functionals. Table 5.6 shows the results obtained with a 49-dimensional feature vector after applying PCA to the VGG feature vector. 

The clips in the BD dataset are recordings of the participants performing seven different tasks, as explained in Chapter 3. The tasks were designed to observe participants while thinking about different mindsets. We wanted to examine the effects of these tasks on the classification of BD episodes.

As mentioned in Chapter 3, the tasks are separated with a 'knock' sound. However, sometimes participants keep talking about the previous task after the sound was heard or they accidentally push the space button twice, which creates errors in the separation of the tasks. Subsequently, we marked the beginning and end times of the tasks manually. Then, based on these timestamps, new sound files were created. Some participants skipped the tasks with no answer, so not all task files were available for each clip.

Dividing the clips into seven separate tasks shortens the amount of material for learning classifiers per task. However, the number of samples increases, which may increase the generalizability of the model. The trade-off between these two aspects could improve the overall performance.

After creating separate files for each task, eGEMAPS features for acoustic, TF-IDF features for linguistic, and FAU features for visual modality are extracted. Z normalization is applied at the feature level (column-wise) and then L2 normalization is applied along the feature vectors (row-wise). Decision level fusion of unweighted and weighted kernel ELM model is used for the classification. information about the bipolar disorder moods.

In the fourth row, the emotion-based grouped clips are used together to train an ELM model. The number of clips in the training set becomes 302, which is almost three times more than using the entire clip without any separation or using tasks separately.

The mean of class probabilities of the tasks obtained from a clip are averaged. The results are improved using all emotion groups in the training together. The best UAR score on the 4-fold cross-validation is achieved using eGEMAPS10

with tree feature selection, LIWC, and FAU features fused with the majority voting method, where 65.7% UAR score is achieved. MM1 score shows that fusion of the modalities increase the maximum unimodal performance by 15%, which is the highest MM1 score achieved on the 4-fold cross-validation results as well.

The first six best performing models uses majority voting method, which shows the effectiveness of this method. Even the remaining two majority voting lines in Table 5 .9 has higher MM1 score than almost all of the feature fusion lines, which indicates that majority voting contributes most to the unimodal results. Feature fusion method is not as successful as majority voting in increasing the unimodal results, since after concatenating the feature sets, the newly generated feature vector has higher dimension, which requires more data for a robust training [73] .

Final test sets experiments are done using the top performing four multimodal fusion systems (first four lines in Table 5 .9), as we wish to have a maximum of 10 test set probes. In order to calculate their MM1 scores on the test set, we also obtain the test set results of the constituent unimodal models, namely for eGEMAPS10, eGEMAPS10

with tree feature selection, eGEMAPS, LIWC, FAU, and geometric features. higher than the best performing result published so far. Since test set results are obtained using more data compared to the cross-validation setting, some test set results give higher UAR score than the 4-fold cross-validation results. eGEMAPS10 gives the highest unimodal UAR score, which is also higher than the state-of-the-art test set score on this dataset. We achieve 64.8% on both 4-fold cross-validation and test set results. From the test set confusion matrix, we can see that mania and remission classes are easier to classify as they have more distinct features. 

During this thesis, we worked on the classification of bipolar disorder episodes (mania, hypomania, depression) using the BD dataset that contains video recordings of the bipolar disorder patients while they are interviewed by their psychiatrists. During the interviews, the patients perform seven different tasks. The tasks are designed in a way that they elicit both positive and negative emotions in the patients, and some tasks are emotionally neutral.

We showed that multimodality improves the generalizability of the classification of bipolar disorder. The information coming from acoustic, textual, and visual modalities complement each other and improve the performance of the unimodal systems. The results suggests that using all three modalities together gives the best performance, however a fusion model of the linguistic, and acoustic modalities still perform well while requiring less information.

As a classification algorithm, we use fusion of weighted and unweighted ELMs.

ELM was a good fit for this problem, since it is a 2-level neural and prone to overfitting.

The data imbalance creates a need for a weighted model, however weighted ELM mostly favor the minority class. So using the fusion of weighted and unweighted ELMs, the optimum point is found.

The best performing model is achieved using eGEMAPS10, LIWC, and FAU features using the fusion of weighted and unweighted kernel ELMS, and fused using majority voting as a late fusion process. We achieve 64.8% test set UAR on this configuration, which is the best result achieved on the BD dataset as can be seen in Figure 6 .1. The results suggest that benefiting from all three modalities is useful, since the first 13 best performing model is achieved on the fusion models of three modalities.

However, the 14th highest score on the Table 5 .9 uses only linguistic and acoustic modalities. So, it is possible to use only audio recordings of the patients, like phone recordings and achieve promising results from the fusion of linguistic and acoustic modalities. Besides, the MM1 scores on Table 5 .9 shows that, fusion of modalities increase the maximum scores achieved on a single modality in all the configurations.

eGEMAPS is a commonly used minimalistic acoustic feature set. So we used it for the audio classification, and in the fusion experiments. Besides, we summarized eGEMAPS LLDs with the 10 functionals presented in [6] . We achieved a better performance using eGEMAPS10 feature set, which shows that eGEMAPS LLDs can give better results when summarized with different functionals. eGEMAPS, and eGEMAPS10 feature sets contain 88, and 230 features respectively. So, a higher feature size may help finding better features that generalize better to the dataset. These results are still not high enough to use in a real-world application as a decision system. One of the main difficulties was the small size of the BD corpus.

There are 25, 38, and 41 clips in the dataset for the remission, hypomania, and mania classes respectively, which is not enough to generalize with a high certainty. The dataset is collected in a real-life scenario. So there were some noises, and in some cases the clinician explains things about the questions to the patients, so her voice can be heard as well. These issues are expected to be present if a real-life application is created, so the natural recording setup makes this database valuable. Another difficulty stems from missing information in some clips, where patients do not answer some of the questions. In one of the test case clips, the patient does not answer any question at all. This can be used as a feature as well. However in our method, it caused a poor performance.

Besides the clip level evaluation, we look for the effect of the tasks separately, and by grouping the same emotion eliciting tasks during the classification. Since some tasks are not performed in every clip, the number of clips per task are different. To be able to compare the results among the task groups and the entire clips results, we assign the middle class label to the missing clips. Since the dataset size is already small, this distorted the final scores somewhat. Still, from the task level experiments we can see that emotion eliciting tasks are more useful in the classification of BD for all three modalities, as expected. In order to increase the dataset size, we also used the task groups as separate data points and performed classification. However, the results were not better than the entire clip level results, which shows that the information obtained from longer clips is necessary for learning.

Our final best performing model contains information from three different modalities, and each modality is represented using feature vectors with various sizes, which causes poor explainability of the model. It is especially important to create explainable models in medical domain. As a further study, the explainability of the system can be investigated, which also gives insights to the psychiatrists about the features used in the classification, and the best performing ones can be adapted in their decision making progresses.

Speech (Rate and Amount)

Lifetime and 12-month prevalence of bipolar spectrum disorder in the National Comorbidity Survey replication

The Global Burden of Disease: 2004 update, World Health Organization

The National Depressive and Manic-depressive Association (DMDA) survey of bipolar members

Bipolar Bozukluk -Mani Dönemi Tanılı Bireylerin Görüntü-Ses Ozniteliklerinin KlininkÖzellikler ve Nörokognitifİşlevlerleİlişkileri

Machine learning in mental health: a scoping review of methods and applications

The Turkish audio-visual bipolar disorder corpus

MP-BGAAD: Multi-Person Board Game Affect Analysis Dataset

MUMBAI: Multi-Person, Multimodal Board Game Affect and Interaction Analysis Dataset

Geeks and guests: Estimating player's level of experience from board game behaviors

Openface: an open source facial behavior analysis toolkit

Speech Analysis for Automatic Mania Assessment in Bipolar Disorder

Selecting learning algorithms for simultaneous identification of depression and comorbid disorders

A big data application to predict depression in the university based on the reading habits

Towards the early diagnosis of Alzheimer's disease via a multicriteria classification model

EEG complexity modifications and altered compressibility in mild cognitive impairment and Alzheimer's disease

A depression detection model based on sentiment analysis in micro-blog social network

Using natural language processing to classify suicide notes

Tackling mental health by integrating unobtrusive multimodal sensing

Automated audiovisual depression analysis

Depression recognition based on dynamic facial and vocal expression features using partial least square regression

Detecting depression from facial actions and vocal prosody

Fusion of audiovisual features using hierarchical classifier systems for the recognition of affective states and the state of depression

Ecologically valid long-term mood monitoring of individuals with bipolar disorder using speech

Opensmile: the munich versatile and fast open-source audio feature extractor

Assessing bipolar episodes using speech cues derived from phone calls

Classification of bipolar disorder episodes based on analysis of voice and motor activity of patients

Identifying Mood Episodes Using Dialogue Features from Clinical Interviews

AVEC 2018 workshop and challenge: Bipolar disorder and cross-cultural affect recognition

Bipolar Disorder Recognition with Histogram Features of Arousal and Body Gestures

Bipolar Disorder Recognition via Multiscale Discriminative Audio Temporal Representation

Multi-modality Hierarchical Recall based on GBDTs for Bipolar Disorder Classification

Automated Screening for Bipolar Disorder from Audio/Visual Modalities

Determine Bipolar Disorder Level from Patient Interviews Using Bi-LSTM and Feature Fusion

Audio-based Recognition of Bipolar Disorder Utilising Capsule Networks

Multi-instance learning for bipolar disorder diagnosis using weakly labelled speech data

Multimodal Deep Learning Framework for Mental Disorder Recognition

A Hybrid Model for Bipolar Disorder Classification from Visual Information

Bipolar Disorder Classification Based on Multimodal Recordings

An Investigation of Emotional Speech in Depression Classification

Image classification with the fisher vector: Theory and practice

Fusing Acoustic Feature Representations for Computational Paralinguistics Tasks

Weighted extreme learning machine for imbalance learning

Challenges in representation learning: A report on three machine learning contests

Dynamic routing between capsules

SimSensei Kiosk: A virtual human interviewer for healthcare decision support

AVEC 2019 workshop and challenge: state-of-mind, detecting depression with AI, and cross-cultural affect recognition

Deep weighted extreme learning machine

Weight Learning in Weighted ELM Classification Model Based on Genetic Algorithms

A rating scale for mania: reliability, validity and sensitivity

Analysis of emotion in Turkish

Contrasting and combining least squares based learners for emotion recognition in the wild

Eleventh Annual Conference of the International Speech Communication Association

The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing

A review of bipolar disorder among adults

Language models are unsupervised multitask learners

Bert: Pre-training of deep bidirectional transformers for language understanding

Language models are few-shot learners

Linguistic inquiry and word count: LIWC2001

KNN with TF-IDF based framework for text categorization

Automatic Mood Classification Using TF* IDF Based on Lyrics

A probabilistic justification for using tf× idf term weighting in information retrieval

Improved feature selection approach TFIDF in text mining

Natural language processing with Python: analyzing text with the natural language toolkit

An algorithm for suffix stripping

Vader: A parsimonious rule-based model for sentiment analysis of social media text

textblob Documentation

Flair: An easy-to-use framework for state-of-the-art nlp

Video-based emotion recognition in the wild using deep transfer learning and score fusion

Extreme learning machine for regression and multiclass classification

Generalized inverse of a matrix and its applications

Kernel ELM and CNN based facial age estimation

A review and meta-analysis of multimodal affect detection systems

Automatic temporal segment detection and affect recognition from face and body display