key: cord-0646484-0k18eec8 authors: Xie, Wanqing; Liang, Lizhong; Lu, Yao; Wang, Chen; Shen, Jihong; Luo, Hui; Liu, Xiaofeng title: Interpreting Depression From Question-wise Long-term Video Recording of SDS Evaluation date: 2021-06-25 journal: nan DOI: nan sha: 0ae20abf74ee14711433a4fa58842839d087e16d doc_id: 646484 cord_uid: 0k18eec8 Self-Rating Depression Scale (SDS) questionnaire has frequently been used for efficient depression preliminary screening. However, the uncontrollable self-administered measure can be easily affected by insouciantly or deceptively answering, and producing the different results with the clinician-administered Hamilton Depression Rating Scale (HDRS) and the final diagnosis. Clinically, facial expression (FE) and actions play a vital role in clinician-administered evaluation, while FE and action are underexplored for self-administered evaluations. In this work, we collect a novel dataset of 200 subjects to evidence the validity of self-rating questionnaires with their corresponding question-wise video recording. To automatically interpret depression from the SDS evaluation and the paired video, we propose an end-to-end hierarchical framework for the long-term variable-length video, which is also conditioned on the questionnaire results and the answering time. Specifically, we resort to a hierarchical model which utilizes a 3D CNN for local temporal pattern exploration and a redundancy-aware self-attention (RAS) scheme for question-wise global feature aggregation. Targeting for the redundant long-term FE video processing, our RAS is able to effectively exploit the correlations of each video clip within a question set to emphasize the discriminative information and eliminate the redundancy based on feature pair-wise affinity. Then, the question-wise video feature is concatenated with the questionnaire scores for final depression detection. Our thorough evaluations also show the validity of fusing SDS evaluation and its video recording, and the superiority of our framework to the conventional state-of-the-art temporal modeling methods. . The illustration of collecting SDS questionnaires and the corresponding synchronized face videos. The depression can be better detected with both SDS score and facial expression videos. for intervention therapy. The earlier that treatment can begin, the more effective it is [2] . However, the comprehensive clinical interview for final diagnosis, i.e., clinical golden standard [3] , can be costly for the large population screening. Depression is even more prevalent during the COVID-19 pandemic [4] , [5] , while the in-person clinician interview can be inconvenient or even prohibitive. The Self-Rating Depression Scale (SDS) [6] is a widely adopted self administrative fast screening questionnaire with twenty questions, which involves affective, psychological, and somatic symptoms related to depression 1 . Each question is framed in terms of positive and negative statements, and be scored on a Likert scale ranging from 1 to 4. The final result is the sum of each question. We note that the larger score usually indicates the subject is more likely to be a depression patient. Conventionally, we set 50 as the threshold of normal or depression [7] . The SDS ratings have indicative depression level ranges that may help health assessment and testing [8] . The ratings include suggestive depression level ranges that may help therapeutic and scientific research, but the SDS outcome can vary from the clinical interview for verifying a depression diagnosis [9] . A reason can be the uncontrollable self-administered measure can be easily affected by insouciantly or deceptively answering [10] , and producing the different results with the clinician-administered interview, e.g., Hamilton Depression Rating Scale (HDRS) [11] . Clinically, facial expression (FE) [12] and the actions [13] , [14] can play an important role in clinician-administered evaluation, while FE and actions are underexplored for selfadministered evaluation. Actually, expression and actions can be expressive features for many psychiatry analyses [15] , [16] . Based on this insight, in this work, we collect a novel dataset of 200 subjects to evidence the validity of self-rating questionnaires with video recording. To provide a more finegrained connection between the questionnaire and the video, we adopt a Software-Defined Camera (SDC) system which synchronized with the questionnaire software to record the video start from the question showing and end on the score is chosen. For each subject, there are 20 question and video pairs. To extract the region of interests (ROI), the face detector is applied, and the face box is extended 100% to incorporate the hand action, e.g., head-scratching, chin-touching. Moreover, the answering time may also affect the depression diagnosis [17] . The video provides additional information for analysis, while it also introduces several challenges for automatical analysis. Firstly, the video of each question can be quite long (e.g., 525 frames in our dataset), and the useful information can be sparse in a long-term sequence. 2) Secondly, the length of each video varies from 50 to 525 frames, which is depends on the question and the participants. 3) Moreover, a practical information fusion scheme is necessary to explore both the SDS evaluation and its question-wise video recording. To automatically interpret depression from the SDS evaluation and its corresponding FE and action video recording, a redundancy-aware conditional self-attention framework is proposed. Specifically, we resort to a hierarchical model which utilizes a 3D convolutional neural network (CNN) [18] for local temporal pattern exploration and a self-attention scheme for question-wise global feature aggregation. We factorize the question-wise video into the fix-length clips according to the time of human facial expression developing. With the fixlength input, the 3D CNN is able to extract the local temporal cues efficiently. Then, the clip-wise representation is feed forwarded to a parametric redundancy-aware self-attention (RAS) scheme to eliminate the uninformative signals and extract the representative question-wise feature. Targeting for the redundant long-term FE and action video processing, our RAS is able to effectively exploit the correlations of each video clip in a question to emphasize the discriminative information and eliminate the redundancy. It traverses the clips within a question to produce a refined representation based on pair-wise feature affinity [19] . Our residual attention term is able to prioritize discriminative clips while ignoring inferior ones for explicit redundancy reduction. In addition, the temporal sequence is explicitly considered with a Gaussian similarity kernel. Then, the question-wise video feature is concatenated with the questionnaire scores for final depression detection. Our thorough evaluations also show the validity of fusing SDS evaluation and its video recording, and the superiority of our framework to the conventional state-of-the-art temporal modeling methods. The main contribution of this work can be summarized as follows: • To the best of our knowledge, this is the first attempt at exploring depression with both SDS evaluation and its corresponding question-wise face video recording. An elaborate synchronized system is designed, and the final clinician interview results are collected. • A practical hierarchical conditional self-attention framework is proposed to explore the long-term variable-length video with 3D CNN for local temporal modeling, redundancy-aware RAS for global attention modeling, and SDS-score conditional question-wise fusion. • Our parametric redundancy-aware self-attention (RAS) scheme explicitly emphasizes the discriminative clips and reduces the redundancy based on feature pair-wise affinity, and is aware of the temporal sequence with a Gaussian similarity kernel. • The systematic and thorough comparisons with the previous temporal modeling methods provide further insights into the potential benefits of our framework. We note that the proposed framework can potentially be generalized to other classification tasks using both questionnaire and video modalities. In recent decays, there are numerous works dedicated to collect better quality, and larger quantities of depression datasets [13] . There has been a long history of using SDS evaluation for self-report [7] . In addition, the facial expression [20] , eye movement [14] and body action [21] can be the important modality for depression detection [13] . However, publicly available SDS and its video recording datasets appropriate for incorporating machine learning methods are missing. To the best of our knowledge, this is the first attempt to automatically explore both the SDS evaluation and the corresponding question-wise video. Moreover, the "ground truth" of many datasets are the self report (e.g., DAIC-WOZ [22] and AVEC [23] ), which is highly unreliable [13] . The clinician interview has usually been used for final diagnosis [24] , which can be costly for large-scale labeling. In this work, all subjects have taken a more comprehensive clinical interview to collect the golden standard label of depression. The scale of our collection, i.e., 200 subjects, is able to support the automatic analysis with the deep learning system. We note that many modalities can be used for depression detection. For example, [25] targets to the conversation with the Patient Health Questionnaire (PHQ)-8 metric [26] . [27] propose to fuse the spoken language and 3D facial key points in DAIC-WOZ dataset [22] . The audio is used in [28] with a self-supervised embedding, and the phoneme feature is used in [29] . These works try to mimic the clinician-based interview. More recently, the electroencephalography and paralinguistic behaviors are fused with the classifier ensemble for depression detection [30] . [31] explores the EEG signal via kernel-target alignment. The fNIRS can also be used for diagnosis [32] . These modalities are able to provide more accurate features, while they are not salable for efficient screening as SDS. Temporal modeling [33] can be the essential part of the video-based classification and analysis tasks, e.g., video facial expression [34] , [35] , [36] and action recognition [37] . The recursive neural network (RNN) [38] is widely used for temporal patterns modeling, which makes dependent sequential processing. However, it is notorious for the long-term forgetting and hard to train [39] . Therefore, its performance can be largely affected if the input is too long. The bi-directional long-short term memory (Bi-LSTM) has been proposed to alleviate the difficulties [40] , [41] , which is still relatively slow in both training and inference. More recently, the 3D CNN [42] is proposed to explore the spatial-temporal patterns in a unified manner. It can be fast processed and has demonstrated good performances in many tasks. Nevertheless, the input to the 3D CNN should be fixed [43] , limiting its application to the video with variable length, e.g., different questions in our dataset. In this work, we propose to factorize a variable-length video into several fixed-length clips to utilize the 3D CNN to better balance performance and efficiency. Start from the machine translation in [44] , the attention scheme has demonstrated great potential in many applications. It has been the core block of many successful systems [45] . Conventionally, it computes the adjusted output at a position with the weighted sum of all positions in that sentence. A similar philosophy has also been inherited in the non-local algorithms [46] , which focused on the image denoising task. Pair-wise relationships were also modeled using interaction networks [47] , [48] , [49] , [50] . Moreover, [19] proposes a link between self-attention and the wider category of non-local filtering activities. [51] proposes to learn various time scales of temporal dependencies between video frames. Inspired by these methods, we further adapt this idea to the variable length long-term SDS video analysis. We collect the Self-Rating Depression Scale (SDS) questionnaires and the corresponding face video from 200 subjects. The study protocol was approved by the Ethics Committee of the Affiliated Hospital of Guangdong Medical University (No. PJ2021-026). Each participant is instructed to sit in a quiet consulting room along and fill the self-report questionnaire following the instruction in the software interface to avoid being affected by the others. Moreover, a Huawei Software-Defined Camera (SDC) is hidden behind a one-way mirror, and the participants do not become aware of the camera in the evaluation. SDC adopts an open-ended software architecture that can flexibly integrate with the code of specific application 2 . In addition, the data is uploaded to the back-end server for processing. To connect each question with the video, the camera is synchronized with the questionnaire software to record the video start from the question showing and end on the score is chosen. The illustration of our data collection is shown in Fig. 1 . In Fig. 2 , we provide some frames of a subject for answering a question. There are 20 questions for SDS evaluation, and they usually take about 10 minutes to complete [7] . Each participant takes a different time to finish different questions. In our collected dataset, the time of each question varies from 2s to 21s. The 25 fps videos with the resolution of 3840×2160 are collected for each question. Since the region of interest is the face of participants, we use the face detector [52] to crop the face region. To incorporate the hand action of head-scratching [53] , chin-touching [54] , etc., we extend the face box by 100% and resize the extended region of interests (ROI) in each frame to 110×110. Moreover, the image is followed by gray processing to reduce the size of the input. The pre-processing flow chart is given in Fig. 3 . After the self-report SDS evaluation, a more comprehensive clinical interview [3] , including the clinician-administered Hamilton Depression Rating Scale (HDRS) assessment [11] , SCL-90-R Symptom Checklist [55] , and Self Rating Anxiety Scale (SAS) [56] , is taken to help the clinician for confirming a diagnosis of depression. We use the final diagnosis result as our ground truth label. Considering the self-administrated SDS can be uncontrollable, the result of SDS evaluation can be different from the final diagnosis [9] . In Tab. I, we provide the detailed statistics of the SDS and final results of our 200 subjects. We note that there are only two classes, i.e., normal and depression, in our dataset. We can see that there are about 10% of subjects have different results. Specifically, the SDS evaluation of 10 subjects is diagnosed as depression, while the other 10 subjects with the high SDS score are diagnosed as normal with the subsequent clinical interview. The long-term video recording of SDS evaluation inherits rich emotional information, while it also poses several challenges for processing. First, the long video can be redundant, and only very few (i.e., sparse) frames may indicate the useful cues for depression detection. Second, the length of the video can be varied across different subjects and questions. Considering that the human expression usually takes 200ms to 500ms, it can be reasonable to segment the video into several fixed-length short clips and explore the local temporal patterns within it. We empirically set each video clip to 10 successive frames in our task. With the fixed length, the 3D CNN can be an efficient module for fast processing [42] . To avoid splitting an expression, we set an overlap ratio to 0.5, that the first clip of the first question {I 1 1−i } 10 i=1 is from 1 to 10 frames, and the second clip {I 1 2−i } 10 i=1 is from 6 to 15 frames, respectively. We use the superscript to denote the question from 1 to 20, and the subscript denotes the images in each clip. Therefore, for a N -frame video, there can be M = 2 × (N/10) − 1 clips. For a 16 seconds video with 25fps in our dataset, we have 79 clips, and only very few of them are not neutral expressions and do not contribute to complementary information. We extract the 128-dimensional feature of each clip and denote the clip-wise representation of the first question as {f 1 m } M m=1 , where m ∈ {1, · · · , M } index the clips within a question. We note that M can be different for different questions, and we omit the question index for simple notations. After the clip-wise representations are extracted, the global attention is applied on top of them to extract the 128-dimensional question-wise features {a i } 20 i=1 for 20 questions. To adaptively learn the significance of each clip within a question, we resort to a redundancy-aware self-attention scheme. Then, the question-wise features {a i } 20 i=1 are concatenated with the corresponding tabular questionnaire results and the answering time. We note that the SDS has 4 scales, and we use four-dimensional one-hot vector to encode the choice of each problem. The answering time is also concatenated as a onedimensional scalar. The concatenated feature of each question can be a 133 dimension vector, and all of the 20 questions are concatenated together to form a 2660-dimensional questionnaire and video fused feature. We use the fully connected layers with the sigmoid output unit for binary classification, i.e., normal or depression. We note that all of the 3D CNN and self-attention modules are shared for all of the clips and questions. In the following subsections, we provide the detailed framework of our hierarchically constructed 3D CNN, redundancy-aware attention, and question-level fusion modules. 3D CNN has demonstrated its effectiveness of fast temporal representation extraction for relatively short fixed-length video [18] . We apply the standard 3D convolution operation to model the relationships between the successive frames. We illustrate the basic framework of 3D CNN in Fig. 5 . After a few 3D Convolutional and maxpooling layers, we get a 256dimensional feature vector, which is sent to a fully connected layer and result in a 128-dimensional clip-wise representation f i m , where i and m index the 20 questions and M clips within this question, respectively. Considering the height and width dimensions of the video clips (e.g., 110) is usually much larger than the frame-wise dimension (e.g., 10), the first three maxpooling layers only half the height and width dimensions and denote as (2×2×1)-(1×1×1). In the 4-th maxpooling layer, we half all of the three dimensions and denote as (2×2×2)-(1×1×1). The detailed network structure is given in Tab. II. To explore the correlations between each clip, we resort to the affinity of clip-wise feature vectors. We use i to index the M clips within a question, and the i-th vector is regarded as a probe vector. j index the other M − 1 clips other than the i-th clip. We note that different questions can have a different number of fix-length clips, while the 3D CNN is not applicable [57] , [42] . In our redundancy-aware self-attention module, we configure several self-attention blocks, which are indexed by l ∈ {1, 2 · · · , L}. L is the total number of stacked sub-self attention blocks. In each self-attention block, we first traverse all of the clips to set the current clip as the probe. Then, we traverse the M-1 clips other than the probe to explore their correlations with the current probe clip. Specifically, our self-attention block can be formulated as where Ω ( l) ∈ R 1×1×128 is the weight vector to be learned and f q(0) i = f q i . The response is normalized by C i . The pairwise affinity ω(·, ·) is an scalar. The operation of ω in Eq. (1) has many possible function candidates [19] , [51] . We simply choose the embedded Gaussian given by where ψ(f are two embedding functions, and Ψ, Φ ∈ R 128×128 are the corresponding learnable mapping matrix [19] . The residual term, i.e., f , is the difference of the neighboring feature (i.e., f q(l−1) j ) and the current probe feature f q(l−1) i . If f q(l−1) j incorporates complementary information or more significant cues compared to the current probe feature f q(l−1) i , then our redundency-aware attention scheme will eliminate the information from the inferior f q(l−1) i and emphasis the more discriminative f q(l−1) j . Compared to the original non-local network which uses only f q(l−1) j [19] , our formulation can be more similar to the diffusion maps [58] , graph Laplacian [59] and non-local image processing [60] . All of them are non-local analogues [61] of local diffusions, which are expected to be more stable than its original non-local counterpart [19] due to the nature of its inherit Hilbert-Schmidt operator [61] . We not that the the current probe feature will be added back in the last as the residual neural network. Therefore, the previous steps in the self-attention block is adjusting the information of each clip that need to be emphasised and translate to the later blocks. Since the pair-wise residual term takes all possible clips into consideration and not following the sequential modeling, it does not suffer from long-term forgetting. Therefore, it can be an ideal choice for the attention modeling of many clips. In addition, the weighted average operation is able to take any number of inputs, which is fit for the different clip numbers in different questions. We note that the input and output of a self-attention block have the same size. Specifically, the M inputs with the size of 1 × 1 × 128 will be processed to M outputs with the size of 1 × 1 × 128. Permutation-invariant is a special property of a self-attention scheme [39] . Since we use the sum operation in Eq. 1 to fuse the pair-wise residual terms. In the previous self-attention-based video analysis works, each frame is regarded as independent of each other and discards the sequential patterns [19] . Our video clips are inherently sequential data, and have an overlap with the neighboring clips. For exploiting the temporal patterns, the Gaussian kernel is proposed for sequential neighboring distance measure where m i , m j ∈ R represent the position of i th and j th feature vectors in the video of a question, respectively. σ is a hyperparameter to control the shape of Gaussian Kernel. After several concatenated self-attention blocks, the global pooling [19] is added on these M feature maps for elementwise averaging. The final output is a question-wise video feature a q ∈ R 1×1×128 . The SDS questionnaire score of each question is denoted as s q ∈ R 4 . Moreover, we empirically found that the answering time of each question can also be helpful for the diagnosis. Therefore, we also concatenate the video length of each question t q ∈ R. For a video that takes 3 seconds, we set its t q to 3. The too short or too long answering maybe unreliable [17] . We concatenate a q , s q and t q to form a 133 dimension feature vector for each question. Then, the 20 question-wise video and questionnaire features are concatenated to a 2660 dimensional feature for a final depression diagnosis. The fully connected layers are adopted, and the detailed network structure is given in Tab. III. The widely used Rectified Linear Unit (ReLU) [62] is used as the non-linear mapping function between each fully connected layer. We use the sigmoid output unit p = 1 1+e out ∈ (0, 1) for binary classification, where out is the network prediction scalar in the last layer and will be normalized to a probability value, i.e., the likelihood of this subject is depression patient. We note that we indicate the normal subject with 0 (i.e., y = 0) and the depression subject with 1 (i.e., y = 1) according to the final clinician interview. To automatically train our model with backpropagation, we use the binary cross-entropy loss as the optimization objective. which has the zero loss if p matches its corresponding y. In addition, we simply set the threshold of binary prediction to p = 0.5 for testing. IV. EXPERIMENTS In this section, we compare the classification performance of our VoxelHop against 3D CNN-based classification. We also provide a systematic ablation study and sensitive analysis to demonstrate the effectiveness of the design choice of our framework. All the experiments were implemented using the widely adopted deep learning library Pytorch [63] on our server with an NVIDIA V100 GPU, Xeon E5 v4 CPU with 128GB memory. Our model is trained with the Adam [64] optimizer with the hyper-parameters of β 1 =0.9 and β 2 =0.999. We used a batch size of 2 for our dataset. The networks of our framework and the compared methods are trained for 200 epochs for a fair comparison. We report the results of five random initialization, and provide the standard deviation (±sd) along with the average performance. We note that the 3D CNN and the redundancyaware self-attention modules are shared for all of the questions, which can be processed parallelly. We empirically set L = 5. The training takes about 8 hours, while the average inference time of a subject only takes 1.3s. The threshold of binary classification testing is set to 0.5. We adopt the five-fold cross-validation for the 200 subjects in our dataset. Specifically, we split the dataset into five subsets, and each has 40 subjects. We note that there are no overlap w.r.t. subjects between two folds. Then, we sequentially select a fold as our testing set (i.e., 40 participants), while the remaining four folds (i.e., 160 participants) are used for training. For performance evaluation, we use the widely accepted binary classification metrics of accuracy, sensitivity (i.e., recall), specificity. More formally: where TP, TN, FP, and FN indicate true positive, true negative, false positive, and false negative, respectively. We note that the positive and negative corresponding to the depression and normal, respectively. In addition, by varying a threshold, the receiver-operating characteristic (ROC) curve plots the true positive rate (TPR) against the false positive rate (FPR). It demonstrates the diagnostic performance of a binary classification algorithm. The larger area under the curve (AUC), the better performance. We note that the TPR= T N T N +F P and FPR= F P F P +T N . With our SDS and video dataset, there can be three choices of the modality, i.e., SDS only, video only, and both of them. With only the SDS evaluation result, we can simply add the score of 20 questions and using the threshold of 50 for normal and depression classification. According to the statistics in Tab. I, we can see that the SDS results can be different from the clinician diagnosis. We also tried only using the video modality for classification, which did not concatenate s q for the question-level fusion. It is clear that using both SDS and video can outperform SDS only by a large margin, which evidenced the effectiveness of the additional video modality. The complementary information in the corresponding video is helpful to detect depression more accurately. It is also appealingly that even we only use the video, we are also able to predict the depression with an accuracy of 0.69, which is higher than the chance probability of 0.5. To demonstrate the effectiveness of our model, we also applied two baseline methods, i.e., recurrent neural networks (RNN) [40] , and non-local networks [19] , for comparison. We note that this is also the first attempt to apply RNN and nonlocal attention for SDS and video analysis. RNN is a typical choice for temporal modeling [65] . Specifically, we use the bi-directional LSTM [41] for our video feature extraction. Moreover, the non-local scheme [19] is recently proposed to address the long-term forgetting of RNN [66] . We use RNN or non-local to replace the 3D CNN and redundancy-aware self-attention in our framework to extract the 128-dimensional question-wise feature representation. The quantitative evaluation results are shown in Tab. IV. In addition, the accuracy of different epochs is plotted in Fig. 8 . Our proposed framework achieves significantly better performance than the RNN and non-local counterparts. We also provide the systematical ablation study for our framework modules. • Our:color indicates using RGB frames as input, which without the gray-scale pre-processing. We note that we can simply modify the 3D CNN for multi-channel input, while the computation cost can be significantly increased. • Our:non-local indicates using the conventional non-local [19] as the alternative of our redundancy-aware self-attention module. We note that the original non-local [19] is also first introduced to depression detection in this paper, and is regarded as a baseline. • Our:w/o time denotes the t q is not concatenated for the question-level fusion. • Our:f l−1 j refers to using f l−1 j instead of the difference term f q(l−1) j −f q(l−1) i in Eq. (1). It does not explicitly consider the redundancy and lead to lower accuracy. • Our:(ΨΦ) L indicates using the embedded Gaussian pairwise affinity for every block, which has similar performance but usually doubles the training time. • Our:w/o∆ indicates excise ∆ i,j in Eq. (1) and does not taking the temporal sequence into consideration. • Our+gender indicates that we add the 1/0 label of male/female along with the time. • Our:MLP indicates that only using MLP to fuse the score and SDS time. • Our:SLF indicates that using score-level fusion instead of feature level fusion. The results are provided in Tab. V. The relatively inferior performance of the compared settings demonstrates the effectiveness of our choices. By adding the gender label, we do not achieve improvements w.r.t. the accuracy and AUC metrics. Our proposed method is able to explore the video information and achieve better performance than using MLP to fuse the score and SDS time. In addition, our feature-level fusion can outperforms the score-level fusion significantly. There are several hyperparameters in our framework. We provide a sensitivity analysis of these settings in this subsection and provide the analysis results in Fig. 9 . Specifically, we tested using different output dimensions of 3D CNN to explore the balance of the representative and computational cost. In Fig. 9 , we can see that the output dimension of 3D CNN can be stable between 128 to 256 dimensions. The longer output feature can introduce significant additional costs for the subsequent redundancy-aware selfattention scheme. Moreover, since the subsequent redundancyaware self-attention scheme does not change the length of the feature, the fully connected layers can be hard to process the longer inputs without enlarging its network structure. The length of each clip can be related to the developing time of human expression and action in this task. The performance is not sensitive for a large range, e.g., 8 to 13 frames in a clip. The too-short clip may not able to incorporate an expression or action, while the longer clip can be hard to be effectively processed to extract useful information. In addition, we can use different L to configure the number of redundancy-aware self-attention blocks. With five self-attention blocks, it is able to achieve the best performance in our task. In Table VI , fix the 40 subjects in each cross-validation round, and reduce the training sample in each round to 40, 80, and 120 subjects. The performance can be better with more training data. We note that the difference between using 120 subjects or 160 subjects can be similar. In Table VII , we compared the performance of using different head box sizes. The performance is not sensitive to the size within a relatively large range, while the 200% can be a good balance of the efficiency and performance. In Table VIII , we investigated the performance of using different fps of the video. We can see a significant performance drop w.r.t. both accuracy and AUC for the lower fps. Therefore, we chose the highest fps in our dataset. It can be promising to apply a higher frequency camera to capture the microexpression information, while it can be costly in computation and memory. Depression has been affecting more than 300 million people worldwide, while the early diagnosis can be immensely helpful for the treatment. The popularity of depression is even more significant in the COVID-19 season [4] , which has long-term quarantine rules. A recent COVID-19 mental health survey [5] indicates that 23% of adults in Ireland reported suffering from depression 3 . However, the clinician interview can be prohibited or difficult considering the restrictions of avoiding infectious diseases, e.g., COVID-19. This further requires efficient selfadministrative screening for depression detection. The proposed framework has demonstrated good prediction accuracy for normal and depression, and has the potential for clinical practice in the future, especially for self-screening. With the Software-Defined Camera (SDC), we are able to transfer the questionnaire and its video to the back-end server for processing. Moreover, a similar protocol can be potentially applied to the smartphone APPs, which can easily take the face video of the user with the front camera. We note that the captured view can be different from our collected data and may result in the domain shift of the appearance. A possible solution to avoid the largen scale labeling of the mobile captured video is using the unsupervised domain adaptation to transfer the knowledge from our dataset to the unlabeled mobile dataset [67] , [68] , [69] . In addition, it is promising to apply a face pose invariant or robust feature extractor as [35] , [70] , [71] , [72] . 3 https://www.maynoothuniversity.ie/newsevents/ covid19mentalhealthsurvey/maynoothuniversityandtrinitycollegefindshighratesanxiety Therefore, the subsequent video-level aggregation modules can be shared across the datasets with different face poses. Actually, our pre-processing only crops a small area of the head, which is robust to the background changes. We are able to adjust the threshold for different applications with different sensitivity to the misclassification. Positive patients will then be referred to specialized clinics for a more comprehensive diagnosis. The swift, automated deep learning system will partially substitute and support primary doctors' long-term clinical training, improving the primary diagnosis accuracy of depression in developing countries and laying the groundwork for early diagnosis and care of depression patients. Our system targets to explore the SDS evaluation and its video, while the clinician interview usually involves the roundbased conversation. The spontaneous reaction and the speech (including text and phoneme) can provide more informative features. The interactive multi-round dialog system can be a promising direction. In addition, we only collect the subject in China, especially in the Guangdong province, the population shift may affect the performance of our system. we will continuously collect more samples from different areas in the following study to achieve better performance. Moreover, anxiety is also closely related to depression, while can have different treatment. In future work, we are also planning to incorporate anxiety into our diagnostic system. This study targets to automatically explore both the SDS evaluation and its question-wise video recording. By extending the face detector box, the facial expression, eye movement, and the actions of head-scratching and chin-touching are taken into account. A hierarchical end-to-end neural network framework is proposed to process the long-term variable-length video, which is also conditioned on the questionnaire results and the answering time. Based on the collected SDS and video recording dataset with an accurate clinician interview label, our model can make the diagnosis by fusing the information in tabular SDS results and video sequence. The 3D CNN module is able to efficiently explore the local temporal feature, while the novel redundancy-aware self-attention module can explicitly emphasizes the discriminative clips and reduces the redundancy based on feature pair-wise affinity. Our system exhibits appealing accuracy for depression detection, which can be promising for clinical practice in the future, especially in smartphones. The positive cases will then be referred to a specialized hospital for a final clinician diagnosis and care. Depression: Causes and treatment The clinical interview for depression: development, reliability and validity Prevalence of depression symptoms in us adults before and during the covid-19 pandemic Anxiety and depression in the republic of ireland during the covid-19 pandemic A self-rating depression scale Self-rating depression scale in an outpatient clinic: further validation of the sds Validity of the zung selfrating depression scale Reliability, discriminant and predictive validity of the zung self-rating depression scale Factors influencing the self-rating depression scale A structured interview guide for the hamilton depression rating scale Survey on rgb, 3d, thermal, and multimodal approaches for facial expression recognition: History, trends, and affect-related applications Automatic assessment of depression based on visual cues: A systematic review An improved classification model for depression detection using eeg and eye tracking data Impaired recognition of affect in facial expression in depressed patients Facial expression of schizophrenic patients and their interaction partners Depression: A clinical-research approach 3D convolutional neural networks for human action recognition Non-local neural networks Automatic behavior descriptors for psychological disorder analysis An automated framework for depression analysis Humaine Association Conference on Affective Computing and Intelligent Interaction The distress analysis interview corpus of human and computer interviews Avec 2013: the continuous audio/visual emotion and depression recognition challenge Is this patient clinically depressed Text-based depression detection on sparse data The phq-8 as a measure of current depression in the general population Measuring depression symptom severity from spoken language and 3d facial expressions Depa: Self-supervised audio embedding for depression detection Audvowelconsnet: A phoneme-level based deep cnn architecture for clinical depression diagnosis Multimodal depression detection: fusion of electroencephalography and paralinguistic behaviors using a novel strategy for classifier ensemble An optimal channel selection for eeg-based depression detection via kernel-target alignment Feature-level fusion for depression recognition based on fnirs data A survey of temporal knowledge discovery paradigms and methods Identity-aware facial expression recognition in compressed video Mutual information regularized identity-aware facial expression recognition in compressed video Classification-aware semi-supervised domain adaptation See the forest for the trees: Joint spatial and temporal recurrent neural networks for video-based person re-identification Automated interpretation of congenital heart disease from multi-view echocardiograms Permutation-invariant feature restructuring for correlation-aware image set-based recognition Bidirectional recurrent neural networks Action recognition in video sequences using deep bi-directional lstm with cnn features Dependency-aware attention control for image set-based face recognition Revisiting temporal modeling for video-based person reid Attention is all you need Self-attention with relative position representations A non-local algorithm for image denoising Interaction networks for learning about objects, relations and physics Vain: Attentional multi-agent predictive modeling Visual interaction networks: Learning a physics simulator from video Learning to compare: Relation network for few-shot learning Temporal relational reasoning in videos Faceboxes: A cpu real-time face detector with high accuracy Psychotic depression: what is it and how should we treat it Nonverbal behavior and childhood depression The scl-90-r, brief symptom inventory, and matching clinical rating scales Psychometric attributes of the self-rating anxiety scale Dependency-aware attention control for unconstrained face recognition with image sets Nonlocal neural networks, nonlocal diffusion and nonlocal modeling Spectral graph theory Nonlocal linear image regularization and supervised segmentation Analysis and approximation of nonlocal diffusion problems with volume constraints Analysis of function of rectified linear unit used in deep learning Automatic differentiation in pytorch Adam: A method for stochastic optimization How to construct deep recurrent neural networks On the difficulty of training recurrent neural networks Subtype-aware unsupervised domain adaptation for medical diagnosis Energy-constrained self-training for unsupervised domain adaptation Image2audio: Facilitating semisupervised audio emotion recognition with facial expression image Mutual information regularized feature-level frankenstein for discriminative recognition Featurelevel frankenstein: Eliminating variations for discriminative recognition Disentanglement for discriminative visual recognition