key: cord-0585884-4n9iriu4 authors: Zhu, Hongzhi; Rohling, Robert; Salcudean, Septimiu title: Multi-task UNet: Jointly Boosting Saliency Prediction and Disease Classification on Chest X-ray Images date: 2022-02-15 journal: nan DOI: nan sha: 475a11ad7c38353ec54da552916ad2a9fcca0f53 doc_id: 585884 cord_uid: 4n9iriu4 Human visual attention has recently shown its distinct capability in boosting machine learning models. However, studies that aim to facilitate medical tasks with human visual attention are still scarce. To support the use of visual attention, this paper describes a novel deep learning model for visual saliency prediction on chest X-ray (CXR) images. To cope with data deficiency, we exploit the multi-task learning method and tackles disease classification on CXR simultaneously. For a more robust training process, we propose a further optimized multi-task learning scheme to better handle model overfitting. Experiments show our proposed deep learning model with our new learning scheme can outperform existing methods dedicated either for saliency prediction or image classification. The code used in this paper is available at https://github.com/hz-zhu/MT-UNet. Recent work in machine learning and computer vision have demonstrated advantages of integrating human attention with artificial neural network models, as studies show that many machine vision tasks, i.e., image segmentation, image captioning, object recognition, etc., can benefit from adding human visual attention (Liu and Milanova, 2018) . Visual attention is the ability inherited in biological visual systems to selectively recognize regions or features on scenes relevant to a specific task (Borji et al., 2012) , where "bottom-up" attention (also called exogenous attention) focuses on physical properties in the visual input that are salient and distinguishable, and "top-down" attention (also called endogenous attention) generally refers to mental strategies adopted by the visual systems to accomplish the intended visual tasks (Paneri and Gregoriou, 2017) . Early research on saliency prediction aims to understand attentions triggered by visual features and patterns, and thus "bottom-up" attention is the research focus (Borji et al., 2012) . More recent attempts, empowered by interdisciplinary efforts, start to study both "bottom-up" and "top-down" attentions, and therefore the terms, saliency prediction and visual attention prediction, are used interchangeably (Sun et al., 2021) . In this paper, we use the term saliency prediction as the prediction of human visual attentions allocations when viewing 2D images, containing both "bottom-up" and "top-down" attentions. 2D heatmap is usually used to represent human visual attention distribution. Note that saliency prediction studied in this paper is different from neural network's saliency/attention which can be visualized through class activation mapping (CAM) by Zhou et al. (2016) and other methods (Simonyan et al., 2013; Fu et al., 2019; Selvaraju et al., 2016) . With the establishment of several benchmark datasets, data driven approaches demonstrated major advancements in saliency prediction (review in Borji (2019) and ). However, saliency prediction for natural scenes is the primary focus, and more needs to be done in the medical domain. Hence, we intend to study the saliency prediction for examining chest X-ray (CXR) images, one of the most common radiology tasks worldwide. CXR imaging is commonly used for the diagnosis of cardio and/or respiratory abnormalities; it is capable of identifying multiple conditions through a single shot, i.e., COVID-19, pneumonia, heart enlargement, etc. (Ç allı et al., 2021) . There exists multiple public CXR datasets (Irvin et al., 2019; Wang et al., 2017) . However, the creation of large comprehensive medical datasets is labour intensive, and requires significant medical resources which are usually scarce (Castro et al., 2020) . Consequently, medical datasets are rarely as abundant as that for non-medical fields. Thus, machine learning approaches applied on medical datasets need to address the problem of data scarcity. In this paper, we exploit the multitask learning for solution. Multi-task learning is known for its inductive transfer characteristics that can drive strong representation learning and generalization of each component task (Caruana, 1997) . Therefore, multi-task learning methods partially alleviates some of the major shortcomings in deep learning, i.e., high demands for data sufficiency and heavy computation loads (Crawshaw, 2020) . However, to apply multi-task learning methods successfully, challenges still exist, which can be the proper selection of component tasks, the architecture of the network, the optimization of the training schemes and many others (Zhang and Yang, 2021; Crawshaw, 2020) . This paper investigates the proper configuration of a multi-task learning model that can tackle visual saliency prediction and image classification simultaneously. The main contributions of this paper are: 1) development of a new deep convolutional neural network (DCNN) architecture for CXR image saliency prediction and classification based on UNet (Ronneberger et al., 2015) , and 2) proposal of an optimized multi-task learning scheme that handles overfitting. Our method aims to outperform the state-of-theart networks dedicated either for saliency prediction or image classification. DCNN is the leading machine learning method applied to saliency prediction (Pan et al., 2016; Kümmerer et al., 2016; Jia and Bruce, 2020; Kroner et al., 2020) . Besides, transfer learning with pre-trained networks was observed to boost the performance of saliency prediction (Oyama and Yamanaka, 2017; Kümmerer et al., 2016; Oyama and Yamanaka, 2018) . A majority of DCNN approaches are for natural scene saliency prediction, and so far, only a few studied the saliency prediction for medical images. By Cai et al. (2018) , the generative adversarial network is used to predict expert sonographer's saliency when performing standard fetal head plane detection on ultrasound (US) images. However, the saliency prediction is used as a secondary task to assist the primary detection task, and thus, the saliency prediction performance failed to outperform benchmark prediction methods in several key metrics. Similarly, by Karargyris et al. (2021) , as a proof-of-concept study, the gaze data is used as an auxiliary task for CXR image classification, and the performance of saliency prediction is not reported in the study. Public datasets for CXR images enabled data driven approaches for automatic image analysis and diagnosis (Serte et al., 2020; . Advancements in standardized image classification networks, i.e., ResNet (He et al., 2016) , DenseNet (Huang et al., 2017) , and EfficientNet (Tan and Le, 2019) , facilitate CXR image classification. Yet, CXR image classification remains challenging, as CXR images are noisy, and may contain subtle features that are difficult to recognize even by experts (Ç allı et al., 2021; Khan et al., 2021) . As stated in Section 1, component task selection, network architecture design, and training scheme are key factors for multi-task learning. We select the classification task together with the saliency prediction based on the fact that attention patterns are task specific (Karessli et al., 2017) . Radiologists are likely to exhibit distinguishable visual behaviors when different patient conditions are shown on CXR images (McLaughlin et al., 2017) . This section introduces our multi-task UNet (MT-UNet) architecture, and derives a better multi-task training scheme for saliency prediction and image classification. Figure 1 shows the architecture of the proposed MT-UNet. The network takes CXR images, x ∈ R 1×H×W , where H and W are image dimensions, as input, and produces two outputs, predicted saliency y s ∈ R 1×H×W , and predicted classification y c ∈ R C , where C is the number of classes. As the ground truth for y s is human visual attention distribution, represented as a 2D matrix whose elements are non-negative and sum to 1, y s is normalized by Softmax before output from MT-UNet. Softmax is also applied to y c before output so that the classification outcome can be interpreted as class probability. For the simplicity of notation, batch dimensions are neglected. The proposed MT-UNet is derived from standard UNet architecture (Ronneberger et al., 2015) . As a well-known image-to-image deep learning model, the UNet structure has been adopted for various tasks. For example, the UNet is appended with additional structures for visual scene understanding (Jha et al., 2020) , the features from the bottleneck (middle of the UNet) are extracted for image classification tasks (Karargyris et al., 2021) , and by combining UNet with Pyramid Net (Lin et al., 2017) , features at different depth are aggregated for enhanced segmentation (Moradi et al., 2019) . What's more, the encoder-decoder structure of UNet is utilized for multi-task learning, where the encoder structure is used to learn representative features, along with designated decoder structures or classification heads for image reconstruction, segmentation, and/or classification (Zhou et al., 2021; Amyar et al., 2020) . In our design, we apply classification heads (shaded in light green in Figure 1 ), which are added not only to the bottleneck but also the ending part of the UNet architecture. This additional classification specific structures aggregates middle and higher-level features for classification, exploiting features learnt at different depths. The attention heads perform global average pooling operations to the 4D tensors, followed by concatenation, and two linear transforms (dense layers) with dropout (rate=25%) in the middle to produce classification outcomes. The MT-UNet belongs to the hard parameter sharing structure in multi-task learning, where different tasks share the same trainable parameters before branched out to each tasks' specific parameters (Vandenhende et al., 2021) . Having more trainable parameters in task specific structures may improve the performance for that task at a cost of introducing additional parameters and increasing computational load (Crawshaw, 2020; Vandenhende et al., 2021) . In our design, we wish to avoid heavy structures with lots of task specific parameters, and therefore, task specific structures are minimized. In Figure 1 , we use yellow and green shades to denote network structures dedicated for saliency prediction and classification, respectively. Balancing the losses between tasks in a multi-task training process has a direct impact on the training outcome (Vandenhende et al., 2021) . There exist multi-task training schemes (Kendall et al., 2018; Chen et al., 2018; Guo et al., 2018; Sener and Koltun, 2018) , and among which, we adopt the uncertainty based balancing scheme (Kendall et al., 2018) with the modification proposed in (Liebel and Körner, 2018) . Hence, the loss function is: where L s and L c are loss values for y s and y c , respectively; σ s > 0 and σ c > 0 are trainable scalars estimating the uncertainty of L s and L c , respectively; σ s and σ c are initialized to 1; ln(σ s +1) and ln(σ c +1) are regularizing terms to avoid arbitrary decrease of σ s and σ c . With Equation (1), we know that σ values can dynamically weigh losses of different amplitudes during training, and loss with low uncertainty (small σ value) is prioritized in the training process. L > 0. Given y s and y c with their ground truthȳ s andȳ c , respectively, the loss functions are: where H(Q, R) = −Σ n i Q i ln(R i ) stands for cross entropy of two discrete distributions Q and R, both with n elements; H(Q) = H(Q, Q) stands for the entropy, or self cross entropy, of discrete distribution Q. L s is the Kullback-Leibler divergence (KLD) loss, and L c is the cross-entropy loss. By observing Equation (2) and Equation (3), we know that only the cross entropy terms, H(·, ·), generate gradient when updating network parameters, as the term −H(ȳ s ) in L s is a constant and has zero gradient. Therefore, we extend the method in (Kendall et al., 2018) , and use 1 σ 2 to scale a KLD loss (L s ) as that for a cross-entropy loss (L c ). Although the training scheme in Equation (1) yields many successful applications, overfitting for multi-task networks still can jeopardize the training process, especially for small datasets (Wang et al., 2020) . Multiple factors can cause overfitting, among witch, learning rate, r > 0, shows the most significant impact (Li et al., 2019) . Also, r generally has significant influences on the training outcome (Smith, 2018), making it one of the most important hyper-parameters for a training process. When training MT-UNet, r is moderated by several factors. The first factor is the use of an optimizer. Many optimizers, i.e., Adam (Kingma and Ba, 2014) and RMSProp (Tieleman et al., 2012) , deploy the momentum mechanism or its variants, which can adaptively adjust the effective learning rate, r e , during training. As a learning rate scheduler is often used for more efficient training, it is the second factor to influence r. The influence of r from a learning rate scheduler can be adaptive, i.e., reduce learning rate on plateau (RLRP), or more arbitrary, i.e., cosine annealing with warm restarts (Loshchilov and Hutter, 2016) . By observing Equation (1), we know that an uncertainly estimator σ for a loss L also serves as a learning rate adaptor for L, which is the third factor. More specifically, given a loss value L with learning rate r, the effective learning rate for parameters with a scaled loss value L σ 2 is r σ 2 . Decreasing r upon overfitting can alleviate its effects (Smith, 2018; Duffner and Garcia, 2007) , but Equation (1) leads to increased learning rate upon overfitting, further worsening the training process. This happens because training loss decreases when overfitting occurs, reducing its variance at the same time. Thus, σ decreases accordingly, which increases the effective learning rate, thus creating a vicious circle of overfitting. More detailed mathematical derivation is presented in Appendix A. This phenomenon can be observed in Figure 2 , where changes of losses and σ values during a training process following Equation (1) are presented. We can see from Figure 2 (a), at epoch 40, after an initial decrease in both the training and validation losses, the training loss start to acceleratedly decrease while the validation loss start to amplify, which is a vicious circle of overfitting. A RLRP scheduler can halt the vicious circle by resetting the model parameters to a former epoch and reducing r. Yet, even with reduced r, a vicious circle of overfitting can remerge in later epochs. (1) To alleviate overfitting, we propose the use of the following equations to replace Equation (1): The essence of Equations (4) and (5) is to fix the uncertainty term for one loss in Equation (1) to 1, so that the flexibility in changing effective learning rate is reduced. With the uncertainty term fixed for one component loss, Equations (4) and (5) demonstrate the ability to alleviate overfitting and stabilize the training processing. It is worth noting that Equations (4) and (5) cannot be used interchangeably. We need to test both equations to check which can achieve better performances, as depending on the dataset and training process, overfitting can occur of different severity in all component tasks. In this study, training process with Equation (5) achieves the best performance. Ablation study of this method is presented in Section 5. We use the "chest X-ray dataset with eye-tracking and report dictation" (Karargyris et al., 2021) shared via PhysioNet (Moody et al., 2000) in this study. The dataset was derived from the MIMIC-CXR dataset (Johnson et al., 2019a,b) with additional gaze tracking and dictation from an expert radiologist. 1083 CXR images are included in the dataset, and accompanying each image, there are tracked gaze data; a diagnostic label (either normal, pneumonia, or enlarged heart); segmentation of lungs, mediastinum, and aortic knob; and radiologist's audio with dictation. The CXR images in the dataset are in resolutions of various sizes, i.e., 3056 × 2044, and we down sample and/or pad each image to 640 × 416. A GP3 gaze tracker by Gazepoint (Vancouver, Canada) was used for the collection of gaze data. The tracker has an accuracy of around 1°of visual angle, and has a 60 Hz sampling rate (Zhu et al., 2019) . Several metrics have been used for the evaluation of saliency prediction performances, and they can be classified into location-based metrics and distribution-based metrics (Bylinskii et al., 2018) . Due to the tracking inaccuracy of the GP3 gaze tracker, location-based metrics is not suited for this study. Therefore, in this paper, we follow suggestions in (Bylinskii et al., 2018) and use KLD for performance evaluation. We also include histogram similarity (HS), and Pearson's correlation coefficient (PCC) for reference purposes. For the evaluation of classification performances, we use the area under curve (AUC) metrics for multi-class classifications (Hand and Till, 2001; Fawcett, 2006) , and the classification accuracy (ACC) metrics. We also include the AUC metrics for each class: normal, enlarged heart, and pneumonia, denoted as AUC-Y1, AUC-Y2, and AUC-Y3, respectively. In this paper, all metrics values are presented as median statistics followed by standard deviations behind the ± sign. Metrics with up-pointing arrow ↑ indicates greater values reflect better performances, and vise versa. Best metrics are emboldened. In this subsection, we compare the performance of MT-UNet, with benchmark networks for CXR image classification and saliency prediction. Detailed training settings are presented in Appendix B. For CXR image classification, the benchmark networks are chosen from the top performing networks for CXR image classification examined in (El Asnaoui et al., 2021) , which are ResNet50 (He et al., 2016) and Inception-ResNet v2 (abbreviated as IRNetV2 in this paper) (Szegedy et al., 2017) . Following Karargyris et al. (2021) , we also include a state-of-the-art general purpose classification network: EfficientNetV2-S (abbreviated as EffNetV2-S) (Tan and Le, 2021) for comparison. For completeness, classification using standard UNet with additional classification head (denoted as UNetC) is included. Results are presented in Table 1 , and We can see that MT-UNet outperforms the other classification networks. For CXR image saliency prediction, comparison was conducted with 3 state-of-the-art saliency prediction models, which are SimpleNet (Reddy et al., 2020) , MSINet (Kroner et al., 2020) and VGGSSM (Cao et al., 2020) . Saliency prediction using standard UNet (denoted as UNetS) is also included for reference. Table 2 shows the result, where MT-UNet outperforms the rest. Visual comparisons for saliency prediction results are presented through Table 4 in Appendix C. MT-UNet UNetC EffNetv2-S IRNetv2 ResNet50 ACC ↑ 0.670 ± 0.018 0.593 ± 0.009 0.640 ± 0.037 0.640 ± 0.017 0.613 ± 0.013 AUC ↑ 0.843 ± 0.012 0.780 ± 0.006 0.826 ± 0.015 0.824 ± 0.014 0.816 ± 0.010 AUC-Y1 ↑ 0.864 ± 0.014 0.841 ± 0.007 0.852 ± 0.013 0.862 ± 0.016 0.845 ± 0.015 AUC-Y2 ↑ 0.912 ± 0.008 0.840 ± 0.003 0.901 ± 0.015 0.897 ± 0.011 0.896 ± 0.015 AUC-Y3 ↑ 0.711 ± 0.027 0.597 ± 0.018 0.653 ± 0.017 0.633 ± 0.036 0.622 ± 0.022 Table 1 : Performance comparison between classification models. MT-UNet UNetS SimpleNet MSINet VGGSSM KLD ↓ 0.726 ± 0.004 0.750 ± 0.002 0.758 ± 0.009 0.748 ± 0.003 0.743 ± 0.007 PCC ↑ 0.569 ± 0.004 0.552 ± 0.002 0.545 ± 0.008 0.557 ± 0.002 0.561 ± 0.005 HS ↑ 0.548 ± 0.001 0.540 ± 0.001 0.541 ± 0.002 0.545 ± 0.001 0.545 ± 0.003 Table 2 : Performance comparison between saliency prediction models. To validate the modified multi-task learning scheme, ablation study is performed. The multi-task learning schemes following Equations (1), (4) and (5) are compared, and they are denoted as MTLS1, MTLS2, and MTLS3, respectively. Please note that the bestperforming MTLS3 is used for benchmark comparison in Section 5.1. Figure 3 in Appendix C shows the training process for MTLS2 and MTLS3. With Figures 2 and 3 , we can see that overfitting occurs both for MTLS1 and MTLS2, but the overfitting is reduced in MTLS3. The training processes shown in Figures 2 and 3 are with optimized hyper-parameters. The resulting performances are compared in Table 3 in Appendix C. We can see that MTLS3 outperforms the rest learning schemes both in classification and in saliency prediction. To validate the effects of using classification head that aggregates features from different depths, we create ablated versions of MT-UNet that use features from either the bottleneck or the top layer of the MT-UNet for classification, denoted as MT-UNetB and MT-UNetT, respectively. Results are presented in Table 3 in Appendix C. We can see that MT-UNet generally performs better than MT-UNetT and MT-UNetB. In this paper, we build the MT-UNet model and propose a further optimized multi-tasking learning scheme for saliency prediction and disease classification with CXR images. While a multi-task learning model has the potential of enhancing the performances for all component tasks, a proper training scheme is one of the key factors to fully unveil its potentiality. As shown in Table 3 , MT-UNet with the standard multi-task learning scheme may barely outperform existing models for saliency prediction or image classification. Several future work could be done to improve this study. The first would be the expansion of the gaze tracking dataset for medical images. So far, only 1083 CXR images are publicly available with radiologist's gaze behavior, limiting extensive studies of gazetracking assisted machine learning methods in the medical field. Also, more dedicated studies on multi-task learning methods, especially for small datasets, can be helpful for medical machine learning tasks. Overfitting and data deficiency are the lingering challenges encountered by many studies. A better multi-task learning method may handle these challenges more easily. published the multi-modal chest X-ray dataset for this research. This research is supported by Compute Canada and the Natural Sciences and Engineering Research Council of Canada (NSERC). Yue Zhou, Houjin Chen, Yanfeng Li, Qin Liu, Xuanang Xu, Shu Wang, Pew-Thian Yap, and Dinggang Shen. Multi-task learning for segmentation and classification of tumors in 3d automated breast ultrasound images. Medical Image Analysis, 70:101918, 2021. Hongzhi Zhu, Septimiu E Salcudean, and Robert N Rohling. A novel gaze-supported multimodal human-computer interaction for ultrasound machines. International journal of computer assisted radiology and surgery, 14 (7):1107-1115, 2019. Let L ≥ 0 be the loss for a task, T , and σ > 0 be the variance estimator for L used in Equation (1). Therefore, the loss for T following Equation (1) can be expressed as: The partial derivative of L with respect to σ is: During a gradient based optimization process, to minimize L, σ converges to the equilibrium value (σ remains unchanged after gradient descend) which is achieved when ∂L ∂σ = 0. Therefore, the following equation holds when σ is at its equilibrium value, denoted asσ: which is calculated by letting ∂L ∂σ = 0. Let f (σ) = L,σ > 0, we can calculate that: Therefore, we know that f (σ) is strictly monotonically increasing with respect toσ, and hence the inverse function of f (σ), f −1 (·), exists. More specifically, we have: As a pair of inverse functions share the same monotonicity, we know thatσ = f −1 (L) is also strictly monotonically increasing. Thus, when L decreases due to overfitting, we know thatσ will decrease accordingly, forcing σ to decrease. The decreased σ leads to an increase in the effective learning rate for T , forming a vicious circle of overfitting. We use the Adam optimizer with default parameters (Kingma and Ba, 2014) and the RLRP scheduler for all the training processes. The RLRP scheduler reduces 90% of the learning rate when validation loss stops improving for P consecutive epochs, and reset model parameters to an earlier epoch when the network achieves the best validation loss. All training and testing are performed with the PyTorch framework (Paszke et al., 2019) . Hyper-parameters for optimizations are learning rate r, and P in RLRP scheduler. The dataset is randomly partitioned into 70%, 10% and 20% subsections for training, validation and testing, respectively. The random data partitioning process preserves the balanced dataset characteristic, and all classes have equal share in all sub-datasets. All the results presented in this paper are based on at least 5 independent trainings with same hyper-parameters. NVIDIA V100 and A100 GPUs (Santa Clara, USA) were used. Table 4 : Visualization of predicted saliency distributions. The ground truth and predicted saliency distributions are overlaid over CXR images. Jet colormap is used for saliency distributions where warmer (red and yellow) colors indicate higher concentration of saliency and colder (green and blue) colors indicate lower concentration of saliency. Multi-task deep learning based ct imaging analysis for covid-19 pneumonia: Classification and segmentation Saliency prediction in the deep learning era: Successes and limitations Quantitative analysis of human-model agreement in visual saliency modeling: A comparative study What do different evaluation metrics tell us about saliency models? Multi-task sonoeyenet: detection of fetal standardized planes assisted by generated sonographer attention maps Deep learning for chest x-ray analysis: A survey Aggregated deep saliency prediction by selfattention network Multitask learning. Machine learning Causality matters in medical imaging Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks Multi-task learning with deep neural networks: A survey An online backpropagation algorithm with validation error-based adaptive learning rate Automated methods for detection and classification pneumonia based on x-ray images using deep learning An introduction to roc analysis Multicam: Multiple class activation mapping for aircraft recognition in remote sensing images Dynamic task prioritization for multitask learning A simple generalisation of the area under the roc curve for multiple class classification problems Deep residual learning for image recognition Densely connected convolutional networks Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison Mt-unet: A novel u-net based multi-task architecture for visual scene understanding Eml-net: An expandable multi-layer network for saliency prediction Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports Mimic-cxr-jpg, a large publicly available database of labeled chest radiographs Creation and validation of a chest x-ray dataset with eye-tracking and report dictation for ai development. Scientific data Gaze embeddings for zero-shot image classification Multi-task learning using uncertainty to weigh losses for scene geometry and semantics Intelligent pneumonia identification from chest x-rays: A systematic literature review Adam: A method for stochastic optimization Contextual encoderdecoder network for visual saliency prediction Deepgaze ii: Reading fixations from deep features trained on object recognition Research on overfitting of deep learning Accuracy of deep learning for automated detection of pneumonia using chest x-ray images: a systematic review and meta-analysis Auxiliary tasks in multi-task learning Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection Visual attention in deep learning: a review Sgdr: Stochastic gradient descent with warm restarts Computing eye gaze metrics for the automatic assessment of radiographer performance during x-ray image interpretation Physionet: A research resource for studies of complex physiologic and biomedical signals Mfp-unet: A novel deep learning based approach for left ventricle segmentation in echocardiography Fully convolutional densenet for saliency-map prediction Influence of image classification accuracy on saliency map estimation Shallow and deep convolutional networks for saliency prediction Top-down control of visual attention by the prefrontal cortex. functional specialization and long-range interactions Pytorch: An imperative style, high-performance deep learning library Tidying deep saliency prediction architectures U-net: Convolutional networks for biomedical image segmentation Why did you say that Multi-task learning as multi-objective optimization Deep learning in medical imaging: A brief review Deep inside convolutional networks: Visualising image classification models and saliency maps A disciplined approach to neural network hyper-parameters: Part 1-learning rate, batch size, momentum, and weight decay Visual saliency prediction using multi-scale attention gated network Inception-v4, inception-resnet and the impact of residual connections on learning Efficientnet: Rethinking model scaling for convolutional neural networks Smaller models and faster training Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning Multi-task learning for dense prediction tasks: A survey What makes training multi-modal classification networks hard? Revisiting video saliency prediction in the deep learning era Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weaklysupervised classification and localization of common thorax diseases A survey on multi-task learning Learning deep features for discriminative localization We would like to thank physionet.org for providing the open platform for dataset sharing, and we also like to express our gratitude to contributors who collected, organised and