key: cord-0064972-a6ppngvv authors: Sun, Haijing; Wang, Anna; Wang, Wenhui; Liu, Chen title: An Improved Deep Residual Network Prediction Model for the Early Diagnosis of Alzheimer’s Disease date: 2021-06-18 journal: Sensors (Basel) DOI: 10.3390/s21124182 sha: 413fa5d3fba4111948e77c6d4bf6434eeafa506b doc_id: 64972 cord_uid: a6ppngvv The early diagnosis of Alzheimer’s disease (AD) can allow patients to take preventive measures before irreversible brain damage occurs. It can be seen from cross-sectional imaging studies of AD that the features of the lesion areas in AD patients, as observed by magnetic resonance imaging (MRI), show significant variation, and these features are distributed throughout the image space. Since the convolutional layer of the general convolutional neural network (CNN) cannot satisfactorily extract long-distance correlation in the feature space, a deep residual network (ResNet) model, based on spatial transformer networks (STN) and the non-local attention mechanism, is proposed in this study for the early diagnosis of AD. In this ResNet model, a new Mish activation function is selected in the ResNet-50 backbone to replace the Relu function, STN is introduced between the input layer and the improved ResNet-50 backbone, and a non-local attention mechanism is introduced between the fourth and the fifth stages of the improved ResNet-50 backbone. This ResNet model can extract more information from the layers by deepening the network structure through deep ResNet. The introduced STN can transform the spatial information in MRI images of Alzheimer’s patients into another space and retain the key information. The introduced non-local attention mechanism can find the relationship between the lesion areas and normal areas in the feature space. This model can solve the problem of local information loss in traditional CNN and can extract the long-distance correlation in feature space. The proposed method was validated using the ADNI (Alzheimer’s disease neuroimaging initiative) experimental dataset, and compared with several models. The experimental results show that the classification accuracy of the algorithm proposed in this study can reach 97.1%, the macro precision can reach 95.5%, the macro recall can reach 95.3%, and the macro F1 value can reach 95.4%. The proposed model is more effective than other algorithms. Alzheimer's disease (AD) is a common, irreversible, progressive neurological disease characterized by cognitive impairment, whereby the patient's memory and thinking ability are slowly damaged over time [1, 2] . AD is characterized by an insidious onset, slow progression, and irreversible course. Currently, there is no effective treatment that can reverse the damage caused by Alzheimer's disease [3, 4] . At present, most patients with clinically diagnosed AD are in the middle or advanced stage, which means that the optimal time for treatment has already passed. Mild cognitive impairment (MCI) is an intermediate state between normal function and AD. It refers to a mild impairment of cognitive and memory functions rather than dementia [5, 6] . According to statistics, the conversion rate of people with MCI to AD is significantly higher than that of healthy people [7] . Accurate diagnosis early in the course of the disease may allow patients to initiate preventive and intervention measures to slow or stop the progression of the disease before irreversible 1. We propose a new ResNet model where a new Mish activation function is selected in the ResNet-50 backbone to replace the Relu function; 2. The STN is introduced between the input layer and the improved ResNet-50 backbone. This enhances the spatial invariance of the model; 3. A non-local attention mechanism is introduced between the fourth and fifth stages of the improved ResNet-50 backbone. In this section, we review studies related to the early diagnosis of AD. With the increasing attention given to MCI, more and more researchers have proposed new MCI prediction methods [32] . There are diagnostic methods based on the clinical symptom and cognitive function scale. Several AD screening scales commonly used in clinical practice include the clockdrawing test (CDT), mini-mental state examination (MMSE), Montreal cognitive assessment (MOCA), and Alzheimer's disease assessment scale (ADAS-COG), among others. [33] [34] [35] . Brodaty H. et al. [36] argued that a clock map is a very effective test screening measure for detecting mild or moderate AD in the clinical population, with very low false-negative and false-positive rates. Pozueta A. et al. [37] proposed that a combination of MMSE and CVLT-LDTR could distinguish PR-AD and S-MCI at the baseline. The analysis of these two neuropsychological predictors is relatively short and may be easily accomplished in a non-specialist clinical setting. Zainal N. et al. [38] proposed that ADAS-COG, which is widely used in clinical trials, may be suitable for an Asian cohort, and is useful for detecting MCI and mild AD. Roman F. et al. [39] proposed that Argentina-type MBT and MMSE were significantly correlated with memory cells and that they were effective tools for detecting MCI. The working characteristics of the MBT are very suitable, more so than those of other commonly used tests for detecting MCI. Carlew A. et al. [40] proposed that detection by MMSE is significantly affected by the disease course, while in the case of MOCA, severe MCI results in insignificant changes. Although statistically significant, the actual clinical significance of the changes in MOCA is unclear. The growing use of MOCA requires further research to understand what constitutes clinically significant changes and whether it is appropriate to track cognitive trajectories. There are diagnostic methods based on biomarker detection; cerebrospinal fluid (CSF) biomarkers Aβ , T-tau, and P-tau are well-validated, and are being increasingly used in clinical practice as tools for the affirmative diagnosis of AD [41] . The long-term stability of core CSF biomarkers in patients with AD provides further support for their use in clinical studies and treatment monitoring in clinical trials [42] . Michael E. et al. [43] proposed that CSF Aβ 1-42 showed the best diagnostic accuracy among the CSF biomarkers. At a sensitivity of 85%, the specificity in differentiating AD dementia from other diagnoses ranged from 42% to 77%. Geijselaers S. et al. [44] provided further evidence of the relationship between brain insulin signaling and AD pathology. This also highlights the need to consider sex and the APOE ε4 genotype during assessment. Gs A. et al. [45] proposed that blood, urine, saliva, and tears have yielded promising results, and several new molecules have been identified as potential brain biomarkers thanks to the development of new ultra-sensitive techniques. In this review, the authors discuss the advantages and limitations of classic CSF biomarkers for AD, as well as the latest prospects for new CSF candidate biomarkers and alternative substrates. Fossati S. et al. [46] found that plasma tau is higher in AD independently from CSF-tau. Importantly, adding plasma tau to CSF tau or P-tau improves the diagnostic accuracy, suggesting that plasma tau may represent a useful biomarker for AD, especially when added to CSF tau measures. Abe K. et al. [47] found that the present serum biomarker set provides a new, rapid, non-invasive, highly quantitative, and lowcost clinical application for dementia screening, and also suggests an alternative pathway or mechanism by which AD causes neuroinflammation and neurovascular unit damage. Nabers A. et al. [48] used immune infrared sensors to measure the secondary structure distribution of amyloid beta (Aβ) and tau in plasma and cerebrospinal fluid as structurebased biomarkers of AD. In the first diagnostic screening step, structure-based Aβ blood biomarkers support AD recognition with a sensitivity of 90%. In the second diagnostic There are also neuroimage-based detection methods. Basheera S. et al. [49] used a CNN model with inception blocks to extract depth features from gray matter slices for the early prediction of AD. Ji H. et al. [50] mainly studied the early diagnosis of AD using convolutional neural networks. The gray matter and white matter image slices of MRI were used as classification inputs. After a convolution operation combined with the output of deep learning classifier, an ensemble learning method was adopted to improve classification. Tofail B. et al. [51] proposed constructing multiple deep two-dimensional convolutional neural networks (2D-CNNs) to learn various features from local brain images and combine these features with the final classification for AD diagnosis. Subramoniam M. et al. [52] proposed a method for the prediction of AD from MRI based on deep neural networks. The state of image classification networks, such as VGG, residual network (ResNet), etc., with transfer learning, show promising results. The performance of pretrained versions of these networks can be improved by transfer learning. A ResNet-based architecture with a large number of layers was found to give the best result in terms of predicting different stages of the disease. Hussain et al. [53] proposed a model based on 12-layer CNN to use brain MRI data for the dichotomization and detection of AD. Among the abovementioned methods available for the early diagnosis of AD, the diagnosis method based on MRI has the advantages of non-invasiveness and non-radioactivity, and has become an indispensable technical tool in the clinical and scientific research of AD. In this study, ResNet-50 is used as the backbone network because of its simpler structure, and since the increase in identity mapping does not reduce network performance. The proposed method can extract more information from layers by deepening the network structure through deep ResNet [54, 55] . The data used in this study come from ADNI (Alzheimer disease neuroimaging initiative) (http://adni.loni.usc.edu (accessed on 16 February 2020)) [56]. Generally, the dataset is divided into the following 3 categories: normal control (NC), mild cognitive impairment (MCI), and Alzheimer's disease (AD). MCI is a major step in the transition from a normal to AD state. We screened a total of 515 samples, which were divided into 55 AD samples, 255 NC samples, and 205 MCI samples. The proportion of men and women in each category was roughly equal. MMSE mainly relies on experienced doctors to ask patients to obtain scale scores. The scale score is a continuous integer from 0 to 30. The higher the score, the healthier the patient, while the lower the score, the more severe the dementia. For NC, the MMSE score is 24-30 and the ADAS-Cog score is <12. For MCI, the MMSE score is 23-30 and the ADAS-Cog score is 7-17. For AD, its MMSE score is 20-26 and the ADAS-Cog score is 12-29 [57] . Information on the collected data is shown in Table 1 . ResNet was proposed by 4 Chinese scientists, including Kaiming He, from the former Microsoft Research Institute, and the proposal of deep ResNet is a milestone event in the history of CNN images. The residual module in the deep ResNet is shown in Figure 1 [58] . ResNet was proposed by 4 Chinese scientists, including Kaiming He, from the former Microsoft Research Institute, and the proposal of deep ResNet is a milestone event in the history of CNN images. The residual module in the deep ResNet is shown in Figure 1 [58] . In the figure, x is weighted by the first layer, then F(x) + x is obtained after the nonlinear variation in the Relu function and the weighting of the second layer. This is a linear stack, and the two layers constitute a residual learning module. The network composed of residual modules is called ResNet. The difference between the ResNet and the ordinary network is that the jump connection is introduced, which can help the information of the previous residual block flow into the next residual block without obstruction. The problem of vanishing gradient and degradation caused by too deep a network is avoided [58, 59] . Since the Relu function often causes the permanent inactivation of neurons, these inactivated neurons will be occupied. Due to the computational resources involved, the ability to extract image features still needs to be improved. In order to make up for the deficiency of Relu, the new activation function Mish was selected to replace the function of Relu in the model. The Mish activation function is expressed as in Equation (1) [60], as follows: The positive value of the Mish activation function can reach any height, avoiding saturation due to capping. Due to the smoothness of the Mish activation curve, better information can be penetrated into the neural network, resulting in better accuracy and generalization. As the depth of the network increases, Mish can better maintain accuracy. In the deep ResNet-50, the bottleneck residual module is stacked with a 1 × 1 convolution, 3 × 3 convolution, and 1 × 1 convolution. The two l × 1 convolutions play the role of decreasing and increasing dimensions, respectively. The bottleneck residual module can greatly improve the computational efficiency and significantly increase the depth of the residual block. The introduction of more Mish activation functions can improve the representation ability of ResNet. The bottleneck residuals module of different layers for the ResNet-50 architecture is expressed in Figure 2 [58] [59] [60] . In the figure, x is weighted by the first layer, then F(x) + x is obtained after the nonlinear variation in the Relu function and the weighting of the second layer. This is a linear stack, and the two layers constitute a residual learning module. The network composed of residual modules is called ResNet. The difference between the ResNet and the ordinary network is that the jump connection is introduced, which can help the information of the previous residual block flow into the next residual block without obstruction. The problem of vanishing gradient and degradation caused by too deep a network is avoided [58, 59] . Since the Relu function often causes the permanent inactivation of neurons, these inactivated neurons will be occupied. Due to the computational resources involved, the ability to extract image features still needs to be improved. In order to make up for the deficiency of Relu, the new activation function Mish was selected to replace the function of Relu in the model. The Mish activation function is expressed as in Equation (1) [60] , as follows: The positive value of the Mish activation function can reach any height, avoiding saturation due to capping. Due to the smoothness of the Mish activation curve, better information can be penetrated into the neural network, resulting in better accuracy and generalization. As the depth of the network increases, Mish can better maintain accuracy. In the deep ResNet-50, the bottleneck residual module is stacked with a 1 × 1 convolution, 3 × 3 convolution, and 1 × 1 convolution. The two l × 1 convolutions play the role of decreasing and increasing dimensions, respectively. The bottleneck residual module can greatly improve the computational efficiency and significantly increase the depth of the residual block. The introduction of more Mish activation functions can improve the representation ability of ResNet. The bottleneck residuals module of different layers for the ResNet-50 architecture is expressed in Figure 2 [58] [59] [60] . STN can adaptively perform spatial transformation. In the case of large spatial differences in the input data, this network can be added to the existing convolutional network to improve the accuracy of classification. The STN network consists of a localization network, a grid generator, and a sampler, as shown in Figure 3 [61] . STN can adaptively perform spatial transformation. In the case of large spatial differences in the input data, this network can be added to the existing convolutional network to improve the accuracy of classification. The STN network consists of a localization network, a grid generator, and a sampler, as shown in Figure 3 [61] . STN can adaptively perform spatial transformation. In the case of large spatial differences in the input data, this network can be added to the existing convolutional network to improve the accuracy of classification. The STN network consists of a localization network, a grid generator, and a sampler, as shown in Figure 3 [61] . Localization net: localization net is traditional CNN and this is the network used for the regression transformation parameter θ. Grid generator: the grid generator generates a coordinate network corresponding to each pixel of the output image in the input image. Sampler: a sampler uses the sampling network and the input element graph as an input, then inputs, then obtains the result after transforming the element graph. After the input picture is passed through the STN module, the transformed picture is obtained, and the transformed picture is then input into the CNN network. Loss is calculated through the loss function, and the gradient is then calculated to update the θ parameter. Finally, the STN module will learn how to correct the picture [61, 62] . The attention mechanism is a general mechanism for information acquisition, which is applied to scenarios where a large number of sources are used to obtain specific critical information and avoid processing all the data. The non-local attention mechanism directly captures remote dependencies by calculating the interaction between any two locations, rather than being limited to adjacent points. The non-local attention mechanism is shown in Figure 4 [63] . Grid generator: the grid generator generates a coordinate network corresponding to each pixel of the output image in the input image. Sampler: a sampler uses the sampling network and the input element graph as an input, then inputs, then obtains the result after transforming the element graph. After the input picture is passed through the STN module, the transformed picture is obtained, and the transformed picture is then input into the CNN network. Loss is calculated through the loss function, and the gradient is then calculated to update the θ parameter. Finally, the STN module will learn how to correct the picture [61, 62] . The attention mechanism is a general mechanism for information acquisition, which is applied to scenarios where a large number of sources are used to obtain specific critical information and avoid processing all the data. The non-local attention mechanism directly captures remote dependencies by calculating the interaction between any two locations, rather than being limited to adjacent points. The non-local attention mechanism is shown in Figure 4 [63] . Three feature images, A, B, and C, can be obtained through three 1 × 1 convolutional layers. A and B are multiplied to obtain S using softmax. Then, the product of S and C can be multiplied by the scale coefficient to obtain D, D can be reshaped to the original shape, and then X is added to obtain the final output E. We can see that the value of each position of E is a weighted sum of the original feature and each position. can be written as Equation (2) [63, 64] . Three feature images, A, B, and C, can be obtained through three 1 × 1 convolutional layers. A and B are multiplied to obtain S using softmax. Then, the product of S and C can be multiplied by the scale coefficient to obtain D, D can be reshaped to the original shape, and then X is added to obtain the final output E. We can see that the value of each position of E is a weighted sum of the original feature and each position. E j can be written as Equation (2) [63, 64] . In this paper, a new deep ResNet learning method that combines STN and the nonlocal attention mechanism is proposed. The model uses MRI slices of a large number of subjects to train the network, automatically learns image features, avoiding manual extraction, and then classifies the input images based on these features to obtain diagnosis results for the subject's state. In this study, a new activation function Mish was selected to replace Relu in the traditional ResNet-50 model. This method could solve the problem of local information loss in ordinary CNN and can satisfactorily extract the long-distance correlation in feature space. The framework of the proposed method is shown in Figure 5 [58] [59] [60] [61] [62] [63] [64] [65] . Sensors 2021, 21, x FOR PEER REVIEW 8 of 15 In this paper, a new deep ResNet learning method that combines STN and the nonlocal attention mechanism is proposed. The model uses MRI slices of a large number of subjects to train the network, automatically learns image features, avoiding manual extraction, and then classifies the input images based on these features to obtain diagnosis results for the subject's state. In this study, a new activation function Mish was selected to replace Relu in the traditional ResNet-50 model. This method could solve the problem of local information loss in ordinary CNN and can satisfactorily extract the long-distance correlation in feature space. The framework of the proposed method is shown in Figure 5 [58-65]. A local network is a network used in regression of transformation parameter θ. Its input is a feature image and its output is the spatial transformation of parameter θ through a series of hidden network layers. If a 2D affine transformation is required, θ is the output of a 6-dimensional (2 × 3) vector. The size of θ depends on the type of transformation applied. A grid generator is used to build a sampling grid according to the predicted transformation parameters. It is the output of a group of points in the input image after sampling and transformation. What the grid generator actually obtains is a kind of mapping relation [62] . Assuming that the coordinate of each pixel of input image is ( , ), the coordinate of each pixel of output image is ( , ). The space transformation function is a twodimensional affine transformation function. The corresponding relationship between ( , ) and ( , ) can be written as Equation (3), as follows: In Equation (3), S represents the coordinate point of the input feature image, T represents the coordinate point of the output feature image, and is the output of the local network. The sampler in STN uses the sampling grid and the input feature map as the input to produce the output. Additionally, it obtains the result after the feature map is transformed. Further, n and m will traverse all coordinates of the original graph U, and refers to the pixel values of a point in the original graph U. Then, , denotes the coordinates of the corresponding point in the U graph to be found at the ith point in V. The denoted coordinates are those on the U graph. K denotes filling by different methods, usually using bilinear interpolation. The following Equation (4) is obtained [61, 62] : A local network is a network used in regression of transformation parameter θ. Its input is a feature image and its output is the spatial transformation of parameter θ through a series of hidden network layers. If a 2D affine transformation is required, θ is the output of a 6-dimensional (2 × 3) vector. The size of θ depends on the type of transformation applied. A grid generator is used to build a sampling grid according to the predicted transformation parameters. It is the output of a group of points in the input image after sampling and transformation. What the grid generator actually obtains is a kind of mapping relation T θ [62] . Assuming that the coordinate of each pixel of input image is (x s i , y s i ), the coordinate of each pixel of output image is (x t i , t t i ). The space transformation function T θ is a two-dimensional affine transformation function. The corresponding relationship between (x s i , y s i ) and (x t i , y t i ) can be written as Equation (3), as follows: In Equation (3), S represents the coordinate point of the input feature image, T represents the coordinate point of the output feature image, and A θ is the output of the local network. The sampler in STN uses the sampling grid and the input feature map as the input to produce the output. Additionally, it obtains the result after the feature map is transformed. Further, n and m will traverse all coordinates of the original graph U, and U nm refers to the pixel values of a point in the original graph U. Then, x s i , y s i denotes the coordinates of the corresponding point in the U graph to be found at the ith point in V. The denoted coordinates are those on the U graph. K denotes filling by different methods, usually using bilinear interpolation. The following Equation (4) is obtained [61, 62] : Integrating the STN module between the input and ResNet allows the network to automatically learn how to transform the feature map, thus helping to reduce the overall Sensors 2021, 21, 4182 9 of 15 cost of network training. We locate the output value in the network, indicating how to transform each item of training data [61] [62] [63] [64] . The non-local attention mechanism is embedded as a component in ResNet-50, and new weights are learned in transfer learning so that pretrained weights are not unavailable due to the introduction of new modules [63, 65] . The architecture of the proposed method is shown in Table 2 [58]. The environment of this experiment is a Linux system, which is designed and realized by the Keras framework, and the model is trained using the Adam optimization algorithm. The experiment steps are as follows: (1) The experiment uses two-dimensional slices as training data, so it is necessary to slice the three-dimensional MRI coronal plane. In order to ensure that the input image size of the classifier is consistent, this experiment unifies these slices into a size of 224 × 224. The experiment uses the CAT12 toolkit of the SPM12 software to preprocess the images. Image preprocessing includes format conversion, skull stripping, grayscale normalization, MRI slicing, and uniform sizing, etc. The detailed preprocessing process is shown in Figure 6 ; (2) The Keras experimental platform was built and the STN + ResNet + attention network model was designed; (3) The K-fold (K = 5) cross validation method was used to randomly divide the dataset, with 80% used as the training set and 20% used as the test set; (4) The training set was input into the network for training and the training results were obtained; (5) The optimal model parameters were saved and tested in the model using the test set data. The flow chart with detailed steps is shown in Figure 7 . The flow chart with detailed steps is shown in Figure 7 . The flow chart with detailed steps is shown in Figure 7 . We considered our result in the context of multi-class classification. The multi-classification data were transformed into two classification problems and a one-vs-rest strategy We considered our result in the context of multi-class classification. The multiclassification data were transformed into two classification problems and a one-vs-rest strategy was adopted-that is, one category comprised positive samples and the other categories comprised negative samples [66] [67] [68] [69] [70] . TP i : the prediction is category i, the reality is category i. TN i : the prediction is other classes of category i, the reality is other classes of category i. FP i : the prediction is category i, the reality is other classes of category i. FN i : the prediction is other classes of category i, the reality is category i. Each category was taken as a positive sample to calculate the total accuracy, precision, and recall values for each category. The accuracy can be expressed using Equation (5) The precision of a certain category can be understood as predicting the accuracy of the sample, expressed as Equation (6), as follows: The recall of a certain category can be understood as the extent to which the sample of category i, which was correctly predicted, covers the sample of category i in the sample set, expressed as Equation (7), as follows: To investigate the merits and demerits of classifiers under different categories, a macro average should be introduced. Macro-averaging refers to the mathematical average of the values of each statistical index of all types. Their calculation equations are expressed in Equations (8)-(10), as follows: The total confusion matrix is obtained by adding the values of each folded confusion matrix. For the convenience of calculating the Precision macro , Recall macro , and F1 macro , we use the average confusion matrix of multiple classifications. We divide the value of the total confusion matrix by five to get the average confusion matrix. The model proposed in this study was used for training and testing on the selected ADNI dataset, and the test results are shown in Table 3 [70] . For the purpose of illustrating the effectiveness of the method proposed in this paper, ResNet50 baseline, ResNet50 + Mish, and STN + ResNet50 + Mish were selected to conduct experiments using the same dataset. The Accuracy, Precision macro , Recall macro , and F1 macro of each model were, respectively, calculated as shown in Table 4 . In this paper, the Accuracy, Precision macro , Recall macro , and F1 macro value of the above models were successively compared and analyzed, as shown in Figure 8 . experiments using the same dataset. The Accuracy, Precisionmacro, Recallmacro, and F1macro of each model were, respectively, calculated as shown in Table 4 . In this paper, the Accuracy, Precisionmacro, Recallmacro, and F1macro value of the above models were successively compared and analyzed, as shown in Figure 8 . The experimental results show that the method proposed in this article is compared with the other three methods in classification accuracy. There is a big improvement, and the standard deviation of the experimental results is smaller. The experiments show that the Mish activation function is used to replace the Relu function in the model, and the accuracy is increased by 1.9% compared with the baseline. After the introduction of the STN and attention mechanism, the accuracy of the model increased by 5.8%. In this research article, we propose a deep learning model based on ResNet-50 for the early diagnosis of Alzheimer's disease. In the model, a new Mish activation function is selected in the ResNet-50 backbone to replace the Relu function, the STN is introduced between the input layer and the improved ResNet-50 backbone, and a non-local attention mechanism is introduced between the fourth and fifth stages of the improved ResNet-50 backbone. The Mish activation function is boundless (that is, the positive value can reach any height) to avoid saturation due to capping. Theoretically, the slight bias toward the negative values allows for better gradient flows in comparison with the hard zero boundary, as in Relu. Integrating the STN module into the ResNet-50 network allows the network to automatically learn how to transform the feature map, thus helping to reduce the overall cost of network training. The addition of a non-local block attention mechanism module provides a solid improvement. The proposed method was validated using the ADNI experimental dataset and compared with the ResNet-50 baseline, ResNet-50 + Mish, and STN + ResNet-50 + Mish models. The experimental results show that the proposed model is more effective and provides a better robustness for clinical application. The The experimental results show that the method proposed in this article is compared with the other three methods in classification accuracy. There is a big improvement, and the standard deviation of the experimental results is smaller. The experiments show that the Mish activation function is used to replace the Relu function in the model, and the accuracy is increased by 1.9% compared with the baseline. After the introduction of the STN and attention mechanism, the accuracy of the model increased by 5.8%. In this research article, we propose a deep learning model based on ResNet-50 for the early diagnosis of Alzheimer's disease. In the model, a new Mish activation function is selected in the ResNet-50 backbone to replace the Relu function, the STN is introduced between the input layer and the improved ResNet-50 backbone, and a non-local attention mechanism is introduced between the fourth and fifth stages of the improved ResNet-50 backbone. The Mish activation function is boundless (that is, the positive value can reach any height) to avoid saturation due to capping. Theoretically, the slight bias toward the negative values allows for better gradient flows in comparison with the hard zero boundary, as in Relu. Integrating the STN module into the ResNet-50 network allows the network to automatically learn how to transform the feature map, thus helping to reduce the overall cost of network training. The addition of a non-local block attention mechanism module provides a solid improvement. The proposed method was validated using the ADNI experimental dataset and compared with the ResNet-50 baseline, ResNet-50 + Mish, and STN + ResNet-50 + Mish models. The experimental results show that the proposed model is more effective and provides a better robustness for clinical application. The integration with STN enhances the ability of this model to extract network features by improving the spatial invariance of the network. This demonstrates its good recognition effect. The introduction of a non-local block attention mechanism can enhance model robustness. In the end, the experiment results found using the ADNI dataset show that the Biomarkers for Alzheimer's Disease (AD) and the Application of Precision Medicine Peripheral Biomarkers for Alzheimer's Disease: Update and Progress Blood tests to screen for Alzheimer's disease Alzheimer's disease Alzheimer's disease and other neurodegenerative dementias in comorbidity: A clinical and neuropathological overview Cognitive Training and Stress Detection in MCI Frail Older People Through Wearable Sensors and Machine Learning Early diagnosis of Alzheimer's disease with deep learning Brain Resources: How Semantic Cueing Works in Mild Cognitive Impairment due to Alzheimer's Disease (MCI-AD) A novel conversion prediction method of MCI to AD based on longitudinal dynamic morphological features using ADNI structural MRIs Biomarkers for the Early Detection and Progression of Alzheimer's Disease Biomarkers for Alzheimer's disease: Academic, industry and regulatory perspectives Increased prediction value of biomarker combinations for the conversion of mild cognitive impairment to Alzheimer's dementia Screening for mild cognitive impairment: A systematic review MMSE Subscale Scores as Useful Predictors of AD Conversion in Mild Cognitive Impairment The Mini-Cog versus the Mini-Mental State Examination and the Clock Drawing Test in daily clinical practice: Screening value in a German Memory Clinic Transmembrane Amyloid-Related Proteins in CSF as Potential Biomarkers for Alzheimer's Disease. Front Impact of CSF storage volume on the analysis of Alzheimer's disease biomarkers on an automated platform Towards harmonizing subtyping methods for PET and MRI studies of Alzheimer's disease. Alzheimer's Dement. 2020, 16, e042807 Association of short-term cognitive decline and MCI-to-AD dementia conversion with CSF, MRI, amyloidand 18F-FDG-PET imaging Early identification of MCI converting to AD: A FDG PET study Convolutional Neural Network-based MR Image Analysis for Alzheimer's Disease Classification Effects of mind body exercise on brain structure and function a systematic review on MRI studies Exploiting Discriminative Regions of Brain Slices Based on 2D CNNs for Alzheimer's Disease Classification Survey on identification of Alzheimer disease using magnetic resonance imaging (MRI) images GGA: A modified Genetic Algorithm with Gradient-based Local Search for Solving Constrained Optimization Problems A Deep Siamese Convolution Neural Network for Multi-Class Classification of Alzheimer Disease Alzheimer's Diseases Detection by Using Deep Learning Algorithms: A Mini-Review Diagnosis of Alzheimer's Disease Severity with fMRI Images Using Robust Multitask Feature Extraction Method and Convolutional Neural Network (CNN) Representation and Classification of Auroral Images Based on Convolutional Neural Networks Malware Detection Algorithm Based on the Attention Mechanism and ResNet Spatial non-local attention for thoracic disease diagnosis and visualisation in weakly supervised learning. IET Image Process Early diagnosis model of Alzheimer's disease based on sparse logistic regression with the generalized elastic net Clinical and biomarker changes of Alzheimer's disease in adults with Down syndrome: A cross-sectional study A Novel Computerized Cognitive Stress Test to Detect Mild Cognitive Impairment Early Cognitive Assessment Following Acute Stroke: Feasibility and Comparison between Mini-Mental State Examination and Montreal Cognitive Assessment The Clock Drawing Test for dementia of the Alzheimer's type: A comparison of three scoring methods in a memory disorders clinic Detection of early Alzheimer's disease in MCI patients by the combination of MMSE and an episodic memory test Psychometric Properties of Alzheimer's Disease Assessment Scale-Cognitive Subscale for Mild Cognitive Impairment and Mild Alzheimer's Disease Patients in an Asian Context. Ann Validation of the Argentine version of the Memory Binding Test (MBT) for Early Detection of Mild Cognitive Impairment A-02 Comparing Rate of Change in MoCA and MMSE Scores over Time in an MCI and AD sample State-of-the-Art Methods and Emerging Fluid Biomarkers in the Diagnostics of Dementia-A Short Review and Diagnostic Algorithm CSF Biomarkers for Early Diagnosis of Synucleinopathies: Focus on Idiopathic RBD CSF biomarkers for the differential diagnosis of Alzheimer's disease: A large-scale international multicenter study Association of Cerebrospinal Fluid (CSF) Insulin with Cognitive Performance and CSF Biomarkers of Alzheimer's Disease. J. Alzheimer's Dis AD biomarker discovery in CSF and in alternative matrices Plasma tau complements CSF tau and P-tau in the diagnosis of Alzheimer's disease A New Serum Biomarker Set to Detect Mild Cognitive Impairment and Alzheimer's Disease by Peptidome Technology Aβ and tau structure-based biomarkers for a blood-and CSF-based two-step recruitment strategy to identify patients with dementia due to Alzheimer's disease. Alzheimer's Dement A novel CNN based Alzheimer's disease classification using hybrid enhanced ICA segmented gray matter of MRI Early diagnosis of Alzheimer's disease using deep learning Binary Classification of Alzheimer's Disease Using sMRI Imaging Modality and Deep Learning Deep learning based prediction of Alzheimer's disease from magnetic resonance images Deep Learning Based Binary Classification for Alzheimer's Disease Detection Using Brain MRI Images An Improved ResNet Based on the Adjustable Shortcut Connections A transfer convolutional neural network for fault diagnosis based on ResNet-50 A multi-model deep convolutional neural network for automatic hippocampus segmentation and classification in Alzheimer's disease Deep residual learning for image recognition Forward Stability of ResNet and Its Variants Mish: A self regularized non-monotonic neural activation function Inverse Compositional Spatial Transformer Networks Spatial transformer networks Non-local Neural Networks Attention Is All You Need An Automatic Modulation Recognition Method with Low Parameter Estimation Dependence Based on Spatial Transformer Networks Plant recognition using spatial transformer network Deeply-Learned Spatial Alignment for Person Re-Identification Assisted Diagnosis of Alzheimer's Disease Based on Deep Learning and Multimodal Feature Fusion The Diagnosis of Alzheimer's Disease Based on Enhanced Residual Neutral Network An Empirical Study of Spatial Attention Mechanisms in Deep Networks Discovering genomic patterns in SARS-CoV-2 variants