key: cord-1039310-j0eftl0n authors: Messan, K. S.; Pham, L. K. T.; Harris, T.; Kim, Y.; Morgan, V.; Kosa, P.; Bielekova, B. title: Intra-individual reproducibility as essential determinant of clinical utility of smartphone-based neurological disability tests date: 2021-06-04 journal: nan DOI: 10.1101/2021.06.01.21258169 sha: 15c6323b8e8afeea113937039366d9d8739d98b4 doc_id: 1039310 cord_uid: j0eftl0n Technological advances, lack of medical professionals, high cost of face-to face encounters and disasters such as COVID19 pandemic, fuel the telemedicine revolution. Numerous smartphone apps have been developed to measure neurological functions. However, their psychometric properties are seldom determined. Lacking such data, it is unclear which designs underlie eventual clinical utility of the smartphone tests. We have developed the smartphone Neurological Function Tests Suite (NeuFun-TS) and are systematically evaluating their psychometric properties against the gold-standard of complete neurological examination digitalized into NeurExTM App. This paper examines the fifth, and thus far the most complex NeuFun-TS test, the "Spiral tracing". We generated 40 features in the training cohort (22 healthy donors [HD] and 105 multiple sclerosis [MS] patients) and compared their intraclass correlation coefficient, fold-change between HD and MS and correlations with relevant clinical and imaging outcomes. We assembled the best features into machine-learning models and examined their performance in the independent validation cohort (56 MS patients). We show that by aggregating multiple neurological functions, complex tests such as spiral tracing are susceptible to intra-individual variations, decreasing their reproducibility and thus, clinical utility. Simple tests, reproducibly measuring single function(s) that can be aggregated to increase sensitivity are preferable in app design. Expert neurological examination is an art that is slowly but surely disappearing. Skilled neurologist can reliably identify deficit in neurological function(s) and localize it to the specific part of the central (CNS) or peripheral nervous system (PNS). Expert examiner can also differentiate deficit that lacks anatomical substrate, by examining identical neurological function in different ways, noting inconsistencies, and motivating a patient to provide adequate effort. Such neurological examination takes between 30-60 minutes to perform and years to master. Because of examiner-dependency, the quantitative aspect of neurological examination, especially when performed by different raters, is less precise. Traditional neurological disability scales non-algorithmically aggregate semi-quantitative ratings of different neurological functions, usually selected by an individual (e.g., in Expanded Disability Status Scale; EDSS (Kurtzke, 1983) ) or teams of experts (e.g., in Scripps Neurological Rating Scale; SNRS (Sipe et al., 1984) ), into a single number. This is suboptimal for two reasons: 1. The features of the neurological examination aggregated to the disability scale are not data-driven and therefore may not be optimal and 2. Lack of defined algorithm causes errors during translation of the examination into a scale. These drawbacks are eliminated by data-driven scales (such as Combinatorial Weight-Adjusted Disability Scale; CombiWISE (P. Kosa et al., 2016)) and digital tools that allow convenient documentation of neurological examination in its entirety with automated, algorithmically codified computation of relevant disability scales (such as NeurEx TM App (Peter Kosa et al., 2018) ). However, these solutions are useless when the lack of expert medical professionals, limited time for patient encounters or inability to examine patient in person due to pandemic, deprives patients of the benefit of this historically validated tool. Therefore, there is a strong movement to supplement neurological examination or, in some instances, to replace it, by patient-autonomous tests of neurological functions (both cognitive and physical) acquired via smartphones, tablets or web interphase (Balto, Kinnett-Hopkins, & Motl, 2016 ; Alexandra K. Bove et al., 2015; A. Creagh et al., 2020; Erasmus et al., 2001; Longstaff & Heath, 2006; Maillart et al., 2019; Midaglia et al., 2019; Pham et al., 2021; . While some of these Apps are already marketed to patients, they often lack careful assessment of their psychometric properties against the gold standard of neurological examination and imaging or electrophysiological measures of CNS (or PNS) injury. In fact, even simple assessment of test-retest reproducibility may be missing. Many of these Apps use tests adopted from standard neurological examination and modified to self-administered digital test. This is true for the Spiral tracing test, examined in this paper. Spiral tracing has been used in movement disorders to identify tremor and quantify its severity. Its digitalization offers automated identification and quantification of the tremor frequency and amplitude by Fourier transformation (Lin, Chen, Yang, & Chen, 2018) . Furthermore, digitalization of the shape(s) tracing allows other quantitative measurements of speed and precision of the tracing (by finger or stylus) which may reflect neurological (dys)functions. We have reviewed previous studies of digitalized spiral/object tracing (A. Creagh et al., 2020; Erasmus et al., 2001; Feys et al., 2007; Longstaff & Heath, 2006) to derive comprehensive set of digital features (40 total) and determined their psychometric properties (i.e., reproducibility, ability to differentiate MS patients from HD and correlation with relevant features of neurological examination, disability scales and CNS tissue destruction visible on brain MRI) in the training and independent validation cohorts of MS patients. Our data indicate that smartphone modification of spiral tracing has inferior clinical utility compared to simpler smartphone tests in NeuFun-TS, such as finger tapping (Tanigawa et al., 2017) , balloon popping (Alexandra K. and level test (Boukhvalova et al., 2019) , likely because of its complexity and resulting poor reproducibility when modified to smartphone self-administration. The NeurEx data were collected from participants enrolled in the Natural History protocol: Comprehensive Multimodal Analysis of Neuroimmunological Diseases in the Central Nervous System (Clinicaltrials.gov identifier NCT00794352). The study was approved by National Institute of Allergy and Infectious Diseases (NIAID) scientific review and by the National Institutes of Health (NIH) Institutional Review Board. All methods were performed in accordance with the relevant guidelines and regulations. All study participants gave informed consent. HD were recruited in two ways: 1) full participants in the Natural History protocol that underwent comprehensive neurological/imaging evaluation and 2) participants in a substudy of the Natural History protocol to obtain normative data for smartphone apps (without neurological/imaging evaluation). Two different groups are comprised within the MS datasets: a cohort that is tested in a clinic approximately every 6 months (non-granular testing sub-cohorts) and those that had the smartphone at home and did the test more than 5 times during a period of two years (granular testing sub-cohorts). Prior to all analyses, the MS datasets were separated into a 2/3 training and 1/3 test set weighted by one of the clinical features (average 9HPT; see Table 2 in Section 2.3). A summary of the demographic information is provided in Table 1 . center of the spiral). During the experiment, a raw sensor data was collected from the smartphone touchscreen as an x-and y-screen coordinates with a corresponding timestamp in milliseconds and an estimated pressure of the tap based on the surface area of the finger on the touchscreen. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 4, 2021. The complete neurological examination, lasting 30-60 minutes and performed by MS-trained clinician was transcribed into NeurEx TM App (Peter Kosa et al., 2018) . NeurEx TM computes traditional disability scales such as EDSS (Kurtzke, 1983) , SNRS (Sipe et al., 1984) and others. We also extracted relevant subsystem scores of those neurological functions that, based on domain expertise, contribute to the spiral tracing . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 4, 2021. ; https://doi.org/10.1101 https://doi.org/10. /2021 (i.e., pyramidal and motor functions of hands, cerebellar functions, proprioception functions). Finally, we extracted semi-quantitative MRI data of CNS tissue destruction, focusing on brainstem, cerebellum, and medulla/upper cervical spinal cord. These are features previously validated as important in determining physical disability (Kelly et al., 2021; P. Kosa et al., 2015) . The details of MRI sequences and computation of selected MRI features has been previously published (Kelly et al., 2021; P. Kosa et al., 2015) . Thus, together we tested 22 disability features in the MS training cohort (Table 2 ). Though spiral data were obtained from 22 HD, we note that clinical features were generated for only 9 HD (see Material and Methods for details). Memedi et al., 2015) . Thus, we initially computed the velocity (v), radial and angular velocity (av) of the drawing spirals as follows: Where , , = 1 … are the horizontal and vertical coordinate of pixels on the screen respectively with N representing the total number of touch data points. , = 1 … is the timestamp converted to second. The radial velocity is computed as: . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 4, 2021. ; https://doi.org/10.1101 https://doi.org/10. /2021 Where = √( 2 + 2 ). If we denote the four-quadrant inverse tangent ( . . = −1 ( )), then the angular velocity takes the following form: The sum, coefficient of variation, skewness, and kurtosis were computed for the velocity, radial velocity, and angular velocity respectively. To calculate the degree of resemblance between the reference spiral and cohort's drawing, we introduced features related to the Hausdorff distance, which quantify the extent to which each point in the reference spiral lies near the points in the cohort's drawing following procedures illustrated in (A. P. Creagh et al., 2020; Dubuisson & Jain, 1994; Huttenlocher, Klanderman, & Rucklidge, 1993; Jeong & Srinivasan, 2017) . Similar to Figure 3 in (Jeong & Srinivasan, 2017) , a detail example procedure to calculate the Hausdorff distance of the reference and cohort's drawing is presented in Figure 2 . . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 4, 2021. Prior to calculating the Hausdorff distance, the x and y screen coordinate points of the reference spiral were interpolated to the length of the cohort's drawing's coordinates using cubic spline interpolation (Fritsch & Carlson, 1980; Hyman, 1983; McKinley & Levine, 1998) . Several Hausdorff distance related features were then calculated (e.g. maximum of Hausdorff distance, interquartile range of Hausdorff distance, etc.). All Hausdorff distance related features are provided in Table 3 . Coefficient of variation of angular velocity Kurtosis of angular velocity Approximate Entropy of angular velocity - Time taken to complete drawing Image entropy of shape drawn with respect to reference shape ------Two approaches were utilized to compute the error related features between the reference spiral and the cohort's drawings. The first error was computed using the trapezoid to integrate over the two spiral regions. This error was calculated by finding the intersection of the two-spiral region (i.e., the difference between the two areas in magnitude). For instance, we note the following trapezoidal formula from (Aghanavesi et al., 2017) : Where N is the total number of x or y screen coordinate points and − is the spacing between points. Suppose the reference spiral and the cohort's spiral are denoted by ( , ) and ℎ ( , ) respectively. Then the error based on the trapezoidal rule becomes . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 4, 2021. ; Where |. | is the absolute value of the difference between the two Area Under the Curve (AUC). We now proceed with the second error calculated using the following 2D Mean Square Error (MSE; (Asamoah, Ofori, Opoku, & Danso, 2018) ): Where M and N=2 are the number of data points and coordinates points respectively. Again, a spline interpolation was used on the cohort's data point to the M length of the reference data point prior to error calculation. Furthermore, to obtain the similarity between the reference spiral and cohort's drawing, the following 2D correlation coefficient from (Aljanabi, Hussain, & Lu, 2018) was utilized: Where AMN and BMN are the reference and cohort's spiral coordinate points with dimension M x 2 respectively. ̅ = ∑ + ∑ 2 and ̅ = ∑ + ∑ 2 are the reference and cohort's spiral mean. Note that the two spirals mean are not necessarily equal as the x and y coordinate points differ. A full list of features calculated are provided in Table 3 . When applicable, references of the calculated spiral derived features are provided in Table S1 in the supplemental material. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 4, 2021. ; https://doi.org/10.1101/2021.06.01.21258169 doi: medRxiv preprint Statistical analyses were used to evaluate the validity and strength of features (i.e., clinical disability scales and spiral derived features) on assessing the upper extremity function in MS patients. The analyses were conducted using the R software (R Version 4.0.4; (Team, 2021) ). Recall that the MS datasets were separated into a 2/3 training and 1/3 test set weighted by the average 9HPT disability scale. A cutoff Benjamini-Hochberg (BH; (Benjamini & Hochberg, 1995) ) adjusted p-value < 0.001 was used to establish statistically significant differences for comparing the HD and MS cohorts. All features that were not statistically significant using the unpair Two-Samples Wilcoxon test (Wilcoxon, 1945) in the training set were removed from subsequent analysis. Moreover, average fold change (FC) between the HD and MS was computed at all difficulty levels and at both dominant and non-dominant hand. A FC > 2 was used as cutoff of significant difference between HD and MS. Test-Retest reliability of the spiral derived features were measured using the Intraclass Correlation Coefficient (Andrea Vianello, Luca Chittaro, Stefano Burigat, & Riccardo Budai) of features obtained from the granular testing HD and MS sub-cohorts. As stated by (Koo & Li, 2017; WEIR, 2005) , there are several versions of the ICC that can give different results when used on the same dataset. However, the authors pointed that the two-way mixed-effects model and the absolute agreement are more appropriate for test-retest reliability studies. Thus, an ICC with two-way mixed-effects model and the absolute agreement was used in this study. Following the recommendation of (Koo & Li, 2017 ) that stated that an ICC between 0.5 and 0.75 are consider moderate, a spearman correlation matrix between clinical disability scales (see Table 2 ) and spiral derived features (see Table 3 ) with ICC > 0.5 and FC > 2 were constructed. A BH adjusted pvalue > 0.05 was used to access statistical significance of the correlation test. To determine the existing relationship between the significant spiral derived features (i.e. HB adjusted p-value < 0.001, FC > 2, and ICC > 0.5 between HC and MS) and the statistically significant clinical disability scales, four different regression models (Elastic Net or ElasticNet, Support Vector Regression with Radial Basis Function Kernel or SVR Radial, Random Forest or RF, and Stochastic Gradient Boosting or GBM) were used where the clinical disability scale were the dependent variables while spiral derived features were the independent variables. For all regression models, the 'caret' . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 4, 2021. ; https://doi.org/10.1101/2021.06.01.21258169 doi: medRxiv preprint library (Kuhn, 2008) in the R software was used along with other libraries such as 'glmnet' (Friedman, Hastie, & Tibshirani, 2010) for ElasticNet model, 'randomForest' (Liaw & Wiener, 2002) for RF model, and 'xgboost' (Chen et al., 2019) for GBM model. Prior to the regression modeling, outliers in the spiral derived features were identified as feature values that lie outside of ± 2(q0.9 -q0.1) where qp is the p-quantile (Hyndman et al., 2020) . To reduce variability in the features, all variables were bi-symmetric log transformed using the transformation formula = ( ) 10 (1 + | / |) where y is the transformed function of the x variable, C has a default value of 1/ln(10), and sgn(x) is the mathematical Signum function (as presented in (Webber, 2012) ). Moreover, a linear regression model was used to assess the relationship between selected clinical disability scales and the sum of the Hausdorff distance (Feature F24 in Table 3 ). During the analysis, adherence to the normality assumptions of the residuals was tested using histograms and quantile plots. All models were evaluated using the Root Mean Square Error (RMSE; measured in seconds) and the coefficient of determination (R-square) of the prediction. Apart from the linear regression model, all model parameters (i.e. the penalty strength parameter λ and the penalties from both L1 and L2 regularization parameter α in ElasticNet; the cost value C and γ of the SVR Radial; the number of variables randomly sampled at each split in the RF model; the number of tree, the interaction depth, the minimum number of samples in tree terminal nodes, and the learning rate in GBM model) were tuned via grid search. 5fold cross-validation (CV) with 10 repetitions was used to assess the model suitability in the training cohort. The out-of-sample test performance was evaluated using the final model from the 5-fold CV based on the RMSE to predict clinical disability scales given the test datasets (i.e., independent validation cohort). To determine clinical disability scales and spiral derived features that are relevant for further analysis, statistical significance between HD and MS was calculated using unpair Two-Samples Wilcoxon test. With the exception of four features (NeurEx TM vision . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 4, 2021. ; https://doi.org/10.1101/2021.06.01.21258169 doi: medRxiv preprint score, brainstem atrophy, medulla/upper c-spine atrophy, and cerebellum atrophy), all clinical disability scales and MRI features were found to be statistically significant at p < 0.001 after adjusting for the p-value using BH approach (Table 2 ). In the dominant hand category, 28, 26, and 27 spiral derived features were statically significant (BH adjusted p-value < 0.001) at the difficulty level 1, 2, and 3, respectively. However, in the nondominant hand category, 28, 22, and 22 spiral derived features were statically significant at the difficulty level 1, 2, and 3, respectively. While spiral derived features that were statistically significant between HD and MS varies between dominant, nondominant hands and difficulty levels, 19 of these features were consistently statistically significant at all levels and both hands (see bold feature description in Table 3 ). . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 4, 2021. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 4, 2021. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 4, 2021. ; https://doi.org/10.1101/2021.06.01.21258169 doi: medRxiv preprint To determine the relationship between the clinical disability scales and most impactful spiral derived features (Kurtosis of velocity, radial velocity, angular velocity, and the sum of Hausdorff distances), a spearman rho correlation matrix among the features was constructed (see Figure 6 for difficulty level 1 and 2, and Figure S9 for difficulty level 3). . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 4, 2021. ; https://doi.org/10.1101/2021.06.01.21258169 doi: medRxiv preprint We observed a very high positive correlation between the sum of Hausdorff distances and the time taken to complete the drawing in both the HD and MS patients (Figures S10-S12; spearman rho > 0.97 for both dominant and non-dominant hand and at all difficulty levels). This is counter intuitive as we expected that increasing speed of spiral . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 4, 2021. ; https://doi.org/10.1101/2021.06.01.21258169 doi: medRxiv preprint drawing will negatively affect tracing accuracy. Instead, it appears that disability and/or person's confidence of his/her ability to trace the spiral affected both speed and accuracy of the tracing congruently. Four regression models (ElasticNet, SVR Radial, RF, GBM) were used to evaluate the relationship between clinical disability scales and our four most impactful features (Kurtosis of velocity, Kurtosis of Angular velocity, and Sum of the Hausdorff Distances). The models had the best performance predicting the clinical disability scales in the (small) HD cohort based on the mean RMSE and R-square across the 5-Fold CV with 10 repetitions (see Tables S2, S5, and S8 for difficulty level 1, 2 and 3 respectively in the supplemental material). Among the HD at the difficulty level 1, RF model performed the best by explaining at most 81 percent in clinical disability scales when the dependent variables were 9HPT Average or EDSS (Table S2) . When the dependent variables are CombiWISE or NeurEx, GBM had the best performance at the difficulty level 1 with R 2 value between 0.72 and 0.90 (Table S2 ). The results of the percent variance explained in model outcomes at the difficulty levels 2 and 3 in healthy donors were lower compare the difficulty level 1 but still range between 30 and 75 percent (Table S5 and S8 in the supplemental material). Model's performance in the (much larger) MS training cohort were much lower at all difficulty levels in comparison to results from the HD. The SVR Radial performed the best by explaining only 6 to 23 percent of variance in clinical disability scale depending on the hand used (dominant or non-dominant), difficulty levels, and clinical disability scale (see Tables S3, S6, and S9 for difficulty level 1, 2 and 3 respectively in the supplemental material). However, the out-of-sample test performance (i.e., the independent validation cohort) were much lower compare to the result from the 5-fold CV of the training cohort (see R 2 values between 0.0002 and 0.18 in Tables S4, S7, and S10 for difficulty level 1, 2 and 3 respectively in the supplemental material). Of these, models with 9HPT . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 4, 2021. ; https://doi.org/10. 1101 /2021 Average yield the best performance with an R-square between 0.10 and 0.18 but only for dominant hand (Tables S4, S7, and S10 for difficulty level 1, 2 and 3 respectively in the supplemental material). However, in the independent validation cohort the RSV models validated the worst. Thus, we conclude that cross-validation of the training cohort is overly optimistic and does not reliably predict an independent validation cohort performance. Given that the sum of Hausdorff distances had the highest correlation with the clinical disability scale at all difficulty levels ( Figure 6 and Figure S9 ), a linear regression models were constructed to measure the relationship between the disability scales and . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 4, 2021. ; https://doi.org/10.1101/2021.06.01.21258169 doi: medRxiv preprint the sum of Hausdorff distances alone. In general, all clinical disability scales were positively correlated with the sum of Hausdorff Distances (see Figure 7 for correlation with Average 9HPT and S13-S15 for correlations with EDSS, CombiWISE, and NeurEx respectively). The validation cohort performance of the sum of Hausdorff Distances alone (Figure 8 for 9HPT) was comparable to the more complex ML-based models (i.e., R 2 between 0.04 and 0.14 in Table S11 ). Overall, we observed better predictive accuracy (based on R-square) in dominant hand category than non-dominant hand category and for difficulty levels 1 and 2 compared to difficulty level 3 in the dominant hand (Figure 8) . . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 4, 2021. ; https://doi.org/10.1101/2021.06.01.21258169 doi: medRxiv preprint Test reproducibility, (measured as ICC, signal-to-noise ratio or concordance coefficient) is generally given lesser importance in the test design than test sensitivity. Presented analyses of the Spiral tracing support the notion that test reproducibility is essential determinant of its clinical utility. Achieving high reproducibility of digital tests should be at forefront of the medical app developers. Spiral tracing is complex test that includes many neurological functions: fine finger movements, negatively affected by motoric dysfunction, proprioceptive loss and cerebellar dysfunction, eye-hand coordination, affected by vision, oculomotor and cerebellar dysfunctions and cognition or anxiety associated with anticipated test difficulty. Additionally, the precision of the test is affected by time of test execution, even though we observed, counterintuitively, strong positive correlation between measures of test accuracy such as sum of Hausdorff distance and the time it took to perform the tracing, even in HD. This suggests that time was not the primary driver of the inaccuracy of tracing. Rather, combined neurological disability and/or lack of confidence in the ability to perform test both fast and accurately, negatively affected the test performance. This explains why Spiral tracing features that adjusted accuracy of tracing for the velocity performed worse than the most successful accuracy measure, the sum of Hausdorff distance. The comparison of cross-validation performance of the MS training cohort with the performance of the ML-based models in the true independent validation cohort demonstrated that training cohort cross-validation performance over-estimates performance of the test in subjects who did not contribute to the model development: when the performance of the strongest ML-based models (i.e., modeling 9HPT) are compared between cross-validation of the training cohort and the independent validation cohort, all four ML algorithms greatly over-estimated performance of the models in the independent validation cohort. In fact, the best feature of the Spiral tracing (the sum of Hausdorff distances) performed comparatively to the ML-based models in . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 4, 2021. ; https://doi.org/10.1101/2021.06.01.21258169 doi: medRxiv preprint the independent validation (Figure 8 ). This overestimation of the model performance from the training cohort data, even when training cohort results are based on crossvalidation, is the rule we observed uniformly in the past decade of our experience with independent validation of complex models. In fact, we used to not even show the training cohort results in our publications, as we consider them irrelevant. However, after realizing that vast majority of ML studies in biomedical literature do not use independent validation cohort and that most readers and reviewers consider crossvalidation of the training cohort equivalent to the independent validation, we now routinely publish training cohort data to demonstrate the level of overfit in comparison to truly independent validation cohort. We also point out that the cross-validation performance of the training cohort does not faithfully predict even the ranking of the models: e.g., the RSV models explained the highest proportion of variance in the training cohort cross-validation, but performed especially poorly in the independent validation cohort. In this regard, because COVID19 pandemic precluded us from recruiting independent validation cohort of HD, we consider cross-validation performance of the HD models unrealistically optimistic (especially in view of the small number of HD) and fully expect that those models represent overfit. Therefore, the HD models should not be considered promising without an independent validation. The poor performance of these ML-based models was, in our experience, expected based on poor ICCs and weak univariate correlations of individual Spiral tracing features with the gold standard of neurological disability measures and brain MRI markers of CNS injury in MS cohorts. In comparison, much simpler test such as rapidly tapping on the screen of the smartphone correlated much stronger with analogous disability measures (i.e., up to spearman Rho of 0.76) and were also much more intraindividually stable(Alexandra K. . Interestingly, both 9HPT and smartphone finger tapping differentiated MS from HD better for non-dominant hand, observation that was reproduced for 9HPT in multiple studies (A. K. Boukhvalova et al., 2018; P. Kosa et al., 2016; Tanigawa et al., 2017) . We interpreted this observation by functional repair: even though MS likely affects both hands equally, the daily use of . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 4, 2021. ; https://doi.org/10.1101/2021.06.01.21258169 doi: medRxiv preprint dominant hand promotes repair, both as remyelination and establishment of new synaptic circuits by remaining neurons. Therefore, digital tests of non-dominant hand, which has less rehabilitation/repair, are more sensitive to measure difference between MS patients and HD and to measure progression of disability in time. Surprisingly, nondominant performed much worse in Spiral tracing test, in both MS cohorts. We believe that this was due to higher intra-individual variance/greater noise, that has little to do with disability and more to do with test complexity. In fact, as test complexity increased to Level 3, the reproducibility and clinical relevance of the Spiral tracing features decreased quite dramatically. We recognize that poor reproducibility of the Spiral tracing observed in our study may be mitigated in situations where spiral tracing is performed on tablets and therefore, the spiral is much larger (Erasmus et al., 2001) . We developed NeuFun-TS for smartphones rather than tablets, due to larger World-wide prevalence and greater availability of different sensors in the former compared to the latter. Clearly, the test selection must consider the screen size difference for apps targeting different mobile devices. In conclusion, in self-administered digital measurements of neurological functions, the designers should strive to develop tests that are easy to perform and therefore highly reproducible, but still reflect a specific neurological (dys)function. These simpler tests will likely be (by design) less sensitive than tests that depend on multiple neurological functions, but the sensitivity can be restored by aggregating results from multiple simple tests, as is being done in NeuFun-TS. However, the total time necessary to complete all tests in NeuFun-TS will likely determine compliance with longitudinal testing. Therefore, as Spiral tracing does not add clinical value beyond existing tapping (Tanigawa et al., 2017) , balloon popping (Alexandra K. and level tests (Boukhvalova et al., 2019) , we plan to drop Spiral tracing from NeuFun-TS standard tests. Spiral tracing Fourier analysis to identify tremor, its frequency and severity may still be very useful in patients with movement disorders. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 4, 2021. ; https://doi.org/10.1101 https://doi.org/10. /2021 The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. BB conceived the study design. BB, LP, TH, YK, VM, and PK performed the app construction and clinical data collection. BB and KSM conceived the analytical methodologies. KSM performed the statistical analysis and ML modelling with their respective visualizations. All authors participated in writing and editing the manuscript. The datasets analyzed in this study along with the R Code have been made available at https://github.com/bielekovaLab/Bielekova-Lab-Code/tree/master/FormerLabMembers/Messan_Komi. Hannikainen, and Samuel Wachamo for helping with data collection in the clinic. Finally, we would like to thank our patients, their caregivers, and healthy volunteers for being partners in researchwithout them this work would be possible. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 4, 2021. ; https://doi.org/10.1101/2021.06.01.21258169 doi: medRxiv preprint Figure S4 . Figure S5 . Average fold change of HD and MS of the spiral derived features with respect to their ICC from the training set. ICC was calculated from the granular data of the healthy donors (HD). . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 4, 2021. ; https://doi.org/10.1101 https://doi.org/10. /2021 The numbers indicate the feature's labels as illustrated by the label in Table 2 . HD and MS denote Healthy Donors and Multiple Sclerosis respectively. Figure S6 . Average fold change of HD and MS of the spiral derived features with respect to their ICC from the test set. ICC was calculated from the granular data of the healthy donors (HD). The numbers illustrate the feature's labels as indicated by the label in Table 2 . HD and MS denote Healthy Donors and Multiple Sclerosis respectively. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 4, 2021. ; https://doi.org/10.1101/2021.06.01.21258169 doi: medRxiv preprint Figure S7 . Table 2 . HD and MS denote Healthy Donors and Multiple Sclerosis respectively. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 4, 2021. ; https://doi.org/10.1101/2021.06.01.21258169 doi: medRxiv preprint Figure S8 . Table 2 . HD and MS denote Healthy Donors and Multiple Sclerosis respectively. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 4, 2021. ; https://doi.org/10.1101/2021.06.01.21258169 doi: medRxiv preprint Figure S9 . Spearman rho correlation matrix between the statistically significant clinical disability and the top four most significant spiral derived features based on FC at the difficulty level 3. The number indicates the spearman correlation coefficient. Red is negative correlation while blue stand for positive correlation. The white color indicates correlation that are not statistically significant at BH adjusted p-value of 0.05. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 4, 2021. ; https://doi.org/10.1101/2021.06.01.21258169 doi: medRxiv preprint Figure S10 . . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 4, 2021. ; https://doi.org/10.1101/2021.06.01.21258169 doi: medRxiv preprint Figure S11 . . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 4, 2021. ; https://doi.org/10.1101/2021.06.01.21258169 doi: medRxiv preprint Figure S12 . . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. Sum of radial velocity -F6 Coefficient of variation of radial velocity -F7 Skewness of radial velocity -F8 Kurtosis of radial velocity -F9 Sum of angular velocity -F10 Coefficient of variation of angular velocity -. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. True Asymmetry in comparison with reference shape -F38 2D image Correlation between two images (Aljanabi et al., 2018; A. P. Creagh et al., 2020) F39 Image entropy of shape drawn (A. P. Creagh et al., 2020; Tsai, Lee, & Matsuyama, 2008) F40 Image entropy of shape drawn with respect to reference shape (A. P. Creagh et al., 2020) . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 4, 2021. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 4, 2021. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 4, 2021. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 4, 2021. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 4, 2021. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 4, 2021. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 4, 2021. ; https://doi.org/10.1101/2021.06.01.21258169 doi: medRxiv preprint A smartphone-based system to quantify dexterity in Parkinson's disease patients An Entropy-Histogram Approach for Image Similarity and Face Recognition Measuring the Performance of Image Contrast Enhancement Technique Accuracy and precision of smartphone applications and commercially available motion sensors in multiple sclerosis Spiral drawing time as a measure of bradykinesia Controlling the False Discovery Rate -a Practical and Powerful Approach to Multiple Testing Smartphone Level Test Measures Disability in Several Neurological Domains for Patients With Multiple Sclerosis. Front Neurol Identifying and Quantifying Neurological Disability via Smartphone Identifying and Quantifying Neurological Disability via Smartphone Evaluating more naturalistic outcome measures: A 1-year smartphone study in multiple sclerosis Xgboost: extreme gradient boosting Smartphone-based remote assessment of upper extremity function for multiple sclerosis using the FLOODLIGHT draw a shape test Smartphone-based remote assessment of upper extremity function for multiple sclerosis using the Draw a Shape Test A modified Hausdorff distance for object matching Measurement of ataxic symptoms with a graphic tablet: standard values in controls and validity in Multiple Sclerosis patients Digitised spirography as an evaluation tool for intention tremor in multiple sclerosis Regularization paths for generalized linear models via coordinate descent Monotone piecewise cubic interpolation Comparing images using the hausdorff distance Accurate monotonicity preserving cubic interpolation forecasting: Forecasting functions for time series and linear models. R package version 8.13 Fast selection of geologic models honoring CO2 plume monitoring data using Hausdorff distance and scaled connectivity analysis Machine learning-optimized Combinatorial MRI scale (COMRISv2) correlates highly with cognitive and physical disability scales in Multiple Sclerosis patients A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research NeurEx: digitalized neurological examination offers a novel high-resolution disability scale Development of a Sensitive Outcome for Economical Drug Screening for Progressive Multiple Sclerosis Treatment Novel composite MRI scale correlates highly with disability in multiple sclerosis patients Building predictive models in R using the caret package Rating neurologic impairment in multiple sclerosis: an expanded disability status scale (EDSS) Classification and regression by randomForest. R news A digital assessment system for evaluating kinetic tremor in essential tremor and Parkinson's disease Spiral drawing performance as an indicator of fine motor function in people with multiple sclerosis Multiple Sclerosis self-assessment digital solution: results of a comparative study versus standard tests Cubic spline interpolation. College of the Redwoods A method for measuring Parkinson's disease related temporal irregularity in spiral drawings Automatic Spiral Analysis for Objective Assessment of Motor Symptoms in Parkinson's Disease. Sensors (Basel) Adherence and Satisfaction of Smartphone-and Smartwatch-Based Remote Active Testing and Passive Monitoring in People With Multiple Sclerosis: Nonrandomized Interventional Feasibility Study Smartphone-based symbol-digit modalities test reliably captures brain damage in multiple sclerosis A neurologic rating scale (NRS) for use in multiple sclerosis Finger and foot tapping as alternative outcomes of upper and lower extremity function in multiple sclerosis R: A language and environment for statistical computing. R Foundation for Statistical Computing Information entropy measure for evaluation of image quality MotorBrain: A mobile app for the assessment of users' motor performance in neurology MotorBrain: A mobile app for the assessment of users' motor performance in neurology A bi-symmetric log transformation for wide-range data Quantifying test-retest reliability using the intraclass correlation coefficient and the sem Individual comparisons by ranking methods We would like to thank clinicians Alison Wichman and Mary Sandford, research nurse Tiffany Hauser, and patient care coordinator Michelle Woodland for their excellent patient care. We would like to thank former laboratory members Alex Boukhvalova, Ann . CC-BY-NC-ND 4.0 International license It is made available under a