key: cord-1049233-ff34v95s authors: Neary, Martha; Bunyi, John; Palomares, Kristina; Mohr, David C.; Powell, Adam; Ruzek, Josef; Williams, Leanne M.; Wykes, Til; Schueller, Stephen M. title: A process for reviewing mental health apps: Using the One Mind PsyberGuide Credibility Rating System date: 2021-10-29 journal: Digit Health DOI: 10.1177/20552076211053690 sha: 4a31b97bce5edff600f484c6c44d45285709dd0f doc_id: 1049233 cord_uid: ff34v95s OBJECTIVE: Given the increasing number of publicly available mental health apps, we need independent advice to guide adoption. This paper discusses the challenges and opportunities of current mental health app rating systems and describes the refinement process of one prominent system, the One Mind PsyberGuide Credibility Rating Scale (PGCRS). METHODS: PGCRS Version 1 was developed in 2013 and deployed for 7 years, during which time a number of limitations were identified. Version 2 was created through multiple stages, including a review of evaluation guidelines and consumer research, input from scientific experts, testing, and evaluation of face validity. We then re-reviewed 161 mental health apps using the updated rating scale, investigated the reliability and discrepancy of initial scores, and updated ratings on the One Mind PsyberGuide public app guide. RESULTS: Reliabilities across the scale's 9 items ranged from −0.10 to 1.00, demonstrating that some characteristics of apps are more difficult to rate consistently. The average overall score of the 161 reviewed mental health apps was 2.51/5.00 (range 0.33–5.00). Ratings were not strongly correlated with app store star ratings, suggesting that credibility scores provide different information to what is contained in star ratings. CONCLUSION: PGCRS summarizes and weights available information in 4 domains: intervention specificity, consumer ratings, research, and development. Final scores are created through an iterative process of initial rating and consensus review. The process of updating this rating scale and integrating it into a procedure for evaluating apps demonstrates one method for determining app quality. The increasing availability of technologies, such as smartphones, affords opportunities to increase access to mental health care. These technologies are more crucial than ever in the era of COVID-19, when mental health concerns are increased and additional, unique barriers to care exist (such as physical distancing measures which limit contact with providers). 1 An estimated 325,000 mobile health apps are available in the app marketplace, 2 with at least 1 10,000 of those for mental health. 3 Consumer interest in mental health apps (MH apps) is growing; 64% of teens and young adults report using health apps, with many of those apps being related to mental health including sleep, meditation, stress, and substance use. 4 Despite reported interest, those wanting to use MH apps often have little help in selecting potentially effective products. The app stores only provide star ratings, and these user reviews correlate poorly with clinical utility. 5 People who contribute app ratings are a self-selected sample likely to represent technologically-savvy users 5 or those with a particularly negative or positive experience to share. 6 App developers can also leave ratings for their own apps or pay others to do so and there is no way to distinguish genuine consumer ratings from those that are fraudulent. 7 While thousands of MH apps are available, it is well documented that few have been reviewed, researched, or vetted in any systematic way. 6, [8] [9] [10] In 2017, Firth and colleagues 11 conducted a systematic search of seven databases and identified only 18 randomized controlled trials (RCTs) examining the effects of mental health interventions for depression delivered via smartphones. A similar study in the same year identified only nine RCTs for anxiety apps. 12 While RCTs are the gold standard and would provide useful information for making choices, they are not available for every app, and the incentives for developers to complete trials are misaligned with incentives for researchers. 13 Consumers now increasingly look to professionals and "trusted sources" for app recommendations 14 which means that we need frameworks for rigorous evaluations. 15 Navigating the MH app marketplace: some potential solutions In navigating the MH app marketplace, two common questions exist: "which apps are effective?", and "how does one distinguish a good app from a bad app?". Efforts to help people answer these questions can be broadly categorized into evaluation guidelines and app rating platforms. 6, 16 Evaluation guidelines, for example the Mobile App Rating Scale (MARS), 17 the American Psychiatric Association's App Evaluation Model, 3 and Enlight, 18 aim to guide the consumer through a number of questions to decide whether or not to proceed with using an app. However, these frameworks do not provide clear metrics to guide app choices. Even with the help of evaluation guidelines, consumers (even clinician consumers) generally do not have the time or qualifications to thoroughly evaluate apps. 19 This is despite recent efforts to make these guidelines more streamlined or provide additional materials to support their use. 20 These sorts of guidelines require careful consideration and evaluation of apps for security, credibility, and clinical efficacy, and so will be even more challenging for lay consumers, who generally want simpler information to make choices. 9, 21 Third-party quality reviews might fill this gap by providing information on the quality of an app at the point of download (e.g. on the app stores). 21 In the absence of such a solution, independent app rating platforms for smartphone apps that produce scores can help consumers and clinicians distinguish highquality apps. These include the Organization for the Review of Care & Health Applications (ORCHA), MindTools.io, Credible Mind, and One Mind PsyberGuide, but these too have drawbacks. Recent work by Carlo and colleagues 22 demonstrated inconsistencies across different rating systems. They found low rating agreement for the most commonly downloaded MH and wellness apps by ORCHA, MindTools, and One Mind PsyberGuide. Ratings of credibility and evidence base demonstrated the most agreement, with ratings of user experience the least. Powell and colleagues 19 also found poor inter-rater reliability using the same measures, particularly for ratings of effectiveness. This "inherent methodological subjectivity" must be acknowledged, 22 and rating developments need to define criteria clearly to ensure consistency. 19 The One Mind PsyberGuide Credibility Rating Scale One Mind PsyberGuide (hereafter "PsyberGuide") is a nonprofit organization providing reviews of digital tools (including both apps and web-based programs) for mental health and wellness. All reviews are publicly available at https://onemindpsyberguide.org/. In addition to narrative reviews by professionals, PsyberGuide reviews digital tools on three different metrics (shown in Figure 1 ) which map onto key considerations for service users 23, 24 to help them make informed decisions. Although all three metrics might affect user adoption and engagement, in this paper we focus only on the PsyberGuide Credibility Rating Scale (PGCRS). The other measures have been described and evaluated elsewhere. 17, 25 This paper describes the process of updating the PGCRS to better reflect the evidence and support backing MH apps. The PGCRS is completed by a trained app reviewer for each tool. This rating is reviewed and discussed with a supervision team comprising two Master's-level staff members and one PhD-level clinical psychologist. Final scores are based on discussion with the supervision team, and the maximum number of points possible for any tool is five. The first version of the PGCRS (PCGRS 1.0) was created in 2013 and used for seven years, with some minor periodic updates. Informal feedback from consumers, developers, and researchers on key aspects of the original scale demonstrated that it did not capture all the information that would be useful. Version 2 of the PGCRS (PGCRS 2.0) was developed to respond to these issues. PGCRS 2.0 development followed a series of stages, outlined in Figure 2 and explained in more detail below. We reviewed available app rating frameworks, for example Enlight, 18 the American Psychiatric Association App Evaluation Model, 3 and ORCHA. 26 We also reviewed consumer research to understand what additional consumer questions pertaining to issues of credibility were not addressed by PGCRS 1.0. 14, 21, 27, 28 Finally, we reviewed anecdotal feedback we have received on PGCRS 1.0 over the course of its implementation. The preliminary PCGRS 2.0 was then developed based on this evidence. Three experienced raters, who had completed dozens of ratings using PGCRS 1.0, used PCGRS 2.0 to review 10 apps. This process produced further clarifications to the wording and criteria (for example adding examples to the anchors for clarity of purpose; see Appendix for full rating tool). PGCRS 2.0 was reviewed by the PsyberGuide Scientific Advisory Board, including all co-authors of this paper, to assess face validity. Based on their feedback we added items on indirect research evidence, development processes, efficacy of other products by the same development team, and the average value of consumer ratings. Details of the item changes from PGCRS 1.0 to PGCRS 2.0 based on these three stages are presented in the results. Reviewers were three undergraduates and two graduates. They completed six weeks of training by an experienced team in digital mental health (including two graduate-level trained app reviewers and one clinical psychologist). During training they rated 15 training apps which involved downloading the assigned app, using it for a period of at least two hours across more than one day, and then producing an initial rating. Training apps were completed in batches of five. After each batch, reviewers and their supervisors met to review the initial scores, answer questions, and discuss the experience. In these meetings a consensus (final) score was determined for each of the 15 apps by resolving discrepancies through discussion. To understand scoring differences between raters and the reliability of initial scores, discrepancies between the raters' initial scores and the app's consensus scores were examined. For each of the 15 training apps, consensus scores were subtracted from initial individual rater scores for each subscale. Inter-rater reliability of the initial scores was determined by calculating Krippendorff's Alpha using the 'R' statistical computing tool script provided by Zapf and colleagues. 29 Inter-rater reliability of the final scores could not be calculated because these scores were produced through a consensus process. Once training was completed, reviewers used each of the remaining available apps from the PsyberGuide App Guide (N = 146) and completed the PGCRS 2.0 (one rating per tool). Final scores were completed using a consensus process. Including training apps, this resulted in 161 rated platforms in total, completed over a period of five and a half months. We compared the consensus scores between PGCRS 1.0 and 2.0 and investigated them in detail if the ratings changed by one point or greater (N = 36) or if they were in the top 10% (N = 17) or the bottom 10% (N = 16) of all ratings. This investigation was carried out by an experienced supervisor, who downloaded and used the tool, and examined the ratings, and approved the final score. If this supervisor had questions or was unsure of the score change, an additional supervisor also reviewed and discussed in order to reach a decision. When all reviews were completed and approved, we examined the correlations between the app store star ratings and the PGCRS 2.0 scores, for tools that had a smartphone app available in either the Apple App Store or Google Play Store (N = 147). The star ratings (range 1 to 5) were obtained from the iOS and Android app stores using AppTrace, an analysis service which programmatically queries both iTunes and Google Play application programming interfaces (APIs). Mirroring the method used by Singh et al. 5 we queried the cumulative star rating from all previous versions of the app, instead of the summary rating for the current version only which is presented in the app store. As noted by Singh et al. 5 the rating from all versions represents a more stable estimate of an app's perceived value. For multiplatform apps, we calculated a mean rating based on the iOS and Android star ratings. Because the PGCRS 2.0 accounts for consumer ratings, we ran two correlations: (1) app store star rating and total PGCRS 2.0 score, and (2) app store star rating and PGCRS 2.0 score, minus the consumer rating. The main features assessed by the Scale, and changes from PGCRS 1.0 to PGCRS 2.0 (made in Stages 1-3 of development), are shown in Table 1 . The full rating tool and scoring are provided in the Appendix. To understand scoring differences between raters and reliability of initial scores, discrepancies between the raters' initial scores and the app's consensus scores were examined. Average discrepancies, standard deviation (SD), mean absolute error, and Krippendorff's Alpha are presented in Table 2 . Because only four reviewers completed the last batch of apps, Krippendorff's Alphas are presented as ranges between the first 10 and last five apps. Of the 177 tools listed on the PsyberGuide App Guide, 15 (9%) were identified as no longer available (e.g. had been removed from the app store) leaving 161 tools currently available to the public. For ratings using PGCRS 2.0, the average overall score for the 161 tools was 2.51 (range 0.33-5.00; SD = 1.23) and compared to PGCRS 1.0, 42% (n = 67) increased their score and 58% (n = 93) decreased. The average score change was small (−0.04) although some did show large changes (e.g. 2.24). Score changes were attributable most commonly due to the number and average score of consumer ratings and ongoing maintenance and updates (date of last software update), which regularly fluctuate. Correlating app store scores and credibility scores Spearman's rho correlation coefficient was used to examine the relationship between app store star ratings and total PCGRS 2.0 final scores, and showed a small correlation, r s (145) = .18, p = 0.024. We also examined the correlation between app store star ratings and PCGRS 2.0 scores minus the consumer rating item. There was no significant correlation between the two, r s (145) = .08, p = 0.268. This paper reports on the development and face validity of PGCRS 2.0 and its application to all available digital tools for mental health and wellness listed on the PsyberGuide Goals should not only be clear, but achievable; a tool which over promises or makes lofty claims is unlikely to deliver on those goals (for example, "become more successful" or change your life"). (2) Consumer Ratings a. (i) Number of app store ratings a (ii) Average value ✓ In addition to the number of consumer reviews, which serves as a proxy for popularity, the average value of reviews can help distinguish apps which consumers rate highly or poorly. ( App Guide at onemindpsyberguide.org. During the review process, nearly a tenth of digital tools listed on PsyberGuide were identified as no longer available (and were moved to the "currently unavailable" section of the guide). This speaks to the rapidly changing marketplace in which consumers and clinicians search for suitable apps, supporting our view that app recommendations are vital, but also demonstrating the challenge of making these ratings current enough to enable consumer choices. The final scale was reasonably reliable and provided a measure to assess the credibility of MH apps when completed by trained raters (i.e. undergraduate students) under supervision. It includes updated and additional items deemed important through a review of available frameworks, relevant literature, feedback from developers, and input from subject matter experts. We also chose specifically to embed factors identified in the literature as important to consumers, such as development processes and feasibility data, 21 direct research evidence and evidencebased content, and clinical input in development. 14, 27, 28 Our rating process included reviewers producing initial scores using the PGCRS and then creating final scores through a consensus process with discussion and supervision. Even after training, ratings were regularly discussed in team meetings to allow opportunities to calibrate scores, provide ongoing supervision, and produce final scores. Reviewing the discrepancies between the raters' initial scores and the app's final (consensus) scores (Table 2 ) allowed us to identify those items requiring further discussion. Using 0.667 as an acceptable level of reliability, 30 five items had good reliability (items 2a, 3a, 3c, 4b, 4d) and four items had relatively low reliability (items 1a, 3b, 4a, 4c). Discrepancies for low reliability items were likely due to sparse or hard-to-find information; for example, 1) information on stakeholder or consumer involvement is not always readily available (item 4a), and 2) it is not always easy to track new products from the same development team, with frequent changes to company names, app names, and websites (item 4b). Items that required discussion included whether a development team could be counted as clinical input (item 4c). High reliability items were those where objective information was available on app stores or publicly available databases, for example, the number of consumer ratings (item 2a), direct research evidence (item 3a), and date of last software update (item 4d). In future these low reliability items will be refined to include more concrete anchors or examples. As it is emphasized now in app development the amount of clinical input or consumer involvement should become more transparent. It was unsurprising that ratings showed poor correlations with app store star ratings, replicating previous research that star ratings are not predictive of app quality, 5, 19 and that additional ratings beyond app store information are needed to guide consumer and clinician choices. Although many app evaluation models and rating systems have been developed over the last decade, few have incorporated the views or needs of patients and consumers. Consumers are ultimately the end users of all MH apps, and factors which influence consumer adoption and use of apps do not necessarily align with the views of expert or academic groups. Wykes and Schueller 21 propose that information consumers need to make app choices falls into four domains, which were derived from experimental studies, systematic reviews, and reports of patient concerns: (1) privacy and data security, (2) development characteristics, (3) feasibility data, and (4) benefits (the first concern is addressed by the PsyberGuide Transparency Rating, while concerns 2-4 are addressed by PGCRS 2.0 which is the focus of the current paper). More work is needed to ensure that consumer perspectives are central to MH app choices and that we integrate both "bottom up" (consumer-informed) and "top down" (expertdriven) processes in evaluation. Our efforts to re-rate all tools on the PsyberGuide App Guide demonstrate the importance of regular updating. Changes in scores reflect not only the updated scale, but also changes in information that informs the scores, such as additional research. This demonstrates the need for app Table 2 . Discrepancies between final and initial scores and inter-rater reliability for initial scores for training apps. ratings to be nimble and regularly assess whether new information will affect those scores. As more MH apps become available, the challenge of keeping reviews up to date will grow more arduous. We agree with calls for continuous, real-time evaluation of apps to guide evaluation efforts. 13, 31 However, to date, there is no process through which third-party app evaluators can obtain real-world effectiveness data for multiple products, and in the absence of such an infrastructure, expert reviews which are regularly updated are likely the best current solution. Echoing Powell and colleagues, 9 we believe it is problematic to ask clinicians and patients to fend for themselves when evaluating apps. For app ratings to be truly informative and useful, they need to come from objective, unbiased, third-party reviewers, independent from commercial app development efforts. The necessity for independent app evaluation systems has only been heightened by COVID-19, due to both the increased need for digital supports for mental health 1 and further loosening of FDA regulation in order to expand the availability of digital health therapeutic devices for patient and consumer use. 32 The PGCRS 2.0 is a measure to determine app quality that is different to what is provided via the app stores and star ratings. However, it is worth noting some limitations of the current investigation and the scale. We have not carried out a measure validation study. No gold standard measure of app quality exists or is widely available. Therefore, we cannot validate the accuracy of the scale in predicting app quality. The clinical validity of the PGCRS would require examining whether it correlates with clinical benefits, but as of now, no repository of such information exists. Scores resulting from PGCRS 1.0 ratings have been shared in various studies and contexts, 19, 33 and by various organizations including the Anxiety and Depression Association of America and the International Obsessive Compulsive Disorder Foundation, suggesting acceptable face validity of the ratings. We have also responded to consumer feedback in the development of PGCRS 2.0. However, the only item that directly considers consumer input is item 2a (number of consumer ratings in the app store and the average star rating). Further considerations should be given to how to incorporate more informative consumer-driven approaches. The PGCRS 2.0 presents one evaluative framework to quantify the quality of a MH app through the lens of credibility. This credibility metric includes considerations of research evidence (both direct and indirect), the developmental processes, intervention specificity, and consumer ratings. The credibility metric is meant to simplify and weight information available on MH aps in a manner such can be used by consumersboth professionals and non-professionalsto guide decisionmaking. The PGCRS underlies one aspect of the rating system used at PsyberGuide, which has been an influential system in rating apps, demonstrated through its use by various organizations. The process of updating the PGCRS from Version 1.0 to 2.0 also illustrates some important considerations as the field of digital mental health has developed. This includes incorporating indirect evidence given the growing evidence-base in this field and developmental processes. Although no system is perfect, this description and analysis helps demonstrate some of the strengths and limitations of this metric including highlighting the usefulness of embedding metrics into a consensus process. Better transparency around different evaluative frameworks used in this space will hopefully help drive the field forward and improve access to information that can help all stakeholders make informed decisions. The application has been revised within the last 6 months The application has been revised within the last 12 months The application has not been revised or was revised more than 12 months ago Scoring Instructions For mobile applications: Assign a score for each feature. Add each feature score to obtain total score. No items need to be reverse coded. To normalize to a 5 point scale, divide total score by 3. For web-based tools: Omit items 2a and 4d. Assign a score for each feature. Add each feature score to obtain total score. No items need to be reverse coded. To normalize to a 5 point scale, multiply total score by 5 11 . Online mental health services in China during the COVID-19 outbreak 000 mobile health apps available in 2017 -Android now the leading mHealth platform, https:// research2guidance.com/325000-mobile-health-apps A hierarchical framework for evaluation and informed decision making regarding smartphone apps for clinical care Digital Health Practices, Social Media Use, and Mental Well-Being Among Teens and Young Adults in the U Many Mobile health apps target high-need, high-cost populations, But gaps remain State of the field of mental health apps A systematic review of quality assessment methods for smartphone health apps App-based psychological interventions: friend or foe? In search of a Few good apps Current research and trends in the use of smartphone applications for mood disorders The efficacy of smartphone-based mental health interventions for depressive symptoms: a meta-analysis of randomized controlled trials Can smartphone mental health interventions reduce symptoms of anxiety? A meta-analysis of randomized controlled trials Continuous evaluation of evolving behavioral intervention technologies Discovery of and interest in health apps Among those With mental health needs: survey and focus group study NIMH Opportunities and Challenges of Developing Information Technologies on Behavioral and Social Science Clinical Research Quality assessment of self-directed software and Mobile applications for the treatment of mental illness Mobile App Rating Scale: a new tool for assessing the quality of health mobile apps Enlight: a comprehensive quality and therapeutic potential evaluation tool for Mobile and Web-based eHealth interventions Interrater reliability of mHealth App rating measures: analysis of Top depression and smoking cessation apps Why reviewing apps Is Not enough: transparency for trust (T4 T) principles of responsible health App marketplaces By the numbers: ratings and utilization of behavioral health mobile applications Adoption of mobile apps for depression and anxiety: Cross-sectional survey study on patient interest and barriers to engagement The effects of improving sleep on mental health (OASIS): a randomised controlled trial with mediation analysis Reviewing the data security and privacy policies of mobile apps for depression Specific features of current and emerging mobile health apps: user views among people with and without mental health problems. mHealth A qualitative study of user perceptions of mobile health apps Measuring inter-rater reliability for nominal data -which coefficients and confidence intervals are appropriate Content analysis: an introduction to its methodology Beyond validation: getting health apps into clinical practice Enforcement Policy for Digital Health Devices For Treating Psychiatric Disorders During the Coronavirus Disease 2019 (COVID-19) Public Health Emergency JMIR The model of gamification principles for digital health interventions: evaluation of validity and potential utility | Floryan