key: cord-0962165-u0abobw7
authors: Zackoff, Matthew W.; Young, Daniel; Sahay, Rashmi D.; Fei, Lin; Real, Francis J.; Guiot, Amy; Lehmann, Corinne; Klein, Melissa
title: Establishing Objective Measures of Clinical Competence in Undergraduate Medical Education through Immersive Virtual Reality
date: 2020-10-20
journal: Acad Pediatr
DOI: 10.1016/j.acap.2020.10.010
sha: b16562041a1af8bde640232ac80b0e638d2a4168
doc_id: 962165
cord_uid: u0abobw7

OBJECTIVE: The Association of American Medical Colleges defines recognition of the need for urgent or emergent escalation of care as a key Entrustable Professional Activity (EPA) for entering residency (EPA#10). This study pilots the use of an immersive virtual reality (VR) platform for defining objective observable behaviors as standards for evaluation of medical student recognition of impending respiratory failure. METHODS: A cross-sectional observational study was conducted from July 2018-December 2019, evaluating student performance during a VR scenario of an infant in impending respiratory failure using the OculusRift™ VR platform. Video recordings were rated by two pair of physician reviewers blinded to student identity. One pair provided a consensus global assessment of performance (not competent, borderline, or competent) while the other used a checklist of observable behaviors to rate performance. Binary discriminant analysis was used to identify the observable behaviors that predicted the global assessment rating. RESULTS: Twenty-six fourth year medical students participated. Student performance of eight observable behaviors was found to be most predictive of a rating of competent, with a 91% probability. Correctly stating that the patient required an escalation of care had the largest contribution towards predicting a rating of competent, followed by commenting on the patient's increased heart rate, low oxygen saturation, increased respiratory rate, and stating that the patient was in respiratory distress. CONCLUSIONS: This study demonstrates that VR can be used to establish objective and observable performance standards for assessment of EPA attainment – a key step in moving towards competency based medical education.

Competency-based medical education (CBME) is anchored in the concept that trainees master skills at different paces. Progression through the educational continuum should be dictated by demonstrating proficiencies required for transition to the subsequent rank. 1 The Association of American Medical Colleges (AAMC) published Core Entrustable Professional Activities (EPAs) for entering residency which describe specific skills and behaviors expected of all graduating medical students upon entering their first day of residency. 2 However, before CBME can transition from promise to practice, objective measures to assess performance are required. 3 Simulation-based medical education (SBME) offers students a safe environment to perform skills and has demonstrated improved educational outcomes compared to traditional didactics. [4] [5] [6] While standardized patient encounters, a form of SBME, have become the gold standard for clinical skills assessment, their application to the array of clinical competencies remains limited. 7, 8 Specifically, Core EPA #10 for students entering residency requires students to demonstrate recognition of patients requiring urgent or emergent care. 2 In pediatrics, respiratory distress from bronchiolitis is the most common cause of hospitalization for infants, with nearly 14% of hospitalized patients progressing to respiratory failure. 9 Unfortunately, standardized patients for pediatric respiratory distress are not on option, and many available patient simulators cannot display several critical exam findings (e.g. mental status, work of breathing) needed to create realistic conditions for an accurate assessment of competency.

Immersive virtual reality (VR) simulation is a promising new approach to SBME, whereby students are taken to the patient's bedside within a virtual 3D environment. VR simulations promote deliberate practice 10 of skills through safe and realistic interactions with graphical character representatives (avatars). VR has successfully been used for training in various contexts, such as performing procedures, 11 learning empathy, 12 addressing vaccine hesitancy, 13, 14 and performing a clinical assessment. [15] [16] [17] However, VR has yet to be leveraged for the establishment of competency standards or formal assessment of performance. To address this gap, our study aimed to establish competency standards related to student recognition of impending respiratory failure using an immersive VR platform, using the clinical scenario of an infant admitted with bronchiolitis.

A cross-sectional observational study was conducted at Cincinnati Children's Hospital Medical Center in association with the University of Cincinnati College of Medicine, from July 2018 to December 2019. Fourth-year medical students were recruited via email and were provided a $20 gift card for their voluntary participation. Consent was obtained per our Institutional Review Board's approval.

The VR scenario's development and content including a simulated inpatient environment with virtual patient and preceptor avatars, vital signs monitor, and room décor along with functionality (visual and auditory cues including patients' breath sounds) has been previously described in Academic Pediatrics 15 (https://drive.google.com/file/d/1m-1j7hbxvIu-dK1jdgz9MRQYcubS6-IS/view?usp=sharing) with demonstrated effectiveness as a teaching tool. 16 To establish our competency standards, we focused on a case of impending respiratory failure during which the virtual infant displayed altered mental status, increased work of breathing, abnormal breath sounds, tachycardia, tachypnea, and hypoxia -consistent with a need for escalation of clinical care. 15, 16 Following orientation to the VR environment and functionality, the student was provided a prompt with the pertinent history and presumptive diagnosis of viral bronchiolitis. The student was asked to verbally report the physical exam findings and interpretation of vital signs to the avatar preceptor, provide an overall assessment of the patient's clinical status, and describe next steps for management. If the student did not independently state whether the patient required an escalation of care, the student was asked by the avatar preceptor, "Do you think the patient is stable for the floor?" Following completion of the session, students were provided feedback by a study author (MZ) on overall performance. The session lasted approximately 20 minutes and concluded with a demographic survey.

Standard Setting Approach VR sessions were video recorded, de-identified, and stored on an internal password protected drive to facilitate review. Our approach for establishing standards of competence was based on the borderline group method, a strategy previously utilized for standardized patient encounters. 18, 19 This methodology involves cross-referencing two assessment strategies to establish consistent criteria for "passing": 1) a categorized global assessment of performance (competent, borderline, or not-competent) and 2) performance on an itemized observable behavior checklist.

Two physicians with masters training in education and expertise in medical student education and evaluation through their roles as pediatric student clerkship directors (AG, CL) performed a blinded independent review of each student's video session and provided the global assessment of performance. There was no pre-determined description of the global assessment groups, and the behavior checklist was not provided to minimize bias. However, both reviewers were prompted to consider the AAMC core EPAs for entering residency when performing their assessment of the student. The reviewers met after completing independent review of batches of five recordings to discuss any discrepancies and reach an overall consensus score for each student. This was to ensure ongoing calibration in scoring and to provide a consensus global assessment for standard setting. Generally, reviewers agreed on the consensus score and discrepancies were rarely identified. The videos were also reviewed by a second group of physicians (FR, DY) using a structured observable behavior checklist (Figure 1 ), which had been developed for a previous VR study using a modified Delphi approach. 16 Two sample scenarios were graded independently by the two reviewers, followed by a debriefing session to compare scores and reach consensus to enhance reliability. A key grading perspective established during this debriefing was that students who required prompting to state that the patient required an escalation of care would still receive credit for a correct response due to having an accurate interpretation of the clinical scenario. Reviewer 1's scores were used for standard setting while Reviewer 2's scores were used to assess interrater reliability. Data was entered into a secure web-based application (Research Electronic Data Capture). 20 A key component of the borderline group approach is the use of an itemized observable behavior checklist. However, due to the complexity of performing a clinical assessment, the list of potential observable behaviors that take place are extensive, ranging from reporting and interpreting individual findings through synthesizing information into an overall assessment. We utilized binary discriminant analysis to identify which observable behaviors best predicted the global assessment of performance, allowing the creation of a core set of observable behaviors that need to be met to establish competency.

Binary discriminant analysis identified which observable behaviors from the checklist (i.e. independent variables) discriminated between the global assessment ratings (i.e. dependent variable). T-scores were generated, with a higher t-score (>0) for an observed behavior signifying a higher probability that performance of that behavior predicted the assigned global assessment rating. A negative t-score signified that the behavior predicted a different global assessment rating, while a t-score of zero indicated that the behavior had no contribution in discriminating between global assessment ratings. In other words, when assessing students who received a global assessment rating of competent, a behavior with a t-score > 0 would be highly predictive of a rating as competent. Alternatively, a behavior with a t-score < 0 would signify that that the behavior was more predictive of either a borderline or not-competent rating while a t-score of zero would signify that the behavior was not predictive of any rating. Sensitivity analysis was also conducted by dropping the behaviors which had little or no contribution in predictability 22 . Analyses were performed using the 'binda' package in R. 21, 22 Reliability between reviewers for use of the observable behavior checklist was examined as intraclass correlation coefficients (ICCs) and for categorical variables using Kappa statistics, with analyses performed in SAS 9.4 (SAS Institute, Cary, NC). Through our overall strategy for checklist generation, reviewer selection, and establishing reviewer reliability, we strove to establish content and internal structure validity of our assessment approach.

Twenty-six students elected to participate. Most participants reported ages between 25 and 29 (N=23, 88%), and skewed towards female (N=16, 62%). Students self-identified as Caucasian (77%), Asian (11%), mixed (8%) or Hispanic (4%).

For the global assessment of performance, 14 students were rated as competent, 9 as borderline, and 3 as not competent. None of the 'reporter' findings on the checklist were predictive of performance. The binary discriminant analysis examining the eight observable behaviors representing the 'interpreter' findings are presented in Figure 2 . Correctly stating that the patient required an escalation of care (highest t-score) had the largest contribution towards predicting that the student would be rated as competent. In addition, correct interpretation of vital signs (i.e. increased heart rate, low oxygen saturation, and increased respiratory rate) and stating that the patient was in respiratory distress were other factors that predicted a rating of competent (competent performance pattern). T-scores < 0 for these five factors differentiated students into either the borderline or not-competent categories. The addition of a positive t-score for recognition of increased work of breathing predicted a rating of not-competent category (notcompetent performance pattern) while a positive t-score for recognizing altered mental status predicted a rating of borderline (borderline performance pattern). The abnormal aeration with a zero t-score (vertical midline in each performance) had no contribution in predicting the global assessment.

The degree to which student performance of these eight observable behaviors predicted the global assessment of performance ratings are seen in Table 1 . The predicted probability that a student will be assigned into the not-competent category based on exhibiting the not-competent performance pattern of observable behaviors was 74%. For the borderline and competent categories, the predictive probabilities were 69% and 91% respectively for the borderline and competent performance patterns. A further sensitivity analysis, excluding the observable behaviors of recognizing increased work of breathing, altered mental status, and abnormal aeration (those findings with the smallest t-scores) yielded similar results (Appendix II and III).

Good reliability was demonstrated for the complete checklist of observable behaviors with an ICC of 0.71. When examining agreement between the two raters for each of the eight behaviors identified through binary discriminant analysis that predicted global performance, the reliability ranged between very good agreement for recognition of respiratory distress and altered mental status (kappa=1) to moderate agreement for increased heart rate (kappa = 0.66) (Appendix I).

This study demonstrated the novel use of immersive VR to identify objective standards for performance assessment, moving towards the goals set forth by the AAMC Core EPAs for Entering Residency. 2 Our standard setting approach defined observable behaviors that demonstrate a high correlation with global performance ratings. Evaluators can leverage these key observable behaviors to form the basis of an objective assessment metric that may predict, and potentially replace, subjective global assessments of competency related to assessment of respiratory distress.

VR may represent a modality that can begin to close the competency assessment gap by providing a realistic environment with sufficient fidelity to prompt learners to display behaviors they would perform in a true clinical encounter. Our use of binary discriminant analysis allowed identification of the key observable behaviors that predict performance as opposed to a priori weighting of factors which may introduce investigator bias into assessment tool development.

Our approach may serve as a strategy for medical educators to define objective measures of performance that corroborate subjective global assessments. Binary discriminant analysis can be applied in other training or assessment scenarios to identify patterns of performance that can be related to a defined category of an outcome. Such experiences could involve observed patient encounters, standardized patients, or even mannequin simulation.

Our study has several limitations. First, it was performed at a single site with 26 participants who were mainly rated as competent, limiting generalizability due to potential selection bias and applicability to less competent students. Specifically, our sample did not allow the establishment of meaningful discriminators between the borderline and not-competent groups, limiting use of our current findings for establishing passing standards. Second, while we have identified which objective behaviors predict the global assessment rating for this cohort of students, we have no evidence that global assessment ratings correlate with actual clinical performance. We have elucidated the objective findings that informed the global ratings at our institution, but these may or may not be consistent across training programs. Replication of this study across multiple institutions, generating a robust collection of student performance data and global assessment ratings, could help establish comprehensive observable behaviors that define competence for these clinical skills across programs. Third, our clinical scenario was limited to an infant with bronchiolitis. While this limits our ability to generalize to all respiratory distress, the key characteristics that define impending respiratory failure and the need for an escalation of care are consistent across underlying etiologies (e.g. pneumonia, asthma, or sepsis) and patient age. Finally, VR is a resource that may not be available at all institutions. However, VR is becoming more affordable than modern computerized manikins and standardized patients, with potentially greater opportunity for realism in illness scenarios. Additionally, VR content is easily and rapidly disseminated, and can be used remotely by learners -a functionality we now have greater appreciation for secondary to the COVID-19 pandemic.

Despite these limitations, we believe this study serves as an important early step in demonstrating the potential for VR technology to establish objective and observable competency standards for a medical student EPA. VR can overcome limitations of traditional SBME, and through this study was demonstrated as a practical method for implementing an objective assessment. Expansion of this approach may represent an effective strategy to enhance capacity for objective performance assessmentsa vital step in our pursuit of CBME and ensuring students can be entrusted to provide safe and reliable care for patients upon entering residency. Degree of predictability of observable behaviors for the global assessment of student performance. The distance away from the midline (t-score) for each observed behavior corresponds to the degree that behavior predicts the global assessment of performance. Observable behaviors at or near the midline had minimal to no contribution to the prediction of the global assessment of competence. Table 1 . Predicted probabilities (95% confidence interval) for receiving a global assessment rating (not competent, borderline, or competent), based on the scores computed for students' performance for each of the eight observable behavior from the checklist, using Binary discriminant analysis. 

Toward a shared language for competencybased medical education

Toward Defining the Foundation of the MD Degree: Core Entrustable Professional Activities for Entering Residency

Toward Competency-Based Medical Education

Simulation-based medical education in clinical skills laboratory

Technology-enhanced simulation for health professions education: a systematic review and meta-analysis

Does simulation-based medical education with deliberate practice yield better results than traditional clinical education? A meta-analytic comparative review of the evidence

Federation of State Medical Boards of the United States I, and the National Board of Medical, Examiners. United States Medical Licensing Exam: Step 2 Clinical Skills (CS) -Content Description and General Information

Curriculum Reports: SP/OSCE Final Examinations at US Medical Schools

Viral bronchiolitis. Lancet

Deliberate practice and the acquisition and maintenance of expert performance in medicine and related domains

Interactive collaboration for virtual reality systems related to medical education and training

Understanding Empathy Training with Virtual Patients

A Virtual Reality Curriculum for Pediatric Residents Decreases Rates of Influenza Vaccine Refusal

Resident perspectives on communication training that utilizes immersive virtual reality. Educ Health (Abingdon)

Medical Student Perspectives on the use of Immersive Virtual Reality for Clinical Assessment Training

Impact of an Immersive Virtual Reality Curriculum on Medical Students' Clinical Assessment of Infants With Respiratory Distress. Pediatr Crit Care Med

The Future of Onboarding: Implementation of Immersive Virtual Reality for Nursing Clinical Assessment Training. J Nurses Prof Dev

A comparison of empirically-and rationally-defined standards for clinical skills checklists

A comparison of standardsetting procedures for an OSCE in undergraduate medical education

Research electronic data capture (REDCap)--a metadata-driven methodology and workflow process for providing translational research informatics support

The analysis of multivariate binary data

Differential protein expression and peak selection in mass spectrometry data by binary discriminant analysis

The authors would like to thank the medical students from the University of Cincinnati College of Medicine for their participation in this study. This study was supported in part by funding through the Council on Medical Student Education in Pediatrics (COMSEP). Funders played no role in the design and conduct of this study; collection, management, analysis, nor interpretation of the data; nor preparation, review, or approval of this article.