key: cord-0882787-v1yety3y
authors: Femi-Abodunde, Abiola; Olinger, Kristen; Burke, Lauren M. B.; Benefield, Thad; Lee, Ellie R.; McGinty, Katrina; Mervak, Benjamin M.
title: Radiology Dictation Errors with COVID-19 Protective Equipment: Does Wearing a Surgical Mask Increase the Dictation Error Rate?
date: 2021-09-24
journal: J Digit Imaging
DOI: 10.1007/s10278-021-00502-w
sha: b687634d693ca6b568352b05103401df3ea1f0a0
doc_id: 882787
cord_uid: v1yety3y

Our aim was to determine the effect of wearing a surgical mask on the number and type of dictation errors in unedited radiology reports. IRB review was waived for this prospective matched-pairs study in which no patient data was used. Model radiology reports (n = 40) simulated those typical for an academic medical center. Six randomized radiologists dictated using speech-recognition software with and without a surgical mask. Dictations were compared to model reports and errors were classified according to type and severity. A statistical model was used to demonstrate that error rates for all types of errors were greater when masks are worn compared to when they are not (unmasked: 21.7 ± 4.9 errors per 1000 words, masked: 27.1 ± 2.2 errors per 1000 words; adjusted p < 0.0001). A sensitivity analysis was performed, excluding a reader with a large number of errors. The sensitivity analysis found a similar difference in error rates for all types of errors, although significance was attenuated (unmasked: 16.9 ± 1.9 errors per 1000 words, masked: 20.1 ± 2.2 errors per 1000 words; adjusted p = 0.054). We conclude that wearing a mask results in a near-significant increase in the rate of dictation errors in unedited radiology reports created with speech-recognition, although this difference may be accentuated in some groups of radiologists. Additionally, we find that most errors are minor single incorrect words and are unlikely to result in a medically relevant misunderstanding.

The COVID-19 pandemic has presented many challenges to businesses across the world, including hospital systems, and has necessitated rapid changes to the daily practice of medicine. Public use of face masks has been one effective methods of source control recommended by the US Centers for Disease Control (CDC) [1] . Following this recommendation, most hospital systems have mandated the occupational use of masks to limit the spread of aerosols or droplets generated by activities like speaking [2] [3] [4] .

In modern radiology practices, there is widespread use of speech-recognition dictation software as a means to generate radiology reports and assist with patient care. Although there have been significant advances in speech-recognition software over the last 30 + years, automated transcription of speech remains imperfect even in optimal situations, with varying reports on accuracy [5] [6] [7] [8] [9] [10] . Prior studies have described the types of errors which can be introduced by speech recognition -wrong tenses, word substitutions, word omissions, nonsense/incomplete phrases, punctuation errors, incorrect measurements, laterality errors, and wrong dates, among others [7, 10] -and potentially confusing errors have been shown to occur in more than 20% of routine radiology reports dictated using speech-recognition software [10] . During the COVID-19 pandemic, a published study demonstrated a negative impact of personal protective equipment (PPE) on interpersonal healthcare communication in a clinical setting -including speech discrimination and understanding [11] . Experiments by Toscano and Nguyen et al. also illustrated the impact of mask-wearing on voice recognition at low and higher frequencies [12, 13] . Their results suggested that wearing masks does have varying effects on speech recognition. Toscano further demonstrated that there are varying degrees of sound dampening properties depending on the talker, level of background noise, and type of mask.

Radiologists may anecdotally feel that our accuracy has been affected by PPE, although to our knowledge, masks have an unknown effect on the accuracy of speech recognition and rate of dictation errors. The purpose of this study is to determine the effect of surgical masks on the number and type of dictation errors in unedited radiology reports.

Overview A matched-pairs study design was used with no patient data included. A power analysis (detailed below) was conducted to plan the sample size, and a corresponding number of model radiology reports (n = 40) were created. Six participating radiologists used speech-recognition software to create dictations based on these model reports. Dictations (n = 480) were compared to the model reports and errors were manually tallied and classified according to type and severity. A statistical model was used to compare error rates for masked vs unmasked dictations. Before beginning the study, the Institutional Review Board (IRB) for our hospital system was consulted and determined that this project was exempt from a full review as no patient data was included.

To determine the total number of dictations that would be required of our study participants, we conducted a power analysis using G*Power v3.1.9.2 (Heinrich Heine Universität, Düsseldorf, Germany) [14] . Existing literature was used as an estimate of the mean number of errors per report expected to occur when dictating without a mask (1.6 ± 1.1) [10] . The mean number of dictation errors with a mask was hypothesized to be 20% greater (1.9 ± 1.1). We found that with a matched-pairs study design, the upper bound would be 211 dictations in each group for 80% power at alpha = 0.05.

Six radiologists agreed to participate as readers in the project: five attending diagnostic radiologists (four female and one male) each with at least eight years of experience dictating, and one female diagnostic radiology resident in her fourth year of postgraduate training (PGY-4) with more than 2 years of experience dictating.

To meet the target number of dictations, a total of 40 model radiology reports were fabricated by the radiology resident with oversight from one of the faculty radiologists, then validated by a second faculty radiologist to ensure that the reports approximated the structure and complexity commonly generated during a workday at our tertiary care center. No patient data was used. As the five participating attending radiologists were within the division of abdominal imaging and could be expected to have dictation voice models highly tuned to terms and conditions found in abdominal imaging reports, model reports were limited to varieties that would be reported by an abdominal imaging division. Reports were evenly balanced including ten each of computed radiography/radiofluoroscopy (CR/RF) reports, ultrasound (US) reports, computed tomography (CT) reports, and magnetic resonance imaging (MR) reports. Departmental structured templates served as a foundation for these reports, to which features were added including dates and times; indications; factitious comparisons; common, uncommon, and incidental imaging findings; biplanar/multiplanar measurements; and image/series numbers as commonly dictated at our institution. A variety of benign and malignant conditions were included. A summary of study indications and an example report are included in Fig. 1 . The total number of words in each model report was counted to evaluate error rates per 1000 words.

Each radiologist was instructed to dictate word-for-word the contents of the 40 model reports twice: once while wearing a mask, and once without a mask, for a total of 80 dictations per reader and 480 dictations total. To control for bias from dictating the same reports twice, readers were randomized into two equal groups with one group dictating first masked then unmasked, and the other dictating first unmasked and then masked. Masks were provided to each radiologist by the radiology department as personal protective equipment and consisted of a standard disposable surgical mask attached to the face via elastic ear-loops. No N-95 masks were used. When dictating with a mask, participants were instructed to wear the mask tight to the face and fully cover both the nose and mouth.

To standardize the process of dictation, requirements for reading radiologists included: de-novo dictation of all section headers, words, numbers, dates, and punctuation exactly as written in the model report; no proofreading of reports during or after dictation (excepting an obvious manual error or accidental garbling of words due to something other than the mask itself); dictation at a natural pace, tone, and volume; dictation of all reports in the same physical location to minimize variation due to microphone, room noise, or other environmental factors. All reports were created using Pow-erScribe 360 v4.0-SP2 reporting software and a PowerMic III (Nuance Communications, Burlington, Massachusetts) and then copied directly from the reporting software into a separate text document. Radiologists used their own user account and associated personalized voice model, which had been attuned to their pattern of speech through daily use for more than 2 years in each case. To simulate a real-world setting more closely, the dictation wizard was not run at the beginning of each dictation session, as this is not commonly done on a day-to-day basis.

Using a comparison feature in Microsoft Word (Microsoft Corporation, Redmond, WA) to highlight differences, model reports were compared side-by-side with dictations from the radiologists, and dictation errors were manually tallied and categorized by one attending radiologist and the participating PGY4 resident. Categories of errors (outcome variables) included: incorrect words, missing words, additional words, missing or incorrect phrases (defined as 3 + sequential words), incorrect terms of negation (e.g., errors in "no," "not," or "without"), sidedness errors, incorrect image numbers, incorrect measurements, incorrect dates/times, and punctuation errors.

Every error was counted, with no limit as to the maximum number of errors codified per report. Incorrect, missing, and additional-word errors were subclassified as minor, moderate, or major errors based on a subjective assessment of the potential to result in a clinically significant misunderstanding for the ordering provider or a future radiologist. Missing/ incorrect phrases of 3 + words, errors in words of negation, sidedness errors, and incorrect measurements were all subclassified as being major errors that could result in a clinically significant misunderstanding. Incorrect image numbers and incorrect dates/times were all subclassified as being moderate errors which might result in misunderstanding.

All 480 dictations were codified, including 240 in the masked group and 240 in the unmasked group. To validate data coding and address inherent subjectivity, a selection of these dictations (20%; [96/480]) were separately coded by a second attending radiologist and were compared to the initial coding. The discrepancy rate was 6.3% (6/96).

Graphical evaluation showed no evidence of overdispersion. Error rates were modeled for each outcome as a function of the presence/absence of a mask assuming a Poisson distribution with a log link. The number of words in each dictation report was included as an offset and the model controlled for the nuisance parameter of randomization order. The model included mixed effects to control for radiologist-level correlation and correlation within a study document. Predicted error rates per 1000 words were computed for the mask vs no mask group and compared using a t-test. P-values for these comparisons were adjusted using the false discovery rate method to control the Type 1 error rate [15] .

Following an initial data review, an approximately fourfold difference was seen in the total number of errors generated by one participant (1346 total errors for one reader vs. mean of 308 for other participants). This participant was notable for being the only trainee as well as the only participant having accented speech. Using the model above, predicted error rates per 1000 words were computed and compared for this individual vs. the other 5 readers for the "all errors," "major errors," "moderate errors," and "minor errors" outcomes variables. To reduce the potential for significant bias of study outcomes toward error patterns present for this individual, a sensitivity analysis using the same model described above was performed excluding this trainee.

A separate subgroup analysis was conducted to evaluate whether modality was associated with the "all errors" outcome variable. We implemented the same model described above with the addition of a modality indicator variable. Predicted error counts were computed for each modality and compared using a t-test. P-values for these comparisons were adjusted using the false discovery rate method to control the Type 1 error rate.

Results are described using model-based error rates per 1000 words with standard errors and associated adjusted p-values [15] . Adjusted p-values < 0.05 are considered statistically significant.

When analyzing outcomes for all participants, the overall model-based error rate (per 1000 words) in reports dictated without masks was 21.7 ± 4.9 and with masks was 27.1 ± 6.0, a difference of 25% (adjusted p < 0.0001). Significant differences were also seen in the error rates for major errors (5.6 ± 1.6 unmasked vs. 7.3 ± 2.0 masked; p = 0.008), minor errors (11.9 ± 2.6 unmasked vs. 15.2 ± 3.2 masked; adjusted p = 0.0002), punctuation errors (0.4 ± 0.3 unmasked vs. 0.7 ± 0.6 masked; adjusted p < 0.0001), missing-word errors (3.5 ± 0.9 unmasked vs. 4.3 ± 1.1 masked; adjusted p = 0.049), and errors involving terms of negation (0.1 ± 0.05 unmasked vs. 0.2 ± 0.1 masked; adjusted p = 0.018). Significant differences were also seen in subsidiary outcomes including incorrect-word errors of major severity (3.9 ± 1.0 unmasked vs. 5.0 ± 1.3 masked; adjusted p = 0.044) and missing-word errors of moderate severity (0.3 ± 0.2 unmasked vs. 0.6 ± 0.4 masked; adjusted p = 0.001).

A significant difference was seen in the error rate for the one trainee participant for the "all errors," "major errors," "moderate errors," and "minor errors" outcomes variables (all p < 0.0001). Outcomes for the subgroup of five attending radiologists differed from outcomes for the group inclusive of the radiologist in training.

The overall model-based error rate (per 1000 words) for the subgroup consisting of only attending radiologists was 16.9 ± 1.9 in reports dictated without masks and 20.1 ± 2.2 when wearing a mask, a difference of 19%; this difference was borderline significant (adjusted p = 0.054). Incorrectword errors and the subsidiary outcome of incorrect-word errors of minor severity were also marginally significant (adjusted p = 0.054 and 0.066, respectively). Other types of dictation errors did not occur at a significantly different rate when wearing a mask versus when dictating unmasked.

The most frequent types of errors encountered were: incorrect word errors, with a model-based error rate of 14.3 ± 2.7 per 1000 words when unmasked and 15.9 ± 2.9 when masked; missing a word, with a model-based error rate of 3.5 ± 0.9 per 1000 words when unmasked and 4.3 ± 1.1 when masked; and mistakenly added words, with a modelbased error rate of 1.7 ± 0.4 per 1000 words when unmasked and 2.0 ± 0.4 when masked. Errors in numerals (i.e., measurements, image numbers, or dates) were less frequent, with a total model-based error rate of 1.1 ± 0.4 per 1000 words when unmasked and 1.3 ± 0.5. Details of the model-based error rates per 1000 words for all outcome variables for the entire group of participants are listed in Table 1 , and for the subgroup of attending radiologists in Table 2 .

An analysis of the effect of modality on the all-type error rate revealed that MR and CR had significantly higher error rates than CT. When evaluating all participants, MR also had significantly more errors than US, although this difference was not significantly different in the attending radiologist subgroup. Other pairwise comparisons did not reach statistical significance for either group. Error rates per 1000 words by modality, pairwise comparisons of modalities, and adjusted p values are listed for the primary group and attending subgroup in Table 3 .

In the COVID-19 era, the use of facial coverings at workplaces is necessary for reducing aerosolized particles and is typically mandated in a hospital setting. However, masks add complexity to a radiologist's daily work practices, and it is important to better understand the effects of this physical barrier on the accuracy of speech recognition. Although the specific ways masks might affect dictation accuracy were not assessed in this study, we would hypothesize that this could be due to a combination of Interestingly, conclusions drawn from this study differ slightly depending on whether data from one reader with significantly higher error rates is considered. The precise reason(s) for this individual's high error rates is also outside of the scope of this study, although it is notable that this participant was the only resident in the study as well as the only individual with an accented pattern of speech. We believe that any conclusions drawn with this data included would be skewed toward error rates and patterns of errors for this individual, and that a subgroup analysis consisting of only the 5 attending radiologists is more broadly applicable to radiologists in the US. Data from this subgroup is used as the basis for discussion.

In the subgroup of attending radiologists, wearing a mask increased the overall error rate by approximately 19%, a finding which neared statistical significance (p = 0.054). However, the majority of all errors were clinically inconsequential (minor errors) and most commonly the result of a single replaced word. Additional errors resulting from mask-wearing would therefore also be expected to be minor. On one hand, minor errors are a nuisance in that they can affect the perceived quality of radiology reports, including from a medicolegal standpoint, or put undue pressure on the reader when assessing the results of a radiologic study [16] . Transcription errors, irrespective of their effect on interpretability, can also negatively affect the perception of the professionalism of its author and might affect a radiologist's professional relationships or referral patterns [16] . On the other hand, minor errors do not generally impact patient care, making them of far lesser clinical importance than moderate or severe errors. Moderate and severe errors-while of greater importance-occurred at a lower rate than minor errors, and no severity of error proportionally increased to a significant degree when wearing a mask. This study also allows for an exploration of the relative incidence of different types of dictation errors encountered when using an automated dictation system. The most common type of error was the erroneous substitution of single words, which comprised 64-66% of all errors and was the only type of error to approach a significant difference when wearing a mask (p = 0.054). While single-word replacement errors might be of any severity, the majority were found to be minor, with many instances in which articles or conjunctions (e.g., "the," "of," or "and") were replaced with other articles/conjugations, or similar words were substituted (e.g., "duct" instead of "ductal" or "maximum" replacing "maximal"). Single missing words and single added words were the second and third most experienced errors. For all singleword error types, minor and clinically insignificant errors were predominant, and none were significantly affected by wearing a mask.

Errors involving numerals were infrequent, representing about 4% of errors, which is notable as radiologists are often tasked with measuring lesions, and may reference image numbers or measurements from prior studies when reviewing subsequent imaging. Missed or incorrect terms of negation (e.g., errors involving "no," "not," or "without") are also a constant worry for radiologists in that such errors are easily missed and can greatly affect patient management. These were also uncommon, representing about 0.5% of all errors. Again, none of these error types were significantly affected by a mask.

Significantly higher error rates were noted in MR and CR/RF compared to CT, while there was no significant difference between other pairwise comparisons of modalities. Specific reasons for these higher error rates were not explored in this study, although a hypothesis might include the presence of complex descriptions and pathologies on MR reports. The high error rate with CR/RF was unexpected and might be partially attributable to infrequently dictated terminology specific for fluoroscopy. It is worthwhile for radiologists to know which modalities may result in reports with more errors, as additional proofreading may be required when interpreting these imaging modalities.

There were several limitations to our study. First, participants were aware of the study objectives and were unable to be blinded to wearing or not wearing a mask. Next, radiologists were specifically instructed not to proofread their dictated reports, which is typically done before signing a clinical report during a workday. However, the goal of this study was to determine the effect of masks on dictation errors, and if editing were allowed, it would have been impossible to differentiate between errors resulting from a lack of editing versus errors resulting from a mask. As a result, the effect of masks on real-world reports in a patient's electronic medical record remains unknown, although would be expected to be lower than the number in these unedited reports due to proofreading.

There was also some inherent subjectivity in this study during the process of data coding, most importantly during subclassification of errors into minor, moderate, or major. Standardization was attempted to the extent possible by introductory meetings where data coding methods were discussed, including outlining definitions and reviewing examples for each severity classification, as well as by the process of data validation by a separate radiologist. Individual clinicians or radiologists may nonetheless differ in opinion as to what would constitute a minor versus moderate versus major error; this does not invalidate the conclusion that all-type error rates were significantly or near-significantly greater when radiologists wore masks.

Although we believe that these results apply to the majority of radiologists in the US, some factors may limit generalizability. In this study, we tested only one voice-recognition dictation system as it is the software available at our institution. However, the vendor used by our institution is the market leader, with an estimated 81% market share [17] , and therefore findings apply to most radiologists. Minor variations might be expected for radiologists using other vendors for dictation software. Furthermore, demographics for participating radiologists may factor into generalizability. First, five participants in this study were female and one male. While gender is not expected to affect speech recognition, this question was not directly studied. More importantly, highly significant differences in error rates were seen between the attending radiologists and the single resident radiologist who also happened to be the only participant with accented speech. While the reasons for this were not specifically studied, we hypothesize that an accent may degrade voice recognition, although other conceivable reasons exist, for example, resident radiologists may not have as highly tuned dictation voice models as attendings given that residents dictate across multiple different subspecialties while they rotate through a radiology department. Of note, there were significantly more errors of negation for the resident radiologist, which would not be expected to depend upon the level of training or subspecialty. Implications of accented speech on the accuracy of speech recognition software have long been hypothesized, although current research is sparse. This may provide an interesting area for further study. Finally, this study focused on examinations generally interpreted by abdominal imagers as participating faculty radiologists were within the division of abdominal imaging and would be expected to have dictation voice models highly tuned to terms and conditions found in abdominal imaging reports. If other types of exams (e.g., CT head, MR knee) were included, dictation errors could potentially have been due to weaker voice modeling for terms commonly found in those types of reports. Although not directly assessed, error rates for radiologists practicing using a personal voice model attuned to their study mix would not be expected to vary greatly.

We conclude that wearing a mask while dictating results in at least a near-significant increase in the rate of dictation errors in unedited radiology reports created with speech recognition, a difference which may be accentuated in some groups of radiologists. Notably, however, most errors are minor single incorrect words and are unlikely to result in a medically relevant misunderstanding.

Funding No funding was obtained for this study.

Data Availability Data is available on request from the authors.

Ethics Approval The Institutional Review Board (IRB) for our hospital system was consulted and determined that this project was exempt from a full review as no patient data was included.

All participants consented to participate in this study.

All participants consent to publication of this manuscript.

In: Centers for Disease Control and Prevention

Respiratory virus shedding in exhaled breath and efficacy of face masks

Visualizing Speech-Generated Oral Fluid Droplets with Laser Light Scattering

National Academies of Sciences E (2020) Rapid Expert Consultation on the Possibility of Bioaerosol Spread of SARS-CoV-2 for the COVID-19 Pandemic

Initial evaluation of a continuous speech recognition program for radiology

Accuracy of a voice-to-text personal dictation system in the generation of radiology reports

Risks and benefits of speech recognition for clinical documentation: a systematic review

Speech, and Audio Signal Processing and Associated Standards

A systematic review of speech recognition technology in health care

Frequency and Spectrum of Errors in Final Radiology Reports Generated With Automatic Speech Recognition Technology

The negative impact of wearing personal protective equipment on communication during coronavirus disease 2019

Effects of face masks on speech recognition in multi-talker babble noise

Acoustic voice characteristics with and without wearing a facemask

Statistical power analyses using G*Power 3.1: tests for correlation and regression analyses

Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing

Error Rates in Breast Imaging Reports: Comparison of Automatic Speech Recognition and Dictation Transcription

Speech Recognition in Radiology -State of the Market

No authors had conflicts of interest or competing interests.