key: cord-0129687-9tq5we2u authors: Korfiatis, Alex Papadopoulos; Moramarco, Francesco; Sarac, Radmila; Savkov, Aleksandar title: PriMock57: A Dataset Of Primary Care Mock Consultations date: 2022-04-01 journal: nan DOI: nan sha: 62358a816910437928282489c1e8895e44854cf9 doc_id: 129687 cord_uid: 9tq5we2u Recent advances in Automatic Speech Recognition (ASR) have made it possible to reliably produce automatic transcripts of clinician-patient conversations. However, access to clinical datasets is heavily restricted due to patient privacy, thus slowing down normal research practices. We detail the development of a public access, high quality dataset comprising of57 mocked primary care consultations, including audio recordings, their manual utterance-level transcriptions, and the associated consultation notes. Our work illustrates how the dataset can be used as a benchmark for conversational medical ASR as well as consultation note generation from transcripts. The use of Automatic Speech Recognition (ASR) is widespread in the clinical domain but it is generally used to alleviate the administrative burden of clinical notes through dictation (Hodgson and Coiera, 2016; Kumah-Crystal et al., 2018) . However, the adoption of telemedicine, especially in primary care, generates vast quantities of clinical interaction recordings. Additionally, ASR models have become much more robust to applications in the clinical domain. In turn, this is beneficial for downstream Natural Language Processing (NLP) tasks, such as information extraction from clinical conversations (Selvaraj and Konam, 2021; Soltau et al., 2021) and automatic generation of consultation notes (Finley et al., 2018; Enarvi et al., 2020a; Quiroz et al., 2020; Molenaar et al., 2020) . Despite this being an active area of research it still lacks a commonly recognised ASR benchmark due to the sensitive nature of clinical conversations. Furthermore, as the datasets are not shared, research teams always need to invest time and resources into making their own private dataset. These limitations slow down progress in the field. We release 1 a high quality public dataset of primary care consultation audio recordings, including manual transcriptions and associated consultation notes, which is the basis of our contributions: 1. a benchmark for ASR for primary care conversations; 2. a benchmark for automatic generation of consultation notes for primary care. Automated transcription of clinical consultations has attracted quite significant research interest; however, as mentioned above, there is no easily accessible common benchmark dataset in the style of Switchboard (Godfrey et al., 1992) or Fisher (Cieri et al., 2004) , which are both nonmedical conversational audio datasets. Because of this, comparing different approaches for clinical conversation ASR is challenging. For example, Chiu et al. (2018) detail a dataset of ≈ 14,000 hours of recorded and manually transcribed consultations that they use to train an endto-end clinical conversation ASR model. Similarly, Kim (2020 ), Soltau et al. (2021 develop end-toend ASR models for clinical conversations and Mani et al. (2020) train a sequence-to-sequence machine translation model to correct the errors of general-domain ASR engines; but they all use different, proprietary datasets. Johnson et al. (2014) and Kodish-Wachs et al. (2018) perform systematic reviews of the accuracy of a number of open-source and commercial ASR models for clinical conversation transcription; again, on proprietary datasets. As for open-access datasets, compile and release two clinical dialogue datasets in Chinese and English, covering a wide range of clinical specialties. do the same for COVID-19 related clinical dialogue. These Figure 1 : Overview of the data collection process. A mock patient, reading from a medical case card, has a consultation with a clinician which is recorded and transcribed. The resulting dataset includes the consultation audio recordings, notes and manual transcripts. datasets are gathered from online clinical question answering sources; while they are relevant for clinical chatbot research, they are not representative of clinical interactions and do not include audio. Kazi et al. (2020) provide a dataset of audio recordings, automated transcripts and consultation notes for 70 mock psychiatric consultations -but no human transcripts. Automatic consultation note generation and other long-form text summarisation tasks have rapidly developed due to recent advances in Natural Language Generation (NLG) architectures (Vaswani et al., 2017; Devlin et al., 2019) . Several studies (Liu et al., 2019; MacAvaney et al., 2019; Enarvi et al., 2020b; Joshi et al., 2020; Krishna et al., 2021; Chintagunta et al., 2021; Yim and Yetisgen-Yildiz, 2021; Moramarco et al., 2021; Zhang et al., 2021) use proprietary datasets of transcripts and notes to train NLG models endto-end, and a number of them carry out automatic or human evaluations on their proprietary test sets. However, in a similar fashion to the ASR studies discussed above, most studies don't publish these resources; hence, it is again prohibitively difficult to compare their proposed methods. Kazi et al. (2020) provide the only open access clinical dataset that could be used as a benchmark but it only contains psychiatric consultations, which is less applicable to primary care. The requirements for releasing a dataset containing Personal Health Information (PHI) are typically costly and involve collecting patient consent and/or de-identification, which is especially challenging with audio recordings. We built a mock consultation dataset as close as possible to the real conditions as a pragmatic alternative. The diagram in Figure 1 shows an overview of the data collection process. We employed 7 clinicians and 57 actors posing as patients from a range of ethnicities. The clinicians had experience with virtual consultations. Participation was optional and anyone could choose to withdraw at any time. Four of the clinicians were men and three were women; five of them had British English accent, and two of them Indian. The patient accent distribution is as follows: British English (47.4%), various European (31.6%), other English (10.5%), and other non-English (10.5%). The gender distribution was relatively even (52.6% women, 47.4% men); most participants were from 25 to 45 years old (see Figure A .1). Each mock patient was given a case card that included background information (age, social history, family history of illnesses) as well as information about their presenting complaint, symptoms, condi- There is some blood in the urine -pink colour Pain below belly button Feeling nauseated but no vomiting * * * tions, and medications. The case cards were drawn from a pool of primary care conditions, representative of presenting complaints in UK primary care. For a breakdown of presenting complaints, see Table 1. An example case card is given in Table 2 . We recorded 57 mock consultations (8h38m6s in total) over 5 days, using proprietary telemedicine software that allowed us to export the individual clinician and patient audio channels. 2 In order to emulate real clinical practice, clinicians were using laptops while patients were using mobile phones in an office environment with background noise. Clinicians were asked to act as close as possible to their actual consultation sessions, including conforming to a consultation length of 10 minutes and writing a consultation note in the SOAP format (Pearce et al., 2016) . The resulting mock consultations ranged between 3m48s and 14m18s, with an average consultation length of 9m5s. To transcribe the consultation recordings, we employed transcribers with experience in the clinical conversation domain, who were asked to: 1. Listen to the consultation audio recordings, in separate channels for clinicians and patients; 2. Identify the start and end points of individual utterances (continuous speech segments ending in a pause); Figure 2 : Average utterance length for clinician and patient as a function of conversation turns. The patient initially speaks more than the clinician but later in the consultation this trend is reversed. 3. Provide an accurate transcription of each of the utterances identified. Thus we obtained a collection of start times, end times, and utterance-level transcriptions, important for the ASR evaluation described below. Consultations have 92 conversation turns and 1,489 words on average; clinicians tend to speak more than patients (897 vs. 592 words per consultation) and take longer turns (19.3 vs 12.8 words per turn). Interestingly, patients tend to take longer turns than clinicians in the beginning of the consultation, where they presumably state their presenting complaint; turns are more balanced in the middle, and clinicians seem to take over during the diagnosis and management at the end (see Figure 2 ). We perform a baseline study of ASR for clinical conversations by passing the audio recordings of the mock consultations through commonly used open-source and commercial speech-to-text engines: Azure Speech-to-text (ASTT): 7 a commercially available, general domain service. We use the Standard model. To test the accuracy of the above services, we first extract the audio for each individual utterance identified by our human transcribers. We then generate a transcript for the utterance using each of the ASR engines. We ensure consistency by performing the following post-processing steps on both human and automatic transcripts: 1. Remove disfluencies ("umm", "uhh", etc.). These are included in the reference transcripts, but often omitted in each STT service; Finally, we compute the Word Error Rate (WER) for each utterance using SCTK's sclite 8 tool. The mean WER, including a breakdown by gender, role, and accent can be seen in Table 3 . Even though both are general domain, Google and Azure together are the best performing models on our dataset (p = 0.097). Conformer performs surprisingly well, given that it is a character-level model evaluated on a word-level metric. The base WER metric treats all words in a transcript as equally important; this may be less desirable in the clinical domain, where the correct transcription of specific clinical terms is expected to be more important. To test this, we use a proprietary clinical information extraction engine based on fuzzy string matching, linking to SNOMED-CT (Donnelly et al., 2006) . We extract medical concepts from each utterance in both reference and hypothesis transcripts, then compare the concepts extracted to estimate accuracy based on clinical terminology (ECCA in Table 3 ). The results mostly match the WER comparisons; the medical-domain Amazon model does not seem to perform better. The consultation transcripts and corresponding notes (see example in Table 4 ) are intended as a parallel dataset to evaluate methods for automatically generating primary care consultation notes. We propose a benchmark for this task by evaluating a number of baseline approaches and reporting common automatic metric scores on our dataset. The approaches considered include: Table 5 : Average common metrics scores of different models on the 57 consultations. R1 through L represent Rouge F1 scores for unigrams, bigrams, and longest-common-subsequence. B represents nonrescaled BERTScore; score range is between 0.7 to 0.9, so differences are less pronounced. BART-CNN: a neural sequence-to-sequence summariser based on the BART model (Lewis et al., 2020) and fine-tuned on the Dailymail/CNN dataset (Nallapati et al., 2016) ; BERT-ext: a general-purpose extractive summariser based on Bert embeddings (Miller, 2019) ; Random: a baseline that extracts 15 random sentences from the transcript and collates them to form a note; BART-finet: a BART-CNN model further finetuned on a proprietary dataset of 8,000 real transcripts and consultation notes. We evaluate the models on our dataset and report common summarisation metrics scores: Rouge-1, -2 & -L (Lin, 2004) which compute the F-score across ngrams between generated and human notes; and BERTScore (Zhang et al., 2019) , which computes the similarity between BERT embeddings of the notes. The results can be seen in Table 5 : the finetuned BART model scores highest with all metrics, while BART-CNN and BERT-ext fail to outperform the Random baseline model. This highlights the differences between consultation note generation and general-purpose summarisation. A more detailed evaluation of this task can be found in Moramarco et al. (2022) ; example notes can be found in Appendix Table A .3. We present a dataset of 57 high quality mocked consultation audio recordings, their manually aligned and diarised transcripts, and consultation notes. By publishing this dataset, we hope to offer a benchmark for future studies in both ASR for clinical conversations and Consultation Note Generation for the primary care domain. Hx: 1 week history of spontaneous elbow swelling left. Not painful. No trauma. No FH of rheumatological disease-NB pt says he has been old he has OA previously by doctors-? need to confirm this Works in a desk job Not happened before Otherwise well-PMHx: nil of note FH: nil of note DH: not on any medication, allergic to peanuts SH: exercises regularly, active Ex: looks well, not in pain. Mild erythema and minimal swelling (if any) around olecranon process left elbow Imp: possible bursitis Plan: for NSAIDs-usual advice re SE For rheum bloods: esr, crp, fbc, rheum factor and urate Review thereafter in person/ via video To contact us back in interim if any deterioration/concerns-pt warned re symptoms of septic arthritis. Doctor Deen Mirza from GP at Hand sees John Smith. John says he has a weird swelling on his left elbow. He also says he is allergic to peanuts. Deen takes a look at John's elbow to see if there is anything wrong with it. Do you have any other illnesses at all? BERT-ext Before we start your appointment, could you please tell me your first name and your date of birth. And I was born on the fifth of April, , nineteen seventy three. But it's just, just a bit, a bit weird, to see that. , and , , in terms of your job, do you do anything physical? so you know you said you think you've got , , osteoarthritis. and, do you have any other illnesses at all? , I run regularly, like two, three times a week. , what I think we should do is, I think you should be on some anti-inflammatory medication, in the, in the first instance. And, there'll be instructions within that pack, about where to go to get those blood tests done. and , your, your joint doesn't look like that. However, if your, the elbow was to become very red, very painful, , and the redness was to spread or become , you know more intense. That would require more immediate assessment, more immediate treatment. do you, do you think it's something dangerous? Like something, like could I die from that, or is it, is it No. that's four hundred milligrams, two times a day. Maybe within a , actually you know, the follow-up appointment doesn't have to be face-to-face, if it's more convenient for you do, to do it over the phone, we can do that over the phone, , over video. We can do that as well, that's, that's your call. Sure. No, no I haven't noticed that before. OK, OK, great. Yes, a few years ago. do you, do you think it's something dangerous? Fantastic. But you contact us, , after you've had the blood test done, and we can review things then, OK. OK. OK, yeah that sounds good. OK. -. , yeah, no, I'm, think I'm healthy. . So, , this, this is not the case right now. I run regularly, like two, three times a week. don't need to worry. All right then, OK. , take care then. You have a problem with your left elbow. 1 week ago noticed a weird swelling on the left elbow. Not painful at all, but slightly warm, slightly warm. No pain, no swelling, no fluid in the elbow. No injury. No previous history of this. No injury to the elbow. NKDA. SH: Mobile and active, exercise 2-3 times a week, running. Osteoarthritis of the elbow. You should start the treatment you have been prescribed. You should begin the treatment prescribed as we discussed. You may want to take some ibuprofen or paracetamol in addition to any prescribed medication. Table A .3: Examples of a human written note and automatically generated notes with the four baseline models. Xavier Amatriain, and Anitha Kannan. 2021. Medically aware gpt The Fisher corpus: A resource for the next generations of speech-to-text Bert: Pre-training of deep bidirectional transformers for language understanding Snomed-ct: The advanced terminology and coding system for ehealth Generating Medical Reports from Patient-Doctor Conversations Using Sequence-to-Sequence Models Generating medical reports from patientdoctor conversations using sequence-to-sequence models An automated medical scribe for documenting clinical encounters SWITCHBOARD: telephone speech corpus for research and development Zhengdong Zhang, Yonghui Wu, and Ruoming Pang. 2020. Conformer: Convolution-augmented transformer for speech recognition MedDialog: Two Large-scale Medical Dialogue Datasets Risks and benefits of speech recognition for clinical documentation: a systematic review A systematic review of speech recognition technology in health care Dr. summarize: Global summarization of medical dialogue by exploiting local structures Coviddialog: Medical dialogue datasets about covid-19 Upulee Kanewala, and Indika Kahanda. 2020. Dataset for automated medical transcription End-to-End Speech Recognition on Conversations. thesis A systematic comparison of contemporary automatic speech recognition engines for conversational clinical speech Quartznet: Deep automatic speech recognition with 1d time-channel separable convolutions Generating SOAP notes from doctor-patient conversations using modular summarization techniques Electronic Health Record Interactions through Voice: A Review BART: Denoising sequence-to-sequence pretraining for natural language generation, translation, and comprehension Rouge: A package for automatic evaluation of summaries Topic-aware pointergenerator networks for summarizing spoken conversations Ontology-aware clinical abstractive summarization Towards Understanding ASR Error Correction for Medical Conversations Leveraging bert for extractive text summarization on lectures Fabiano Dalpiaz, and Sjaak Brinkkemper. 2020. Medical Dialogue Summarization for Automated Reporting in Healthcare A preliminary study on evaluating consultation notes with post-editing Human evaluation and correlation with automatic metrics in consultation note generation Abstractive text summarization using sequence-to-sequence rnns and beyond The essential soap note in an ehr age The kaldi speech recognition toolkit Identifying relevant information in medical conversations to summarize a clinician-patient encounter Um whereabouts in your skin is it affected? Patient: Uh, mostly like my chest, my, my hands, my arms. Like, like really, itnext week, if things don't get better. Patient: That sounds good. Doctor: OK? Um do you have any questions for me? Patient: Uh, no that's it. Thank you very much. Bye. Thank you as well. Bye. Doctor: Hello. Patient: Hello, can you hear me wet? Doctor: Yes, I think it's a bit better. It's a bit. It's a bit. It's not very clear. But let's continue. Anyway, Patient: Okay. Doctor: okay, let's talk again. So, how can I help you, sir? Patient: Yes, so it's been a few days now. I have like a sore and the Redskin it's kind of it's really itchy and it's like super annoying. Doctor: Okay. Patient: So I'd like to find something quick to serve it. Doctor: No, no problem. Happy to help whereabouts of your skin is affected. Patient: Mostly like my chest my my hands my arms like agree. It's super annoying like it's itching a lot like all the time and I can't even sleep at night. Like I really need something quickly to study because even at work I like when I'm in the meeting and I have to like think about my work Focus like actually focus on my work. It's Doctor: Yeah. Patient: really annoying because I can actually think about what happened say, I'm always like disturbed by this disease 2: An example of a human transcript and a Google Speech-to-text transcript for one of the mock consultations