key: cord-0209075-09phl2ty authors: Zhu, Yuanda; Sha, Ying; Wu, Hang; Li, Mai; Hoffman, Ryan A.; Wang, May D. title: Public Health Informatics: Proposing Causal Sequence of Death Using Neural Machine Translation date: 2020-09-22 journal: nan DOI: nan sha: 27c3a9d13cc921ee2ffe35c71c4c4ab1089f754a doc_id: 209075 cord_uid: 09phl2ty Each year there are nearly 57 million deaths around the world, with over 2.7 million in the United States. Timely, accurate and complete death reporting is critical in public health, as institutions and government agencies rely on death reports to analyze vital statistics and to formulate responses to communicable diseases. Inaccurate death reporting may result in potential misdirection of public health policies. Determining the causes of death is, nevertheless, challenging even for experienced physicians. To facilitate physicians in accurately reporting causes of death, we present an advanced AI approach to determine a chronically ordered sequence of clinical conditions that lead to death, based on decedent's last hospital admission discharge record. The sequence of clinical codes on the death report is named as causal chain of death, coded in the tenth revision of International Statistical Classification of Diseases (ICD-10); the priority-ordered clinical conditions on the discharge record are coded in ICD-9. We identify three challenges in proposing the causal chain of death: two versions of coding system in clinical codes, medical domain knowledge conflict, and data interoperability. To overcome the first challenge in this sequence-to-sequence problem, we apply neural machine translation models to generate target sequence. We evaluate the quality of generated sequences with the BLEU (BiLingual Evaluation Understudy) score and achieve 16.44 out of 100. To address the second challenge, we incorporate expert-verified medical domain knowledge as constraint in generating output sequence to exclude infeasible causal chains. Lastly, we demonstrate the usability of our work in a Fast Healthcare Interoperability Resources (FHIR) interface to address the third challenge. T HERE are more than 2.7 million deaths happening in the United States every year [1] , with nearly 57 million deaths per year around the world 1 . Accurate death reporting is essential for public health institutions such as the National Center for Health Statistics (NCHS) and the Centers for Disease Control and Prevention (CDC) to analyze vital statistics such as life expectancy and to formulate responses to communicable disease threats and epidemics. In addition to reporting simple information such as demographics of the deceased, an important component of death reporting is to determine the causes of deaths. The U.S. death reporting system requires two types of causes of deaths to be filled on death certificates: a single medical condition that is the underlying cause of death, and also an ordered list, a causal chain, of medical conditions that lead to the death. An example causal chain of death is "chronic obstructive pulmonary disease, unspecified (ICD10: J44.9) → other disorders of lung (ICD10: J98.4)". Here ICD10 stands for "10th revision of the International Statistical Classification of Diseases and Related Health Problems", a common coding system used in death reporting 2 . The process of determining such causal chains of death, nevertheless, is challenging, even for an experienced physician. Such a process involves careful reasoning with one's medical domain knowledge, provoking challenges for young or inexperienced physicians. Even worse, a sudden and unexpected death might further exacerbate the process of filling the death report when the physician could only find limited electronic health records of the deceased. Complete and accurate reporting of the full chain of causes of death has multiple benefits. This data is an invaluable public health resource for tracking the prevalence of causes of death, targeting public health interventions, and tracking the effectiveness of those interventions over time. Frequently reported chains can help physicians and public health experts understand the correlations and causal relationships between clinical diseases, potentially allowing the discovery of causal relationships that had not been previously observed. On the patient-level, personal medicine perspective, it may even be possible to warn individual patients of potential diseases leading to death even before any symptoms can be diagnosed. This can help improve clinical care and, in turn, the well-being of patients. To facilitate the timely, accurate and complete reporting of deaths and reduce the subjectivity of reporting physicians, in this paper, we aim to develop a decision support system that suggests probable causal sequences of death based on decedents health histories. These chains of causes of death form the basis of the NCHS Multiple Causes of Death data, a critically valuable data source in public health. These chains outline the chain of medical events and conditions which led to the death, arranged in a cause-effect order. We identify three challenges in this task with the obtained mortality data set when predicting the causal chain of death using deep learning algorithms. Table I summarizes these three challenges and our proposed solutions. The first challenge comes from the different coding systems of clinical conditions. The existing causes of death (COD) in the United States have been using the International Statistical Classification of Diseases and Related Health Problems, Tenth Revision (ICD-10) since January 1999 3 [2] . On the other hand, healthcare institutions and practitioners in the U.S. were still filing patients' health record using the ninth revision (ICD-9) codes until October 2015 [3] . ICD-10 codes are "very different" from ICD-9 codes in both coding structure and quantity: ICD-10 has nearly five times as many diagnosis codes as ICD-9 4 . Thus, no one-to-one mapping shall be expected between the two systems. Intuitively, a solution to this challenge rises from the analogy between our task and the natural language translation. The input sequence of diagnosis codes is from the last hospital discharge record of the deceased, and the output sequence is the corresponding causes of death for that decedent. Similar to translating from English sentences to French sentences, we intend to propose a succinct causal chain of death in ICD-10 codes from the priority-based discharge records of ICD-9 codes. The research area of Natural Language Processing (NLP) contains extensive studies for machine translation. Specifically, the models for machine translation can be classified into autoregressive (AR) [4] [5] [6] [7] and autoencoder (AE) models [8] [9] [10] , with the former factorizing the probability of a given corpus into a series of conditional probabilities and the latter generating output through reconstructing from corrupted input. The second challenge is the domain knowledge conflict. As a pure data-driven approach, the deep learning model can sometimes generate confusing sequences to the physicians. Some results may even contradict the medical domain knowledge. Consequently, the physicians may find it difficult to trust the generated results. To solve this problem, we would like to incorporate medical domain knowledge to guide the deep learning framework. Particularly, we use an external source of expert-curated rules that are pairs of causal relationships between clinical condition codes. When the deep learning model searches for the next clinical condition in the process of generating output sequences, we impose the constraint that only clinical conditions following medical domain knowledge can serve as candidates. The last challenge is the data interoperability in death reporting. Currently, the National Center for Health Statistics (NCHS) coordinates with 57 reporting jurisdictions across the United States to aggregate mortality data [11] . These reporting jurisdictions have different regulations and local laws. To streamline the data storage and transmission between hospitals and these public health institutions and to increase data size for future big data analytics, Fast Healthcare Interoperability Resources (FHIR) is utilized to standardize mortality data reporting. We have developed one web-based FHIR [12] platform that adopts the new health standard named HL7 [13] to access electronic health records data. The newly developed Android version mobile app is FHIR compatible; it is capable of pre-populating different sections of death certificate to extract essential information of health history of the decedents. Furthermore, it serves as a graphical user interface for physicians that the mobile app can automatically query the pre-trained causal chain of death prediction models to provide decision support. In future versions, such applications may collect data that can be used to refine and train decision support models in real-time, expanding the impact of this work beyond retrospective analysis and improving clinical practice at the point-of-care. A recent work of similar topic [14] was submitted in late March, 2020 and Figure 1 shows the difference in pipeline of our work and their work. Our work has five key differences from theirs: 1) Different clinical tasks: we aim to automatically generate the causal chain of death in ICD-10 codes given the discharge data of the decedent's last hospital visit, coded in ICD-9. In comparison, their work is to recognize and convert medical entities in natural language (French) to ICD-10 codes. The medical entities in French are already filled as causes of death on the death certificate by medical experts. 2) Different input data formats: our input data are prioritybased ICD-9 codes from the discharge data of the decedent's last hospital visit, while their input data are medical entities of causes of death in French on the death certificates. 3) Different methods: in addition to neural machine translation model named transformer, we also adapt and apply three recurrent neural network based encodedecoder frameworks to generate the casual chain of death; furthermore, we test the more recent the crosslingual language modeling (XLM) [10] and analyze why it fails on our task. 4) Medical domain knowledge constraint: we learn medical domain knowledge as constraint from ACME decision table on the encoder-decoder frameworks. 5) Different evaluation metrics: we use the modified BLEU score (1-gram and 2-gram precision), which is popular for sequence-to-sequence translation task in natural language processing. Their work uses precision, recall and f-measure for evaluation. In this work, we identify the encoder-decoder models as the main framework to automatically generate a causal sequence of death in ICD-10 given priority-based diagnosis codes in ICD-9 from one decedent's last hospital discharge record. In addition, we also add expert domain knowledge graph learnt from ACME decision table as knowledge constraint to restrict the output from the pure data-driven framework. The overall structure is shown in Figure 2 . In summary, our work has the following contributions: 1) We are the first to develop data-driven approaches for suggesting causal chains of death based on death reports and decedents' last hospital visit discharge records; 2) We apply the State-of-The-Art model for neural machine translation, and augment it with domain knowledge constraints; 3) We are the first one to interpret deep learning results through visualization by identifying meaningful association between clinical conditions coded in different versions; 4) We are the first one to use BLEU score to evaluate the performance of deep learning model on generating causal chain of death; 5) We implement the knowledge-guided deep model on the FHIR interface. We are using last hospital visit discharge records from Michigan Vital Statistics Data that covers 181, 137 patients. As shown in Figure. 3, each patient has exactly one line of last hospital visit essential information, including up to 45 clinical diagnosis codes, one underlying cause of death and up to 17 related causes of death. On average, each patient has 18.84 diagnosis codes and 2.25 causes of death (including the underlying cause of death). The diagnosis codes are in sequence of ICD-9 codes, while the priority-based causes of death are in ICD-10 codes. Typically, we shall have a longer input source sequence around 16-20 codes, and a much shorter output target sequence with roughly 2-3 codes. Such a short sequence of death codes is expected in death reports. We accessed the ten years (2009 to 2018) National Center for Health Statistics (NCHS) Mortality Multiple Cause Files database 5 and calculated that the average length of death code sequence among 26, 322, 220 decedent samples to be 2.95 codes. (Note that discharge codes on last hospital admission may contain previous admission discharge codes.) Ontology of medically valid causal relationships between ICD-10 codes were developed, improved, and promulgated by an international team of medical experts [15] . This ACME (Automatic Classification of Medical Entry) decision table was used to learn the medical domain knowledge constraint [16] . It contains 95,321 lines of causal relationship. Mathematically, we can define the generation of causal chains as follows: [Generation of Causal Chains] Given a deceased's medical history represented as a collection of medical codes x = x 1 , . . . , x m , the goal of causal chain generation is to identify another list of medical codes y = y 1 , . . . , y n that summarizes the conditions leading to the death. The objective is to propose the causal chain of death, which is an ordered sequence of death codes (ICD-10). The inputs, on the other hand, are sequences of diagnosis codes (ICD-9). To generate one sequence of one domain (language) from a sequence of another domain (language), we apply the stateof-the-art algorithms from neurall machine translation. Diagnosis codes in ICD-9 are defined as source codes (input sequence) while death codes in ICD-10 are target codes (output sequence). Both source codes and target codes are split into training set, validation set and testing set according to the ratio of 7 : 1 : 2. We also applied 5-fold cross validation. Each line of codes is treated as one sequence, with each code in the sequence treated as one word in the natural language settings. Translation is to find a target sentence y = y 1 , . . . , y n which maximizes the conditional probability p(y|x) given a source sentence x = x 1 , . . . , x m . Neural machine translation (NMT) aims to maximize this conditional probability of source-target sentence pairs by using a parallel training corpus to fit a parameterized model. As shown in Figure. 4, there are two basic components of an NMT system: • An encoder encodes the input sequence x into a representation s • A decoder generates the output sequence y The conditional probability of the decoder is formulated as: log p(y t |y 1 , y 2 , . . . , y t−1 , s) The the probability of the next generated word y i , is jointly decided by the learned representation s and all previously generated words y 1 , . . . , y t−1 . Fig. 4 . Neural machine translation consists of an encoder (stacked recurrent networks in blue) and a decoder (stacked recurrent networks in red). The symbol < eos > is a special token referring to the end of a sentence. Adapted from [7] . In this section, we would like to briefly introduce the following four models of encoder-decoder framework in neural machine translation. 1) LSTM Encoder -LSTM Decoder: In an LSTM Encoder-Decoder framework [17] , [18] , the encoder reads and encodes an input sequence of embedded vectors x. The encoder will then generate a hidden state h t at time t from the current input x t and the previous hidden state h t−1 : The representation vector s shall have the form: Here f and q are some non-linear functions. For the basic RNN/LSTM model, the conditional probability of output sequence y at time t can be written as: Here g is a (multi-layered) nonlinear function. Generic RNN or LSTM encoder-decoder framework has to process the sentence word by word, failing to preserve longterm dependency. Luong et al [6] proposed global attention which predicts the position of alignment for the current word before computing the context vector using the window centered around that source position. Here the global attention is used in the decoder. 2) Mean Encoder -LSTM Decoder: Mean encoder is a simplified encoder. Instead of adding an LSTM model in the encoder, this mean encoder speeds up by using a mean pooling layer. The same stacked LSTM model and global attention mechanism are used. 3) Bidirectional RNN Encoder -LSTM Decoder: A major disadvantage of the traditional encoder-decoder model is that the neural networks compress source sentences into fixedlength vectors. This may significantly limit the capability of translating long sentences [19] . Bahdanau proposed to use a bidirectional RNN [4] with encoder-decoder approach so that the model can learn to align and translate jointly. As shown in Figure. 5 6 , a transformer consists a stack of encoders and the same number of decoders. The embedded input is passed to the encoder at the bottom; the output from the encoder on the top will be passed to all decoders. The decoder on the top will pass the output to a linear layer and a softmax layer to generate predicted sentence. Encoder has two layers: a multi-head self-attention layer and a feed forward layer (shown in part A of Figure. 6 6 ) . Decoder has an extra multi-head attention layer that processes both the output from the encoder stack and the output from previous multi-head attention layer (shown in part B of Figure. 6 6 ) . A straightforward method of decoding is to keep and predict only one word with the highest score based on previous steps. It is efficient and easy to understand; yet a small mistaken output might corrupt all remaining predictions. Thus, a better strategy named beam search keeps the top k hypotheses for each step and select the best one when reaching the end of sequence. Here, k is the beam size. We also include medical domain knowledge as constraints during translation. The ACME (Automatic Classification of Medical Entry) decision table specifies all the "feasible" pairwise causal relationships between ICD diagnosis codes [16] , [15] . Using this decision table, we construct a domain knowledge graph on all diagnosis codes from Michigan data before training. With diagnosis codes as nodes, we add directed paths between them only if such causal relationship can be found in the ACME decision table. When decoding, the networks are required to look up the knowledge graph and are only allowed to include "feasible" codes in the top k hypotheses. Furthermore, we need to increase the interpretability and demonstrate it through visualization. Several studies visualized attention weights to showcase that their models captured the correspondence words between English and French [4] [21] [20] . The ability to demonstrate the correspondence of words between languages through visualization is also critical for our task. Because the discharge records and causes of death are coded under different versions of ICD codes, associating a cause of death to several past diseases could help us qualitatively evaluate generated causal sequences. In addition to qualitative evaluation, quantitative evaluation remains critical. Here we evaluate how well our proposed causal chainŶ = {Ŷ 1 , ..,Ŷ M1 } aligns with the physicians' decision, i.e., Y = {Y 1 , .., Y M2 }. Here Y i is the individual codes, and M 1 , M 2 are the respective length of the chains. A perfect alignment means M 1 = M 2 , andŶ i = Y i , for i = 1, ..., M 1 . However, this is rarely the case, thus we compute a weighted average precision of our alignment in sub-sequences of variable lengths, i.e., the BLEU (BiLingual Evaluation Understudy) score [22] . Following natural language processing literature, we call sub-sequence of length i "igrams". BLEU score ranges from 0 to 1 or (or from 0 to 100 if multiplied by 100), and the higher BLEU, the higher we have an alignment with physicians clinically. We use a simple example as follows to illustrate the computation of the BLEU score. In our proposed candidate sequence, the underlying cause of death, Asphyxia and Hypoxemia (R909) leads to Pneumonia, Unspecified Organism (J189) which leads to Respiratory failure, unspecified (J969). The reference sequence, determined by the physician, consists of Asphyxia and Hypoxemia (R909), Pneumonia, Unspecified Organism (J189) and then Acute Respiratory Failure (J960). As shown in Table III , we first list 1-grams and 2-grams fromŶ and Y , and we compute the precision for the two case. Here the definition of precision is similar in the classification setting: among all the predictions we made in candidate se-quenceŶ , how many we get correct in the reference sequence Y ? After we compute all the precision metrics, we calculate the geometric average of them as the BLEU metrics, in this case, approximately 0.47. From CandidateŶ Appear in Y In natural language settings, people usually calculate BLEU score for the geometric average up to 4-gram precision. In our case, however, we only compute the geometric average up to 2-gram precision, and apply clipping to each of the precision. This is due to the fact that the average length of causal chain of death in Michigan dataset is 2.25 codes so including 3gram precision will lead to substantially inaccurate evaluation. Furthermore, we also include a brevity penalty to penalize sentences that are too short. , where precision i is defined as Ŷ the count of i-grams inŶ that appears in Y Ŷ the count of i-grams inŶ We refer readers to the original paper for a detailed discussion on the variants of BLEU metric [22] . For clinical interpretation, our modified BLEU score indicates how well our proposed sub-sequences of causal conditions match the physicians' results. The 1-gram precision emphasizes individual condition codes matching, while 2gram precision evaluates the causal relationship between two neighboring condition codes. Physicians can manually check whether the generated causal relationship between any two neighboring condition codes fulfills or contradicts their medical domain knowledge; in addition, a data-driven algorithm can incorporate ACME decision table as medical domain ground truth to assess the validity of two neighboring condition codes. In Table IV , we show an example of different candidate sequences that have perfect 1-gram precision but different 2gram precision. The reference sequence from underlying cause of death to immediate cause of death is: I251 (Atherosclerotic heart disease of native coronary artery), I38 (Endocarditis, valve unspecified), I429 (Cardiomyopathy, unspecified) and I469 (Cardiac arrest, cause unspecified). We argue that our modified BLEU score favors candidate sequences that have more reasonable and feasible condition codes with pairwise casual relationship. IV. EXPERIMENTS By using OpenNMT package [23] , we have tested five different encoder-decoder models. In addition to OpenNMT, we have incorporated the state-of-the-art pretraining model named cross-lingual language model (XLM) [10] on our data set. OpenNMT serializes the training, validation and vocabulary data into PyTorch files for preprocessing. During training, we use the 2-layer LSTM model, with 500 hidden units in each layer for the default LSTM framework. The mean encoder framework simply has the two LSTM layers removed in encoder but keeps other parts unchanged. For bidirectional RNN encoder, a 2-layer bidirectional LSTM with 500 and 250 hidden units is implemented. CNN-based encoder includes two gated convolutional layers with 500 and 1000 hidden units and kernel size (3,1) followed with two convolutional multi-step attention layers. The transformer has 6 stacking layers, with 2048 hidden units in feed forward layers and 8 heads in multihead attention layers. XLM (cross-lingual language model) [10] incorporates masked language modeling (MLM) proposed in BERT (Bidirectional Encoder Representations from Transformers) [8] with the transformer model to improve translation performance. The preprocessing includes tokenizing and applying fastBPE (byte pair encoding) [24] to monolingual and parallel data. MLM is the core strategy in monolingual language model pretraining. Training consists of three major steps: denosing auto-encoder, parallel data training and online back-translation. Due to the limited size of our data set, we concatenate all training, validation and testing data into two corpora for monolingual pre-training. Masked language modeling (MLM) perplexities are used for validation during pre-training. We further train the cross-lingual model with parallel validation data and predict on parallel test data. We set the transformer framework with 512 embedding size and 4 attention heads. We vary the encoder-decoder stacking size from 6 layers to 1 layer. The drop out rate is 0.1, attention dropout 0.1, batch size 32 and sequence length 128. We used GELU for activation and adam as optimizer. In search for better prediction performance, we add an extra pre-processing step, the validity check. For training and validation data, we adopt the same algorithm in [16] to remove the pairs of sentences that include "invalid" causal relationship between diagnosis codes in target sentence. In this way we reduce the number of sentences in the training set from 136, 753 to 107, 711 and those in the validation set from 34, 385 to 27, 009. We then follow the same pipeline to train and translate with the same five models. As shown in Table V , we calculate the average BLEU score for each encoder-decoder framework across five folds. Except for the mean encoder model that has lower BLEU score by 2% to 3% after validity check, other models have higher BLEU score by 1.5% to 25.8% after removing invalid samples from Michigan data. Knowledge constraint shows a higher BLEU score by up to 1.8% for mean encoder and transformer models, while it also shows a 24.9% to 28.1% drop in performance on BRNN model. Knowledge constraint has a mixed impact on the default LSTM model. According to [7] , larger vocabulary size tends to achieve higher BLEU score. Their proposed hybrid NMT model achieved 17.7 BLEU score with 10k vocabulary size on English-Czech translation task. Our vocabulary size in source set is 7616 and that in target set is 2649. Thus, our translation performance is close to that of the state-of-the-art. CNN-based encoder-decoder framework frequently reported errors during the experiments. Conceptually, RNN-based encoder-decoder frameworks with attention mechanism are more suitable on sequence-to-sequence tasks; CNN-based model, on the other hand, finds it challenging to deal with sequences of different lengths. Consequently, we do not include the inconsistent results from CNN-based encoderdecoder framework. In addition, we would like to show the visualization of attention during translation. As shown in Figure. 7, the source sentence is on the top of the graph while the predicted sentence is on the left. In this example, the code C349 (malignant neoplasm of unspecified part of bronchus or lung) is the correct prediction. We can observe that the code 1890 (malignant neoplasm of kidney, except pelvis) from the source code has high attention with the predicted code. Yet the nonnegligible flaw is that the source code 2761 (hyposmolality and/or hyponatremia) has wrong attention with the < EOS > symbol in the prediction. To our surprise, the state-of-the-art algorithm XLM (crosslingual language model) performs much worse than the other encoder-decoder frameworks. All BLEU scores are less than 1 after trying different combinations of hyper-parameters. Even though the pre-training can successfully finish after 500-700 epochs on monolingual corpora using masked language model perplexities as evaluation metric, the training on parallel data using BLEU score as evaluation metric fails to converge properly. We argue that the core algorithm behind BERT and XLM, masked language model, does not work on our data set. The idea of masked language modeling is to randomly mask (hide, making it unknown) a few words in the sentence (either source or target sentence) during the training stage and then to recover these masked words based on surrounding context. Since in average our target sentence has 2.25 words, masking one word can make it extremely difficult to recover. Even worse, over 31% of our target sentences consist of only one word: masking the only word makes it impossible to recover. We have implemented a prototype Android mobile application to demonstrate the usefulness and applicability of this work. This prototype app supports causal chain prediction, patient search, patient information display, determining causes of death, and review/submit. The app allows a physician to add death-related information when filling out the "Pronouncing Death" screen and the data bundle is compatible with FHIR servers. When generating causal chain of death for clinical decision support, the app will automatically retrieve medical condition codes from the FHIR server. Future versions of the mobile app will be capable of querying the python pre-trained model for prediction; for demonstration purposes, we store predictions for our synthetic test patients in a local text file. Based on the source sentence of conditions, the app loads the indexed predicted causal chain of death from the text file. Figure. 8 shows a screenshot of the FHIR Android app when displaying the causal chain of death. The ICD-10 codes have already been mapped into human-readable short descriptions. From top to bottom, we show the ordered causes of death (from underlying cause of death to immediate cause of death). The use of mobile apps such as this prototype for delivering public health informatics creates the opportunity for realtime, point-of-care feedback. In the immediate term, clinicians completing mortality reporting data can be provided with decision support capability to improve the accuracy and completeness of the causes of death reported. In the future, such infrastructure may even be able to provide predicted causes of death for still-living patients, enabling predictive medical care. In this paper, we are the first to successfully propose the causal chain of death using neural machine translation frameworks to support the timely, accurate and complete death reporting. The generated sequence has a BLEU score of 16.44, close to the performance of the state-of-the-art in natural language domain (English-Czech translation task achieving BLEU score 17.7 given the same vocabulary size around 10k). In addition, we incorporate medical domain knowledge as constraint when generating output sequence. Furthermore, we have visualized the results with attention mechanism, providing a tool to explore the relationship between certain condition codes in source sentence and those in target sentence. Lastly, we demonstrate a FHIR compatible mobile app to retrieve, modify and upload data. Still, there are a few limitations with this work. The visualization of attention mechanism clearly shows the failure of alignment during translation, largely due to the extremely imbalanced lengths between source and target sentences. Furthermore, even though that the cross-lingual language modeling (XLM) has proven its effectiveness in natural language translation, it fails on our task. One potential cause is that the masked language modeling might not work on extremely short sentences (in average 2.25 words per sentence). One unsolvable problem is the one-word target sentence. Only in very rare condition we shall see a sentence consists of just one word in natural language; yet 31.77% of training data, 31.68% of validation data and 31.27% of testing data are oneword target sentences. These samples significantly undermine the effectiveness of neural machine translation models. Future work includes data augmentation so that the target sentence length will fit the newest masked language model (such as XLM). The model framework needs adapting towards the imbalanced source and target sentences. Furthermore, a fully automatic query between the IOS mobile app and python codes shall be implemented to fulfill the translation requirements from new data samples. Deaths: final data for 2016 International statistical classification of diseases and related health problems: instruction manual Transition to the icd-10 in the united states: an emerging data chasm Neural machine translation by jointly learning to align and translate Semi-supervised sequence learning Effective approaches to attention-based neural machine translation Achieving open vocabulary neural machine translation with hybrid word-character models Bert: Pre-training of deep bidirectional transformers for language understanding Xlnet: Generalized autoregressive pretraining for language understanding Cross-lingual language model pretraining A primer and comparative review of major us mortality databases Intelligent mortality reporting with fhir Hl7 clinical document architecture, release 2 Neural translation and automated recognition of icd10 medical entities from natural language Using acme (automatic classification of medical entry) software to monitor and improve the quality of cause of death statistics Improving validity of cause of death on death certificates Learning phrase representations using rnn encoder-decoder for statistical machine translation Sequence to sequence learning with neural networks On the properties of neural machine translation: Encoder-decoder approaches Attention is all you need Hierarchical attention networks for document classification Bleu: a method for automatic evaluation of machine translation OpenNMT: Open-source toolkit for neural machine translation Neural machine translation of rare words with subword units The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. The work was supported by the U.S. Department of Health and Human Services Centers for Disease Control and Prevention under Award HHSD2002015F62550B, the NIH National Center for Advancing Translational Sciences under Award UL1TR000454, and the National Science Foundation Award NSF1651360. The author would like to thank Paula Braun (CDC) for her invaluable assistance and support in shaping this project. This article does not reflect the official policy or opinions of the CDC, NIH or NSF and does not constitute an endorsement of the individuals or their programs.