key: cord-0559149-fjqnjhtd
authors: Apostolova, Emilia; Karim, Fazle; Muscioni, Guido; Rana, Anubhav; Clyman, Jeffrey
title: Self-supervision for health insurance claims data: a Covid-19 use case
date: 2021-07-19
journal: nan
DOI: nan
sha: d8aea0b8fe1005e9fc092f91fba4881fbf83d5ae
doc_id: 559149
cord_uid: fjqnjhtd

In this work, we modify and apply self-supervision techniques to the domain of medical health insurance claims. We model patients' healthcare claims history analogous to free-text narratives, and introduce pre-trained `prior knowledge', later utilized for patient outcome predictions on a challenging task: predicting Covid-19 hospitalization, given a patient's pre-Covid-19 insurance claims history. Results suggest that pre-training on insurance claims not only produces better prediction performance, but, more importantly, improves the model's `clinical trustworthiness' and model stability/reliability.

Self-supervision or pre-training on large unlabeled corpora (word2vec [1] , Glove [2] , Elmo [3] , Bert [4] and related model, GPT1-3 [5] [6] [7] , etc.) has led to continuously improving state-of-the-art results on numerous Natural Language Processing (NLP) tasks. The success of pre-training and self-supervision has recently been expanded to other fields, such as imaging [8] , human activity recognition [9] , molecular data [10] , time series clinical data [11] , etc.

In this work, we modify and apply NLP self-supervision techniques to the domain of medical health insurance claims (a subset of clinical data). The US health insurance process requires providers (physicians and hospitals) to submit detailed visit claim information for the purposes of health insurance payments. Typically, an insurance claim contains billing codes for various medical diagnoses, procedures and medications, relevant to the billing process. These billing codes are comprised of a subset of the patient's electronic medical record (EMR), and exclude more comprehensive clinical information, such as vital signs and clinical notes. The claims history of a patient can be used for a variety of patient outcome predictions that can help guide and advise patient and provider behaviour for improved health outcomes and healthcare affordability.

We model patients' anonymized health care claims history as a 'free-text narrative' and apply self-supervision to introduce prior knowledge, later utilized for patient outcome predictions. The health insurance claim 'narrative' consists of a sequence of diagnosis, procedure, and medication codes submitted for billing purposes, together with some basic demographic information, such as age and gender. An example of the information used from a set of anonymized health insurance claims is shown below: In this study, we focus on utilizing medical health insurance claims pre-training for predicting hospitalization due to a Covid-19 infection, as efforts to reduce mortality due to Covid-19 include early identification and outreach to patients who have the highest risk of developing severe complications from the disease [12] . Predicting post-Covid-19 hospitalization, given patient's pre-Covid-19 insurance claim history is an extremely challenging task due to both the clinical complexity of the disease [13] , as well as the inherently limited and noisy nature of insurance claims (containing only a subset of the patient's EMR, relevant to billing purposes) 1 .

The work relevant to this study falls into 2 categories: machine learning models focusing on the health insurance claims and self-supervision on clinical data, in particular, data present in insurance claims, such as diagnosis, procedure, and medication codes.

The majority of literature focusing on health insurance claims aims to predict fraud, anomalies, and errors in health insurance claims [14] [15] [16] [17] [18] [19] and typically uses traditional data mining and machine learning approaches. A few studies focus on predicting medical outcomes from claims. Hung et al. [20] show that a deep neural net and Gradient Boosting Machines (GBM) outperform Support Vector Machines (SVM) and logistic regression on the task of stroke prediction from electronic medical claims. Vekeman et al. [21] use a random forest classifier for identifying patients with Lennox-Gastaut syndrome in health insurance claims. Valdez et al. [22] analyze the conditions of myalgic encephalomyelitis and chronic fatigue syndrome in insurance claim, and conclude that the symptom information in claims is insufficient to identify diagnosed patients. Nagata et al. [23] apply GBM and LSTM models to predict risk of type-2 diabetes using claims data.

In terms of self-supervision on clinical data, a number of studies focus on low-dimensional representation learning of medical concepts and medical codes [24] [25] [26] [27] [28] , utilizing word2vec, Glove, continuous bag-of-words model with time-aware attention, as well as graph-based attention models utilizing medical ontologies. More recently, BEHRT [29] applies Bert-like transformer pre-training on Electronic Health Records (EHR) using masked language model that outperforms previous deep EHR representations, such as [30] that combines word2vec embeddings with CNN. G-BERT is a model that combines Graph Neural Networks and BERT that learn medical representations from MIMIC III. Med-Bert [31] is another BERT-like model that is pre-trained on data from 28 million patients that outperforms BEHRT and G-BERT [32] . We were unable to find studies that focus specifically on self-supervision for medical claims.

In this study, a historical anonymized claims dataset is used to pre-train the model. This dataset contains information of 50 million claims submitted in 2019 and 2020 to a major US health insurance provider. A separate internal dataset that contains 471,971 anonymized Covid-19 positive patients (based on lab result or diagnosis) is used to build a model that would detect patients who are at risk of being hospitalized due to Covid-19 complications. This dataset contains prior 3-years of claim records (diagnosis, procedure, and medication codes) of each Covid-19 positive patient and their respective age and gender. To avoid data leakage, claims up to 7 days prior to a Covid-19 positive diagnosis date are dropped as they may contain information relevant to current Covid-19 infection signs and symptoms. Further, age is discretized into clinically meaningful age ranges [33] . Covid-19 related hospitalizations were identified based on the primary diagnosis associated with the hospitalization claim. On the other hand, an individual was considered to be not hospitalized, if the individual had non-hospitalized claims subsequent to the COVID-19 positive diagnosis date or the individual did not have any claims 30 days after the COVID-19 positive diagnosis date. The Covid-19 hospitalization rate for the dataset is 15%. The number is significantly higher than reported in the US [34] due to an inherent bias in the dataset which contains patients whose Covid-19 positivity was determined solely by the primary diagnosis of hospitalization. The number is also overestimated by the bias in insurance claims submitted Covid-19 tests (excluding Covid-19 tests without insurance claims and individuals with mild symptoms that were not tested).

We compare the performance of 4 prediction models on the task of identifying post-Covid hospitalization, given prior 3 years of medical claims history.

As a simple baseline method, we used mappings of diagnosis and procedure codes to the set of known Covid-19 risk factors, e.g. all neoplasm ICD-10 codes (C00-D49) were converted to the risk-factor variable 'cancer'. A total of 25 risk factor variables, together with age and gender, were used to build a logistic regression model on the task. A second baseline method utilizes all available diagnosis, procedure, and drug codes as a bag-of-words representation of the historical claims 'narrative' and a Support Vector Machines model [35] . The baseline methods do not use pre-training and utilize only the dataset of 471,971 Covid-19-diagnosed patients.

The third approach utilizes pre-training on diagnosis, procedure, and medications codes, analogous to word embeddings. Word2vec embeddings 2 for diagnosis (ICD-10), procedure (Healthcare Common Procedure Coding System), and medication (National Drug Code Directory) codes, each of size 1000, were generated utilizing data from close to 50 million historical claims. The embeddings were then utilized in the Covid-19 positive patients by averaging the embeddings for each type of code respectively (diagnosis, procedure, and medications) for the prior 3 years of the patient's claims history. The 3 types of averaged embeddings were concatenated together with the demographic information (age and gender) and used in a Gradient Boosting Machine (GBM) model [36] to predict post-Covid-19 hospitalization status.

Lastly, in the forth approach, the dataset of 50 million historical claims was utilized in a transformer-based masked language model: RoBERTa [37] . Before pre-training, data from each of the 50 million claim records were randomly shuffled. Roberta was trained by masking 30% of the tokens, which include diagnosis/procedure/medications codes, age, and gender. The Roberta model was then fine-tuned on the Covid-19 dataset to predict post-Covid-19 hospitalization status.

Due to the sensitive, clinical nature of the dataset/task and the inherent bias of healthcare claims data, the models' performance needed to be evaluated not only in terms of metrics, such as precision and recall, but also in terms of 'clinical trustworthiness' and model stability/reliability. In an attempt to generate explainable Table 1 . An example of a feature input (a medical claim), together with a perturbation automatically generated by substituting each code with the code closest in the pre-trained embedding spaces for diagnoses, procedures and medications respectively.

model predictions, we applied the LIME feature attribution model [38] on a random sample of 100 positive and 100 negative predictions for the machine learning models described above. Two clinicians were invited to review the model predictions and determine which model is most clinically 'trustworthy' by reviewing explainability results. Unfortunately, due to the size, variability, and both limited and noisy nature of the claims data, the clinicians were not able to utilize the LIME explanations. As a substitute for human evaluation, we instead measured model stability/reliability by introducing input feature perturbations. For each of our Covid-19 training samples, we substituted each diagnosis/procedure/medication code with the code closest in the corresponding embedding space. Table 1 shows an example of a feature input (medical claim), together with a perturbation automatically generated by substituting each code with the code closest in the pre-trained embedding spaces for diagnoses, procedures and medications. We then measured the differences in prediction probability between the original input and the perturbed input, as well as the differences in the corresponding variable importance scores. The expectations are that such small variations in input should result in minor output / variable significance differences. The perturbations also try to mimic real world coding discrepancies, as medical billing coders have some freedom as to how to code a claim, and the choice of a particular billing code from a set of similar codes is often subjective or circumstantial [39] .

The source code for all experiments will be made available at the time of publication. 3 Table 3 . Differences between the prediction percent probabilities between the prediction of input / perturbed-input pairs for the three algorithms. The row Predict Agreement shows the prediction agreement between the original and perturbed inputs at a probability threshold of 0.5. Row Var Importance MSE shows the mean squared error of the LIME variable importance of the original vs. the perturbed input. Table 2 shows the performance of the four models. 70% of the 471,971 Covid-19 positive patients were used for training and cross-validation, and the rest 30% were used for testing. The data used for pre-training consists of claims submitted prior to the first Covid-19 diagnosis in the dataset. As shown, the task proved to be a challenge for all algorithms, with modest precision and recall scores. Results are comparable to results reported in literature utilizing much cleaner, EMR-based datasets [13] on the same task. Clinicians concurred that the task is challenging for human experts, as it is extremely difficult to predict Covid-19 related hospitalizations based solely on the pre-Covid medical history, lacking Covid-related signs, symptoms, and vital signs. The task is further complicated by the noisy and limited nature of medical claims history. Of the two pre-trained models, only the GBM model was able to surpass the SVM and logistic regression baselines.

Pre-training, however, seemed to have more significant impact on the 'stability', 'trustworthiness' of the model and its explainability. Table 3 summarizes the prediction probability differences between the original input and the perturbed input, produced by we substituting each diagnosis/procedure/medication code with the code closest in the corresponding embedding space. As individual codes are not used in the logistic regression model, the model was excluded from this evaluation. The differences are summarized in terms of the mean difference between the prediction probability values of the original and perturbed inputs (Predict Prob Diff Mean) and in terms of the prediction agreement between the original and perturbed inputs at a probability threshold of 0.5 (Predict Agreement ). The table also shows the mean squared error computed by comparing the LIME variable importance scores of the original input vs. the perturbed input (Var Importance MSE ). Statistics were produced based on 5,000 random samples from the test set. While the baseline bag-of-word SVM approach (without preparing) exhibits the lowest probability output variability, the methods using pre-training exhibit higher prediction agreement on the original vs perturbed inputs, as well as less variability in terms of input variable importance. This could suggest that that the predictions of the pre-trained models are more 'stable' in terms of both binary prediction outcome, as well as model explainability.

Lastly, as a sanity check, we evaluated the 3 model predictions using as input variations of diagnosis and procedure codes for all conditions associated with high risk of Covid-19 hospitalizations 4 , such as cancer, chronic kidney disease, COPD, etc. The logistic regression model was again excluded from this analysis, as the model is explicitly based on known Covid-19 risks. As expected, in all cases the models predicted Covid-19 related hospitalization. However, the SVM baseline model probability averaged at 68%, while the probability of the pre-trained models was significantly higher, averaging 94% and 78% for GBM and Roberta respectively, indicating that the pre-trained models are more confident in predicting such 'clear-cut' hospitalization examples.

This work demonstrated the utility of self-supervision of medical insurance claims data, which can allow Health Insurance Providers to improve ML model performance on a variety of prediction outcome tasks, aiming to improve patient outcomes and health insurance affordability. Pre-training improved both model prediction performance and model stability on the challenging task of predicting Covid-19 hospitalizations.

Efficient estimation of word representations in vector space

Glove: Global vectors for word representation

Deep contextualized word representations

Pre-training of deep bidirectional transformers for language understanding

Improving language understanding by generative pre-training

Language models are unsupervised multitask learners

Language models are few-shot learners

Generative pretraining from pixels

Masked reconstruction based self-supervision for human activity recognition

Self-Supervised Graph Transformer on Large-Scale Molecular Data

A comprehensive evaluation of multi-task learning and multi-task pre-training on ehr time-series data

Distribution of patients at risk for complications related to COVID-19 in the United States: Model development study. JMIR public health and surveillance

Personalized Predictive Models for Symptomatic COVID-19 Patients Using Basic Preconditions: Hospitalizations, Mortality, and the Need for an ICU or Ventilator. medRxiv

Data mining to predict and prevent errors in health insurance claims processing

Predicting medical provider specialties to detect anomalous insurance claims

A fraud detection approach with data mining in health insurance

A scoring model to detect abusive billing patterns in health insurance claims

Decision support system (DSS) for fraud detection in health insurance claims using genetic support vector machines (GSVMs)

Using massive health insurance claims data to predict very high-cost claimants: a machine learning approach

Comparing deep neural network and other machine learning algorithms for stroke prediction in a large-scale population-based electronic medical claims database

Development of a classifier to identify patients with probable lennox-gastaut syndrome in health insurance claims databases via random forest methodology. Current medical research and opinion

Estimating prevalence, demographics, and costs of ME/CFS using large scale medical claims data and machine learning

Prediction models for risk of type-2 diabetes using health claims

Learning low-dimensional representations of medical concepts

Multi-layer representation learning for medical concepts

GRAM: graph-based attention model for healthcare representation learning

Embedding and clustering medical diagnosis data

Knowledge-based attention model for diagnosis prediction in healthcare

BeHRt: transformer for electronic Health Records

Deepr: a convolutional net for medical records

Med-BERT: pre-trained contextualized embeddings on large-scale structured electronic health records for disease prediction

Pre-training of graph augmented transformers for medication recommendation

Redefining meaningful age groups in the context of disease

Hospitalization rates and characteristics of patients hospitalized with laboratory-confirmed coronavirus disease 2019-COVID-NET, 14 States

LIBLINEAR: A library for large linear classification

Greedy function approximation: a gradient boosting machine

A robustly optimized bert pretraining approach

Explaining the predictions of any classifier

Regional variation in medical classification agreement: benchmarking the coding gap