key: cord-0324232-f63meabu
authors: Hong, C.; Rush, E.; Liu, M.; Zhou, D.; Sun, J.; Sonabend, A.; Castro, V. M.; Schubert, P.; Panickan, V. A.; Cai, T.; Costa, L.; He, Z.; Link, N.; Hauser, R.; Gaziano, J. M.; Murphy, S. N.; Ostrouchov, G.; Ho, Y.-L.; Begoli, E.; Lu, J.; Cho, K.; Liao, K. P.
title: Clinical Knowledge Extraction via Sparse Embedding Regression (KESER) with Multi-Center Large Scale Electronic Health Record Data
date: 2021-03-13
journal: nan
DOI: 10.1101/2021.03.13.21253486
sha: 818b393f8e2379798682e042632b55aa760e903b
doc_id: 324232
cord_uid: f63meabu

Objective: The increasing availability of Electronic Health Record (EHR) systems has created enormous potential for translational research. Even with a working knowledge of EHR, it is difficult to know all the relevant codes related to a phenotype due to the large number of codes available. Traditional data mining approaches often require the use of patient-level data, which hinders the ability to share data across institutions to establish a cooperative and integrated knowledge network. In this project, we demonstrate that multi-center large-scale code embeddings can be used to efficiently identify relevant features related to a disease or condition of interest. Method: We constructed large-scale code embeddings for a wide range of codified concepts, including diagnosis codes, medications, procedures, and laboratory tests from EHRs from two large medical centers. We developed knowledge extraction via sparse embedding regression (KESER) for feature selection and integrative network analysis based on the trained code embeddings. We evaluated the quality of the code embeddings and assessed the performance of KESER in feature selection for eight diseases. Besides, we developed an integrated clinical knowledge map combining embedding data from both institutions. Results: The features selected by KESER were comprehensive compared to lists of codified data generated by domain experts. Additionally, features identified automatically via KESER used in the development of phenotype algorithms resulted in comparable performance to those built upon features selected manually or identified via existing feature selection methods with patient-level data. The knowledge map created using an integrative analysis identified disease-disease and disease-drug pairs more accurately compared to those identified using single institution data. Conclusion: Analysis of code embeddings via KESER can effectively reveal clinical knowledge and infer relatedness among diseases, treatment, procedures, and laboratory measurement. This approach automates the grouping of clinical features facilitating studies of the condition. KESER bypasses the need for patient-level data in individual analyses providing a significant advance in enabling multi-center studies using EHR data.

Creating a clinical knowledge network using EHR data requires two major advancements. First, a general approach is needed to integrate the different types of structured data efficiently, also referred to as codified data, available in EHR. Codified EHR data includes ICD (International Classification of Disease) codes 8,9 for disease conditions, LOINC (Logical Observation Identifiers Names and Codes) 10 for laboratory tests, CPT (Current Procedural Terminology) 11 and CCS (Clinical Classifications Software) 12 for procedures, as well as RxNorm 13 and NDC (National Drug Code) for medications. Approaches for extracting knowledge from codified EHR data using machine learning algorithms have been proposed in recent years [14] [15] [16] . However, these algorithms focused on a specific task and required training with patient-level EHR data. Second, establishing a highly cooperative and shareable clinical knowledge network across institutions requires methods that can ensure data privacy. Existing approaches for data mining require patient-level EHR data, posing significant administrative challenges for data sharing across research groups and institutions.

To overcome these challenges, we propose to transform EHR data into embedding vectors 17 , thus uncoupling the data from the individual patient. The downstream machine learning tasks would use the embeddings vector as summary data rather than individual patient data. Our use of embedding in this study refers to projecting an EHR code into another representation space. In the past decade, embedding vectors have been successfully derived for clinical concepts with textual data and various sub-domains of codified EHR data [18] [19] [20] [21] [22] [23] [24] . These embeddings were primarily derived for specific applications and not for the creation of knowledge networks. In addition, most existing word embedding algorithms tuned the key hyper-parameters, e.g., the appropriate dimension of the embedding vectors, to optimize a specific downstream task. For example, the Code2Vec 19 tuned the embedding dimension via clustering task, and the Med2Vec 21 chose the dimension via future code prediction. However, this approach may limit the applicability of the learned embedding vectors to other downstream tasks. This study aims to Optimal Window Size and k. We conducted additional sensitivity analyses using different window size and k to construct the co-occurrence matrices based on a total of about 70K patients from MGB Biobank. When varying window sizes from 7, 30 up to 60 days and k from 1, 5, up to 10, we observed that the embedding quality is the best when k = 1 but is not sensitive to the choice of window size (Table S2 of the Supplementary Materials). Table 1 summarizes the overall accuracy of between-vector cosine similarities in detecting known similarity and relatedness relationships with embedding vectors derived from either SVD-SPPMI or GloVE 25 . We focus on GloVE trained with dimension 50 and 100 since the GloVE algorithm did not converge at higher dimensions. For detecting similar pairs, the SVD-SPPMI based cosine similarities attained an AUC of 0.839 at MGB and 0.888 at VA with dimensions set at dauc. By thresholding cosine similarities to classify pairs as similar with cut-off chosen to maintain false positive rate (FPR) of 0.05 and 0.10, these classifications yielded sensitivities of 0.593 and 0.669 at MGB and 0.679 and 0.772 at VA. For the relatedness, the cosine similarities based on SVD-SPPMI embeddings at d95% achieved AUC of 0.868 at MGB and 0.862 at VA, sensitivities of 0.608 and 0.717 at MGB and 0.582 and 0.688 at VA at FPR=0.05 and 0.10. Compared to GloVE, embeddings derived via SVD-SPPMI achieved similar AUCs but higher sensitivities. As shown in Table S2 of the Supplementary Materials, the accuracy is overall fairly high in assessing most types of relationships including may cause, differential diagnosis, complications, and symptoms with AUC close to 0.9. The accuracy is lower in detecting risk factors and similar drugs with AUC close to 0.8. Although assessed using different knowledge sources, these observed levels of accuracy are similar to those previously reported based on embedding vectors trained for natural language processing (NLP) concepts 20 .

translation, we learned orthogonal transformation between embedding vectors across the two institutions to enable mapping of a given VA code to the corresponding MGB code 26 . As summarized in Table S3 of the supplementary Materials, the top-1 and top-5 accuracy of code mapping is around 38% and 67% for VA medication codes → RXNORM and around 42% and 74% for PheCode → PheCode using embeddings of dimension dauc. The code mapping accuracy is fairly comparable when using a larger dsnr. The observed code mapping accuracy is comparable to the translation accuracy between different languages reported in the literature 26 27 .

The KESER approach was developed to select features by using embeddings trained within a specific healthcare center, as well as by leveraging embeddings from multiple healthcare centers while incorporating between-site heterogeneity. Table 2 , we summarize the average sensitivities and FPR of KESER integrative knowledge extraction using embedding data from both MGB and VA (KESERINT) in detecting known associations. For comparison, we also provide results based on KESER performed using MGB data only (KESERMGB) and using VA data only (KESERVA). The integrative analysis based on KESERINT attained a sensitivity of 0.660 in detecting known related pairs, while maintaining FPR below 5%. The KESERINT algorithm attained accuracy substantially higher than those from KESER algorithms trained with single institution data and the accuracy is generally higher using embeddings from SVD-SPPMI compared to those from GloVE. All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted March 13, 2021. ; https://doi.org/10.1101/2021.03.13.21253486 doi: medRxiv preprint differential diagnoses as well as important procedures such as colonoscopy, proctoscopy and colorectal resection.

Using codified EHR data from 68,213 MGB Biobank participants, we compared the performance of two supervised phenotype algorithms, the adaptive LASSO (aLASSO) and random forest (RF), trained with existing feature selection strategies to those trained with KESER-selected features. Those existing feature selection strategies included the main PheCode of the disease only (PheCode), all features (FULL), or informative features selected manually or extracted using unsupervised algorithms such as SAFE 15 . The accuracies of the aLASSO phenotyping algorithms trained with different feature sets are summarized in Figure 4 and more detailed comparisons including the RF results are given in Figure S10 of the Supplementary Materials. Given the same feature set, the RF algorithms generally performed slightly worse than the aLASSO algorithms in part due to overfitting. The relative performance of the RF algorithms trained with different feature sets is similar to those from aLASSO. The algorithms generally attained higher performance using embeddings from SVD-SPPMI than those from GloVE. The results are quite similar when using KESERINT versus KESERMGB and hence using MGB embedding information may be sufficient for phenotyping at MGB. Hence we focus our discussions below on the aLASSO algorithms and for KESER, we focus on KESERMGB with SVD-SPPMI embeddings for brevity. Across the 8 phenotypes, phenotyping algorithms trained via aLASSO with KESERMGBselected features attained higher AUCs and F-scores than those based on PheCode alone or using FULL features, and similar AUCs as those trained with SAFE features. On average, the AUC of KESERMGB with SVD-SPPMI based algorithms was 0.052, 0.144 and 0.007 higher than those based on PheCode, FULL and SAFE features. The average F-score of KESERMGB based algorithms was 0.173, 0.157 and 0.013 higher than those based on PheCode, FULL and SAFE features. The 95% confidence intervals of the accuracies associated with algorithms trained with All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. 

We summarized the clinical knowledge network, namely a knowledge mapping, by performing node-wise KESER across all PheCode and RxNorm (https://github.com/celehs/KESER). Figure   5 is a screenshot of the webAPI, given a specific target drug, RxNorm 214555 for etanercept.

The node-wise knowledge extraction aims to find the neighborhood codes related to the target code etanercept. codes. This is expected because the majority of the lab codes are unique to the site, resulting in high cross-site heterogeneity in lab coding. By integrating data from both sites, KESERINT is able to achieve higher accuracy in reflecting clinical knowledge.

These results demonstrate that KESER can successfully select informative and clinically meaningful features that can be used effectively for phenotyping and other downstream analyses.

All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted March 13, 2021. ; https://doi.org/10.1101/2021.03.13.21253486 doi: medRxiv preprint

The KESER approach efficiently summarizes patient-level longitudinal EHR data into hospitalspecific embedding data and enables extraction of clinical knowledge based only on summary level data. This summary data generated based solely on relationships between codes, and clusters related codes together, which provides ready information on features that may be important for identifying or studying different phenotypes. The KESER approach enables assessment of conditional dependency between EHR features by performing sparse regression of embedding vectors without requiring additional patient level data. In this paper we demonstrate the advantage of integrative analyses across sites in detecting known associations. Ultimately, we believe this innovation provides a potential solution for barriers facing the much-needed multicenter collaborative studies using EHR data.

The majority of EHR-based clinical studies are performed entirely behind the firewalls of individual institutions. Collaborations across centers typically require that each institution perform analyses individually with results compared across institutions. However, coding behaviors, disease management and strategies, and healthcare delivery patterns 28 can vary across different healthcare systems. For example, at VA, medication procedures (such as infliximabinjection) are coded as HCPCS procedure codes, while at MGB, they are coded as local medication codes that directly map to RxNorm. At VA, the majority of patients are male, and thus the pattern of diseases or treatments may differ from MGB where females are the majority.

Variations between the two institutions were observed when validating the embedding vectors compared against known PheCode-RxNorm pairs (Table 1) . While the knowledge derived from the embedding vectors captures all the relevant RA treatments at both VA and MGB, the weights of the individual treatments differed slightly between the two healthcare systems ( Figure 3 ).

Among the top-50 weighted treatments, there are 36 same concepts obtained from both healthcare All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. Integrating the data from both systems improves the robustness of the identified relationships and accounts for the heterogeneity of data in each system. Notably, since the embedding vectors contain no patient data, the integration of these data can be performed outside of each system.

Embedding vectors also provide information on highly related groups of codes. Unlike ICD codes which have established groupings and hierarchies, lab codes are much less standardized, and no established grouping structure can be used at scale for research studies. As an example, for the inflammatory marker C-reactive protein (CRP), potential lab codes include, LOINC:11039-5 (crp), LOINC:30522-7 (crp, high sens, cardio), and LOINC:X1166-8 (crp (mg/L)). Additionally, at both VA and MGB, individual labs within each institution also had unique lab codes that do not map to the LOINC codes. The embedding vectors derived from the co-occurrence matrices enable grouping of codes based on the similarity between the vectors, thus allowing the use of grouped lab codes in research studies.

We also addressed the need to tailor the dimension of embedding vectors to the goals of a particular study. Currently, there is no clear evidence regarding how to select the optimal dimension for analyses using embeddings. Existing embedding-based approaches usually use a 300-dimension word embedding GloVE 25 or a 500-dimension CUI embedding for cui2vec 20 . We demonstrate that different dimensions may be preferred for different tasks. Lower dimensions appear to be better suited for the task of identifying near synonymous concepts or translations while higher dimensions are needed for assessing relatedness and embedding regression aiming to optimize feature selection and building knowledge networks. While lower dimensions may be useful for many downstream tasks such as code mapping between institutions, we recommend keeping embedding vectors at high dimensions for dissemination to enable better assessment of relatedness while allowing users to further truncate to lower dimensions for other tasks.

All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted March 13, 2021. ; https://doi.org/10.1101/2021.03.13.21253486 doi: medRxiv preprint constructed two separate co-occurrence matrices and derived embeddings via the SVD-SPPMI using all EHR data up to Nov, 2020 from 30K COVID+ patients at MGB and 100K COVID+ patients at VA. As a proof of concept, we identified clinical concepts most related to the COVID code. As shown in Figure S11 of the Supplementary Materials, the results are encouraging in that the top selected codes include the highly important laboratory tests for monitoring COVID progression (e.g. D-dimer, CRP, Ferritin) and medications for managing COVID patients (e.g.

norepinephrine often used as first line vasoactive, cefepime for managing bacteria pneumonia complications, tocilizumab, dexamethasone and remdesivir) as well as related diagnoses and complications (e.g. viral pneumonia, respiratory insufficiency, shock, and kawasaki disease).

In conclusion, KESER provides an approach allowing investigators to integrate patient level data as embedding vectors from multiple EHR systems for downstream analyses. We provide an example of using the knowledge network to automatically provide features that may be important for phenotyping, without requiring additional patient level data. This innovation will facilitate multi-center collaborations and bring the field closer to the promise of creating distributed networks for learning across institutions while maintaining patient privacy.

We highlight three key innovations detailed below in the methods. First, we provided an approach to integrate four domains of codified data, ICD, CPT, laboratory codes, and medications, from two large hospital systems. Second, we applied a data driven approach to specify the dimension of embedding vectors. Third, we developed a method to use embedding vectors rather than patient-level data as the input into a sparse graphical model. All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (1999-2019) . The CDW supports both business operations and research. A total of 12.6 million patients with inpatient and outpatient codified data from at least 1 visit were included for this analysis. We defined outpatient visit to include services from all VA outpatient stop codes. There are over 500 outpatient stop codes that cover a wide range of services such as emergency department visits, therapy and primary care. We first extracted records from the CDW. We then grouped each patient's records together in ascending chronological order.

Codes occurring multiple times for the same patient within the same day are counted once per day.

The resulting files were stored using parquet, a columnar storage format. The parquet file format was well suited to storing this data compactly while also allowing parallel processing. between 1998 and 2018. We used the same format as VA described above to store patient visit level data for processing.

Code Roll-up. We gathered four domains of codified data including diagnosis, procedures, lab measurements, and medications from VA and MGB EHRs. Since multiple EHR codes can represent the same broad concept, (e.g. acute myocardial infarction (MI) of anterolateral wall and All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted March 13, 2021. ; https://doi.org/10.1101/2021.03.13.21253486 doi: medRxiv preprint acute MI of inferolateral wall are separate codes that describe the same concept of MI), we rolled individual codes to a code representing a general concept. ICD codes were aggregated into PheCodes to represent more general diagnoses, e.g., MI rather than acute MI of inferolateral wall, using the ICD-to-PheCode mapping from PheWAS catalog (https://phewascatalog.org/phecodes).

We utilized multiple levels of granularity of PheCode, including integer level, one-digit level and two-digit level. To reduce the effect of collinearity, when conducting KESER regression, for phenotypes with multiple levels of PheCode, we only included one-digit level PheCodes. For laboratory measurements, due to the difference in coding systems between VA and MGB, we created a code dictionary for each site. At VA this was done by grouping local lab codes to manually annotated lab concepts or LOINC codes, as well as individual lab codes that have not been annotated but occurred in at least 1000 patients. At MGB, all local lab codes were aggregated into group and a LOINC code was assigned to each. Since embeddings cannot be trained well for very low frequency codes, we only included codes occurring >1000 times at MGB and >5000 times at VA. The different thresholds were used because VA has a larger population and larger number of codes than MGB. A total of 9,535 codes (1776 PheCodes, 1561

RxNorms, 5974 Labs and 224 CCS groups) at VA and 5,245 codes (1772 PheCodes, 1238

RxNorms, 1992 Labs and 243 CCS groups) at MGB passed the frequency control. All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. 

We obtained embeddings by performing singular value decomposition (SVD) on the shifted positive pointwise mutual information (SPPMI) matrix, known as the SPPMI-SVD algorithm.

This approach provided embeddings considered as efficient and equivalent to those derived from the skip-gram algorithm with negative sampling 17,20,30,31 .

Co-occurrence Matrix. We first constructed code co-occurrence matrices as described in Beam et al 14 . For any given patient, we scanned through each of their codes as a target code. For any given target code occurring at time t, denoted by , we counted all codes occurring within 30 days of as co-occurrences with . The total numbers of co-occurrences for all possible pairs of codes are aggregated over all target codes within each patient and then across all patients, yielding the cooccurrence matrix, denoted by ℂ = [∁( , )]. Although only codes that occur after the target code are considered, this is the same as finding co-occurring codes within 30 days of the target code (i.e. between -30 and 30 days), owing to the symmetry of the data. Thus, given a target phenotype (e.g, PheCode 714.1 for RA), we assume the context codes vocabulary ( ) are the codes cooccurred with the target word within a 30-day window. This step requires considerable computational resources and a detailed algorithm for efficiently computing the co-occurrence matrix was created for this study (https://github.com/rusheniii/LargeScaleClinicalEmbedding).

Since our sparse regression procedures (described in later sections) require selection of tuning parameters, we constructed two separate co-occurrence matrices at each site. At VA, from the 12.6 million patients, we used data from 11.6 million patients to create a training matrix ℂ VA and data from the remaining 1 million patients to create a validation matrix ℂ VA . At MGB, we used half of the patients to create training and the other half to create validation matrices, respectively denoted by ℂ MGB and ℂ MGB . All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. 

with the negative sample set as 1 (i.e. no shifting), where ( , •) is the row sum of ( , ).

For each given SPPMI, we obtain its first d-dimensional SVD as diag( 1 , … , ) and then construct the d-dimensional embedding vectors as , where = diag(√ 1 , … , √ ).

We propose to infer conditional dependency among the clinical codes based on the conditional dependency among their corresponding embedding vectors. To provide a rationale for this framework, we note that the skip-gram model with negative sampling 16 (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted March 13, 2021. 

dimension d, we first initialized the dimensions by retaining 95% of the variation in the SVD, denoted by d95%. Subsequently, we considered two data-driven strategies for optimizing the dimension up to d95% by maximizing (i) the signal to nose ratio (SNR); and (ii) the AUC, where SNR( ) = / , and are the average cosine similarity among all pairs with known relationships and among all random pairs. For similarity, we used the PheCode hierarchy for tuning optimal dimensions and defined pairs as similar if they shared the same integer to calculate the SNR and AUC. For relatedness, we used 10% of the known related PheCode-PheCode pairs from Wikipedia and PheCode-RxNorm pairs from https://www.drugs.com/ and MEDRT to tune the dimension and used the remaining known related pairs for validation.

We evaluated the quality of the derived embedding vectors by quantifying their accuracy in detecting known similar pairs (RxNorm-RxNorm and Lab-Lab) and related pairs (PheCode-PheCode, PheCode-RxNorm), and evaluated the KESER algorithm by quantifying its power in detecting known related pairs as described above. For each type of relation, since a vast majority of pairs are unrelated, we randomly sampled a large number of pairs within each type of relationships to obtain the reference distribution for unrelated pairs. For each type of relationship, we obtained the cosine similarity of the embedding vectors between known pairs and between random pairs. We first calculated the area under the AUC as an overall accuracy summary. We then reported the sensitivity of detecting related pairs by thresholding cosine similarities to achieve a false positive rate (FPR) of 0.01, 0.05 or 0.10. We also evaluated the performance of the KESER for feature selection at each site and integrative feature selection at both sites. We report the sensitivities in detecting known related PheCode-PheCode and PheCode-RxNorm pairs, that is the proportion of pairs detected by KESER among all known pairs. All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted March 13, 2021. ; (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted March 13, 2021. ; Figure 4 . Comparison of AUCROCs, AUCPRCs and F-scores with gold standard labels for adaptive lasso phenotyping algorithms for 8 diseases using the main PheCode only (PheCode), all features (FULL), SAFE selected features (SAFE), KESERMGB and KESERINT selected features based on SVD-SPPMI embeddings as well as KESERMGB and KESERINT selected features based on GloVE embeddings. F-scores are calculated at the cutoff points with the estimated prevalence equal to the population prevalence. The bootstrap based 95% confidence intervals (bars) are shown.

All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted March 13, 2021. ;  

Considerations for the analysis of longitudinal electronic health

Opportunities and challenges in developing risk prediction models with electronic health records data: a systematic review

Using electronic health records to drive discovery in disease genomics

Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data

EHRs connect research and practice: Where predictive modeling, artificial intelligence, and clinical decision support intersect

Building the Partners Healthcare Biobank at Partners Personalized Medicine: Informed Consent, Return of Research Results, Recruitment Lessons and Operational Considerations

Electronic health records to facilitate clinical research

Clinical Classifications Software (CCS)

Utilizing RxNorm to support practical computing applications: capturing medication history in live electronic health records

Learning probabilistic phenotypes from heterogeneous EHR data

Surrogate-assisted feature extraction for high-throughput phenotyping

Electronic phenotyping with APHRODITE and the Observational Health Sciences and Informatics (OHDSI) data network

Distributed representations of words and phrases and their compositionality

Building the graph of medicine from millions of clinical narratives

Embedding and clustering medical diagnosis data

Multi-layer representation learning for medical concepts

No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity

Regularization and variable selection via the elastic net

A survey of named entity recognition and classification

High-throughput phenotyping with electronic medical record data using a common semi-supervised approach (PheCAP)

Improvements on cross-validation: the 632+ bootstrap method

High-throughput multimodal automated phenotyping (MAP) with application to PheWAS

No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity

All rights reserved. No reuse allowed without permission.(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.The copyright holder for this preprint this version posted March 13, 2021.