key: cord-0845402-kttjkhai
authors: Dhayne, Houssein; Kilany, Rima; Haque, Rafiqul; Taher, Yehia
title: EMR2vec: Bridging the Gap Between Patient Data and Clinical Trial
date: 2021-03-15
journal: Comput Ind Eng
DOI: 10.1016/j.cie.2021.107236
sha: 6ca0e4f9fc86dd3f609d47d5ae2fd9fa40afbc38
doc_id: 845402
cord_uid: kttjkhai

The human suffering from diseases caused by life-threatening viruses such as SARS, Ebola, and COVID-19 motivated many of us to study and discover the best means to harness the potential of data integration to assist clinical researchers to curb these viruses. Integrating patients data with clinical trials data is enormously promising as it provides a comprehensive knowledge base that accelerates the clinical research response-ability to tackle emerging infectious disease outbreaks. This work introduces EMR2vec, a platform that customises advanced NLP, machine learning and semantic web techniques to link potential patients to suitable clinical trials. Linking these two different but complementary datasets allows clinicians and researchers to compare patients to clinical research opportunities or to automatically select patients for personalized clinical care. The platform derives a ’bag of medical terms’ (BoMT) from eligibility criteria by normalizing extracted entities through SNOMED-CT ontology. With the usage of BoMT, an ontological reasoning method is proposed to represent EMR and clinical trials in a vector space model. The platform presents a matching process that reduces vector dimensionality using a neural network, then applies orthogonality projection to measure the similarity between vectors. Finally, the proposed EMR2vec platform is evaluated with an extendable prototype based on Big data tools.

The growing awareness of electronic medical records (EMRs) in diverse healthcare institutions has undoubtedly reshaped the way healthcare is delivered and health data is documented, enabling the collection of medical data from millions of patients. EMRs, including diagnosis codes, laboratory results 5 and prescription data, are typically used for the systematic collection of patient health records in a digital format for the purpose of patient diagnosis and treatment. This collected electronic medical data holds great promise; EMRs not only contribute significantly to the provision of health care but can also be linked to datasets collected by other sectors to support a wide range of clinical During the last decade, the concept of evidence-based medicine (EBM) has aroused great interest because it integrates the best available evidence obtained 25 by clinical research with the experience of the practitioner and expectations of the patient [? ] . Additionally, personalized medicine typically involves a combination of diagnostic steps to provide a patient-specific profile and an actual treatment step. As well as, it is widely recognized that different patients respond differently to the same treatment. Therefore, integrating EMR and clinical 30 trials could be used to apply the outcome of clinical trials into personalized recommendations by identifying patients for whom the benefits of treatment outweigh the harms, which can ultimately be used to enable more personalized clinical care.

Motivation. researchers need to take advantage of any data that is available. 35 For instance, the advent of big EMR data offers an unprecedented opportunity to draw on the characteristics of real-world patients to guide and inform clinical research; this would require the linking and integration of big EMR data with clinical trial datasets. Integrated data could be extremely helpful in supporting investigators; it can provide a better understanding of actual patient 40 populations, optimise the precision, recruitment feasibility and representation of eligibility criteria, and reduce the capture of redundant data. It can also assist in verifying the feasibility of clinical trials, evaluating the efficacy and results of treatment, and carrying out post-marketing surveillance and long-term monitoring [? ] . Indeed, several studies have described the advantage of leveraging 45 EMRs to improve trial recruitment [? ] .

In clinical trials, the eligibility criteria specify the characteristics of patients for whom a research protocol may be applicable. The criteria differ from one study to another; they can include age, gender, medical history and current health status. More than 74% of eligibility criteria could be evaluated using 50 available structured data elements in the EMR [? ] , the most common categories are disease, symptom or sign (36%), therapy or surgery (13%), and medication (10%). When linking patients to clinical trials, it is helpful to match patient medical information to eligibility criteria, allowing clinicians and researchers to compare patients to clinical research opportunities or to automatically select pa-55 tients for personalised clinical care [? ] . Consequently, there is a need to develop scalable integrated healthcare platforms to manage and link EMR datasets with clinical trials.

In this platform, the linking process identifies the eligibility criteria for each trial and then automatically determines eligible patients based on information 60 from the EMR. Each created link is made up of the clinical trial identity, patient identity and a numeric value. This value represents the similarity score between the trial criteria and the patient's condition.

Challenges. there are significant challenges in linking EMR data to clinical trials, which, to our knowledge, have not all been systematically addressed [? 65 ]. (1) Eligibility criteria are usually described by a set of free text to be human readable; consequently, they are both syntactically and semantically complex.

Computational processing requires the extraction and representation of semantics of the eligibility criteria in a machine-processable manner. (2) There is a semantic 'gap' between current expressions of clinical trial eligibility criteria and 70 clinical data from the EMR; while eligibility criteria are described by coarser (more generic) clinical concepts or by defining their characteristics (attributes), EMR data is presented by granular (more specific) information. This discrepancy requires matching at the semantic concept level rather than verification of the absence or presence of a criterion at the lexical level. (3) Health data 75 comes in many forms: vital signs, diagnosis, procedures, prescriptions and various types of medical reports. While the main forms are structured and could easily be analysed over time, medical reports must be analysed and interpreted using advanced natural language-processing tools.

Proposal. in our previous works, we proposed a semantic-driven engine to inte-80 grate structured and unstructured patient data in order to reformulate an entire patients medical record and query patient data across different data sources [? ] .

Furthermore, in another work, we proposed a framework for automated match-ing of patients to clinical trials based on unstructured data from both datasets [? ] . The proposed framework used BioBERT, a pre-trained biomedical language 85 representation model, to match unstructured medical data. In this context, we found that vector representing methods such as word2vec, med2vec and BERT were very adequate at representing word and phrase as an embedding, but were not sufficient for representing complex objects such as patient or clinical trial, since these objects are usually composed of different types of information, which 90 are structured and unstructured.

At the experimental level, the vector space model has proven to be an effective and robust framework for representing entities as vectors and querying about them [? ] . Therefore, In this research, we explore the potential of using this model to match and link EMR data to clinical trials. With this model, we 95 represent EMR and clinical trials objects as vectors of features values, where each feature corresponds to a dimension in the vector space model. Vector elements are generally represented by weights that describe the degree to which the corresponding feature describes the object (EMR or Clinical trial). In the vector space model, object vector representation plays a crucial role in many 100 tasks, from objects matching and data clustering to similarity measuring [? ] .

We therefore present EMR2vec, a vector space platform in order to link two different but complementary datasets, patient data and clinical trial. To this end, we have customised and combined the advanced technologies of NLP, machine learning and the Semantic web and have derived a bag of medical terms 105 (BoMT) from eligibility criteria. Utilising the BoMT, we proposed a method based on the vector space model to represent structured data from EMRs and unstructured eligibility criteria from clinical trials in order to develop an effective matching measure between patients and clinical trials.

Platform overview. fig. 1 shows an overview of the proposed EMR2vec plat-110 form. It includes five main stages: BoMT preparation, EMR vectorisation, clinical trial vectorisation, dimensionality reduction and data matching. Two datasets, Clinical Trial and EMR, represent the main entries for the platform. 1)The BoMT preparation stage consists of extracting medical terms (features) from a set of inclusion and exclusion criteria in order to construct the features of 115 the BoMT. Different processes are applied to these criteria, including; classification, named entity recognition and normalisation. 2) In addition to processes that applied to construct BoMT, negation detection is performed to support representing clinical trials in vector space. 3) At the EMR and clinical trial vectorisation stages, BoMT features are used to represent data in a vector space.

Medical ontology is used to convert EMR to vector by inferring relationships, as well as measuring the similarity between EMR terms and BoMT. 4) Since the high dimensionality space of BoMT is susceptible to noise, a dimensionality reduction technique is applied in stage four to reduce feature vector size and eliminate noise from the data. 5) At the data-matching stage, a projection sim-125 ilarity measure is used, whereby an orthogonal projection of the EMR vector onto the clinical trial vector, calculates a value showing the degree of matching between these two vectors.

Contributions. The main contributions of the paper are summarized as follows:

• We first defined a pipeline describing a novel methodology to extract the main SNOMED-CT terms from the criteria of targeted clinical trials.

• Regardless of specific hospital information systems, we described a novel methodology to transform EMR data into vectors in a vector space with dimensions of clinical trials.

• We investigated the power of combining machine learning with ontological reasoning techniques to match structured and unstructured medical data.

• To find links between EMR and clinical trial datasets, we systematically analysed their common medical characteristics, then introduced respective geometric measures to match a patient to a clinical trial.

The remainder of this paper is organized as follows. In Section 2, related background knowledge is covered. Section 3 deals with detailed descriptions of the BoMT preparation. 4 and 5 describe profiling vectors of Clinical trial and EMR, and how they were matched. The experimental evaluations are presented in Section 6, while Section 7 draws conclusions.

There are many initiatives in literature aimed at providing solution for ef- in this paper we also use a vector space model, our work distinguishes itself from the cited references in the fact that it exploits both machine learning and semantic web technologies to represent EMR and clinical trial data.

EMR is an application environment, replacing paper medical records, com-175 posed of patient information that can be created, gathered, managed, and consulted by authorized clinicians and staff within one health care organization.

The basic features and functions of EMRs include the following [? ]:

• Manage patient information including patient problems, medications, allergies, notes, past medical history, and observation results (such as labo-180 ratory, radiology, and other testing results).

• Provide substantial benefits to healthcare practitioners such as physicians, clinic practices, to monitor and manage patient information.

• Guide workflow and manage patient-specific care plans, provide appropriate guidelines and protocols, and support clinical decision-making.

• Provide a 360 • view of the patient's condition at all times.

A clinical trial is a type of research that provides a longstanding foundation in the practice of medicine and the evaluation of new medical treatments. Each trial has eligibility criteria describing the characteristics according to which a 190 patient or participant must meet all "inclusion criteria" and none of the "exclusion criteria". In this respect, the criteria differ from study to study. The authors in [? ] analysed 1000 eligibility criteria and showed that 23% of the criteria are simple, or can be reduced to simple criteria, and that 77% of the criteria remain complex to evaluate. Therefore, a formally computable representation of 195 eligibility criteria would require natural language processing techniques as part of various research functions in the era of EMR including evaluating feasibility, cohort identification and trial recruitment.

EMR could help physicians to find patients who meet the criteria by search- ] is a standardized, multilingual ontology of clinical terminology that includes terms of all medical domains and provides the general core terminology for the 215 EMR. SNOMED-CT has been developed over the past 30 years in a multinational effort and accepted as the global common language for health terms. It is a comprehensive international clinical terminology that is used in over fifty countries. With more than 349,548 unique biomedical terms(concepts) and 1.2 million synonyms grouped into 19 top-level concepts, SNOMED has very good 220 clinical conceptual coverage. The concepts in SNOMED-CT are divided into hierarchies as diverse as body structure, clinical findings, geographic location, and pharmaceutical/biological product [? ] . The core component types in SNOMED-CT are:

• Concepts that represent clinical meanings organized in hierarchies.

• Descriptions that relate the appropriate human-readable terms to concepts.

• Relationships that link each concept to other related concepts.

Matching EMR to a clinical trial or automatically screening a patient for 230 clinical trial eligibility is the task of comparing the clinical features of the patient f p i (f pi ∈ emr) to the features extracted from the inclusion and exclusion of eligibility criteria (EC) of clinical trial (ct), f t j (f t j ∈ ct). Where emr and ct are two sets representing clinical features of the patient's EMR and clinical trial EC, respectively. From a practical standpoint, in order to be able to compare 235 emr and ct, the terms in these two datasets should be normalized using the same medical standard or ontology (such as SNOMED-CT ).

In this work, we will represent patients and clinical trials in the vector space model, by converting them to features vector with dimensionality equal to the number of different medical terms extracted from all eligibility criteria.

In vector space model, the features characterize the vector instance. Each feature of the vector is assigned a weight, which captures the relative importance of the feature in the vector. Thus, to represent these features, entities extracted Next, every criterion is processed to extract various medical terms using Named Entity Recognition (NER) -a popular technique of NLP. Each extracted term is replaced with the corresponding normalized term from the SNOMED-CT ontology. The output is stored and will serve as features to represent the dimensions of EMR and clinical trial vectors. 

The objective of this work is to match clinical trial with data from EMR.

Since the primary content of EMR is the medical and treatment history of patients in one practice and that ECs (such as demographic, race,...) will probably be difficult to retrieve from traditional EMR, we focused in our process to se-265 lect EC for problem and treatment categories only. As we will see in the next section, classifying sentences into these classes will help reduce ambiguity when identifying and categorizing named entities mentioned in ECs [? ] .

In the previous section, the EC has been classified as a single entity, but 270 given that a sentence from EC contains important medical concepts, we were interested in extracting the concepts (named entities) that express the main idea of the sentence. For instance, in the case of a sentence representing a problem, we aimed to detect the name of the disease. Consider the sentence "the patient suffers from severe cardiovascular diseases", the goal was to detect the 275 terms "cardiovascular disease". To that end, we used medical NER which is an important task of natural language processing (NLP). It enables the detection of a medical entity (problem, treatment, ...) in a sentence. Several methods have been implemented to examine the performance of medical NER. Most of these methods are based on Conditional Random Fields (CRF) and supervised 280 machine learning models which utilize both textual and contextual information.

Named entities recognized from EC have to be compared to EMR structure data represented by standard coding systems. Therefore, entities extracted from EC must be normalized according to a standard ontology for medical concepts. 285 We choose to use SNOMED-CT ontology.

Term normalization was performed by querying MetaMap for an entity and retrieving the corresponding concept with its Concept Unique Identifier (CUI) from SNOMED-CT . As a result, for each clinical trial, we ended up with a list of EC presented by a classification type, a normalized term, and a CUI ( fig.2 ).

By linking ECs to a medical ontology, it becomes possible to match semantic terms between Clinical Trial and EMR using automated inference over medical concepts. MetaMap to treat and annotate these ECs with concepts from UMLS and preserve sentences containing only these two concepts. UMLS is a compendium of many controlled vocabularies in the biomedical sciences, each concept is assigned one or more semantic types, which are linked to each other by semantic relationships.

To filter these ECs, we proposed to regroup some semantic types from UMLS Each sentence of the input dataset was submitted to the MetaMap tool.

From the generated results produced by MetaMap, we considered the semantic 315 types and the mapping scores. It is worth noting that for the same sentence, MetaMap provides multiple semantic types with different scores. Once a semantic type matched to one of the semantic types mentioned above, the sentence was accepted in that group. If none of the semantic types provided by MetaMap was found in the above list, this sentence was discarded. If for a sentence, MetaMap 320 provided multiple semantic types included in our list, the one with the highest score was considered eligible.

Eventually, we obtained an imbalanced dataset of 5000 sentences where the Problem group dominates in more than 4000 sentences. In order to account for this imbalance, we manually removed most of the Problem's sentences as well 325 as the similar ones. The final EC dataset containing 1500 EC was prepared for verification and annotation in order to be manually labelled by a nurse and a data scientist according to the two classes: Problem and Treatment. The following criteria were applied:

• Problem class: includes the patient's complaints, symptoms, diseases, and 330 diagnoses.

• Treatment class: includes medications, surgeries and other procedures.

In the case of multi-entities recognition in the sentence, the context of the sentence must be checked. For example, an ambiguous sentence, such as "Concurrent Medication: Allowed, Aerosol ribavirin for short-term treatment of 335 RSV" (NCT00000961) discusses patient's medication, therefore it was labelled by Treatment despite the presence of a disease entity (Respiratory Syncytial Virus infection-RSV).

Once the EC were labelled, the next step was to train a classifier. For that 340 purpose, we splitted the dataset into 80% samples for the training set and 20% for evaluation and testing. In our experiment, we explored and empirically compared five methods which are widely used in classification as the baseline of our classification: SVM, CNN, LSTM, C-LSTM, BioBERT. We used PubMed-and-PMC-w2v and the average word embedding to turn each EC into a vector representation form, which could be manipulated by machine learning algorithms [? ] .

• Support Vector Machine (SVM): SVM is a regulated machine learning 355 algorithm that is widely used for classification challenges due to its high accuracy. SVM aims to create a hyperplane or set of hyperplanes to classify all inputs in a high-dimensional space. We took advantage of pretrained PubMed-and-PMC-w2v to create a vector representation of EC, which will be the input of the SVM algorithm. • Long Short-Term Memory LSTM: LSTM is an effective type of recurrent neural network (RNN) architecture. The basic unit of an LSTM network is the memory block, which is able to learn long-term dependencies. 

We measure the final experiment result using three different criteria: Recall, Precision and F1-score (eq.1), which are common criteria used for evaluating 

Profiling medical data is a very challenging task. In this section, we first The main drawback is that this method ignores any semantic similarity between terms. In the following section, we detail our proposed similarity weight that 460 operates on patient feature and BoMT.

As stated above, Clinical Trials have to be constructed as a profile in a vector space model with dimensionality equal to the number of terms in BoMT. As presented in fig. 4 , the clinical trial vector could be constructed by carrying out 465 the same steps that were done to create the BoMT and by adding the negation step.

Although the aim of this work is to study a group of a predetermined set of in clinical notes, and an F1-score of 97% and 85% in biological literature [? ] .

As an example, NegScope detects the negation "No severe cardiac failure" from 485 the following EC "uncontrolled or persistent hypercalcemia Cardiovascular: No severe cardiac failure" (NCT00010088). In the equation, negated inclusion criteria detected by the negation method will be treated as Exclusion EC, and vice versa, for the negation Exclusion.

Based on the section 3 that represents each EC in the form of SNOMED-CT terms, the clinical trial is represented as a vector of zeros except for the position corresponding to the SNOMED terms found in the eligibility criteria.

The weight of features in Clinical Trial Vector is calculated as follows: 

In order to represent a patient in the vector space model, patient information is collected from different admissions (visits) from the EMR, regardless of the admission timeline. As we detailed in Section 3.1, the primary content of EMR is the medical and treatment history of patients. Therefore, the target patient information is generated by creating two lists; problems and treatments, 500 which combine terms extracted from EMR structured data such as diseases (for problem list), procedures and prescriptions (for treatment list). Other patient information (like age, gender) is not included in the EMR vector as it is easy to use to filter EMR database before starting the data linkage process.

In NLP, TF-IDF method ignores any semantic similarity between terms.

Indeed, terms could be different enough to be considered as different features, although they can be semantically similar. For example "Cardiac Insufficiency"

represents a feature of a patient in the EMR, while "Heart Failure" is an entity extracted from the EC "Clinical diagnosis of heart failure" (NCT03390088).

TF-IDF considers these two terms to be different and, therefore, patients with 510 "Cardiac Insufficiency" could not be mapped to any EC features. Thus, semantic similarity between terms yields better results for applications such as biomedical information retrieval. To determine each feature weight of a patient vector, unlike TF-IDF that reflects how important a feature is to a document in a corpus, we propose a similarity metric among clinical features( fig.5 ). The

new Weight EMR metric approach (W EM R) calculates the maximum Term Semantic Similarity (T ermSemSim) between the Patient Terms P T of patient P i and BoMT features.

While EC is described by coarser (more generic) clinical concepts or by defin-520 ing their characteristics (attributes), EMR data is presented by granular (more specific) information. To overcome this matching challenge, (T ermSemSim) uses semantic reasoning by incorporating knowledge from SNOMED-CT ontology, such as "is a" and "has a-type" to semantically match both datasets. For example, using "is a" relation, the patient with the "Cardiac arrhythmia" condi-525 tion will be mapped to the EC "Underlying Heart disease" (NCT02217267), as well as, a reasoning task could be performed to map the patient's condition "Benign tumor of lung parenchyma" to EC "History or presence of any benign neoplasm considered by the investigator to be clinically significant"( NCT01839279)

using "Has associated morphology". 

Based on the above, we define (T ermSemSim) as an equation to calculate similarity between patient feature and BoMT by:

and T has-characteristic BoM T j ) on what appeared to produce a reasonable valid matching.

Measuring the patient-clinical trial matching score requires the use of similarity measures between vectors. More precisely, the more common features two vectors share, the larger the value of matching will be. However, the number of 545 features is very large, therefore it is common to reduce the original dimensionality before measuring the patient-clinical trial similarity.

By collating all patients vectors together, the EMR dataset is represented as a matrix EM R m ∈ R n * m , where n is the number of patients and m is the In this work, we applied the autoencoder model to reduce the dimensionality of features and obtain a low-dimensional approximation that would extract the main features of BoMT and eliminate the noise of the data. 

Let the following definitions be given, describing the content of the BoMT, EMR and clinical trial:

.., f m , BoM T = (f j |f j extracted from criteria)

• An electronic medical record emr i of a patient i with r features f p 1 , f p 2 , ...., f p r 575 is represented as an m-dimensional vector, i.e. P i = (e i1 , e i2 , ..., e im |e ij = W EM R ij (EM R) = W EM R(P i , f j )).

• A clinical trial ct i with l features f t 1 , f t 2 , ..., f t l is represented as an

The criteria from the clinical trial are the ones that establish the matching process, since the presence of a feature in the trial requires researching it in the patient, and the absence of a feature in the trial should never affect the matching process. Therefore, considering a feature f j ∈ BoM T , the following "Matching rules" should be respected when computing the matching score between an emr 585 vector and a clinical trial ct vector:

1. The matching score should increase when f j appears in both emr and inclusion criteria.

2. The matching score should decrease when f j appears in both emr and the exclusion criteria. 3. The matching score should not to be affected when f j appears neither in the inclusion nor in exclusion criteria.

The score calculation rules between emr and ct vectors mentioned above are incompatible with certain similarity measurement metrics such as Cosine, Euclidean and Jacquard distances. Consider, for example, two patients P 1 (1, 0.5, 1) 595 and P 2 (1, 1, 1) and a clinical trial CT 1 (1, 0, 1). With Cosine, the similarity between P 1 and CT 1 is 0.9, which is greater than the similarity between P 2 and CT 1 , 0.8. The same result was found using Euclidean distance measure. The distance between P 1 and CT 1 is 0.5, which is lower than the similarity between P 2 and CT 1 , 1. The results show that these two measures do not satisfy rule 3. The similarity projection of emr onto ct is the vector denoted proj ct emr which is represented by:

Where "·" denotes the dot-product of two vectors and θ the angle between 610 emr and ct. Therefore, given a clinical trial ct and a patient document emr, the dot-product assigned to the pair (emr, ct) is a score in the interval [-1,1].

And thus, the longer the projection of emr vector on the ct vector, the higher The proposed matching derived from vector projection captures the proposed matching rules, which was not the case of the Cosine and Euclidean distance 620 measures. Therefore, the proposed matching is considered the most appropriate.

The growing amount of data in the healthcare industry has made it impossible to perform data integration using traditional tools. For instance, legacy 625 data warehouses are unfit to handle data with high volume, high variety and high velocity. Thus, to ensure scalable load capacity, we have to adopt Big Data tools when developing an application for integrating and analyzing healthcare data [? ] .

To meet this challenge, we adopted a Data Lake infrastructure for developing 630 our prototype. A Data Lake is a massively scalable storage technology which enables us to answer specific analytical questions, by simplifying the processing of data variety using modern integrated tools [? ] . In this context, the Data Lake concept realizes the polyglot persistence model by collecting data from huge heterogeneous datasources and providing an integrated view of the data 635 without any predefined schema [? ] .

The data is stored in its raw original format and is processed whenever required for a particular analysis task to meet a specific request. We used respectively Hive, MongoDB and Blazegraph for stroring and accessing relational data, semi-structured data and RDF(Resource Description Framework) data.

PostgreSQL is used as a metadata repository where we maintain the schema information and the similarity score results from datasets matching. An overview of our prototype is shown in fig. 7 . The main function of the prototype is to allow a data analyst to integrate and query data from EMR and clinical trials datasets.

For instance, some researchers have successfully used EMRs as supportive tools to facilitate the assessment of clinical trial outcomes [? ] . To illustrate our platform's work in an assessment outcome, let us consider the example where a researcher(user) is interested in assessing the effectiveness of a diabetes treatment among a number of trials. Therefore, as a first step, he needs to match and link patients over 15 years old with "Family history of diabetes mellitus" (ICD9 code = v180 ) to several clinical trials related to testing drugs for treating "Diabetes Mellitus, Type 1". The process of creating links between patients and clinical trials is initiated by the user, who should express these two queries through the use of the platform's provided graphical interface. The role of these 655 two queries is to select two subsets from EMR and clinical trial respectively. The first one filters patients from Hive and the second one filters clinical trial from

MongoDB. In addition, the user defines the semantic similarity threshold ξ with a value between 0.5 and 1, as well as the different mapping files that provide the mapping between EMR coding systems (such as ICD9) and SNOMED-

CT. These two queries are then executed by our plateform's pipeline which processes their result, produces the BoMT, as well as, creates representation vectors and matches patients to clinical trials. Finally, the pipeline stores the matching scores in the metadata of the data lake. Once the matching scores have been populated, the researcher can begin investing in the new linked datasets.

Unlike many existing matching tools, the new linked datasets generated from the pipeline will allow researchers to ask ad-hoc analytical queries. Fig.8 illustrates the user interface for preparing data and configurations.

To test our platform, we used two datasets; MIMIC-III (Medical Informa- refers to a single patient and HADM ID refers to a single admission. In our experiment, we used 3 tables of events (diagnosis, procedures, prescriptions) and imported data from PostgreSQL into Hive using Apache Sqoop.

Diseases and procedures are encoded using the International Classification of Diseases version 9 (ICD-9) codes, while prescriptions use various coding systems for drug representation, including Generic Sequence Number (GSN) and

National Drug Code (NDC).

The ClinicalTrials.gov is the preferred resource for analyzing knowledge from clinical trials. After downloading the XML-encoded data from the official NCT website 4 , we automatically processed and converted it into JSON format in order to import it and store it in MongoDB by using Apache NiFi.

The xml package contains all information about every study registered in 690 ClinicalTrials.gov. However, it does not contain discrete eligibility features and therefore does not automatically support the required analysis of eligibility criteria. The following section describes how to prepare eligibility criteria for the processing step.

The test was done according to the methodology described in the previous sections, using a random subset of 10,000 patients, paired with a random subset of 10,000 clinical trials.

To fully achieve system interoperability and prepare an infrastructure that supports the measurement of similarity between different classification systems, In this work, SNOMED-CT represents the semantic bridge between various terminologies used in different datasources. Therefore, we needed to map ICD9 to SNOMED-CT using mapping files "ICD-9-CM Diagnostic Codes to SNOMED-CT Map" created by National Library of Medicine (NIH) 6 .On the 715 other hand, we also mapped NDC code to SNOMED-CT using RxNorm 7 . 

The eligibility criteria are usually organized as free-text paragraphs or as bullet lists. Therefore, the classification of criteria into classes of treatment and problem requires the extraction of each criterion as an autonomous sentence. In . Since the range of our input vectors is [−1, 1], we found that, for this task, the combina-750 tion of Leaky-ReLU for the hidden layer and Tanh for the output layer achieves better performance than ReLU and Sigmoid, due to:

• Leaky-ReLU assigns a non-zero slope for negative value, in contrast to ReLU in which the negative part is totally removed.

• Tanh function actually is a variant of the Sigmoid function but it maps a 755 real input into the range [−1, 1](see fig. 9 ).

We are interested in evaluating the performance of matching patients to a clinical trial, and therefore it is essential to measure the quality of top patient retrieval results. Thus, we evaluate using P@K based on the fact that the higher 760 the proportion of true positives for a given patient at the top of a ranked list, the better is the platform performance. These techniques are widely used to evaluate text retrieval since the best search results need to be prioritized.

We are interested in evaluating the performance of matching patients to a clinical trial, and therefore it is essential to measure the quality of top patient 765 retrieval results. 

The mAP score (Eq.8) for a set of clinical trials is the mean of the average precision scores AP (Eq.7) of these clinical trials.

The results of P@K for the 3 experiment evaluations are presented in fig. 10 , with the following (ξ) threshold values in equation 4: 0.5, 0.6 and 0.7. As we 790 can observe, the distribution of points which represent the precision at k for the 6 clinical trials suggests that the threshold 0.6 might be better suited than 0.7 and 0.5. For threshold 0.6, the ratio of precision 1.0 is 63% against 20% and 14% for threshold 0.7 and 0.5 respectively. increase and decrease in the same way according to clinical trial, this behaviour will be detailed in the discussion section. Furthermore, fig. 11b shows that the best overall mAP 0.86 was obtained using a relevance threshold of 0.6.

Consequently, using 0.6 as a semantic similarity threshold leads to the best cor-800 respondence between the eligible patients and the clinical trial. Table 3 reports the AP and mAP values for each clinical trial and each threshold experiment.

To empirically compare our results with the state of the art of clinical trialpatient matching, we explored the most relevant works. We were able to extensively compare with [? ], which followed an architecture similar to ours. We [? ]. Consequently, we can deduce that combining machine learning, semantic web and information extraction techniques, and more precisely, our proposed semantic similarity measurement which uses the SNOMED-CT medical ontol-810 ogy to construct the patient vector, has a very interesting impact in improving the matching performance.

To further explore the insights of our EMR2vec, we investigated the reason why AP in the 3 evaluation experiments changed in a balanced way. We reviewed 815 the eligibility criteria for clinical trials and quantified the number of medical terms in each trial. We found a relationship between the number of medical terms on the one hand, and the value of APs on the other. To corroborate this statement; (1) we normalized between 0 and 1 the number of extracted terms using eq.9, so as a higher value would indicate a more complex criteria, 

On the whole, the experimental result (mAP=0.86) of our proposed emr2vec platform demonstrates that our algorithm is able to efficiently link EMR and 825 clinical trial datasets. It is now well accepted that vectors representation offer a reliable approach towards automatic matching patients and eligibility criteria.

Moreover, automated inference under SNOMED-CT plays a critical role in semantic similarity; Whereas, SNOMED-CT was suitable for automatic reasoning of semantic relations between terms. Whereas structured data could be the main information representing patients, unstructured clinical text captured in the EMR, including discharge summaries, treatment plans, and progress notes, is certainly also useful for the matching process. Since the matching score should reflect both structured and 835 unstructured data, we propose the integration of a new technique to match unstructured data from both datasets into the current pipeline in order to endow the platform complete control over all data types.

Another set of errors were caused by missing fields in the EMR such as laboratory test observations. Laboratory test results are typically based on a 840 more advanced standard to be compared to the inclusion/exclusion criteria such as "AST/SGOT and ALT/SGPT < 3.0 times upper limit of normal (ULN)".

Logical Observation Identifiers Names and Codes (LOINC) is a database and universal standard for identifying medical laboratory observations. Unfortunately, a lack of adoption of LOINC codes in EMR and a lack of LOINC codes 

In this paper, We presented EMR2vec, a Big data vector space platform for medical data linking. EMR2vec platform allows health researchers to match, link and query two different but complementary datasets, EMR data and clinical trial. To the best of our knowledge, this is the first study aimed at highlight-855 ing eligibility of patients for a trial using vector space model approach and combining machine learning and semantic web techniques. EMR2vec features three pipelines that are coupled together to support data matching; 1) BoMT creation, 2) patient data conversion into a vector and 3) clinical trial presentation as a vector. Our Matching process reduces vector dimensionality using 860 neural network, then applies orthogonality projection to measure the similarity between vectors. In this work, we carefully analyzed the element types and data structure of both datasets; as well as we investigated how to handle the diversity of medical nomenclatures, vocabularies, coding and classification systems in order to support a smooth integration of health datasets. We verified 865 the effectiveness of leveraging machine learning and semantic web techniques on EMRs and eligibility criteria. The potential of machine-learning emerged in converting unstructured data to queryable data, whereas semantic web provided data reasoning as well as interoperability between heterogeneous datasets.

We evaluated the performance of the proposed platform by carrying out 870 several experiments. The outcome shows that the vector space model is a reliable approach for medical data matching tasks. More specifically, vector space provides efficient semantic representations of both datasets. To sum up, the proposed EMR2vec platform is a promising approach to bridge the gap between patient data and clinical trial. This can be very effective and efficient in treating 875 patients suffering life-threatening diseases and hence can play a significant role in saving lives.

Precision-at-k (k=1 to 5) obtained for 6 clinical trials and 3 semantic similarity thresholds