key: cord-0443973-z20mownq authors: Zhao, Zhengyun; Jin, Qiao; Yu, Sheng title: PMC-Patients: A Large-scale Dataset of Patient Notes and Relations Extracted from Case Reports in PubMed Central date: 2022-02-28 journal: nan DOI: nan sha: 769eae977c287f7696ad8fd4cc568785fdbe1779 doc_id: 443973 cord_uid: z20mownq We present PMC-Patients, a dataset consisting of 167k patient notes with 3.1M relevant article annotations and 293k similar patient annotations. The patient notes are extracted by identifying certain sections from case reports in PubMed Central, and those with at least CC BY-NC-SA license are re-distributed. Patient-article relevance and patient-patient similarity are defined by citation relationships in PubMed. We also perform four tasks with PMC-Patients to demonstrate its utility, including Patient Note Recognition (PNR), Patient-Patient Similarity (PPS), Patient-Patient Retrieval (PPR), and Patient-Article Retrieval (PAR). In summary, PMC-Patients provides the largest-scale patient notes with high quality, diverse conditions, easy access, and rich annotations. Recent Natural Language Processing (NLP) models have shown great successes in a variety of tasks (Devlin et al., 2019; Brown et al., 2020) , such as text classification, information extraction and retrieval, question answering, and summarization. They provide promising opportunities for the development of clinical decision support systems (Garg et al., 2005; Kawamoto et al., 2005) , which automatically analyze Electronic Health Records (EHR) of patients and assist clinical decision making. Data unavailability due to privacy concerns is one of the biggest challenges in clinical NLP (Chapman et al., 2011) , and about half of the studies surveyed in a recent review used private datasets (Wu et al., 2020) , causing reproduction issues. Only a few patient note datasets are publicly available (Styler et al., 2014; Johnson et al., 2016; Caufield et al., 2018) , most of which are smallscale. MIMIC-III (Johnson et al., 2016) , an EHR dataset, is the most notable example and has been widely used for various purposes (Komorowski et al., 2018; Mullenbach et al., 2018; Alsentzer et al., 2019; Huang et al., 2019) . However, it only includes 46.1k patients in critical care departments without relational annotations, and the data quality is questionable (Kurniati et al., 2019 ). An ideal patient note dataset should be large-scale, highquality, cover diverse medical conditions, and have rich annotations to perform downstream tasks. In this paper, we collect such a patient note dataset from case reports in PubMed Central (PMC). "Case report" is a specific type of medical publication, which typically consists of 1. a patient note that summarizes the patient's whole admission, progress, discharge, and follow-up situations; and 2. a literature review of similar cases and relevant articles. We extract the patient notes by identifying certain sections and consider scientific articles and other case reports in the references to be relevant and similar, respectively. An overview of the PMC-Patients is shown in Figure 1 . In summary, PMC-Patients is: 1. publicly available: the whole dataset can be easily downloaded; 2. large-scale: PMC-Patients has the largest scale with 167k patients, and more can be extracted with distant supervision; 3. high-quality: the patient notes are of publication-quality; 4. diverse: PMC-Patients covers a variety of medical conditions; 5. annotated: Patient notes are annotated with relevant articles and similar notes. The dataset collection procedure and characteristics will be introduced in Section 2 and Section 3, respectively. PMC-Patients can be used for various purposes, such as case report literature mining (Karami et al. 2019) , pre-training clinical language models (Alsentzer et al. 2019) , and building downstream tasks. In this paper, we explore Patient Note Recognition, Patient-Patient Similarity, Patient-Patient Retrieval, and Patient-Article Retrieval using PMC-Patients, which will be described in Section 4-7. PubMed Central (PMC) 2 is a free full-text archive of biomedical and life sciences journal literature, currently archiving 7.6M articles. As a subset of PMC, PMC Open Access (OA) 3 includes over 4M articles available for reusing, though with various licenses and some are not allowed for redistribution. Therefore, only publications with at least CC BY-NC-SA license 4 , which amount to nearly 3.2M, are used to build PMC-Patients. Relational annotations are defined by citation relationships in PubMed 5 . Publications without titles or abstracts are excluded. 2 https://www.ncbi.nlm.nih.gov/pmc/ 3 https://www.ncbi.nlm.nih.gov/pmc/ tools/openftlist/ 4 https://creativecommons.org/licenses/ by-nc-sa/4.0/ 5 https://pubmed.ncbi.nlm.nih.gov/ The collection pipeline of PMC-Patients is shown in Figure 2 , which can be summarized as follows: (a) For each section of each article, identify whether there are potential patient notes with extraction triggers. (b) For sections triggering in step 1, extract patient note candidates with extractors. (c) Apply various filters to candidates to exclude non-patient-notes. (d) Relational annotations are defined by citation relationships in PubMed. Extraction triggers are a set of regular expressions to identify whether there is no, one or multiple potential patient notes in a given section, basically consisting of two successive triggers: PMC OA Articles (n=3,180,413) section_title_trigger e.g. "Case Report", "Patient Representation" multi_patients_trigger e.g. "Case 1: a 7-year-old boy...", "The second patient is . section_title_trigger: Searching in the section title for certain phrases that indicate the presence of patient notes, such as "Case Report" and "Patient Representation". multi_patients_trigger: Searching for certain patterns in the first sentence of each paragraph and titles of subsections to identify whether multiple notes are presented, such as "The second patient" and "Case 1". Extracting is performed at paragraph level. Depending on whether multi_patients_trigger is triggered, different extractors are used: single_patient_extractor: Extract all paragraphs in the section as one note, if not triggered. multi_patients_extractor: Extract paragraphs between successive triggering parts (the last one is taken till the end of the section) as multiple patient notes, if triggered. We remove noisy candidates with three filters: length_filter: Candidates with less than 10 words are excluded. language_filter: Candidates with more than 3% non-English characters are excluded. demographic_filter: The age and gender of a patient are identified using regular expressions. Candidates missing either demographic characteristic are excluded. Two types of relations are annotated: Patient-Article Relevance: All articles citing or cited by the article containing a patient note, plus the article itself, are defined as relevant articles to the patient. Patient-Patient Similarity: Patients extracted from the relevant articles to a patient are defined as similar patients to the given patient with a similarity of 1, and patients from the same article are annotated with a similarity of 2. To evaluate the quality of the automatically extracted patient notes and demographics, we randomly sample 500 articles with at least one extracted case for human annotations. Two independent biomedical experts (senior M.D. candidates) are employed to label the patient note spans and their demographics in these articles. Agreed annotations are directly considered as the ground truth. Disagreed annotations will be discussed in the second round, and the final agreement will be used as the ground truth. The extraction quality of PMC-Patients and the two experts are shown in Table 1 . The results show that the patient note spans extracted in PMC-Patients are of high quality with a larger than 90% strict F1 score, which will be discussed more in Section 4. In addition, the extracted demographics are close to 100% correct. Note Span Age Gender Demographics Percent of patients Conditions We also analyze the medical conditions associated with the patients: for PMC-Patients, we use the MeSH Diseases terms of the articles as a proxy; for MIMIC, we use the ICD codes. The most frequent medical conditions are shown in Figure 5 . In PMC-Patients, the majority (16/30) of frequent conditions are related to cancer, with an interesting exception of COVID-19 as the second most frequent condition. In MIMIC-III, severe non-cancer diseases (e.g. congestive heart failure) have the highest relative frequencies, and their absolute values are much higher than those of the most frequent conditions in PMC-Patients. For example, hypertension and lung neoplasms are the most frequent condition in MIMIC-III and PMC-Patients, respectively. Over 40% of MIMIC-III patients have hypertension, while less than 4% of patients in PMC-Patients have lung neoplasms. In addition, PMC-Patients covers 4031/4933 (81.7%) MeSH Diseases terms, which is relatively more than the 6770/14567 (46.5%) ICD codes. In PMC-Patients, there are over 3M patientarticle relevance annotations, with an average of 18.64 articles per patient, bridging patient notes and PubMed publications, and nearly 300k patient-patient similarity annotations at two different levels, with an average of 1.76 per patient. Though the patient notes in PMC-Patients are relatively clean (shown in Section 2.3), a large number of other potential patient notes in PMC are missed since the extraction triggers and the filters (especially demographic_filter) are quite strict. The goal of the Patient Note Recognition (PNR) task is to identify patient notes in the rest of PMC, where patient notes cannot be easily extracted by heuristics, and include such notes in PMC-Patients to form a "PMC-Patients-Large" dataset. Heuristically extracted patient notes in PMC-Patients can be used to train the PNR models. We model PNR as a paragraph-level sequential labeling task, similar to the named entity recognition (NER) task. For each article, given input as a sequence of texts p 1 , p 2 , · · · , p n , where n is the number of paragraphs, the output is a sequence of BIO tags t 1 , t 2 , · · · , t n . Annotations are generated automatically during extraction. We randomly sample 5k articles as dev set and another 5k as test set. Notes extracted from articles in each split are denoted as PMC-Patients-train/dev/test, respectively. Table 3 shows the dataset statistics. PNR performance is evaluated at two levels: Note level: Similar to NER F1 score, only valid predicted spans are taken as predicted notes, and predictions matching a ground truth note exactly are considered correct. Precision, recall, and F1 score are reported. Paragraph level: For each article, precision is defined by matching each predicted span with every true span, taking the maximum percentage of overlapped paragraphs (divided by length of the predicted span), and taking the average of all predilections. Recall is defined in reverse, by matching each true span with predicted spans, dividing maximum overlap by length of the true span, and taking the average of all true spans. F1 is calculated using precision and recall defined above and taken average over articles. We use the final expert annotations (500 articles) in Section 2.3 as ground truths to evaluate PNR performances of different baseline models and annotations. The results are reported in Table 4 . Automated extraction of PMC-Patients presents high quality with a note-level F1 greater than 90% and is comparable to human expert performances in terms of paragraph-level metrics. BioBERT achieves nearly 90% note-level F1 and CNN-LSTM-CRF's paragraph-level F1 is just slightly lower than annotations, illustrating the possibility, empowered by PMC-Patients, of collecting a "PMC-Patients-Large" with high quality. The goal of the patient-patient similarity (PPS) task is to measure similarity between any given pair of patients, which is the core task of patient-patient retrieval (PPR). Models pre-trained on PPS dataset can serve as a reranker for PPR task in Section 6. We model PPS as a 3-way classification task: each instance consists of a pair of notes n 0 , n 1 and a label l ∈ {0, 1, 2}. Labels 1 and 2 represent respective similarities (positive samples) and label 0 is given to randomly sampled irrelevant notes (negative samples). The ratio of positive to negative samples is 1:1. For training set, both notes are collected from PMC-Patients-train while for dev/test set, one note is from PMC-Patients-dev/test and the other from the union of PMC-Patients-train and PMC-Patientsdev/test. Dataset statistics are reported in Table 5 . Performance is evaluated by accuracy. Logistic Regression For each pair of patient notes, we extract the following features to run a logistic regression (Berkson, 1944) : 1. age difference: |a 1 − a 2 |, where a 1 and a 2 are ages of the patients; 2. gender difference: where I denotes the indicator function and g 1 , g 2 are genders of the patients; 3. topic similarity: DSC(E 1 , E 2 ), where DSC denotes the Dice similarity coefficient (Dice, 1945) and E 1 , E 2 are sets of named entities detected by Scispacy (Neumann et al., 2019) in each note. BERT We fine-tune a BioBERT and a Clinical BERT classifier that takes two notes separated by "[SEP]" token as input and outputs the probabilities of three labels. Results on test set are reported in Table 6 . Logistic regression, even using only topic similarity as a single feature, presents quite good performances, indicating that patients annotated as similar generally share similar topics, with more co-mentioned named entities. Besides, the ability of such a model to distinguish positive and negative samples by topic similarity verifies the diversity of topics in PMC-Patients. BERT outperforms logistic regression by a large margin, and BioBERT outperforms Clinical BERT slightly. The confusion matrix shown in Figure 7 demonstrates the model's capacity of distinguishing between different levels of similarity, and the most frequent mistakes are labeling pairs with similarity 1 as irrelevant. It should be noted that many notes in PMC-Patients have token counts far exceeding BERT's 512 token limits and truncation is applied in our baselines, which suffers from inevitable information loss. Efficient transformers (Tay et al., 2020), represented by Big bird (Zaheer et al., 2020) and Longformer (Beltagy et al., 2020) , are potential solutions, and we leave that for future work. The goal of the Patient-Patient Retrieval (PPR) task is to retrieve similar patient notes in a large database (e.g.: a local EHR database or PMC-Patients) to any given patient note. The diagnosis, treatment responses, and prognosis of similar patients can provide evidence-based decision support for the given patient. We model PPR as a 2-grade retrieval task, without distinguishing between two levels of similarity. Each patient note serves as a query and other notes constitute document collection D. Note that some notes in PMC-Patients don't have similarity annotations and are excluded from the queries. For each split, queries are collected from the corresponding split of PMC-Patients. Documents of the training set only consist of PMC-Patients-train while dev/test set uses the union of PMC-Patientsdev/test and PMC-Patients-train as document collection. Dataset statistics are reported in Table 7 . PPR performance is evaluated with mean reciprocal rank (MRR), precision at 10 (P@10), and recall at 1k and 10k (R@1k, R@10k). DSC We use topic similarity defined in Section 5 as patient similarity score and rank all the patients. The goal of the Patient-Article Retrieval (PAR) is to find relevant scientific articles in PubMed to any given patient note. Such articles can provide clinical decision supports for the management of the given patient. We model PAR as a 2-grade retrieval task where each patient note serves as a query and document collection D consists of titles and abstracts of all PubMed publications with machine-readable titles or abstracts. Dataset split is based on patient notes (queries) split. Dataset statistics are reported in Table 9 . KNN For each query patient in test set, we first retrieve similar patients (nearest neighbors) in only PMC-Patients-train using BM25, then take their relevance annotations (training set of PAR) as candidates. Note that there are articles relevant to multiple patients, so the relevance score between an article and the query is defined as the sum of similarity scores of all retrieved relevant patients. BM25 performs much worse on PAR than PPR, as shown in Table 8 and PMC-Patients is basically a case report dataset that: 1. contains patient notes similar to clinical narratives and 2. is annotated with patient-patient similarity and patient-article relevance information. In this section, we discuss four categories of related works: case report datasets, clinical narrative datasets, patient-patient retrieval datasets, and patient-article retrieval datasets. Case report datasets are typically extracted from published articles (case reports). Caufield et al. reformulates about 12k case reports to 100k gapfilling machine reading comprehension instances. Compared to these works, PMC-Patients is much larger, in English, and contains document-level relevance annotations. Apart from MIMIC that has already been discussed and compared above, several other clinical narrative datasets have also been proposed, which are mostly collected from EHRs. MTSamples 7 contains 5k sample transcription reports. The THYME project (Styler et al., 2014) , which is aimed at extracting useful temporal relations from clinical narratives, has released a corpus of about 1.2k clinical, pathology, and radiology records for patients with brain and colon cancer. The n2c2 8 (originally named i2b2 9 ) project has released many clinical note datasets that are specifically annotated with certain information, e.g annotates 14k sentence pairs from MIMIC for clinical language language inference. These datasets are manually annotated and thus limited by size, typically containing several hundred to a few thousand clinical note pieces which only reflect part of patients' medical situations. Grabar et al. (2020) show that case reports and clinical narratives have high textual similarity, so PMC-Patients has the potential to facilitate the processing and understanding of clinical narratives. To build a patient-patient retrieval dataset, it is essential to define and calculate patient-patient similarity, which is a hard research question (Seligson et al., 2020) and there are many ongoing efforts (Sharafoddini et al., 2017; Parimbelli et al., 2018) . However, to the best of our knowledge, no publicly available similar patient retrieval dataset exists, possibly due to the difficulty of defining patient similarity. There are only a few works on similar patient retrieval (Plaza and Díaz, 2010; Arnold et al., 2010) , all of which use private datasets and annotations. In PMC-Patients, we leave the hard task of patient-patient similarity definition to case report authors, who usually cite other case reports that contain similar patients and describe their similarity in the corresponding anchor texts. The Text REtrieval Conference (TREC) has organized several challenges on retrieving relevant documents for given patient queries: The Clinical Decision Support (CDS, 2014-2016) tracks focus on retrieving relevant PMC articles for given patient summary notes with specific intents (e.g.: finding treatment/diagnosis) (Simpson et al., 2014; Roberts et al., 2015 Roberts et al., , 2016 ; The Precision Medicine (PM, 2017-2020) tracks focus on retrieving relevant PubMed articles and eligible clinical trials for given semi-structured patient queries, which include demographics, cancer types, gene mutations, and tentative treatments (Roberts et al., 2017 (Roberts et al., , 2018 (Roberts et al., , 2019 (Roberts et al., , 2020 ; The Clinical Trial (CT, 2021) track 10 focuses on retrieving eligible clinical trials for given patient summary notes. Each year, 30-75 patient queries (topics) will be released and about 30k submitted patient-document pairs will be annotated with relevance for evaluation. Annotations of previous years are usually used to train next years' participating systems. Though the annotation size is relatively large, the diversity of patient queries is severely limited. PMC-Patients provides over 3M relevant article annotations of 167k patients, where relevance is also defined by the case report authors. In this paper, we present PMC-Patients, a publicly available, large-scale, high-quality, and diverse patient note dataset that is annotated with patientpatient similarity and patient-article relevance labels. PMC-Patients can be used to find more patient notes, calculate patient-patient similarity, and retrieve similar patients or relevant articles for any given patient note. PMC-Patients will be updated in the future from the following perspectives: 1. collecting a "PMC-Patients-Large" with distant supervision; 2. incorporating human evaluations of PPR and PAR datasets; and 3. leveraging diverse information including tables, figures, and journal attributes. There is also much room for improving downstream tasks' performances. More elaborate sampling methods in PPS, such as ones that introduce multi-hop citations (connected by intermediate articles, rather than directly cited), and corresponding graph-based encoding methods may help rerank candidates and generate better results in PPR. Besides, more sophisticated rerankers, especially ones able to process long texts, can be designed to improve performances in PPR and PAR tasks. Table 11 gives several samples of PMC-Patients. To give examples of the relational annotations, we sample two anchor patients and then take samples from their annotations randomly. A 53-year-old woman was referred to our clinic with waist and back pain and numbness of the lower limbs for more than 1 month. The pain was not related to her posture and became more prominent when she moved. She had a medical history of lumbar disc herniation and no history of trauma. On initial evaluation, her vital signs were stable. Apart from the pain of the waist and back, physical examination revealed unremarkable findings. Routine blood tests were obtained. Further, liver function tests revealed normal results. The blood CA199, CA125, CEA, and AFP levels were also within normal limits. Computed tomography of the chest revealed scattered pulmonary nodules with calcifications associated with a soft tissue mass measuring 3.3 cm × 2.4 cm and without pleural thickening at the superior lobe of the right lung () (SOMATOM definition, Siemens Healthcare, Erlangen, Germany; tube voltage, 100-120 kVp; tube current, 450 mA; slice thickness, 0.625 mm; pitch, 0.992:1; rotation speed: 0.5 s/rot; ASIR-V:30%.). Enlarged lymph nodes of the right hilar were also evident. Abdominal contrast-enhanced CT revealed diffuse lesions with massive calcifications in the liver, which shows faint peripheral enhancement in the arterial phase and low enhancement in the portal phase (Iopromide Injection, Bayer Pharma AG; the arterial phase and portal venous phase were obtained at 25 s and 60 s after contrast injection.). The largest lesion measuring 10.2 cm00d7 5.9 cm was located in the right lobe of the liver and (). CT examination also revealed osteolytic lesions with a massive thick sclerotic rim in the right second rib, 11th thoracic vertebra, and first lumbar spine. Bone scintigraphy with 99mTc-methylene diphosphonate showed multiple hypermetabolic activities in the involved bones (). Cerebral magnetic resonance imaging (MRI) revealed no anomalies. The patient underwent transthoracic needle biopsy of the largest pulmonary lesion located in the right superior lobe. Histopathological analysis revealed epithelioid cells arranged in a glandular pattern with clear cytoplasm (). Immunohistochemical staining showed that the neoplastic cells were positive for CD31, CD34, CAMTA1, and EMA, but negative for ERG, TFE3, PCK, and desmin, with a Ki-67 index rate of 10%. Histopathological examination indicated a rare low-grade malignant vascular neoplasm, confirming the diagnosis of EHE. Considering the multiple intra-pulmonary, right hilar lymph node, liver, and bone metastases, the patient was treated with chemotherapy with paclitaxel liposome (240 mg/m2; day 1) and carboplatin (550 mg/m2; day 1). At 8 months, the patient had completed four cycles of combination therapy. There were no changes in the patient's disease status on CT at the 8-month follow-up visit. Similarity 1 The patient was a 40-year-old Asian male with a four-month history of a dry cough, dyspnea and hemoptysis. The patient was a heavy smoker with an unremarkable medical history. A chest computed tomography (CT) scan revealed the presence of multiple nodules scattered in both lungs without hilar and mediastinal lymphadenopathy or pleural effusion (). Initially, a bronchofibroscope examination failed to reveal any abnormalities. In order to obtain a definitive diagnosis, the tissue specimens were taken by diagnostic right thoracoscopic lung biopsy. The histological diagnosis of PEH was based on the pathological examination. The pathological examination of the biopsied specimen revealed that the center of the pulmonary nodule was sclerotic and hypocellular, with hyalinization and calcification. The tumor cells were round with abundant eosinophilic cytoplasm, intracytoplasmic vacuolization and a signet ring-like appearance (). Immunohistochemical analysis revealed that the tumor cells were positive for the endothelial markers, factor-VIII-related antigen and CD34 (). PEH disease progressed rapidly in this patient one month after pulmonary surgery. The T1-weighted magnetic resonance imaging (MRI) section examination revealed a nodular lesion in the brain, which was strongly suggestive of brain metastasis (). The CT revealed a spreading of the nodules throughout both lungs three months after surgery (). At this point, the patient began treatment with one cycle of chemotherapy with cisplatin, paclitaxel and endostar (15 mg/day for 14 consecutive days). The patient demonstrated improvements in dyspnea and a dramatic improvement in their clinical status. However, no change in the size of the pulmonary nodules over the period of chemotherapy was observed. The patient subsequently received another two cycles (two, bi-weekly) of chemotherapy treatment with carboplatin, paclitaxel and endostar. No significant reduction was observed in the tumor size and number, and the disease progressed. Following three months of stabilization, progression of the disease was evident. Therefore, the patient was discharged without further treatment. The patient survived for six months following the initial diagnosis. Similarity 1 The patient was a 54-year-old female, non-smoker, who complained of chest pain, dyspnea and a dry cough for 11 months. A chest CT scan revealed intrapulmonary masses in the bilateral superior lobes, and a small right pleural effusion. Abdominal and pelvic CT scans did not reveal any lesions. A thoracoscopic lung biopsy from the right superior lobe was performed in order to examine the nodules. The postoperative course of the patient during follow-up was uneventful. Examination of the nodular sections revealed clusters of neoplastic cells as well as individual tumor cells. The normal pulmonary architecture was replaced by alveoli containing nodules of neoplastic cells and matrix. The histological features of pulmonary epithelioid hemangioendothelioma (EHE) were evident with confirmatory CD31 and CD34 immunohistochemical stains. No markers of mesothelial and muscular differentiation were observed. As a result, the patient was diagnosed with PEH. Immediately following confirmation of the diagnosis, combination chemotherapy with carboplatin, paclitaxel and bevacizumab (15 mg/kg) was initiated for six cycles, without distinct toxicities. The stabilization of the disease was evident, as the chest pain gradually subsided. Following eight months of stabilization, progression of the disease was evident. The patient survived for 15 months following the initial diagnosis. Similarity 1 Epithelioid hemangioendothelioma: a vascular tumor often mistaken for a carcinoma Epithelioid hemangioendothelioma is a unique tumor of adult life which is characterized by an "epithelioid" or "histiocytoid" endothelial cell. Forty-one cases of this rare tumor have been recognized at the Armed Forces Institute of Pathology. They may occur in either superficial or deep soft tissue, and in 26 cases appeared to arise from a vessel, usually a medium-sized or large vein. They are composed of rounded or slightly spindled eosinophilic endothelial cells with rounded nuclei and prominent cytoplasmic vacuolization. The latter feature probably represents primitive lumen formation by a single cell. The cells grown in small nests or cords and only focally line well-formed vascular channels. The pattern of solid growth and the epithelioid appearance of the endothelium frequently leads to the mistaken diagnosis of metastatic carcinoma. The tumor can be distinguished from a carcinoma by the lack of pleomorphism and mitotic activity in most instances and by the presence of focal vascular channels. Ultrastructural study in four cases confirmed the endothelial nature of the tumor in demonstrating cells surrounded by basal lamina, dotted with surface pinocytotic vesicles, and occasionally containing Weibel-Palade bodies. Follow-up information in 31 cases indicated that 20 patients were alive and well following therapy; three developed local recurrences and six metastases. It is suggested the term epithelioid hemangioendothelioma be used to designate these biologically "borderline" neoplasms. The significance of the epithelioid endothelial cell is not entirely clear. Since it may be observed in both benign and malignant vascular lesions, its presence alone does not define a clinicopathologic entity. -patient_uid A 60-year-old male presented to our outpatient clinic for a routine visit. He had no complaints except for minimal hand dryness and denied fatigue, generalized weakness, heat or cold intolerance, constipation, diarrhea, hair loss, or any recent weight changes. His past medical history was pertinent for a basal cell carcinoma of scalp treated with extensive excision 3 years prior. On physical examination, an enlarged, firm, non-tender thyroid gland was appreciated with about 3 cm left thyroid lobe mass felt on palpation. Xerosis of hands, arms, and legs were also noted with no pruritus or erythematous rash. No palpable lymphadenopathies or hepatosplenomegaly was noted. He denied any lump like sensation in his neck, neck pain, difficulty in swallowing, hoarseness, or dry cough. Signs of tracheal, esophageal, or neck vein compression were not found. He was an active smoker (35 pack-years) but did not have any history of radiation exposure, family history of thyroid-related disorders, or malignancy. His diet was normal with iodine-rich meals. Complete blood count and basic metabolic profile were within normal limits. TSH was found to be elevated at 10.14 uIU/mL. All other lab values were within normal limit. Thyroid ultrasound was done which revealed large 5.800d7 3.100d7 2.5 cm hypoechoic mass occupying almost complete volume of the left thyroid lobe, and persistent elevation of TSH was found in repeat thyroid panel. Patient underwent fine needle aspiration (FNA) of the mass which revealed2018indeterminate follicular neoplasm2019 (Bethesda category IV). The cytologic features were suspicious for neoplastic process, and differential diagnosis was broad including medullary thyroid carcinoma, neuroendocrine tumors, hematolymphoid process, as well as metastatic malignancy. With concern for malignancy, Positron Emission Tomography2013 Computed Tomography (PET2013 CT) scan was done. This revealed intensely hypermetabolic left thyroid mass and mildly hypermetabolic and prominent juxta thyroid lymph nodes in the left neck. Diffuse uptake throughout the remainder of the thyroid gland was present2013 a common finding associated with Hashimoto2019s thyroiditis. Considering probable malignancy which was thus far undefined, decision for thyroid lobectomy was made. The pathology and immunohistochemical (IHC) stains of the excised left thyroid lobe supported histologic impression of extra nodal marginal B-cell lymphoma (MALT lymphoma) with Hashimoto2019s thyroiditis. IHC indicated positivity for CD20, CD79A, CD43, and CD45. IHC of Superior level 6 lymph node did not reveal any pathologic changes. Bone marrow was performed subsequently which did not show any evidence of infiltrating lymphoma. Patient was determined to be at stage IAE (Ann Arbor staging) considering his disease was limited to the thyroid gland without marrow infiltration, evidence of metastatic disease, or B symptoms. He was started on thyroxine replacement and then referred to Medical and Radiation Oncology, where further radiation therapy was recommended. He remains in a close follow-up with his primary care provider and continues to be in remission. He deferred further radiation therapy. patient_uid An 18-year-old boy with 2 years history of UC was referred to our hospital from a clinic for endoscopy. The boy was diagnosed as UC of total colitis type at a large general hospital 2 years earlier. Our hospital performed colonic endoscopy, which revealed erosions, ulcers, and edema continuously from the anus to the cecum. The terminal ileum was not involved. Six biopsies were obtained from various sites of the colorectum. The microscopic examinations of all the six biopsies showed severe infiltration of atypical small lymphocytes [ Figure , ]. They showed hyperchromatic nuclei and increased nucleocytoplasmic ratio. Immunoblastic cells were scattered. Centrocyte-like atypical lymphocytes (CCLs), monocytoid cells, and plasma cell differentiations were seen in some places. Vague germinal centers were present, and apparent lymphoepithelial lesions (LELs) were seen [ Figure , ]. No crypt abscesses were seen, and there were few neutrophils. No apparent features of UC such as crypt abscess, deletion of goblet cells, abnormal branching of the crypts and cryptal atrophy were seen. No Crohn's granuloma was seen. They were also positive for CD45RO, CD3, and CD15, but these positive cells were very scant compared with CD20 and CD7903b1. The infiltrates were negative for CD10, CD30, CD56, cytokeratin (CK) AE1/3, CK CAM5.2, CK34BE12, CK5, CK6, CK7, CK8, CK14, CK18, CK19, CK20, EMA, chromogranin, synaptophysin, NSE, S100 protein, CEA, CA19-9, p63, and HMB45. Without clinical information, the appearances are those of MALT lymphoma. However, with clinical information, making the diagnosis of MALT lymphoma was hesitated. The pathological diagnosis made by the author was atypical lymphoid infiltrates indistinguishable from MALT lymphoma in an adolescent male patient. The patient was planned to be followed up without therapy of MALT but with treatment of UC with salazosulfapyridine and steroids. Malignant lymphoma of mucosa-associated lymphoid tissue. A distinctive type of B-cell lymphoma As illustrated in the two cases described in this paper close morphologic and immunohistochemical similarities exist between Mediterranean lymphoma (MTL) and primary gastrointestinal lymphoma of follicle center cell (FCC) origin as it occurs in Western countries. Similarities between the two conditions include a dense noninvasive monotypic lamina propria plasma cell infiltrate, present in all cases of MTL and in some cases of Western gastrointestinal FCC lymphoma, and an invasive infiltrate of FCCs morphologically distinct from the plasma cells. A distinctive lesion produced by individual gland invasion characterizes both types of lymphoma. A clonal relationship between the lamina propria plasma cells and the invasive FCCs, long suspected but never proved in MTL, can be demonstrated in Western cases. Many of the histologic and clinical features common to these lymphomas can be explained in the context of the normal maturation sequences of gut associated lymphoid tissue. It is suggested that MTL and Western cases of primary FCC gastrointestinal lymphoma share a common histogenesis from mucosa associated lymphoid tissue. -pubmed_id 25253369 Extranodal marginal zone B-cell lymphoma of Mucosa-Associated Lymphoid Tissue (MALT lymphoma) in ulcerative colitis Extranodal marginal zone B-cell lymphoma of mucosa-associated lymphoid tissue (MALT lymphoma) occurring in inflammatory bowel diseases, including ulcerative colitis (UC) and Crohn's disease, has been reported, although it is extremely rare. An 18-year-old man with a two-years history of UC underwent colon endoscopy, and was found to have active total UC ranging from anus to cecum. Six biopsies were obtained. The microscopic examinations showed severe infiltrations of atypical small lymphocytes. They showed hyperchromatic nuclei and increased nucleocytoplasmic ratio and scattered immunoblastic cells. Centrocyte-like atypical lymphocytes, monocytoid cells, and plasma cells were seen in some places. Vague germinal centers were present, and apparent lymphoepithelial lesions were seen. No crypt abscesses were seen, and there were few neutrophils. No apparent other findings of UC were seen. Immunohistochemically, the atypical lymphocytes were positive for vimentin, CD45, CD20, CD79, CD138, -chain, -chain, and p53 and Ki-67 antigen (labeling index = 63%). They were also positive for CD45RO, CD3, and CD15, but these positive cells were very scant compared with CD20 and CD79. They were negative for CD10, CD30, CD56, cytokeratin (CK) AE1/3, CK CAM5.2, CK34BE12, CK5, CK6, CK7, CK8, CK14, CK18, CK19, CK20, EMA, chromogranin, synaptophysin, NSE, S100 protein, CEA, CA19-9, p63, and HMB45. Without clinical information, the appearances are those of MALT lymphoma. However, with clinical information, making the diagnosis of MALT lymphoma was hesitated. It is only mentioned herein that atypical lymphocytic infiltrations indistinguishable from MALT lymphoma occurred in an 18-year-old male patient with a two-year history of UC. - Publicly available clinical bert embeddings Clinical case-based retrieval using latent topic analysis Longformer: The long-document transformer Application of the logistic function to bio-assay