key: cord-0522871-qzdcvvsy
authors: Guti'errez-Fandino, Asier; Armengol-Estap'e, Jordi; Carrino, Casimiro Pio; Gibert, Ona De; Gonzalez-Agirre, Aitor; Villegas, Marta
title: Spanish Biomedical and Clinical Language Embeddings
date: 2021-02-25
journal: nan
DOI: nan
sha: abcaa2a881cf90e2f00d85d610188794be45ee5c
doc_id: 522871
cord_uid: qzdcvvsy

We computed both Word and Sub-word Embeddings using FastText. For Sub-word embeddings we selected Byte Pair Encoding (BPE) algorithm to represent the sub-words. We evaluated the Biomedical Word Embeddings obtaining better results than previous versions showing the implication that with more data, we obtain better representations.

1 Introduction BERT-like (Devlin et al., 2019) and GPT-like (Brown et al., 2020) Language Models' effectiveness is corroborated for most of the Natural Language Processing tasks; however, computing more traditional embeddings is still useful as, for tasks with small data and/or scenarios not using large computational resources, they are still competitive.

These new embeddings use a new Spanish Biomedical Corpus, a Spanish Clinical Corpus, and the use of BPE embeddings. We explain the process of generating the embeddings from two unprecedented Spanish corpora of health. First, we describe the data and the cleaning process, then we explain the embedding methods and, finally, we report the evaluation results.

We have developed two types of embeddings using two different corpora: the Spanish Biomedical Corpora and the Spanish Clinical Corpora. Since the Spanish Biomedical Corpora is of a much larger magnitude in size than the Clinical Corpora, we decided to compute embeddings separately and provide them as distinct resources.

We used a big biomedical corpora gathering from a variety of medical resources, namely scientific literature, clinical cases and crawled data. eral PDFs, Scielo data and a large medical crawling (Krallinger et al., 2021) . See (Villegas et al., 2018) for further details on MeSpEn resource, which compiles many of the previously mentioned corpora.

The clinical Corpora is conformed by 5 main corpora, the information contained by these corpora are mainly COVID-19 cases and ictus cases.

The source of the corpora of both biomedical and clinical domains is of multiple typologies: PDF, WARCs, plain text, etcetera. We cleaned each corpus independently applying a cleaning pipeline with customized operations designed to read data in different formats, split into sentences, perform language detection, remove noisy and malformed sentences, deduplicate and eventually output the data with their original document boundaries. Finally, in order to avoid repetitive content, we concatenated all the individual corpora and deduplicated again common documents among them.

We provide two type of embeddings: FastText Word Embeddings and BPE Sub-word Embeddings.

FastText embeddings are explained in Bojanowski et al. (2017).

We tokenized the sentences and used the script available in the website 3 . As embedding size we used 50, 100 and 300 dimensions. For embedding methods, we used CBOW and Skip-gram. For the Biomedical Corpora, we set the minimum threshold for word frequency to 1 but for the Clinical Corpora we increased the threshold to 4 to avoid leaking sensitive data. The vocabulary size parameter controls the sub-word splitting mechanism. We set the vocabulary size to 8,000 in the case of the Clinical domain and 10,000 in the case of the Biomedical domain. For the uncased version, before computing the BPE vocabulary, the corpus is lower cased. After computing the BPE subwords, FastText embeddings are computed using the official script, omitting the word threshold in the clinical corpus.

We evaluated the biomedical embeddings using the same scenario of a previous work (Soares et al., 2019) 

In this work we provide new materials for the Natural Language Processing community regarding medical domain in Spanish. With these resources we aim to fill the lack of resources in medical AI due to the sensitivity of medical data. We explained how our corpora is conformed, the embedding methods we used and the evaluation we followed.

We have shown that with larger corpus, our embeddings capture more information to accomplish the evaluation providing better results. Our embeddings show a steady improvement as more corpora have been available.

All the Embeddings have been uploaded to Zenodo: 

PharmaCoNER: Pharmacological substances, compounds and proteins named entity recognition track

BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages

Marta Villegas, and Martin Krallinger. Finding Mentions of Abbreviations and Their Definitions in Spanish Clinical Cases: The BARR2 Shared Task Evaluation Results

Medical word embeddings for Spanish: Development and evaluation

European Language Resources Association (ELRA)

The MeSpEN Resource for English-Spanish Medical Machine Translation and Terminologies : Census of Parallel Corpora , Glossaries and Term Translations

This work has been partially funded by the State Secretariat for Digitalization and Artificial Intelligence (SEDIA) to carry out specialised technical support activities in supercomputing within the framework of the Plan TL 4 signed on 14 December 2018; and the ICTUSnet INTERREG Sudoe programme.