key: cord-0317514-z6ch8cua
authors: Ebbehoj, A.; Thunbo, M.; Andersen, O. E.; Glindtvad, M. V.; Hulman, A.
title: Transfer learning for non-image data in clinical research: a scoping review
date: 2021-10-03
journal: nan
DOI: 10.1101/2021.10.01.21264290
sha: 49c85e18c682746215009a179a447c8634a2cb67
doc_id: 317514
cord_uid: z6ch8cua

Background Transfer learning is a form of machine learning where a pre-trained model trained on a specific task is reused as a starting point and tailored to another task in a different dataset. While transfer learning has garnered considerable attention in medical image analysis, its use for clinical non-image data is not well studied. Therefore, the objective of this scoping review was to explore the use of transfer learning for non-image data in the clinical literature. Methods and Findings We systematically searched medical databases (PubMed, EMBASE, CINAHL) for peer-reviewed clinical studies that used transfer learning on human non-image data. We included 83 studies in the review. More than half of the studies (63%) were published within 12 months of the search. Transfer learning was most often applied to time series data (61%), followed by tabular data (18%), audio (12%) and text (8%). Thirty-three (40%) studies applied an image-based model to non-image data after transforming data into images (e.g. spectrograms). Twenty-nine (35%) studies did not have any authors with a health-related affiliation. Many studies used publicly available datasets (66%) and models (49%), but fewer shared their code (27%). Conclusions In this scoping review, we have described current trends in the use of transfer learning for non-image data in the clinical literature. We found that the use of transfer learning has grown rapidly within the last few years. We have identified studies and demonstrated the potential of transfer learning in clinical research in a wide range of medical specialties. More interdisciplinary collaborations and the wider adaption of reproducible research principles are needed to increase the impact of transfer learning in clinical research.

Transfer learning is a form of machine learning where a pre-trained model trained on a specific task is reused as a starting point and tailored to another task in a different dataset. While transfer learning has garnered considerable attention in medical image analysis, its use for clinical non-image data is not well studied. Therefore, the objective of this scoping review was to explore the use of transfer learning for nonimage data in the clinical literature.

We systematically searched medical databases (PubMed, EMBASE, CINAHL) for peer-reviewed clinical studies that used transfer learning on human non-image data.

We included 83 studies in the review. More than half of the studies (63%) were published within 12 months of the search. Transfer learning was most often applied to time series data (61%), followed by tabular data (18%), audio (12%) and text (8%). Thirty-three (40%) studies applied an image-based model to non-image data after transforming data into images (e.g. spectrograms). Twenty-nine (35%) studies did not have any authors with a health-related affiliation. Many studies used publicly available datasets (66%) and models (49%), but fewer shared their code (27%).

In this scoping review, we have described current trends in the use of transfer learning for non-image data in the clinical literature. We found that the use of transfer learning has grown rapidly within the last few years. We have identified studies and 

There is no doubt that most clinicians will use technologies integrating artificial intelligence (AI) to automate routine clinical tasks in the future. In recent years, the U.S. Food and Drug Administration has been approving an increasing number of AIbased solutions, dominated by deep learning algorithms [1] . Examples include atrial fibrillation detection via smart watches, diagnosis of diabetic retinopathy based on fundus photographs, and other tasks involving pattern recognition [1] . In the past, the development of such algorithms would have taken an enormous effort, both regarding computational capacity and technical expertise. Nowadays, computational tools, including cloud computing, free software, and training materials are more easily accessible than ever before [2] . This means that more researchers, with more diverse backgrounds, have access to machine learning, and that they can focus more on the subject matter when developing AI-based solutions for clinical practice.

Despite this trend, machine learning, neural networks, transfer learning, and other elements of AI still seem to be surrounded by mystery in the clinical research community, and AI has yet to reach its potential in the clinic [1] .

A neural network is a type of machine learning model, inspired by the structure of the human brain (Box 1). In the simplest scenario, the input data flows through layers of artificial neurons, known as hidden layers. Each hidden neuron takes the results from previous neurons, calculates a weighted sum before applying a nonlinear function, and feeding this value forward to the next layer. The final layer transforms the results according to the prediction task, for example to probabilities, when the task is to predict whether a lung nodule on a chest X-ray is malignant. Fitting or training a neural network means optimizing the weights or parameters against some performance metric. Deep learning means the use of neural networks with several hidden layers of neurons. In the last decade, several neural network architectures were designed with some consisting of more than 100 layers and tens of millions of weights [3] . In such a deep neural network, neurons at lower levels (i.e. closer to input layer) 'learn' to recognize some lower-level features in the data (e.g. circles, vertical and horizontal lines in images), which the higher level neurons combine into more complex features (e.g. a face, some text, etc. in an image) [4] . Neural networks are popular tools in machine learning as they can approximate any complex nonlinear association. This flexibility, however, comes with a price, as fitting complex neural networks requires very large datasets, which limits the spectrum of fields where their application is feasible.

Transfer learning circumvents the above-mentioned limitation and unlocks the potential of machine learning for smaller datasets by reusing a pre-trained neural network built for a specific task, typically on a very large dataset, on another dataset and potentially for a different task. The pre-trained model is also known as the source model, while the new dataset and task is referred to as the target data and target task. A common example is to take a computer vision model, trained to identify everyday objects in millions of images, and further train this model for grading diabetic retinopathy on only a few thousand fundus photographs, instead of training the model from scratch on this smaller dataset [5] . This example demonstrates a type of transfer learning known as fine-tuning, or weight or parameter transfer. Another type of transfer learning is feature-representation transfer, where the features from the hidden layers of a model are used as inputs for another model, but other forms of transfer learning exist [6] [7] [8] .

Applications of transfer learning are common in computer vision, where large datasets are publicly available to train models that can then be adapted to different . CC-BY 4.0 International license It is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted October 3, 2021. ; https://doi.org/10.1101/2021.10.01.21264290 doi: medRxiv preprint domains. One of the most influential datasets is ImageNet, which includes more than a million images from everyday life [9] . Two recent scoping reviews focused on transfer learning for medical image analysis, one of which identified around a hundred articles applying ImageNet-based models for clinical prediction tasks [10, 11] . The number of published articles using transfer learning for medical image analysis approximately doubled every year in the last decade, demonstrating an increasing interest in transfer learning [10, 11] . To our knowledge, a comprehensive overview of the use of transfer learning for other non-image data types is lacking, despite that tabular and time series data seem to dominate in the clinical literature.

Therefore, the objective of this scoping review was to fill this knowledge gap by exploring and characterizing studies that used transfer learning for non-image data in clinical research.

Scoping reviews identify available evidence, examine research practices, and characterize attributes related to a concept (i.e. transfer learning in our case) [12] .

This format fits better with our research objective than a traditional systematic review and meta-analysis, as the latter require a more well-defined research question.

During the process, we followed the 'PRISMA for Scoping Reviews' guidelines and the manual for conducting scoping reviews by the Joanna Briggs Institute [13, 14] .

In accordance with the aim of this review, we only wanted to include studies using transfer learning for a clinical purpose. Similarly, we were only interested in articles indexed in medical databases, as opposed to in purely technical databases such as

The Institute of Electrical and Electronics Engineers (IEEE) Xplore Digital Library, which most clinical researchers and practitioners might be unfamiliar with. Moreover, . CC-BY 4.0 International license It is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted October 3, 2021. ; https://doi.org/10.1101/2021.10.01.21264290 doi: medRxiv preprint such articles are often written in a technical language, which limits their utility for the clinical research community. By only including articles indexed in medical databases, we could gauge how exposed clinical researchers are to clinical use of transfer learning and could also provide a list of articles that can serve as inspiration to clinical researchers interested in transfer learning.

To be considered for inclusion, studies needed to 1) be published, peer-reviewed, written in English, and indexed in a medical database (defined as PubMed, EMBASE, or CINAHL) since database inception, 2) be a clinical study or focus on clinically relevant outcomes or measurements, 3) use data from human participants or synthetic data representing human participants as target data, 4) use transfer learning with either fine-tuning (parameter transfer) or feature-representation transfer, and 5) analyze non-image target data (text, time-series, tabular data, or audio). Accordingly, we excluded preprints and conference abstracts, basic research, cell studies, animal studies, and studies analyzing image data. Videos were considered as blends of audio and images; therefore, we did not include studies analyzing this type of target data. However, we did not exclude studies that converted non-image data into images (e.g. converting an audio file into an audiogram) and then analyzed them using models from computer vision. Finally, we also excluded studies which combined source data with the target data as an integral part of their analysis as opposed to reusing only a source model. Eligibility criteria were predefined and the review protocol is available online [15] . is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted October 3, 2021. ; https://doi.org/10.1101/2021.10.01.21264290 doi: medRxiv preprint online [15] . In brief, we searched for all studies that either specifically mentioned 'transfer learning' or included both a similar but less specific phrase (e.g. 'transfer of learning', 'transfer weights', 'connectionist network', etc.) and an AI-related keyword (e.g. 'artificial intelligence', 'neural network', 'NLP', etc.). The search strategy was supplemented by a call-out on Twitter by AH (@adamhulman) on May 25, 2021, and by scanning the references of relevant reviews found during the screening process.

As per our aim, we did not include grey literature.

The search results from each database were imported to the reference program Endnote 20.1 (Clarivate Analytics, Philadelphia, PA, USA). Duplicate removal was done by AE and is documented online [15] . Hereafter, the records were transferred to Covidence (Veritas Health Innovation, Melbourne, Australia) for the screening process.

Abstracts and titles were screened against the predefined eligibility criteria by at least two independent authors (AE, MT, OEA, AH). If an abstract and title clearly did not meet the inclusion criteria, the record was excluded. In case of uncertainty, the record was included for full-text screening. Next, full-text versions of all reports were retrieved if possible and assessed for eligibility. If a report was excluded at this step of the screening process, a reason was documented. In any stage of the screening process, conflicts were resolved through discussion between the dissenting authors.

In case of no consensus, the last author (AH) made the final decision. The study selection process is presented using a PRISMA flow diagram [16] in Figure 1 .

Research questions were pre-specified and published in the scoping review protocol [15] . Data of interest from each included study was extracted by two independent . CC-BY 4.0 International license It is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted October 3, 2021. ; https://doi.org/10.1101/2021.10.01.21264290 doi: medRxiv preprint researchers based on a pre-developed and tested data extraction form [15] . Disagreements were solved as described above. Data on study characteristics, like the study area within medicine, the affiliation of the authors (medical or technical departments), and the aim of the study, was extracted. Furthermore, extracted data included knowledge on model characteristics such as what method and origin the model being transferred is based on, type of transfer learning, type of source and target data, and the advantages/disadvantages if compared to a non-transfer learning method. Lastly, information on the reproducibility of the studies was registered, i.e. the public availability of the data, the reused model, and the code for the analysis, as well as the software used. The studies are listed by field within medicine and data type in Table 1 , and the complete dataset is available in the electronic supplement. 

The search resulted in 4,902 records, of which 2,097 were duplicates ( Figure 1 ).

After screening the remaining 2,805 records, 2,528 were excluded as irrelevant. Of the 277 records included for full-text review, 194 were excluded, mainly because they focused on basic, animal, or other non-clinical research (n=107) or did not use . CC-BY 4.0 International license It is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted October 3, 2021. ; https://doi.org/10.1101/2021.10.01.21264290 doi: medRxiv preprint transfer learning (n=43). Another 64 records were identified in reviews or on Twitter, but none of the papers were eligible for inclusion. In total, 83 studies were included in this review (Table 1) .

Only one of the identified studies [17] The most common fields of studies were neurology (n=26) and cardiology (n=18) followed by genetics, infectious diseases and psychiatry (n=5 for each). In line with the medical specialties, analyses of electroencephalography (EEG) and electrocardiogram (ECG) data were common with 20 and 19 studies, respectively.

Study aims were most often described with the terms: 'prediction', 'detection', and 'classification'.

In total, 50 out of 83 (60%) studies included at least one author with a clinical affiliation and at least one with a technical affiliation. Studies with pure technical affiliations were more common (29 out of 83; 35%) than studies with pure clinical affiliations (4 out of 83; 5%).

Time series was the most common target data type (n=51; 61%), followed by tabular data (n=15; 18%), audio (n=10; 12%) and text (n=7; 8%). The reused model was is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted October 3, 2021. ; https://doi.org/10.1101/2021.10.01.21264290 doi: medRxiv preprint developed in the ImageNet [9] dataset (n=27). Figure 3 shows the different combinations of source and target data types.

Fine-tuning was approximately three times as common (58 out 

Applications of transfer learning were identified in a variety of clinical studies and demonstrated the potential in reusing models across different prediction tasks, data types, and even species. Improvements in predictive performance were especially striking when transfer learning was applied to smaller datasets, as compared to . CC-BY 4.0 International license It is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted October 3, 2021. ; https://doi.org/10.1101/2021.10.01.21264290 doi: medRxiv preprint training machine learning algorithms from scratch. Image-based models seemed to have a large impact outside of computer vision and were reused for the analysis of time series, audio, and tabular data. Many studies utilized publicly available datasets and models, but surprisingly few shared their code.

Transfer learning via fine-tuning and feature-representation learning has almost been unknown in the clinical literature until 2019, when the number of articles started to increase rapidly. This development is lagging a few years behind the trend seen in medical image analysis [11] , which might be explained partly by the impact that computer vision models have on non-image applications too. Also, we only included peer-reviewed articles indexed in medical databases, while the latest developments in computer science are most likely to be first published as preprints or conference proceedings.

More than half of the studies were from the fields of neurology and cardiology, while the rest of the studies were more equally distributed across other fields. This could be due to the high-resolution nature of EEG and ECG data that suits well with the application of machine learning. Also, there are many publicly available datasets and data science competitions that attract the attention of the computer science community.

The majority of study aims were predictions of binary or categorical outcomes on the individual level e.g. detection of a disease, classification of disease stages, or risk estimation. Few studies focused on prediction of continuous outcomes (e.g. glucose levels [26, 27] ) and forecasts on the population level (e.g. COVID-19 trend [28] and dengue fever outbreaks [29] ). Prediction models are abundant in the clinical literature, however most do not make it into routine clinical use [30] . For machine learning algorithms, one of the barriers is that clinical practitioners often consider . CC-BY 4.0 International license It is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted October 3, 2021. ; https://doi.org/10.1101/2021.10.01.21264290 doi: medRxiv preprint them as black boxes without intuitive interpretations, even though new solutions are constantly being developed to solve this problem [31] . Another issue is that machine learning algorithms are often used to analyze modestly sized tabular datasets and (not surprisingly) fail to outperform traditional statistical methods, contributing to even more skepticism. Transfer learning helps to overcome this problem by reusing models trained on large datasets and tailoring them to smaller ones. This provides the opportunity to capitalize on recent developments in machine learning research in new areas of clinical research. However, to guarantee the development of easily accessible and clinically relevant models that go beyond the 'proof-of-concept' level into clinical deployment, it is crucial that computer scientists and clinical researchers work together, which we observed in less than two-thirds of the studies. This proportion would probably have been even lower if we had considered studies indexed in non-clinical databases. Even the clinical studies using transfer learning, we identified, are still largely being published in rather specialized journals focusing on an audience with some technical background (e.g. IEEE and interdisciplinary journals) and have not yet reached mainstream clinical journals.

We were surprised by the high number of studies reusing a model that was developed in a dataset of a different data type than the target dataset. This also made it difficult to compare the size of source and target datasets quantitatively, as they might have been characterized in very different units (e.g. hours of audio recordings in the target dataset vs. number of images in the source dataset), but our observation was that the size of the target datasets was often much smaller than the source dataset. E.g. models developed in the ImageNet dataset (>1 million images) were used even in smaller clinical studies with <100 patients [24, 32, 33] , where the use of machine learning models would otherwise not be recommended or feasible. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint Many studies compared long lists of different model architectures (e.g. ResNet [34] , Inception [35] , GoogLeNet [36] , AlexNet [37] , VGGNet [38] , MobileNet [39] etc.) or even different methods. This, often combined with the use of several performance metrics (area under the ROC-curve, F1-score, accuracy, sensitivity, specificity, etc.), makes it difficult to summarize the advantages/disadvantages of transfer learning, even though many studies included comparisons with other deep learning solutions without transfer learning. We observed that the degree of performance improvements varied greatly. Some studies reported faster and more stable model fitting process with transfer learning due to better utilization of data. Some of these results are described in the sections below characterizing the different data types.

Even though half of the studies reused a publicly available model and even more used a publicly available dataset, principles of reproducible research were followed less often. One-third of the studies did not report at all which software they used and even fewer shared their code. Those who did, most often used Python, then MATLAB, while we did not find a single study using R for the transfer learning part.

This might be a consequence of the fact that commonly used libraries like PyTorch, TensorFlow, or fastai, are native in Python and only recently became available in R.

Those authors who shared their code did so almost exclusively on GitHub.

A discussion of data type-specific observations is organized in the following four sections.

Time series was the most common target data type among the included studies, dominated by studies of sleep staging and seizure detection based on EEG data [19, 32, [40] [41] [42] [43] [44] [45] [46] [47] [48] [49] [50] [51] [52] [53] [54] [55] and ECG analyses [23, 24, [56] [57] [58] [59] [60] [61] [62] [63] [64] [65] [66] [67] [68] [69] [70] (e.g. arrhythmia classification).

Moreover, transfer learning was used for prediction of glucose levels [26, 27] , is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted October 3, 2021. ; https://doi.org/10.1101/2021.10.01.21264290 doi: medRxiv preprint estimation of Parkinson's disease severity [71] , detection of cognitive impairment [33] and schizophrenia [52] , and forecasting of infectious disease trends [28] and outbreaks [29] , among other applications [72] [73] [74] [75] [76] [77] [78] [79] [80] [81] [82] .

Almost half of the studies transformed their data into images (e.g. spectrograms, scalograms), most often using Fourier (e.g. short-time or fast) or continuous wavelet transforms, to be able to utilize models from computer vision. A reason for this might be that publicly available time series datasets were rather small until recently, while in computer vision, ImageNet has been described and widely used as a large benchmark dataset for about a decade [9] . In 2020, PTB-XL [83] where it would not be feasible to train models from scratch. This may have a positive impact on the study of rare diseases or minority groups, regardless of the type of data. Lopes et al. [69] used this strategy to detect a rare genetic heart disease based on ECG recordings, although they only had 155 recordings from patients. First, the authors developed a convolutional neural network for sex detection, which required a readily available outcome from their database including approximately a quarter of a million ECG recordings. Even though the model was initially trained for a task with is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted October 3, 2021. ; https://doi.org/10.1101/2021.10.01.21264290 doi: medRxiv preprint little clinical relevance, the model still learnt some useful features from the data. This model was then fine-tuned for the detection of a rare genetic heart disease using only 310 recordings from 155 patients and 155 matched controls. With this approach, the authors achieved major improvements in model discrimination not only compared to training the same architecture from scratch, but also outperformed various machine learning approaches and clinical experts [61] .

In the previous example, an easily accessible outcome or label (i.e. sex) made it possible to train the base model, however even this can be avoided by using autoencoders [84] . An autoencoder is an unsupervised machine learning algorithm often used for denoising a dataset by first reducing its dimensions and only keeping relevant information (encoding), before reconstructing the original dataset (decoding). The feature representations learnt during the encoding step might be then transferred to new tasks. Jang et al. [59] developed an autoencoder based on 2.6 million unlabeled ECG recordings which was then reused as the base of an ECG rhythm classifier. The authors compared this approach to an image-based transfer learning classifier and a model trained from scratch. The autoencoder solution performed best, but the difference to the model trained from scratch was much smaller than in the other example. More interestingly, the authors compared the results when using 100%, 50% and 25% of the available data, and found a major drop in performance of the model trained from scratch when using only 25% of the data, while the autoencoder-based transfer learning solution still had an excellent performance as it still indirectly utilized data from >2 million ECG recordings. The image-based solution had a slightly lower F1-score (indicating worse performance) than other methods when using 100% of the data, but it did not change much when . CC-BY 4.0 International license It is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted October 3, 2021. ; https://doi.org/10.1101/2021.10.01.21264290 doi: medRxiv preprint using only 25% of the data, which led to a better performance than of the model trained from scratch. EEG data were most often used for sleep staging and epileptic seizure detection. We highlight the study by Raghu et al. [51] using pre-trained convolutional neural networks for seizure detection based on EEG-based spectrograms, because the authors described both fine-tuning and feature-representation learning solutions. In the latter approach, features were fed into a support vector machine classifier, a popular machine learning technique. The authors found that taking features from deeper layers, representing higher-level features, resulted in higher accuracy for most architectures. For almost all architectures, feature-representation transfer was found to be better than fine-tuning regarding accuracy. However, this can partly be a consequence of the fact that many more models were fitted including features from different layers. Further, once the authors found out which layer provided the most useful representation for a specific problem, the optimization process was shortened markedly compared to the fine-tuning process (~4 vs. 52 min with the Inception-v3 architecture).

Voice recognition and other audio-based applications of AI surround us in our everyday lives. In a clinical setting, doctors have traditionally used audio signals from stethoscopes to screen patients e.g. for heart and lung diseases. Additionally, vocal biomarkers processed with machine learning algorithms are getting more and more attention in a research setting, and are expected to aid diagnosis and monitoring of diseases in the future [85] . Despite the increasing interest and opportunities for relatively easy and cheap data collection, we only found 10 studies that used transfer learning on audio data. However, these studies covered a variety of fields in . CC-BY 4.0 International license It is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted October 3, 2021. ; https://doi.org/10.1101/2021.10.01.21264290 doi: medRxiv preprint medicine (and corresponding audio signals): neurology (speech and electromyography [21, 22, 86, 87] ), cardiology (heart sound [88, 89] ), pulmonology (respiratory sounds [90, 91] ), infectious diseases (cough [92] ), and otorhinolaryngology (breathing [93] ).

Recordings of speech contain different levels of information: linguistic features that can be derived from transcripts, and temporal and acoustic features that can be derived from raw audio recordings. Two studies applied BERT [94] , an open source, pre-trained natural language model, to analyze transcripts of speech with the aim of diagnosing Alzheimer's disease [21, 87] . Balagopalan et al. [87] also tested a model for the same task that included acoustic features, however, this model performed worse than the model only based on linguistic features. This finding highlights the possible complexity of audio data analysis and the importance of benchmarking different models against one another to get the best performing algorithm.

Machine learning algorithms pre-trained on labeled image-data, like ImageNet, can recognize features on non-image data like audio [95] . To use image models on audio data, the data has to be transformed, usually into a spectrogram image. We were surprised to find that half of the audio studies used pre-trained image models to analyze audio data, and only two studies used pre-trained audio models. This, again, highlights the impact of image models even outside of computer vision. Of interest, Koike et al. [88] aimed to predict heart disease from heart sounds with transfer learning and compared two models: one trained on audio data and another on is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted October 3, 2021. ; https://doi.org/10.1101/2021.10.01.21264290 doi: medRxiv preprint datasets exist (e.g. AudioSet [96] , LibriSpeech [97] ), and the results of Koike et. al highlight how models pre-trained on audio might perform better than image models.

With audio data easily obtainable in a clinical setting, there is a great opportunity to develop and improve transfer learning models that can aid clinicians to screen and diagnose patients in a cost-effective way.

Tabular data is probably the most used data type within clinical research. However, we only identified 15 studies using transfer learning on tabular data covering very different fields in medicine: two-thirds of them were from genetics [98] [99] [100] [101] [102] , pathology [103] [104] [105] , and intensive care [18, 106] , while the remaining five were from surgery [17] , neonatology [107] , infectious disease [108] , pulmonology [109] , and pharmacology [110] . Oncological applications like classification of cancer and prediction of cancer survival were common among the studies in genetics or pathology. As an example for zooming in from a broader disease category to a specific, rarer disease, one study developed a model using gene expression data from a broad variety of cancer types and then fine-tuned the model using a much smaller dataset to predict cancer survival in lung cancer patients [101] .

Transformation of tabular data into images was rare compared to time series and audio data. We only identified two studies transforming gene expression data into images [98, 101] and then applying computer vision models for a prediction task.

Both of these studies used open access gene expression datasets: AlShibili et al. [98] studied classification of cancer types in a dataset from cBioPortal [111] , and López-García et al. [101] predicted cancer survival in a dataset from the UCSC Xena Browser [112] . Both studies reported better accuracy, sensitivity, and specificity . CC-BY 4.0 International license It is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted October 3, 2021. ; https://doi.org/10.1101/2021.10.01.21264290 doi: medRxiv preprint when using transfer learning compared to non-transfer learning approaches based on the original tabular data.

We also identified a study reusing a model across species, which highlights another potential of transfer learning. Seddiki et al. [105] reused a model developed on mass spectrometry data from animals to classify human mass spectrometry data. Here, the benefits of transfer learning are clear, as genetic data from animals are often easier to retrieve and share due to the lack of privacy issues. This can potentially provide new opportunities within translational clinical research by reusing scientific knowledge gained from animal studies.

Wardi et al. [108] used both a transfer learning approach and a non-transfer learning approach to predict septic shock in an emergency department setting based on data from electronic health records (EHR) like blood pressure, heart rate, temperature, saturation, and blood sample values. They showed that transfer learning outperformed the traditional machine learning model, especially when only a smaller fraction of data was used. Furthermore, the transfer learning model was externally validated with promising results, which supports its clinical utility. The study by Wardi et al. makes individual-specific predictions possible, which is a prerequisite before an AI-based tool can be implemented in everyday clinical practice to support decision making [1] .

Another promising use of transfer learning was described in the study by Gao et al. [99] . The authors developed models for various prediction tasks using genetic data from a heterogeneous population with a focus on how to translate models that perform well on the ethnic majority group to ethnic minority groups. Transfer learning clearly improved performance as compared to other machine learning approaches developed from scratch separately for each ethnic minority group. This application . CC-BY 4.0 International license It is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted October 3, 2021. ; https://doi.org/10.1101/2021.10.01.21264290 doi: medRxiv preprint demonstrates how transfer learning can be used to tackle inequalities in health research.

Our main finding regarding text and natural language processing (NLP) was how underutilized transfer learning was in clinical research. We identified only seven studies using transfer learning on text. Applications ranged from risk assessment of psychiatric stressors, diseases, and medication abuse [20, [113] [114] [115] , to prediction of morbidity, mortality [116, 117] , and adverse incidents from oncological radiation [118] . It is somewhat surprising that we found so few studies, given the massive interest in NLP and the field's rapid development in recent years [119, 120] . The relatively slow adoption of transfer learning in clinical research is presumably due to the many technical challenges in medical text analysis (e.g. ambiguous abbreviations, specialty-specific jargon, etc.) [121] . We can indirectly observe this in our review, where all seven articles were written by clinicians and technical authors together or by clinicians alone. This is in contrast with what we observed for time series, tabular data, and audio, where articles were more often written by technical authors only. This is not to say that technical authors are not interested in the use of transfer learning on clinical text, but rather a reflection that transfer learning in text analysis is still a relatively new method. Several articles have been written on transfer learning using radiological reports or EHR, but the studies often focus on purely technical aspects of NLP and are published in journals, which are not indexed in the databases that clinicians are familiar with [119, 120] . While transfer learning in NLP has made considerable progress from a technical perspective, it seems that this trend has only just begun to appear in clinical research. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint Another important reason why we found so few articles on text is that there currently only exist a few large, annotated, and publicly available datasets with EHR [121] .

Removing patient identifiable data from EHR is one of the key challenges currently limiting the sharing of large medical text datasets, though automatic tools have recently been developed to fasten this task [122] . Indeed, among the seven identified studies, three analyzed public social media posts to predict mental illnesses, two used relatively small institutional EHR datasets (<1000 patients), one used a large UK database with restricted access, and only a single study used a large, curated, and openly available dataset (the MIMIC-III critical care database [123] ). A review on the use of all types of NLP on radiological reports found a similar pattern, where most studies used institutional EHR datasets [124] , again highlighting the need for more large, annotated, and publicly available datasets.

As a final note on transfer learning in text, we would like to highlight the study by Si et al. [117] . In this paper, the authors proposed a new method to analyze medical records to predict mortality and identify obesity-related comorbidities. In brief, this method took the temporal aspect of each patient's documents into account, when predicting the patient's risk of mortality within the next year. It makes intuitive sense from a clinical perspective that a recent myocardial infarction could be more informative for mortality than one more than ten years ago, and this technical development could be important for future research in clinical text analysis.

The identification of clinical studies turned out to be more challenging than we expected, as it is a concept that is difficult to define. We had long discussions on whether to include brain-computer interface studies [125] , where transfer learning seems to be a promising method to speed up calibration, with the final aim of helping . CC-BY 4.0 International license It is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted October 3, 2021. ; https://doi.org/10.1101/2021.10.01.21264290 doi: medRxiv preprint people with a clinical condition. However, we concluded that the actual prediction models solve engineering problems rather than predict a clinical outcome. Similarly, we excluded NLP studies on biomedical named entity recognition, because even though the datasets might have been medically relevant documents, the task of identifying specific terms are only indirectly relevant from a clinical perspective e.g. by speeding up knowledge synthesis.

The type of the target data might also be ambiguous, depending on how we define raw data. Transcripts of audio recordings could easily be considered as text, but we tried to evaluate what the first dataset was that e.g. a medical device output without further processing. Using this approach, the recordings were considered as a dataset of audio type and transcribing them to text was considered as a data transformation.

Our scoping review did not include clinical studies if they were only published as preprints, or in proceedings of computer science conferences. However, we chose our inclusion criteria this way consciously, so that we could give an overview of the field from the perspective of clinical researchers. Therefore, our review does likely not include the latest technical developments within AI research and transfer learning, but instead include the techniques which have started to impact clinical research.

During data extraction, we planned to find the best proxy variables to answer our research questions, but some of these were challenging. As described previously, we decided not to extract quantitative data on the size of the source and the target datasets, as the unit of observation often differed. It was also difficult to characterize comparisons of transfer learning vs. non-transfer learning solutions because many studies reported many different models and performance metrics. We were curious . CC-BY 4.0 International license It is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted October 3, 2021. ; https://doi.org/10.1101/2021.10.01.21264290 doi: medRxiv preprint about the background of the authors (clinical vs. technical), but used only the affiliations as a proxy, as that was easily accessible information in most cases.

Our scoping review is a roadmap to transfer learning for non-image data in clinical research, providing clinical researchers an easily accessible resource to this relatively new technique in machine learning. The interest in transfer learning for non-image data in clinical research began to increase rapidly only recently, lagging a few years behind trends in its use for medical image analysis. Applications are unbalanced between different clinical research areas and data types. Neurology and cardiology seem to be among the 'first movers' with time series data, partly driven by the public availability of EEG and ECG datasets, which also suit machine learning due to their high-resolution nature. We found fewer classical epidemiological studies than expected despite transfer learning can help to overcome some of the big challenges of the field i.e. data collection on a large scale is often difficult and expensive, and data sharing can be hindered by privacy concerns. In the future, some of the largest epidemiological datasets (e.g. UK Biobank [126] ) could serve as source data for the development of machine learning algorithms, that a wide range of smaller studies could build on by using transfer learning without access to the actual dataset. This is in line with the FAIR principles [127] supporting the reuse of digital assets in an environment with increased volume and complexity of datasets.

Moreover, the wider use of reproducible research principles and stronger interdisciplinary collaborations between clinical researchers and computer scientists are crucial for the development of clinically relevant prediction models that can be reused with transfer learning across studies, clinical specialties, or even species. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint 

There is no conflict of interest in this project.

The data extracted from the identified articles and then presented in the Results section is available as an electronic supplement along with the code used for the analysis. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted October 3, 2021. ; https://doi.org/10.1101/2021.10.01.21264290 doi: medRxiv preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted October 3, 2021. ; https://doi.org/10.1101/2021.10.01.21264290 doi: medRxiv preprint Agrusa 2020 Some articles could have been assigned to more medical specialties, but the authors have chosen a primary one with consensus for the sake of simplicity. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted October 3, 2021. ; https://doi.org/10.1101/2021.10.01.21264290 doi: medRxiv preprint Box 1 Glossary (Definitions were taken or adapted from Howard & Gugger [2] and the Machine Learning Glossary by Google [128] )

A non-human program or model that can solve sophisticated tasks.

The training of programs developed by allowing a computer to learn from its experience, rather than through manually coding individual steps.

A neural network is a particular kind of ML model. A model that, taking inspiration from the brain, is composed of layers of neurons. An input layer (first layer) receives the input data. This is followed by hidden layer(s). Hidden neurons (yellow circles) typically take multiple input values and generate one output value, which is calculated by applying a nonlinear transformation (activation function) to a weighted sum of input values. An output layer (final layer) returns the results. Model fitting (training) aims to optimize the weights to get the best model according to a pre-defined performance metric (loss).

A type of machine learning that uses neural networks with multiple hidden layers.

The use of a pre-trained model for a task different to what it was originally trained for. For example, to take a model that can recognize everyday objects and further train it on fundus photographs to grade diabetic retinopathy, instead of training a model from scratch on fundus photographs only.

A type of transfer learning that updates the parameters of a pre-trained model by training for a different task than it was originally trained for.

A transfer learning technique that passes input data through a pre-trained model and extracts featurerepresentations (values from hidden layers), which then become inputs for another model for a different task than the pre-trained model was trained for. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted October 3, 2021. ; https://doi.org/10.1101/2021.10.01.21264290 doi: medRxiv preprint

High-performance medicine: the convergence of human and artificial intelligence

Deep Learning for Coders with fastai and

A survey of the recent architectures of deep convolutional neural networks

Visualizing and Understanding Convolutional Networks

Diabetic Retinopathy Grading Using ResNet Convolutional Neural Network. 2020 IEEE Conference on Big Data and Analytics (ICBDA)

Direct transfer of learned information among neural networks. AAAI'91: Proceedings of the ninth National conference on Artificial intelligence

A Survey on Transfer Learning

ImageNet: A large-scale hierarchical image database. 2009 IEEE Conference on Computer Vision and Pattern Recognition

A scoping review of transfer learning research on medical image analysis using ImageNet

Not-so-supervised: A survey of semisupervised, multi-instance, and transfer learning in medical image analysis

Systematic review or scoping review? Guidance for authors when choosing between a systematic or scoping review approach

PRISMA Extension for Scoping Reviews (PRISMA-ScR): Checklist and Explanation

Chapter 11: Scoping Reviews

Transfer learning for non-image data in clinical research: a scoping review protocol

The PRISMA 2020 statement: an updated guideline for reporting systematic reviews

Improving surgical models through one/two class learning

Language models are an effective representation learning technique for electronic health record data

Rebouças Filho PP. Bi-Dimensional Approach Based on Transfer Learning for Alcoholism Pre-disposition Classification via EEG Signals

Deep Learning-Based Natural Language Processing for Screening Psychiatric Patients

Transformer-based deep neural network language models for Alzheimer's disease risk assessment from targeted speech

Towards Computer-Based Automated Screening of Dementia Through Spontaneous Speech

From ECG signals to images: a transformation based approach for deep learning

Multi-Modal Diagnosis of Infectious Diseases in the Developing World

PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals

Multi-Hour Blood Glucose Prediction in Type 1 Diabetes: A Patient-Specific Approach Using Shallow Neural Network Models

Adversarial multi-source transfer learning in healthcare: Application to glucose prediction for diabetic people

ALeRT-COVID: Attentive Lockdown-awaRe Transfer Learning for Predicting COVID-19 Pandemics in Different Countries

Chinese Cities Based on the Deep Learning Method

Prognosis and prognostic research: application and impact of prognostic models in clinical practice

From Local Explanations to Global Understanding with Explainable AI for Trees

Automatic sleep stage classification using time-frequency images of CWT and transfer learning using convolution neural network

Quantitative Assessment of Resting-State for Mild Cognitive Impairment Detection: A Functional Near-Infrared Spectroscopy and Deep Learning Approach

Deep Residual Learning for Image Recognition

Rethinking the Inception Architecture for Computer Vision

Going Deeper with Convolutions

ImageNet classification with deep convolutional neural networks

Very Deep Convolutional Networks for Large-Scale Image Recognition

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

Expert-level automated sleep staging of long-term scalp electroencephalography recordings using deep learning

Multichannel Sleep Stage Classification and Transfer Learning using Convolutional Neural Networks

A Pilot Study on Fast Adaptation of Biosignals-Based Sleep Stage Classifier to New Individual Subject Using Meta-Learning

A Deep Learning Architecture for Temporal Sleep Stage Classification Using Multivariate and Multimodal Time Series

Efficient Epileptic Seizure Prediction Based on Deep Learning

Deep Convolutional Neural Network-Based Epileptic Electroencephalogram (EEG) Signal Classification

Deep learning approach to detect seizure using reconstructed phase space images

Transfer learning from ECG to PPG for improved sleep staging from wrist-worn wearables

Detection of Focal and Non-focal Epileptic Seizure Using Continuous Wavelet Transform-Based Scalogram Images and Pre-trained Deep Neural Networks

Detection of Epileptic Seizure Using Pretrained Deep Convolutional Neural Network and Transfer Learning

Deep Learning for EEG Seizure Detection in Preterm Infants

EEG based multi-class seizure type classification using convolutional neural network and transfer learning

Transfer learning with deep convolutional neural network for automated detection of schizophrenia from EEG signals

Epileptic Signal Classification with Deep Transfer Learning Feature on Mean Amplitude Spectrum

Automatic sleep scoring: A deep learning architecture for multi-modality time series

Cross-Subject Seizure Detection in EEGs Using Deep Transfer Learning

Transfer Learning for Detection of Atrial Fibrillation in Deterministic Compressive Sensed ECG

Convolutional Neural Networks for Electrocardiogram Classification

Atrial fibrillation identification based on a deep transfer learning approach

Effectiveness of Transfer Learning for Deep Learning-Based Electrocardiogram Analysis

A transfer learning approach to detect paroxysmal atrial fibrillation automatically based on ballistocardiogram signal

Computer versus cardiologist: Is a machine learning algorithm able to outperform an expert in diagnosing a phospholamban p.Arg14del mutation on the electrocardiogram? Heart Rhythm

Robust, ECG-based detection of Sleep-disordered breathing in large population-based cohorts

An incremental learning system for atrial fibrillation detection based on transfer learning and active learning

Deep Learning for ECG Analysis: Benchmarks and Insights from PTB-XL

Cardiovascular disease diagnosis using cross-domain transfer learning

Multi-task deep learning for cardiac rhythm detection in wearable devices

Transfer learning for ECG classification

Self-adjustable domain adaptation in personalized ECG monitoring integrated with IR-UWB radar

Improving electrocardiogram-based detection of rare genetic heart disease using transfer learning: An application to phospholamban p.Arg14del mutation carriers

Automated detection of diabetic subject using pre-trained 2D-CNN models with frequency spectrum images extracted from heart rate signals

Ensemble deep model for continuous estimation of Unified Parkinson's Disease Rating Scale III

Robust Methods to Detect Abnormal Initiation in the Gastric Slow Wave from Cutaneous Recordings

Deep learning based photoplethysmography classification for peripheral arterial disease detection: a proof-of-concept study

Deep learning of spontaneous arousal fluctuations detects early cholinergic defects across neurodevelopmental mouse models and patients

Falls Risk Classification of Older Adults Using Deep Neural Networks and Transfer Learning

Noninvasive wearable seizure detection using long-short-term memory networks with transfer learning

Towards More Accurate Automatic Sleep Staging via Deep Transfer Learning

A Novel Deep Learning Approach for Recognizing Stereotypical Motor Movements within and across Subjects on the Autism Spectrum Disorder

PPGnet: Deep Network for Device Independent Heart Rate Estimation from Photoplethysmogram

Tridirectional Transfer Learning for Predicting Gastric Cancer Morbidity

Classification of aortic stenosis using conventional machine learning and deep learning methods based on multi-dimensional cardio-mechanical signals

Voice for Health: The Use of Vocal Biomarkers from Research to Clinical Practice

Deep learning for waveform identification of resting needle electromyography signals

Comparing Pretrained and Feature-Based Models for Prediction of Alzheimer's Disease Based on Speech

Audio for Audio is Better? An Investigation on Transfer Learning Models for Heart Sound Classification

Cross-Domain Transfer Learning for PCG Diagnosis Algorithm

Convolutional neural networks based efficient approach for classification of lung diseases

Breathing Sound Segmentation and Detection Using Transfer Learning Techniques on an Attention-Based Encoder-Decoder Architecture

AI4COVID-19: AI enabled preliminary diagnosis for COVID-19 from cough samples via an app

A novel deep feature transfer-based OSA detection method using sleep sound signals

Rethinking cnn models for audio classification

Audio Set: An ontology and human-labeled dataset for audio events

Librispeech: An ASR corpus based on public domain audio books

A Shallow Convolutional Learning Network for Classification of Cancers Based on Copy Number Variations

Deep transfer learning for reducing health care disparities arising from biomedical data inequality

Improved survival analysis by learning shared genomic information from pan-cancer data

Transfer learning with convolutional neural networks for cancer survival prediction using gene-expression data

A meta-learning approach for genomic survival analysis

CancerSiamese: one-shot learning for predicting primary and metastatic tumor types unseen during model training

Domain adaptation and self-supervised learning for surgical margin detection

Cumulative learning enables convolutional neural network representations for small mass spectrometry data classification

Deep Multi-Modal Transfer Learning for Augmented Patient Acuity Assessment in the Intelligent ICU. Front Digit Health

A multi-task, multistage deep transfer learning model for early prediction of neurodevelopment in very preterm infants

Predicting Progression to Septic Shock in the Emergency Department Using an Externally Generalizable Machine-Learning Algorithm

Performance improvement of machine learning techniques predicting the association of exacerbation of peak expiratory flow ratio with short term exposure level to indoor air quality using adult asthmatics clustered data

A transfer learning approach to drug resistance classification in mixed HIV dataset

The cBio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data

Visualizing and interpreting cancer genomics data via the Xena platform

Text classification models for the automatic detection of nonmedical prescription medication use from social media

Extracting psychiatric stressors for suicide from social media using deep learning

Transfer Learning for Risk Classification of Social Media Posts: Model Evaluation Study

Transformer for Electronic Health Records

Patient Representation Transfer Learning from Clinical Notes based on Hierarchical Attention Network

Automatic Incident Triage in Radiation Oncology Incident Learning System

A survey on transfer learning in natural language processing

Medical Information Extraction in the Age of Deep Learning

Clinical Text Data in Machine Learning: Systematic Review

Automated de-identification of free-text medical records

MIMIC-III, a freely accessible critical care database

A systematic review of natural language processing applied to radiology reports

Application of Transfer Learning in EEG Decoding Based on Brain-Computer Interfaces: A Review

An Open Access Resource for Identifying the Causes of a Wide Range of Complex Diseases of Middle and Old Age

The FAIR Guiding Principles for scientific data management and stewardship

Machine Learning Glossary

Tseng 2021 Infectious diseases

The authors are grateful to Anne Vils Møller (Librarian at the Royal Danish Library) for her valuable advice on the search strategy. We are also grateful for the

Reports sought for retrieval (n = 64)Reports not retrieved (n = 0)Reports excluded: Basic, animal, or other nonclinical research (n = 107) No transfer learning (n = 43) Studies using actual source data (n = 30) Image data (n = 9) Not peer-reviewed (n = 3) Duplicate reports (n = 1) Non-english language (n = 1)