key: cord-0500926-r6exuzxs
authors: Calderon-Ramirez, Saul; Shengxiang-Yang,; Moemeni, Armaghan; Elizondo, David; Colreavy-Donnelly, Simon; Chavarria-Estrada, Luis Fernando; Molina-Cabello, Miguel A.
title: Correcting Data Imbalance for Semi-Supervised Covid-19 Detection Using X-ray Chest Images
date: 2020-08-19
journal: nan
DOI: nan
sha: 90c635cb07e1b45e444286878208ed7e93e72766
doc_id: 500926
cord_uid: r6exuzxs

The Corona Virus (COVID-19) is an internationalpandemic that has quickly propagated throughout the world. The application of deep learning for image classification of chest X-ray images of Covid-19 patients, could become a novel pre-diagnostic detection methodology. However, deep learning architectures require large labelled datasets. This is often a limitation when the subject of research is relatively new as in the case of the virus outbreak, where dealing with small labelled datasets is a challenge. Moreover, in the context of a new highly infectious disease, the datasets are also highly imbalanced,with few observations from positive cases of the new disease. In this work we evaluate the performance of the semi-supervised deep learning architecture known as MixMatch using a very limited number of labelled observations and highly imbalanced labelled dataset. We propose a simple approach for correcting data imbalance, re-weight each observationin the loss function, giving a higher weight to the observationscorresponding to the under-represented class. For unlabelled observations, we propose the usage of the pseudo and augmentedlabels calculated by MixMatch to choose the appropriate weight. The MixMatch method combined with the proposed pseudo-label based balance correction improved classification accuracy by up to 10%, with respect to the non balanced MixMatch algorithm, with statistical significance. We tested our proposed approach with several available datasets using 10, 15 and 20 labelledobservations. Additionally, a new dataset is included among thetested datasets, composed of chest X-ray images of Costa Rican adult patients

Coronavirus is an endemic kind of virus that affects vertebrate animals, ranging from mammals to reptiles and birds.

The SARS-CoV2 virus is a member of this family. Coronaviruses (COVs) belong to the group of Ribonucleic Acid (RNA) viruses. They have the biggest RNA genomes found in the viral world, reaching up to 32 KB [2] . Coronaviruses spread across the gastrointestinal and the respiratory tracks within a large variety of animal groups. The majority of viruses use single animal groups as hosts. However, phylogenetic studies and sequencing of genomes have proven that the COVs have managed to migrate to new host groups [3] , what is referred as a zoonosis. A zoonosis is a contagious disease produced by an infectious agent, such as a virus, which has managed to move across from a vertebrate animal to humans. About sixty percent of new infectious diseases are believed to be of zoonosis origin [27] . Infections caused by zoonosis are of significant concern worldwide. As more and more people regularly travel across the world, the rapid spread is a lurking danger of a worldwide scale.

A key priority for global organizations, including the World Health Organization (WHO) as well as governments across the world, is to develop tools to enable the identification of virus outbreaks and to be able to diagnose them in a short time frame. The quick identification of potential virus carriers is vital to contain a virus outbreak. This is where state of the art Artificial Intelligence (AI) based techniques, such as deep learning, can play a key role, enabling pre-diagnostic and triage systems to effectively identify the presence of the virus in a subject. They offer quick diagnosis responses to enable health systems to cope with rapid spread of virus out-breaks.

This research extends a novel Semi-supervised Deep Learning (SSDL) framework known as MixMatch [10] for the detection of COVID-19 based on chest X-ray images. A Semisupervised learning method allows the combination of labelled and unlabelled data to train the model. This is more cost effective and accessible, as unlabelled data is cheaper than labelled data. Semi-supervised models can easily be adapted for mutations of the virus at a later stage, with relatively small labelled samples.

We propose a modification for the MixMatch algorithm, designed to improve its accuracy under data imbalance settings. Added to smaller labelled datasets, in an outbreak situation, datasets can also be strongly imbalanced, as data available for the subjects manifesting symptoms of the new pathogen are more scarce than non-pathogenic patient records.

A. Use of X-ray images towards the diagnosis of COVID- 19 A common, well established and robust method for the detection of COVID-19 virus is the Real-time Reverse Transcrip-arXiv:2008.08496v2 [eess.IV] 20 Aug 2020 tion Polymerase Chain Reaction (RT-PCR) test [13] . This is a molecular test, which uses respiratory tract samples to identify and confirm infection of COVID-19 [1] . The objective of the method is to find the nucleic acid of the SARS-CoV-2 within both the lower and the upper respiratory areas. Samples from symptomatic patients suspected of infection of the COVID-19 are gathered [42] . However, new research shows the need for testing asymptomatic individuals as well [9] . RT-PCR is the main method used for detecting the presence of the disease [4] . Nevertheless, the costs associated to the use of RT-PCR can be significant, since the facilities and trained personnel needed to perform these tests can be expensive. These severely limit the use of this technique in less industrialized countries, making urgent the need to develop more accessible methods, adding the possible need of testing asymptomatic patients. [31] .

Diagnosing COVID-19 based on medical imaging can be a reliable and accurate alternative, and is still under exploration. The accuracy and sensitivity levels of this approach as a first stage in COVID-19 detection using chest images, have been analyzed in a number of studies [16] , [21] .

The usage of X-ray images for COVID-19 diagnosis has been studied recently. In [6] the authors proposed a severity score using radiography chest images. The dataset used in this study had a total of 783 SARS-CoV-2 infected cases. The score was used to identify patients that could potentially acquire more life threatening symptoms. Several studies [14] , [16] , [39] have suggested that in a small number of people there is a low level of sensitivity towards the manual detection of alterations using medical images of the chest which can indicate the presence of COVID-19. The use of features extracted and learned by a machine might overcome the variable subjective evaluation of X-ray images. This leads us to explore the potential implementation of deep learning solutions using more widely available and less expensive chest X-ray images. As typical deep learning architectures require many labelled images, we aim to explore the usage of SSDL for COVID-19 detection using X-ray images, evaluating it under another frequent challenge; data imbalance.

In this paper, we extensively test the SSDL technique known as MixMatch [10] in a variety of data imbalance situations, with a very limited number of labelled observations. We aim to assess MixMatch's performance under real-world scenarios, specifically medical imaging in the context of a virus outbreak, where small labelled samples are available with a strong under-representation of the new pathology, leading to imbalanced datasets. An imbalanced dataset can frequently lead also to a distribution mismatch between the labelled and unlabelled dataset, as described in [33] .

Moreover, in this work we propose a simple, yet effective approach for correcting data imbalance for the SSDL algorithm MixMatch. We implement a loss based imbalance correction, giving more weight to the under-represented classes in the labelled dataset, a common approach for this aim. In the context of MixMatch, we make use of the pseudo-label and augmented labels predictions to choose the corresponding class-weight. The implemented SSDL solution for COVID-19 detection makes use of unlabelled data. This might help improve model's accuracy, in the absence of high quality labelled data.

The proposed method uses chest X-ray images. X-ray machines are commonly available, which results in a wealth of unlabelled datasets due to the shortage of radiologists and technicians who can label the images. As an example, India, with its current 1.44 billion population, has a ratio between radiologists and patients of 1:100,000 [8] . However, X-ray machines can be found even in remote areas in underdeveloped countries, compared to other medical devices like computer tomography scanners [37] .

We also make available a first sample of a chest-X ray dataset from the Costa Rican medical private clinic Imagenes Medicas Dr. Chavarria Estrada, with observations containing no findings, and test its usage for training the SSDL framework.

In the event of a viral outbreak, it becomes essential to help health practitioners to quickly identify and classify viral pathologies using digital X-ray images. Outbreaks create a large number of cases, which require the intervention of trained radiologists. Labeling data is time consuming, and in the context of a virus out-break gathering high quality and reliable labelled data can be challenging. SSDL can provide much needed key support for the diagnosis, trace and isolation of the COVID-19 infection and other future pandemics through an early, fast and cheap diagnosis, by using more widely available unlabelled data.

The identification of COVID-19 infection based on X-ray images is a new challenge. Thus, up to date there is not much research available with regards to the use of deep learning models for automatically identifying COVID-19 infection. This is the reason why this paper presents mainly pre-published work in the area up-to-date. Since most prepublished articles have not been peer reviewed, it is used here as a general guide and not as a reference towards performance.

A classification model based on a support vector machine fed with deep features was presented in [36] . Different common deep learning architectures were used for feature extraction. These included: VGG16, AlexNet, GoogleNet , VGG19, several variations of Inception and Resnet, DenseNet201 and XceptionNet. The dataset used included a total of fifty observations with half representing COVID-19 images and the other half representing a combination of pneumonia and normal images. The COVID-19 images were acquired from the Github repository created by Dr. Joseph Cohen from the University of Montreal [18] . COVID-19 negative images were downloaded from the public repository on X-ray images presented in [28] . The highest level of accuracy was obtained with the ResNet50 model which was combined with a support vector machine as a top model. An accuracy of around 95%, with statistical significance, was obtained.

Several machine learning algorithms were compared in [5] . Some of the methods considered included: support vector machines, random forests and Convolutional Neural Network (CNN) models. The results reported the CNN model as the best performing approach, with an accuracy of 95.2%. The dataset used in this work includes 48 Cases for COVID-19 + and 23 for negative COVID-19 cases from Dr. Cohen's repository [18] . Data augmentation was used to deal with scarce labelled data.

Another study involving the use of CNNs along with transfer-learning for the automatic classification of pneumonia, COVID-19 and images presenting no lung pathology was presented in [7] . The authors used a 10-fold cross-validation, to test the following CNN architectures: VGG-19, MobileNet v2, Inception, Xception and Inception ResNet v2. An accuracy of around 93% was obtained in the identification of COVID-19, with the use of a VGG-19 model. No statistical significance tests were performed. As for the data used in [7] , similar to related proposed solutions, positive COVID-19 cases were extracted from [18] , while pneumonia and no lung pathology observations were taken from [28] .

A deep learning model for the automatic detection of COVID-19 and pneumonia was proposed in [15] . The system proposed classifies images into three classes; COVID-19 + , viral pneumonia and normal readings. To increase the number of observations, the authors relied on data augmentation techniques including rotation, translation and scaling, along with transfer-learning. The architectures tested included: AlexNet, ResNet19, DenseNet201 and SqueezeNet. A combination of the datasets from [18] was used in this research. The SqueezeNet model outperforms all the other CNN networks. Regarding the data used in such work, a combination of two data repositories [44] , [28] was used for viral and normal image categories, and the data repository in [18] was used for positive COVID-19 cases.

Explainability for deep learning models is an important feature for medical imaging based systems [23] . Model uncertainty estimation is a common approach to enforce model explainability and usage safety [23] . A COVID-19 detection system with uncertainty assessment was proposed in [22] . By providing practitioners with a confidence factor of the prediction, the overall reliability of the system is improved. A high correlation between the prediction accuracy of the model and the level of uncertainty was reported [22] . The dataset used for positive COVID-19 cases also uses Dr. Cohen's repository [18] , and normal X-ray readings were collected from [28] .

In [29] , a semi-supervised approach for defining relevant features for COVID-19 detection was developed. The suspicious regions were extracted by training a semi-supervised auto-encoder architecture that minimizes the reconstruction error. This approach relies in the wider availability of COVID-19 − cases to learn relevant features. Such extracted features were used for classifying the input observations into three classes; COVID-19 + , pneumonia and normal, using a common supervised CNN approach. The extracted features were used to enforce model explainability. Similar to previous reviewed approaches, the datasets provided in [18] , [28] were used.

Similarly, the work in [17] used a feature extractor built from training a model to classify X-ray images in larger datasets with non COVID-19 observations. The model was trained for the regression of COVID-19 severity. Similar to [29] , the feature extractors built ease the extraction of further information from the model, improving the model's explainability. A wider range of datasets were used in such work for training the feature extractor [19] , [11] , [26] , [30] , [44] , [25] .

In summary, the reviewed works implemented transferlearning and data augmentation to deal with limited labelled data. Fewer works trained more specific feature extractors [17] , [29] . The datasets in [18] , [44] , [28] have been used extensively in previous work. The frequently used dataset in [18] includes COVID-19 + observations made available by Dr. Joseph Cohen, from the University of Montreal [18] . The images were collected from journal websites such as radiopaedia.org, the Italian Society of Medical and Interventional Radiology. The images were also collected from recent publications in this area such as [18] . The dataset is made of chest X-ray images involving over 100 patients. Their ages range from 27 to 85 years old. The countries of origin include: Iran, China, Italy, Taiwan, Australia, Spain and the United Kingdom. A warning has been raised by the authors on [18] with regards to any diagnostic performance claims prior to doing a proper clinical study. As for the dataset available in [28] , frequently used in previous work for normal and pneumonia readings, all of them correspond to samples taken from pediatric Chinese patients. The usage of such data as negative COVID-19 cases can be less reliable, since different populations were sampled for COVID-19 and no COVID-19 cases. Observations of adults (with ages ranging between 20 and 86 years old) were used for COVID-19 + cases, while for the normal and pneumonia cases in [28] , the images were sampled from pediatric patients. Therefore, in this work we test a wider variety of sources for COVID-19 − cases, including a new dataset with Costa Rican adult patients.

Little exploration on the benefits of using a fully SSDL model can be found in the literature. Furthermore, to our knowledge no work on the impact and correction of data imbalance in SSDL for COVID-19 detection has been developed so far in the literature.

In general deep learning models require a large number of labelled observations to provide good levels of generalisation. This limitation makes it hard to implement these techniques to medical applications since there is a lack of labelled data SSDL is gaining increasing popularity in the academic community. It is well suited to deal with datasets which are poorly labelled, or have few labels, making SSDL attractive for computer aided medical imaging analysis.

Semi-supervised methods require the use of both labelled S l = (X l , Y l ) and unlabelled samples S u = X u = {x 1 , . . . , x nu }. Each labelled observation in X l = {x 1 , . . . , x n l } has an associated label in the set Y l = {y 1 , . . . , y n l }. No labels are associated to the unlabelled set.

SSDL architectures can be classified as follows: Pretraining, self-training (also known as pseudo-labelled) and regularization based. Some of the regularization methods include generative based approaches, along consistency loss term as well as graph based. An extensive survey on SSDL approaches can be found in [41] .

The MixMatch approach developed in [10] merged intensive data augmentation with unsupervised regularization and pseudo-labelled based semi-supervised learning. This method produced better results compared to other regularized, pseudolabelled and generative based SSDL methods as shown in [10] .

Data imbalance in the labelled dataset, can be approached as a particularisation of the data distribution mismatch problem outlined in [33] , when the unlabelled dataset presents a different distribution. This is common under real-world usage conditions of SSDL techniques. In [33] , authors made a first glance at the impact of Out of Distribution (OOD) data in the unlabelled dataset S u , leading to a distribution mismatch between the distributions of S l and S u .

The work in [12] went deeper into the impact of OOD data in SSDL. Authors tested several distribution mismatch scenarios with different OOD data contamination degrees, and different OOD data sources. The results showed an important influence on the degree of OOD data in the unlabelled dataset S u , as also the distribution of the OOD observations by itself.

In [24] , authors explored further the impact of the distribution mismatch, in the particular case of using imbalanced datasets. The results showed a classification error rate decrease, ranging from 2% to 10% for the SSDL model. Furthermore, the authors proposed a straightforward approach for correcting such accuracy degradation. The approach assigned weights to each unlabelled observation, depending on the number of observations per class. Higher weights were used for under-represented observations in the unlabelled loss term L u . To pick the right weight for each unlabelled observation, the highest label predicted with the model yielded for the current epoch, was used. The authors implemented and tested the approach in the mean teacher model [40] . The results demonstrated a significant accuracy gain by implementing the proposed approach. We base our contribution on these findings, and propose an extended data imbalance correction approach into MixMatch in the context of semi-supervised COVID-19 detection.

The proposed SSDL method is based on the MixMatch [10] algorithm. It creates a set of pseudo-labels, and also implements an unsupervised regularization term. The consistency loss term used by the MixMatch method minimizes the distance between the pseudo-labels and predictions that the model makes on the unlabelled dataset X u .

The average model output of a transformed input x j was used to estimate pseudo-labels:

Here K corresponds to the number of transformations (like image flipping) Ψ η performed. Based on the work by [10] , a value of K = 2 is recommended. The authors also mentioned that the estimated pseudo-label y j usually presents a high entropy value. This can increase the number of non-confident estimations. Therefore, the output array y was sharpened with a temperature ρ:

When ρ → 0, the sharpened distribution y = s ( y, ρ) becomes a Dirac function, assuming a one-hot vector representation. The term S u = X u , Y defines the dataset with the sharpened estimated pseudo labels. It is assumed here that Y = y 1 , y 2 , . . . , y nu

In [10] the authors argued that data augmentation is a key aspect when it comes to SSDL. The authors used the MixUp approach, as proposed in [46] , to further augment data using both labelled and unlabelled observations:

The MixUp method proposed to create new observations based on a linear interpolation of a combination of unlabelled (together with their pseudo-labels) and labelled data. More specifically, for two labelled or pseudo labelled data pairs (x a , y a ) and (x b , y b ), MixUp creates a new observation with its corresponding label (x , y ) based on the following steps: 1) Sample the MixUp parameter λ based on a Beta distribution λ ∼ Beta (α, α). The augmented datasets S l , S u were used by the Mix-Match algorithm to train a model as specified in the training function T MixMatch :

For the labelled loss term, a cross-entropy loss was used; L l (w, x i , y i ) = δ cross-entropy (y i , f w (x i )). As for the unlabelled loss term, an Euclidean distance was implemented L u w, x j , y j = y j − f w (x j ) . The coefficient r(t) was proposed as a ramp-up function that increases its value as the epochs t increase. In our implementation, r(t) was set to t/3000. The γ factor was used as a regularization weight. This coefficient controls the influence on unlabelled data. It is important to highlight that unlabelled data has also an effect on the labelled data term L l . The reason being that unlabelled data is used to artificially increase data observations by using the MixUp method for also the labelled term. In this work an implementation of a data imbalance correction in the loss function of the MixMatch method is proposed. Positive results were yielded in [24] for correcting dataset imbalance by weighting the unsupervised loss function terms in a per observation basis. The authors in [24] developed a similar approach by modifying the SSDL framework known as mean teacher [40] . We extend this approach for the MixMatch algorithm, but using both the pseudo-labels and augmented labels for selecting the appropriate weights for both the unlabelled and labelled loss terms. We refer to the proposed approach in this work as Pseudo-label based Balance Correction (PBC).

The number of observations per class is used to compute the array of correction coefficients c. The actual computation is done by calculating the array v using the inverse of the amount of observations available in each class S l : v i = 1 ni . Here n i corresponds to the total amount of observations for class i. The next step consists in the computation of the array with the normalized weights c as follows:

Where C corresponds to the total number of classes, where in this work C = 2, as a binary classification model is developed. The augmented, pseudo, and original labels y i and y j , are contained in the augmented labelled and unlabelled datasets, S l and S u , respectively, after the MixUp method mentioned in Section II-C is executed. Such augmented labels are used to select its corresponding weight in c. To do so, the one-hot vector notation of the labels is converted to a numeric one:

for every b i and b j observation in S l and S u , respectively. Both the loss function and the calculated weights are used to weight both loss terms:

The chosen indices are used in the array of weights c. We used a cross-entropy and mean squared error loss for the labelled and unlabelled loss terms, respectively. Therefore, the modified cross-entropy and MSE functions are respectively described as follows: L l (w, x i , y i ) = δ cross-entropy (c bi y i , c bi f w (x i )) and L u w, x j , y j = c bj y j − c bj f w (x j ) . The numerical estimated and real labels are then used for indexing the array c. The re-weighted loss functions are minimized as usual 1 .

A system to classify x-ray images into: COVID-19+ and no lung pathology (COVID-19-) is presented in this work. We used different previously existing datasets, and add the usage of a new one, containing negative COVID-19 cases.

The following previously existing datasets were used in this work:

1) COVID-19 + dataset: Images containing COVID-19 + observations were collected from the publicly available github repository accessible from [18] . This repository was built by Dr. Joseph Cohen, from the University of Montreal [18] . The images were collected from journal websites such as radiopaedia.org, the Italian Society of Medical and Interventional Radiology. Images were also collected from recent publications in this area such as [18] . Only images containing signs of COVID-19 + were used in this study. All other images relating to Middle East Respiratory Syndrome (MERS), Acute Respiratory Distress Syndrome (ARDS) and Severe Acute Respiratory Syndrome (SARS) were discarded. This reduced the dataset to a subset containing 102 front chest X-ray containing COVID-19 + observations. The gray-scaled observations were stored with varying resolutions from 400 × 400 up to 2500 × 2500 pixels. 2) Chinese pediatric patients dataset: A dataset of 5856 observations containing images of pneumonia and normal observations was defined in [28] . The patient sample used for the study correspond to Chinese children [28] . These images are divided into 4273 observations of pneumonia (including viral and bacterial) and 1583 of observations with no lung pathology (normal). We used the observations with no findings, and refer to it as the Chinese pediatric dataset. The negative and pneumonia observations from this dataset have been used extensively in recent related research to COVID-19 detection [32] , [47] , [43] , [20] , [34] , [7] . Most of the images were stored with a resolution of 1300 × 600 pixels. 3) ChestX-ray8 dataset: The ChestX-ray8 dataset, made available in [25] , is also used for the category of no findings in this work. The dataset includes 224,316 chest radiographs from 65,240 patients from Stanford Hospital, US. The studies were done between October 2002 and July 2017. We picked a sample of this dataset available in its website 2 given the low labelled data setting used in this work. Patients sampled in this dataset were aged from 0 to 94 years old. 4) Indiana Chest X-ray dataset: The dataset published in [19] gathers 8121 images from the Indiana Network for Patient Care. Only the observations with no pathologies were used in this work. The dataset can be accessed from its repository 3 . Images were stored with a resolution of 1400 × 1400 pixels. In this work we also used a dataset we gathered from a Costa Rican private clinic, Clinica Imagenes Medicas Dr. Chavarria Estrada. The data corresponds to chest X-rays from 153 different patients, with ages ranging from 7 to 86 years old. 63% of the patients were female and 37% are male. The images were taken using a Konica Minolta digital X-ray machine with 0.175 of pixel spacing. The images were stored with a resolution of 1907 × 1791 pixels. As the images were digitally sampled, no tags or manual labels are contained in the images 4 .

All the datasets have been preprocessed to exclude artifacts (manual labels), in the cases where one of them does not present any, to avoid artifact bias. Data augmentation using flips and rotations is implemented. No crops were used to avoid losing regions that might be important for image discrimination. Images stored with 8 bits were replicated by 3 to use the selected CNN architecture.

We used the following hyper-parameters used for the Mix-Match model for all the experiments performed: K = 2 transformations, T = 0.5 of sharpening temperature and α = 0.75 for the beta distribution 5 . A Wide-ResNet [45] model has been used for all the experiments, with an input image size of 110 × 110 pixels, and the following hyper-parameters: a weight decay of 0.0001, a learning rate of 0.00001, a batch size of 12 observations, a cross-entropy loss function and an adam optimizer with a 1-cycle policy [38] .

For each configuration, we trained the model 10 times for a total of 50 epochs. For each run, a sample dataset of 204 observations was picked from both the evaluated COVID19 − dataset and the COVID-19 + dataset available in [18] . Therefore, a total of 10 different training and test samples were used. The same samples were used for all the tested algorithm variations. A completely balanced validation dataset comprising the 30% of the 204 observations was used.

To assess the data imbalance impact, we evaluated both the supervised and the semi-supervised architectures using three balance configurations: 50%50%, 80%/20% and 70%/30% for the labelled dataset S l . The under-represented class corresponds to the COVID-19 + class. We tested different sizes of labelled samples, n l = 10, n l = 15 and n l = 20. The remaining data was used as unlabelled data, with close to a 50% data balance between the two classes. This leads to a distribution mismatch between S u and S l . Tables I, II, III and IV show this layout. Given the low labelled setting, we report the highest validation accuracy, assuming the usage of early stopping to avoid over-fitting. We trained the MixMatch model with both the uncorrected loss function and the proposed PBC modification for data imbalance correction. For reference, we also tested the supervised model with balance correction and without it.

Table V summarizes the accuracy gains when using Mix-Match with PBC vs. not using MixMatch, and using Mix-Match with no balance correction (under the same balance conditions) vs. using MixMatch with PBC. A non-parametric Wilcoxon test was performed to detect whether the accuracy gain was statistically significant (with p > 0.1) across the 10 runs (observations) sampled. Gains not statistically significant according such criteria are written in italic in Table V. Finally, as a qualitative experiment, we calculated the gradient activation maps using the technique proposed in [35] 6 . For this qualitative experiment we compared the supervised model and the MixMatch modification with the proposed PBC. The objective of this experiment was to spot the changes on the regions used by the model to output its decision, when trained with the semi-supervised approach. A sample with 20 labelled observations and around 180 unlabelled observations (for the MixMatch model with PBC) was used for training the model. A completely balanced dataset of 61 observations was used for validation. We trained a Densenet121 model for 50 epochs, for both the supervised and semi-supervised frameworks. Figure 1 includes sampled heatmaps for the chest X-ray8 and Indiana datasets. The net weights in the final output layer for each entry, and the real and predicted labels are also shown for each output image in Figure 1 .

The results using accuracy as a metric for the Costa Rican dataset are depicted in Table I . The base-line accuracy is rather high for very limited labelled settings, even with the baseline supervised model, with accuracies ranging from 87% to 95%, using 10 and 20 labels, respectively. SSDL is perhaps only attractive when using 10 labels, with an accuracy gain of around 7%, as displayed in the summary Table V . The accuracy gain from implementing PBC vs. using the nonbalanced MixMatch approach remains similar in disregard of the number of labels used, always with statistical significance. However, the accuracy gain of using MixMatch, even with the PBC modification, diminishes as the number of labels increases. The accuracy gain is rather similar for both of the data imbalance configurations tested. As seen in Table I , the implemented PBC corrects the data imbalance impact, yielding similar results when using the completely balanced dataset.

Regarding the test results using the Chinese pediatric dataset, the base-line supervised accuracy results are initially low (from 86% to 92%), giving more room for SSDL accuracy gain, as seen in Table II . The usage of MixMatch with the proposed PBC over regular supervised learning yields an accuracy gain over +11% as seen in Table V . Similar to the Costa Rican dataset, as the number of labels increases, the accuracy gain decreases. The benefit of using the PBC over the off-the-shelf MixMatch implementation is higher when facing a more imbalanced dataset scenario, as seen in Table V for the Chinese dataset. The accuracy gain is almost three times higher when using the 80%/20% configuration, increasing from around +3% to +10%, for the 70%/30% and 80%/20% imbalance scenarios, respectively. The PBC is able to almost correct the impact of data imbalance, as its accuracy shown in 6 We used the FastAI implementation available of the gradient activation maps available in https://forums.fast.ai/t/ gradcam-and-guided-backprop-intergration-in-fastai-library/33462 SSDL COVID-19-COVID-19+ LB n l = 10 n l = 15 n l = 20 Table III summarizes the results yielded for the Chest X-ray8 dataset. The base-line accuracy for the supervised model is the lowest from the tested datasets, sitting at around 75%. The accuracy gain of using MixMatch with PBC versus the usual supervised model ranges from +5% to +9.6%, as seen in Table V , in the row for the Chest X-ray8 dataset. As for the accuracy gain of using MixMatch with PBC vs. MixMatch with no balance correction, it stays around +3 to +5% for the 70%/30% imbalance configuration. Higher accuracy gains are obtained when dealing with the more challenging imbalance scenario of 80%/20%, with gains up to 14%. Similar to other datasets, the PBC is able to correct MixMatch's accuracy impact of data imbalance most of the times, as seen in Table  III .

Finally, the test results for the Indiana dataset are depicted in Table IV . The base-line accuracy for the Indiana chest xray dataset ranges from 84% to 88%. The accuracy gain from implementing MixMatch with PBC ranges from 4% and to 5.6% versus the base-line supervised model. Implementing the PBC versus the original MixMatch implementation yields an accuracy gain from +4.5% to +14%. In the case of this dataset, data imbalance seems to further decrease MixMatch's accuracy, as we can see in Table IV when comparing the accuracy results of the 50%50% configuration to the 70%/30% and 80%/20% imbalance settings.

For the tested datasets, the accuracy can be considered to be very similar when evaluating the base-line supervised model under different data imbalance conditions, as seen in Tables  I, II , III and IV, suggesting a higher sensitivity of MixMatch when trained with imbalanced data. The overall trend of the accuracy gain of using the proposed MixMatch with PBC over its original implementation is positive, as seen in V, accross all the datasets tested. Most of the accuracy gains are higher than 3%, and also most of them are statistically significant, after performing a non parametric Wilcoxon test, with an acceptance criteria of the hypothesis of significant difference between the accuracies of both configurations of p > 0.1. There are some cases where the default MixMatch implementation does not bring any accuracy gain when facing an imbalanced dataset, as seen for instance in the test results of the Indiana dataset, detailed in Table IV . For example the accuracy of the supervised model with 10 labels is around 83%, and the accuracy of the MixMatch model with no PBC is no higher than 83%. This implies the mandatory need of correcting data imbalancing for the MixMatch model, given its high sensitivity to data imbalance. Finally, regarding the qualitative experiments proposed, Figure 1 show sample heatmaps for the Indiana and chest X-ray8 datasets, respectively. Both figures reveal how the neural network tends to focus more on lung areas when using the semi-supervised model trained with both datasets. The Densenet121 model trained with MixMatch including the PBC modification yielded an accuracy of 91.3% for the tested sample from the Indiana dataset, and 67.74% for the supervised model. For chest X-ray8 dataset, an accuracy of 93.4% was yielded for the MixMatch framework with PBC, and 77.4% for the supervised model. We can see in Figure 1 how the hot pixels move towards lung regions when using the semi-supervised model, and also how the net weights of the output layer become steeper. This tends to happen even when the resulting predictions in both models are correct.

In this work we have analyzed the impact of data imbalance for the detection of COVID-19 using chest X-ray images. This is a real-world problem, which can arise frequently in the context of a pandemic, where few observations are available for the new pathology. To our knowledge, this is the first data imbalance analysis of a SSDL designed to perform COVID-19 detection using chest X-ray images. The experiment results suggest a strong impact of data imbalance in the overall MixMatch accuracy, since results in Table V reveal a stronger sensitivity of SSDL when compared to a supervised approach. The accuracy hit of training MixMatch with an imbalanced labelled dataset lies in the 2-11% range, as seen in Tables I,  II , III and IV. This enforces the argument developed in [33] , [12] which draws the attention upon data distribution mismatch between the labelled and the unlabelled datasets, as a frequent real-world challenge when training a SSDL model. Moreover, a simple and effective approach for correcting data imbalance by modifying MixMatch's loss function was proposed and tested in this work. The proposed method gives a smaller weight to the observations belonging to the underrepresented class in the labelled dataset. Both the unlabelled and the labelled loss terms were re-weighted, as opposed to the unlabelled re-weighting developed for the mean teacher model in [24] , which only modifies the weights of the unlabelled term. This was done since in our empirical tests the unlabelled term had less impact in the overall model accuracy. For the pseudo-labelled and MixUp augmented observations, we assigned the weights using the pseudo and augmented labels. The proposed method is computationally cheap, and avoids the need of complex and expensive generative approaches to correct data imbalance. A systematic accuracy gain is yielded when comparing the original MixMatch implementation with the proposed PBC for data imbalance correction, as seen in Table V . For the tested datsets, often the proposed PBC leads to significant accuracy gains from the supervised model, as data imbalance can even hinder any accuracy gain of using MixMatch, as seen in Tables I,II , III and IV. The accuracy gain ranges between 3% and 11%, with statistical significance for most of the datasets tested. In most of the datasets, the accuracy gain is higher for the 80%/20% imbalance setting. Among the tested datasets, we included a new one with digital X-rays from healthy Costa Rican patients, which we make available for the community.

This work can be extended by using the customized feature extractors proposed in [17] , as our architecture uses the more common transfer learning approach from a generic dataset (Imagenet), to later refine the feature extractor. The semantic relevance of the extracted features can be improved along with the model explainability, as seen in Figure 1 . However, the proposed solution in this work can be ported to use a more specific feature extractor. Therefore, we plan to test its usage under different customized feature extractors. Furthermore, it is interesting to investigate the impact of SSDL on deep learning explainability/uncertainty measures. We suspect that unlabelled data can improve models' uncertainty estimations and explainability accuracy.

Advice on the use of point-of-care immunodiagnostic tests for COVID-19

Coronavirinae -an overview -ScienceDirect Topics

Deltacoronaviruses -an overview -ScienceDirect Topics

Correlation of chest ct and rt-pcr testing in coronavirus disease 2019 (covid-19) in china: a report of 1014 cases

Hiam Alquran, Isam Abuqasmieh, and Amin Alqudah. Covid-2019 detection using x-ray images and artificial intelligence hybrid systems

Radiographic severity index in covid-19 pneumonia: relationship to age and sex in 783 italian patients

Covid-19: Automatic detection from X-Ray images utilizing Transfer Learning with Convolutional Neural Networks

The training and practice of radiology in India: current trends. Quantitative imaging in medicine and surgery

Presumed asymptomatic carrier transmission of covid-19

Mixmatch: A holistic approach to semi-supervised learning

Padchest: A large chest x-ray image dataset with multilabel annotated reports

Jordina Torrents-Barrena, Shengxiang Yang, Armaghan Moemeni, Wojciech Samek, and Miguel A

Mixmood: A systematic approach to class distribution mismatch in semi-supervised learning using deep dataset dissimilarity measures

Improved molecular diagnosis of covid-19 by the novel, highly sensitive and specific covid-19-rdrp/hel real-time reverse transcriptionpolymerase chain reaction assay validated in vitro and with clinical specimens

Epidemiological and clinical characteristics of 99 cases of 2019 novel coronavirus pneumonia in wuhan, china: a descriptive study

Nasser Al-Emadi, and Mamun Bin Ibne Reaz. Can AI help in screening Viral and COVID-19 pneumonia

Ct imaging features of 2019 novel coronavirus (2019-ncov)

Predicting covid-19 pneumonia severity on chest x-ray with deep learning

Covid-19 image data collection

Preparing a collection of radiology examinations for distribution and retrieval

COVIDX-Net: A Framework of Deep Learning Classifiers to Diagnose COVID-19 in X-Ray Images

Sensitivity of chest ct for covid-19: comparison to rt-pcr

Estimating Uncertainty and Interpretability in Deep Learning for Coronavirus (COVID-19) Detection

Causability and explainability of artificial intelligence in medicine

Class-imbalanced semisupervised learning

Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison

Mimic-cxr: A large publicly available database of labeled chest radiographs

Global trends in emerging infectious diseases

Identifying medical diagnoses and treatable diseases by image-based deep learning

Coronet: A deep network architecture for semi-supervised task-based identification of covid-19 from chest x-ray images. medRxiv

Chest radiograph interpretation with deep learning models: assessment with radiologist-adjudicated reference standards and population-adjusted evaluation

Pooling rt-pcr or ngs samples has the potential to cost-effectively generate estimates of covid-19 prevalence in resource limited environments. medRxiv

Automatic Detection of Coronavirus Disease (COVID-19) Using X-ray Images and Deep Convolutional Neural Networks

Realistic evaluation of deep semi-supervised learning algorithms

COVID-19 Detection using Artificial Intelligence

Grad-cam: Visual explanations from deep networks via gradient-based localization

Detection of coronavirus disease (covid-19) based on deep features

Assessment of the availability of technology for trauma care in india

A disciplined approach to neural network hyperparameters: Part 1-learning rate, batch size, momentum, and weight decay

Emerging 2019 novel coronavirus (2019-ncov) pneumonia

Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results

A survey on semi-supervised learning

vitro diagnostic assays for covid-19: Recent advances and emerging trends

COVID-Net: A Tailored Deep Convolutional Neural Network Design for Detection of COVID-19 Cases from Chest Radiography Images

Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases

Wide residual networks

mixup: Beyond empirical risk minimization

COVID-19 Screening on Chest X-ray Images Using Deep Learning based Anomaly Detection

This work is partially supported by Spanish grants TIN2016-75097-P, RTI2018-094645-B-I00, UMA18-FEDERJA-084 and the funding from the Universidad de Málaga. We acknowledge Clinica Imagenes Medicas Dr. Chavarria Estrada, La Uruca, San Jose, Costa Rica, for its support to the data compilation process of the digital X-ray image dataset used in this work.