key: cord-0583151-1s1b0wtm
authors: Abramovich, Or; Pizem, Hadas; Eijgen, Jan Van; Stalmans, Ingeborg; Blumenthal, Eytan; Behar, Joachim A.
title: FundusQ-Net: a Regression Quality Assessment Deep Learning Algorithm for Fundus Images Quality Grading
date: 2022-05-02
journal: nan
DOI: nan
sha: 4205b9903df062044b2c54d1873be92b751125ac
doc_id: 583151
cord_uid: 1s1b0wtm

Objective: Ophthalmological pathologies such as glaucoma, diabetic retinopathy and age-related macular degeneration are major causes of blindness and vision impairment. There is a need for novel decision support tools that can simplify and speed up the diagnosis of these pathologies. A key step in this process is to automatically estimate the quality of the fundus images to make sure these are interpretable by a human operator or a machine learning model. We present a novel fundus image quality scale and deep learning (DL) model that can estimate fundus image quality relative to this new scale. Methods: A total of 1,245 images were graded for quality by two ophthalmologists within the range 1-10, with a resolution of 0.5. A DL regression model was trained for fundus image quality assessment. The architecture used was Inception-V3. The model was developed using a total of 89,947 images from 6 databases, of which 1,245 were labeled by the specialists and the remaining 88,702 images were used for pre-training and semi-supervised learning. The final DL model was evaluated on an internal test set (n=209) as well as an external test set (n=194). Results: The final DL model, denoted FundusQ-Net, achieved a mean absolute error of 0.61 (0.54-0.68) on the internal test set. When evaluated as a binary classification model on the public DRIMDB database as an external test set the model obtained an accuracy of 99%. Significance: the proposed algorithm provides a new robust tool for automated quality grading of fundus images.

O PHTHALMOLOGICAL pathologies are the leading cause of blindness in world [1] . Globally, it estimated that 237.1 million people suffer from moderate or severe visual impairment, and 38.5 million people are blind [1] . The pathologies with the most severe impact on visual acuity are age-related macular degeneration (AMD), cataract, diabetic retinopathy and glaucoma [1] . A variety of techniques are used for diagnosis, including optical coherence tomography (OCT) imaging, color fundus photography, fluorescein angiography (FA), fundus autofluorescence (FAF) and optical coherence tomography angiography (OCTA) [2] [3] [4] . Of these methods, computerized analysis of color fundus images is a widely used method of diagnosis [5] [6] [7] . A digital fundus image (DFI) is an image of the inner lining of the eye which captures the optic disk, the fovea, the macula, the retina and blood vessels. Early detection and treatment can help slow down and sometimes prevent further vision loss. [8] However, there is currently a global shortage of ophthalmologists which, according to current trends, will only intensify in the following years [9] . This prevents many people from being diagnosed in a timely manner. Techniques such as automated screening devices and telemedicine have been discussed recently, as a way to provide fast diagnoses without the need of an ophthalmologist on-site [10] . Real world DFIs can be of low quality due to many causes: dirty camera lens, improper flash and gamma adjustment, eye blink and occlusion by eyelashes [11] as well as media opacity [12] and insufficient technician skill. Therefore, to be used in a large scale screening or telemedicine device, the device must be able to automatically identify and handle low quality images. However, this is not the current situation in state-of-the-art works. The authors of the main works in the field choose to manually discard low quality DFIs from their databases as a preliminary step [13] [14] . For example, Liu et al. [13] manually reviewed 274,413 DFIs for their research, employing several tiers of human graders, and discarded 1.7% of them(n=4,812). Li et al. [14] trained 21 ophthalmologists to discern between gradable and ungradable DFIs. These ophthalmologists reviewed 48,116 DFIs. This is a time-consuming process that limits the clinical applicability of developed algorithms.

Analyzing the quality of a DFI is a fundamentally different task than analyzing the quality of a regular image. Raj et al [11] explained that typical Image Quality Assessment (IQA) methods might not be adequate for Retinal Image Quality Assessment (RIQA) because the statistical properties of DFIs differ vastly from that of natural images. This means that field specific methods must be developed.

RIQA algorithms can be separated into three groups: similarity-based, segmentation-based and ML-based. Similarity-based algorithms compare the target image to a set of high quality images. Segmentation-based algorithms Fig. 1 . An example of different DFIs used for the reference set. Image A is graded as 2, image B is graded as 5, image C is graded as 7.5 and image D is graded as 9.5. first extract structures from the target image, and then analyse them according to different parameters. DL-based algorithms involve training a ML model on either extracted features from the DFI or the DFI itself.

Similarity-based and segmentation-based methods are not commonly used nowadays [11] . Similarity-based algorithms are not popular because they fail to take into consideration structural information contained in the DFI. Segmentationbased algorithms are very rigid, and only function when the fundus has certain characteristics, such as a specific shape, size and location of physiological features within the image. Changes to these parameters leads to a reduction in the algorithm's performance [11] .

These factors, in conjunction with advancements in hardware and convolutional neural network (CNN) architectures, have lead to the rise of ML-based algorithms, and specifically CNN-based algorithms [15] . In the following section we review state-of-the-art ML-based models used for RIQA.

In 2019, Fu et al. [16] used the novel MCF-Net architecture and the EyeQ database, a subset of EyePACS, which contains 28,792 DFIs, using a train/test split of 43.5/56.5. They experimented with using three different color spaces for the purpose of quality assessment: RGB, HSV and Lab. The MCF-Net is comprised of three parallel CNNs, each of which analyzes a DFI from a certain color space. The network then fuses the features from each color space together, and returns the result. The authors used three classes: "Good", "Usable" and "Reject", and asked two experts to grade the quality of the DFIs. They then tested their network against several different architectures which only accept a single color space. The authors achieved an accuracy of 91.8%.

In 2020, Zapata et al. [17] used a novel CNN architecture called CNN-1 and the Optrtina database, which contains 306,302 DFIs. 150,075 of them were labeled for quality, using the labels "Good" and "Bad". They reported an AUC of 0.947 and accuracy of 91.8% using 10-fold cross validation.

In 2021, Karlsson et al. [18] proposed a novel continuous quality scale. The quality score lay on a scale of 0.0 to 1.0, taking into consideration two features: focus and contrast. In their work they first extracted the relevant features using various filters, and then used a random forest regression algorithm to estimate the score for each feature. Their database consisted of 787 retinal oximetry images, which is an imaging modality closely related to fundus photography, and 253 DFIs. After choosing the threshold of 0.625, they measured their results on the binary-labeled DRIMDB database and achieved an accuracy of 98.1%.

To our knowledge there exist two open access databases for evaluating fundus quality assessment algorithms: DRIMDB and EyeQ [19] [16] . DRIMDB consists of 216 DFIs, which is insufficient for DL, and EyeQ was originally created as a diabetic retinopathy database, and uses a trinary labeling system: "Good", "Usable" and "Bad".

Many works in the field rely on binary classification of "Good" and "Bad" quality [11] [15] . This is problematic for several reasons: first, the definition of "good" and "bad" is highly subjective, and heavily depends on the pathology of interest and the ophthalmologist defining the scale. Second, it ignores the fact that the problem at hand is in essence a regression problem, since quality lies on a spectrum. Binary labels could lead to images of vastly different quality receiving the same binary label, making the classification prone to error when dealing with borderline quality images. In addition, a numerical assessment of the quality of the image could be used by the classifier for several other purposes, including confidence estimation to a diagnosis classification task.

Even when a quality scale is used, in the case of Karlsson et al. [18] , it only takes into consideration the focus and contrast of the image, while ignoring other aspects that are important for clinical usage including the visibility of intraocular structures such as the macula and the optic disc. Finally, it is difficult to benchmark the performance of algorithms in the field. Indeed, most authors report their results, achieved by training the model on their own private test set, which were labeled using their own uniquely defined standard. If researchers wish to use external validation, usually DRIMDB is chosen, due to its binary labeling.

In this work our goal is to create a generic and flexible fundus quality assessment data-driven algorithm. The quality scale used for this work is made available to other researchers at the URL (will be available at publication).

In our work, we employed an objective approach, which was built upon subjective principles. Utilizing the fundus photograph open databases (Drishti-GS, ORIGA and REFUGE), a total of 28 DFIs were selected to span a scale from 1 to 10 with 1 being the lowest possible quality score and 10 the highest. Increments of 0.5 were considered.

The reference set was constructed by a glaucoma ophthalmologist with 30 years of experience (EZB). The reference set was open for viewing and comparison in the subsequent process of scoring individual DFIs. Two ophthalmologists, (EZB) and a senior resident (HP), provided independent quality annotations for an image set of 1,245 DFIs. Each DFI was independently scored on a scale of 1-10 by each of the two ophthalmologist. When grading, the ophthalmologists referred to the elements of resolution, focus, contrast, brightness, artifacts, and the ability to detect fine details, while the reference point for scoring was primarily the optic disc and the peripapillary retina. If the DFI did not include the optic disc, the photo was discarded. All discrepancies were discussed, and a final score agreed by consensus. An example of the quality scale can be seen in Fig 1. A breakdown of the distribution of the scores can be seen in Fig 2. This quality scale provides four major benefits: first, it describes the quality of the image in a higher resolution when compared to other works. Second, since it was established by ophthalmologists, it is more easily interpretable by them, making it more suited for clinical practice. Third, grading the images while having the scale open for scrutiny as part of grading each image greatly assists in scoring and prevents a "drift" in the scores secondary to the quality of the specific analyzed database. Fourth, it addresses the fact that the quality requirement between CNNs are different: The quality score gives more flexibility when deciding on the threshold for discarding DFIs, or when deciding its influence on the diagnosis part of the pipeline. Fifth, since the reference quality scale was established using public databases, other researchers can implement it in their own work and then compare their results to ours.

Table I summarizes all the databases used in this research. The quality scores distribution for the 1,245 DFIs can be seen in Fig 2. 1) Quality scale definition and supervised learning: The ORIGA database contains 650 DFIs of subjects of Malay Singaporean origin, collected by the Singapore Eye Research Institute for the purpose of Glaucoma analysis and research [20] . The subjects' ages range from 40-79, and 25.8% of the images are of glaucomatous patients, with images of males comprising 52.0% of the total database. A total of 145 DFIs from the ORIGA database were randomly chosen for quality annotation using the new fundus quality scale.

The Drishti-GS database is a glaucoma-focused database, comprised of 101 DFIs collected consensually from visitors to the Aravind eye hospital in Madurai, India. The glaucomatous patients were chosen by clinical investigators during examinations and the healthy subjects were chosen from people undergoing routine refraction test. The subjects ranged in age from 40-80 years, with a male/female distribution of approximately 50% [21] . A total of 21 DFIs from the Drishti-GS database were randomly chosen for quality annotation using the new fundus quality scale.

The LEUVEN database is a new, private database which contains 37,345 DFIs from 9,965 unique patients. There are 874 unique labels in the database, of which 61 describe variants of glaucoma. The labels include pathology diagnoses such as glaucoma or myopia, as well as procedures such as trabeculectomy or LASIK [22] . A total of 995 DFIs were randomly chosen.

2) Semi-supervised learning databases: The EyePACS database is a diabetic retinopathy database, containing 88,702 DFIs. Images in this database were captured by a variety of models and types of cameras [16] . 59,910 DFIs from this database were used for semi-supervised learning.

3) Pre-training databases: A total of 28,792 DFIs from the EyePACS database were annotated for quality by Fu [16] , using the labels "Good", "Usable" and "Reject". These images are known as the EyeQ database. In our research, we used the EyeQ database for transfer learning. 4 ) External test set: The DRIMDB database contains 216 DFIs with three classes: "Good"(n=125), "Poor"(n=69) and "Outlier"(n=22). [19] The "Good" and "Poor" images are images of the fundus, whereas the outliers are images of the external eye or random objects. Due to difficulty translating our grading scale to the outlier category, only the 194 DFIs from the "Good" and "Poor" classes were used in this work, for the purpose of evaluating the generalization of the DL model to an external database.

We propose a DL model which is based on the Inception-V3 architecture. The model was developed using a total of 89,947 images from 6 databases, of which 1,245 were labeled by the specialists using the new scale system, and the remaining 88,702 images were used for pre-training and semi-supervised learning using a pseudo-labeling approach.

1) Data preprocessing: All DFIs were cropped to remove their black borders, to remove its influence on the neural network. The DFIs were subsequently resized to 224 x 224 pixels, to match the input layer of the Inception-v3 architecture [23] . Image augmentation was avoided, so that the quality of the DFI will not be accidentally modified.

2) Pre-training: In this study, the convolutional neural network Inception-V3 was chosen [23] . Due to the limited number of quality graded DFIs using the new scale, pretraining can be used to help the DL model learn from the small database. Typically, DL models used for vision tasks are pretrained on the ImageNet database, which contains millions of images organized into over 20,000 categories [24] . However, due to the difference between natural images and DFIs, we performed two experiments: (1) pre-training on ImageNet and (2) pre-training on the EyeQ database.

3) Transfer learning: Following the previous step, the model was modified to perform the regression task. This was done by replacing the classifier layer with a new fully-connected layer, which was connected single neuron at the end. The model with the transferred weights, was then trained and evaluated using the 1,245 DFIs labeled with the new quality scale and with a train/validation/test split of 932/104/209. The database was stratified for the training, validation and test sets according to the quality scores, to guarantee an equal representation of each quality class. 4) Semi-supervised learning: In supervised learning, classifiers require labeled data to train. However, creating these labels is often challenging, since they involve a time consuming annotation process by expert human annotators. Unlike traditional supervised learning, semi-supervised learning enables the use of large amounts of unlabeled data together with small amounts of labeled data to boost performance [25] . In this work we used the pseudo-labeling [26] approach to semi-supervised learning. In pseudo-labeling, a model is first trained on the labeled data, and then attempts to label the unlabeled data. These new labels are called "pseudo-labels", A histogram detailing the quality score distribution of the pseudolabeled images from the EyeQ database. and a different model is then trained using the labeled and pseudo-labeled data.

Due to a limited number of graded DFIs, we incorporated semi-supervised learning in the training process. Pseudolabeling was used, and the large unlabeled database chosen was EyePACS. The DFIs that also exist in EyeQ were excluded, resulting in 59,910 unlabeled DFIs. The model from the transfer learning experiment was then assigned pseudo-labels to the 59,910 unlabeled images, which were added to the training set. The train/validation/test split was 57,899/3,047/209. A histogram of the pseudo-labels is available at Fig 3. 

After training the Inception-V3 model, the DRIMDB database was used to perform external validation [19] . The final DL model inferred quality scores for each DFI from DRIMDB, which was then translated into "Good" and "Poor" according to the threshold 6.5 that was recommended by our consulting ophthalmologists. The average, standard deviation (STD) and maximal and minimal score were reported for each class, as well as the overall accuracy, sensitivity and specificity.

Since the network solves a regression problem, rather than a classification problem, the metrics chosen were Mean Absolute Error (MAE), Standard Deviation of the errors and the maximal error. To measure the improvement made by each step, the Wilcoxon-signed-rank test was used [27] .

The results of the transfer learning experiment are summarized in Table II . Model 1, which was pre-trained using the ImageNet database achieved an MAE of 0.77±0.08, with a maximal error of 4.04 and a minimal error of <0.01. Model 2, which was pre-trained on the EyeQ database achieved an MAE of 0.66±0.08, a maximal error of 4.12 and a minimal error of 0.01.

As expected, performing pre-training using DFIs yield lower MAE when compared to using ImageNet. Model 3 (FundusQ-Net) achieved an MAE of 0.61±0.07, a maximal error of 3.7 and a minimal error of <0.01 (Table II) . This means that the pseudo-label method was able to reduce the mean error by 7.6%, the maximal error by 10% and the STD by 14.5%. Applying the Wilcoxon-signed-rank test between two subsequent models yields a p-value smaller than 0.01, demonstrating that each model has achieved meaningful improvement compared to the previous one. The histogram of the errors for the best model using pre-training on EyeQ and pseudolabeling is shown in Fig. 4 

The histograms of the quality scores for the DRIMDB database can be seen in Fig 6, and the table containing the result of the experiment can be seen in Table III . Using a threshold of 6.5, our model achieved an accuracy of 99%, with a sensitivity of 98.4% and specificity of 100%. This suggests that 6.5 is the correct threshold to use for this specific database, and demonstrates the flexibility in our proposed quality scale. In addition, our model outperforms performance reported in previous works when comparing accuracy and specificity, while being a very close second place when comparing the sensitivity.

In this paper we sought to establish a new, meaningful quality scale for DFIs. We further developed a DL algorithm for quality grading of DFI. For that purpose we combined the state-of-the-art Inception-V3 DL architecture with field specific pre-training and pseudolabeling. Overall, high performance was obtained by the final model, denoted FundusQ-Net, with a MAE and confidence interval of 0.62 (0.55-0.69) on the test set and a generalization accuracy of 99% on the external DRIMDB. Furthermore, we demonstrated that domain pretraining and pseudolabeling improved the model performance significantly from MAE of 0.76 to 0.62 (p<0.05, see Table  II) . We performed error analysis in order to better understand in what cases FundusQ-Net performed poorly. Following consultation with the specialists, outliers from the test set were defined as DFIs for which the absolute delta between their score and their estimated score was larger than 1.5. Using this definition we analyzed 11 outliers. In 54.5% of cases(n=6) the estimated score was higher than the actual score, increasing to 80% of cases(n=4) when considering the 5 largest deltas. A possible factor contributing to the error are discrepancies between the quality of the disc and the quality of the retina. This discrepancy was identified in 36%(n=4) of cases, with the disc area being of higher quality in 75%(n=3) of cases. An example of both cases can be seen in DFIs (a) and (b) in Fig 7. Another contributing factor could be a very pathological eye, which can be see in image (c) in Fig 7. This made the DFI to be very different from most training set examples, which could have caused the mistake. For the external test set, as previously stated, the chosen threshold was 6.5. There were two missclassified examples, both are from the "Bad" class and were misclassified as "Good". Upon review, a possible reason is that both DFIs are extremely dark and there were no such examples in the training set. Overall, augmenting the training set with such low quality examples (highly contrasted, very pathological and dark images) would help to address improve FundusQ-Net performance. Both outliers are included in Fig  7. In conclusion, we presented a novel fundus image quality scale and DL model, denoted FundusQ-Net, that can estimate fundus image quality relative to this new scale. The proposed data-driven algorithm provides a new robust tool for automated quality grading of fundus images.

Global causes of blindness and distance vision impairment 1990-2020: a systematic review and meta-analysis

Imaging in Diabetic Retinopathy

Fundus autofluorescence and age-related macular degeneration

A review of optical coherence tomography angiography (OCTA)

Diabetic Retinopathy Diagnosis Through Computer-Aided Fundus Image Analysis: A Review

Automatic detection of age-related macular degeneration pathologies in retinal fundus images

Deep convolution neural network for accurate diagnosis of glaucoma using digital fundus images

Glaucoma in Adults-Screening, Diagnosis, and Management: A Review

The number of ophthalmologists in practice and training worldwide: a growing gap despite more than 200 000 practitioners

Virtual Ophthalmology: Telemedicine in a COVID-19 Era

Fundus image quality assessment: survey, challenges, and future scope; Fundus image quality assessment: survey, challenges, and future scope

Image quality characteristics of a novel colour scanning digital ophthalmoscope (SDO) compared with fundus photography

Development and Validation of a Deep Learning System to Detect Glaucomatous Optic Neuropathy Using Fundus Photographs

Efficacy of a Deep Learning System for Detecting Glaucomatous Optic Neuropathy Based on Color Fundus Photographs

Deep Learning for Retinal Image Quality Assessment of Optic Nerve Head Disorders

Evaluation of Retinal Image Quality Assessment Networks in Different Color-spaces

Artificial Intelligence to Identify Retinal Fundus Images, Quality Validation, Laterality Evaluation, Macular Degeneration, and Suspected Glaucoma

Automatic fundus image quality assessment on a continuous scale

Identification of suitable fundus images using automated quality assessment methods

ORIGA-light : An online retinal fundus image database for glaucoma analysis and research

Annual International Conference of the IEEE Engineering in Medicine and Biology Society, EMBC'10

Drishti-GS: Retinal image dataset for optic nerve head(ONH) segmentation

Accurate prediction of glaucoma from colour fundus images with a convolutional neural network that relies on active and transfer learning

Rethinking the Inception Architecture for Computer Vision

ImageNet Large Scale Visual Recognition Challenge

Semi-Supervised Learning Literature Survey

Pseudo-Label : The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks

Wilcoxon-Signed-Rank Test

Automated Quality Assessment of Fundus Images via Analysis of Illumination, Naturalness and Structure

Quality and content analysis of fundus images using deep learning

ACKNOWLEDGMENT We thank Mr. Joshua Melamed for providing useful feedback and comments on the manuscript.