key: cord-0676618-qy6l5vaq authors: Nardini, John T.; Pugh, Charles W. J.; Byrne, Helen M. title: Statistical and Topological Summaries Aid Disease Detection for Segmented Retinal Vascular Images date: 2022-02-20 journal: nan DOI: nan sha: 0ae94258a77e77cc1f060ab108f5646b2da8ba80 doc_id: 676618 cord_uid: qy6l5vaq Disease complications can alter vascular network morphology and disrupt tissue functioning. Diabetic retinopathy, for example, is a complication of type 1 and 2 diabetus mellitus that can cause blindness. Microvascular diseases are assessed by visual inspection of retinal images, but this can be challenging when diseases exhibit silent symptoms or patients cannot attend in-person meetings. We examine the performance of machine learning algorithms in detecting microvascular disease when trained on either statistical or topological summaries of segmented retinal vascular images. We apply our methods to four publicly-available datasets and find that the fractal dimension performs best for high resolution images. By contrast, we find that topological descriptor vectors quantifying the number of loops in the data achieve the highest accuracy for low resolution images. Further analysis, using the topological approach, reveals that microvascular disease may alter morphology by reducing the number of loops in the retinal vasculature. Our work provides preliminary guidelines on which methods are most appropriate for assessing disease in high and low resolution images. In the longer term, these methods could be incorporated into automated disease assessment tools. Blood vessel networks deliver nutrients and remove waste to maintain tissue homeostasis [1] . Disease complications can alter vascular network morphology, which may lead to insufficient oxygen levels or waste buildup that disrupt tissue function [2, 3] . For example, type 1 and 2 diabetes mellitus (T1DM and T2DM) cause high blood sugar levels, which damage small retinal blood vessels and can lead to diabetic retinopathy (DR) [4, 5, 6] . DR is the leading cause of blindness in the United States for individuals between the ages of 20 and 64 [7, 8] . Due to the high prevalence of T1DM and T2DM worldwide, DR incidence is expected to increase over the next 20 years [9] . To prevent DR, the United States and United Kingdom recommend that patients diagnosed with T1DM or T2DM attend annual eye exams. During these exams, physicians image patients' retinas using fundus photography. The image is then visually inspected for signs of DR, which include leaking blood vessels, edema, capillary non-perfusion, and damaged nerve tissue [2, 10] . In extreme cases, more advanced imaging techniques can be used to provide a more detailed view of the retinal vasculature. While these examinations are effective in preventing DR severity, DR incidence is still projected to increase due to patients exhibiting silent DR symptoms, low exam compliance, and a lack of patient access to health care [9, 11] . There is thus a growing need for the development of widely-accessible screening tools that are capable of detecting early signs of DR as a means to reduce its economic and healthcare impact on the global population. Computational methods that quantify vascular morphology present a promising tool to aid microvascular disease detection. Many metrics summarizing static vascular features have been proposed to aid microvascular disease detection, including artery and vessel diameter ratios, tortuosity, and average daughter vessel branching angle, among others [2] . The fractal dimension (Df) descriptor is now commonly used to quantify network complexity, and Popovic et al. showed that Df increases with DR severity on the Standard Diabetic Retinopathy Database Calibration Level 1 dataset [12, 13] . The effectiveness of the Df descriptor for DR detection is limited, however, as in this study it was only applied to images of retinal vascular networks with many branches. There is no consensus about the relationship between Df and DR; other studies have reported Df increases [14, 15] and decreases [16, 17, 18] with increasing DR severity. Topological data analysis (TDA) provides an alternative computational method to aid disease detection. TDA is a recent field of mathematics that uses concepts from topology and geometry to infer the structure of data [19, 20] . Persistent homology (PH) is a commonlyused area of TDA that identifies topological features (e.g., connected components and loops) that are characteristic of a dataset [21] . As opposed to scalar-valued summary statistics, PH produces vectors that summarize features in the data across multiple scales. Additionally, previous theoretical results guarantee that PH computations are robust when data is noisy or incomplete [22] . TDA summaries have been used to detect disease from medical data, with applications in melanoma [23] , breast cancer [24] , pulmonary disease [25] , heart disease [26] , and Covid-19 [27, 28] . Several of these studies combine TDA summaries with machine learning algorithms to cluster, or classify, the data into groups that share similar topological characteristics and disease status [26, 27, 29] . Disease prediction is then performed by computing the same topological summaries on unseen data and identifying the cluster to which each dataset belongs. Thus, TDA, and in particular PH, computations represent a promising multiscale approach to describe vascular network morphology, which may aid microvascular disease detection. PH has been applied to vascular network data previously: Bendich et al. showcased how PH summaries can reveal the effects of aging on brain vascular morphology [30] , and Stolz et al. used PH to uncover effects of radiotherapy on tumor vascular structure not seen using statistical methods [31] . We recently applied PH computations and a machine learning algorithm to cluster a large collection of simulated vessel networks according to their topological features [32] . We found that simulations within the same clusters were generated from similar input model parameters. We propose that both statistical and PH computations of segmented retinal vascular network images, can be combined with machine learning algorithms to detect disease in medical images. While we focus on segmented vascular images, Garside et al. combined PH computations of colored fundus images with machine learning to detect DR [33] . Their work was applied to high resolution images, and the prediction algorithms required many input summaries to achieve high accuracy levels. We apply our methods to low and high resolution images and are able to achieve similar accuracy levels with only one statistical or topological summary. In this work, we show that both statistical and topological computations provide summaries that can aid microvascular disease detection from segmented vascular images. We apply these methods to four existing and publicly-available datasets, namely the STructured Analysis of the REtina (STARE) [34] , Digital Retinal Images for Vessel Exraction (DRIVE) [35] , Child Heart And health Study in England (CHASE) [36] , and High Resolution Fundus (HRF) [37] datasets. Each dataset provides two-dimensional binary image segmentations that were manually annotated by experts from fundus images and a disease classification for each image. We train support vector machines to predict disease classification from either statistical or topological computations and use 5-fold cross validation to estimate the accuracy of each approach. We find that these methods achieve high accuracy levels when using statistical summaries of high resolution images and topological summaries of low resolution images. This work may provide guidelines for which computational methods are most suitable for microvascular disease analysis based on the quality of data. These guidelines can be used to develop automated medical tools to aid clinical decisions, which may make healthcare more effective, accessible, and affordable. We obtained two-dimensional binary vessel segmentation images (VSIs) and their disease classifications from four publicly available datasets. We summarize the morphology of each VSI by computing standard descriptor vectors and topological descriptor vectors. These vectors are used to train a machine learning algorithm that predicts VSI disease classification. By comparing the algorithm's accuracy when trained with each input vector type, we identify descriptor vectors are most informative for disease prediction. The study pipeline is summarized in Figure 1 and consists of three steps: 1. data collection, 2. data analysis, and 3. disease prediction. Table 1 . This dataset contains fundus images from 20 patients [34] . Each image was annotated separately by two experts, and has resolution 700×605 pixels. Of the 20 patients, 10 were diagnosed as 'Normal', and 10 were diagnosed with a specific disease, such as Hollenhorst emboli and vein occlusion. For simplicity, we classify the latter patients as 'Diseased'. The STARE data are available at https://cecas.clemson.edu/ ahoover/stare/. This dataset includes fundus images from 20 patients [35] . Each image was annotated once and has resolution 565×584 pixels. Of the 20 patients, 17 were diagnosed as Normal and 3 as Diseased. The DRIVE data are available at https://drive.grand-challenge.org/. F I G U R E 1 Methods pipeline overview. 1. Data collection. Two-dimensional fundus images, their corresponding binary vessel segmentation images (VSIs), and their disease classifications were collected from four publicly available datasets. 2. Data analysis. We compute standard descriptor vectors and topological descriptor vectors for each VSI. A) Standard descriptors. We compute standard descriptor vectors by first representing each VSI as a graph. We then count the number of nodes and edges in the graph, and compute its average edge length. We also quantify vessel network complexity by computing the fractal dimension for each VSI. B) Topological data analysis (TDA). We construct flooding, Vietoris-Rips, radial inward, and radial outward filtrations for each VSI. Each filtration yields two two-dimensional persistence images (PIs) that describe the connected components and loops of the filtration. The PIs are then vectorized. 3. Disease prediction. VSI disease classification is predicted using either a standard descriptor vector or a vectorized PI from Step 2. A) Principal components analysis (PCA) is used to project vectorized PIs to their two dominant principal components. B) Support vector machines are trained to predict VSI disease classification using either dimension-reduced PIs or standard descriptor vectors. We downsampled all VSIs to have the same resolution of 523×356 pixels. We summarise VSIs by computing three graph-based and one complexity vectors and refer to them collec-tively as standard descriptor vectors. In order to compute the standard descriptor vectors, we first use the Network Extraction From Images (NEFI) software to generate a graph for each VSI [38] . As shown in Figure We compute the fractal dimension (Df) of each VSI using the box-counting method [12] . This method decomposes the VSI into boxes of pre-determined side lengths. We consider ten box side lengths, s: for the largest value of s, four boxes cover the entire VSI; for the smallest value, each box is the size of one pixel. For each value of s, boxes are assigned a value of one if they contain at least one nonzero pixel; otherwise they are assigned a value of zero. We denote by N (s) the number of boxes of size s and value one. We assume that ln (N (s)) is linearly proportional to ln (1/s), in which case the slope of the line of best fit approximates the Df of a given VSI [39] . TDA is an emerging field of research that combines concepts from algebraic topology and computational geometry to analyze the shape of high dimensional data. PH is a widely-used methodology within TDA [19, 21, 20] . In contrast to descriptors that summarize data at a sin-NEFI-generated VSI graph representation We now present an illustrative example of our PH pipeline (Figure 3 ), in which we apply the flooding filtration to the VSI from step 1B of Figure 1 . In part 1, we compute the flooding filtration by constructing a sequence of two-dimensional binary images. The first binary image identifies the vessel segments present in the VSI as white pixels, the second identifies the original vessel segments and their nearest neighbors as white pixels, the third identifies the original segments, their nearest neighbors and second nearest neighbors as white pixels, and so on. The process terminates when F I G U R E 3 Persistent homology (PH) pipeline to quantify vessel segmentation image (VSI) morphology. The flooding filtration is applied to the VSI from step 1B of Figure 1 . 1) Topological filtration. A sequence of binary images is constructed by applying a flooding step to the previous image, and starting with the initial image. Each binary image is then converted into a simplicial complex, K i for i = 1, 2, 3, 4. PH quantifies the presence of connected components (CC) and loops within the filtration K = {K 1 , K 2 , K 3 , K 4 }. 2) Barcodes and persistence diagrams. The lifetime of each topological feature type is summarized by a barcode or persistence diagram. 3) Topological descriptor vectors -Persistence images. Persistence diagrams are converted into persistence images, a form of topological descriptor vector that is suitable for machine learning tasks. all pixels are white (for the VSI in Figure 3 , this corresponds to binary image 5, not shown). We apply four flooding steps in this example, but typically more steps are needed to analyze real VSIs. The sequence of binary images is then converted into a filtration by replacing each binary image with a set of points, lines and triangles, known collectively as a simplicial complex. The i th (i = 1, 2, 3, 4) simplicial complex, K i , is generated from the i th binary image by placing a point at the centroid of each white pixel, drawing lines between any two neighboring points, and putting filled in triangles between any three neighboring points that form a right triangle (see zoomed-in inset of K 1 in Figure 3 ). The filtration is given by We use PH to quantify the lifetime, or persistence, of topological features (connected components and loops) there are three connected components and one loop; in There is currently no consensus about which topological filtrations are best suited for determining how disease alters VSI morphology and, in turn, predicting the presence of disease from VSI data. To address this question, we consider four filtrations: the Flooding filtration, the Vietoris-Rips filtration, and the radial outward and radial inward filtrations (Figure 4 ). For each filtration, we define N values of its scale parameter; from the i th (i = 1, . . . , N ) value, we build a simplicial complex, K i . A simplicial complex is a structure consisting of points, lines, triangles, and their higherorder counterparts that connect data points together. We refer to the resulting sequence of simplicial complexes, For computational reasons, we subsample 2000 representative points from the resulting point cloud data (i.e., approximately 10% of the points) using a Greedy furthest point sampling algorithm [41] . To construct the Vietoris-Rips filtration, K VR , we define N = 40 distance values 0 ≤ 1 < 2 < . . . < N and place a circle of radius i around each subsampled point. We connect, with a line, any two points whose circles overlap, and connect with a filled-in triangle any three points whose circles pairwise overlap. Each K VR i (i = 1, 2, . . . , N ) is generated by combining all such points, lines, and filled-in triangles that result from i . We choose i = (i − 1)185/39 to analyze each dataset. The radial outward and inward filtrations: We consider a radial outward filtration, K outward , and a radial inward filtration, K inward (the radial outward filtration is depicted in Figure 4 , bottom row). In both cases, for a given VSI, we first compute a graph that comprises a a set of nodes, V , and a set of edges, E (Figure 2 ). Each node is placed at locations where three vessel segments intersect and also at terminal points of vessel segments. Edges connect pairs of nodes that are connected by a vessel segment in the VSI. We use the NEFI software to generate each graph [38] . We designate the node from V with the smallest betweenness centrality measure as the graph's central node, v c (see green star in Figure 2 ). This definition for v c was chosen to identify a node located near the optic disk of each VSI. we consider N = 40 radial values 0 ≤ r 1 < r 2 < · · · < r 40 . All nodes from V located within distance r i of v c are in- We require vectors of the same length to represent each VSI for disease prediction. Therefore, we map all persistence diagrams into two-dimensional persistence images sian normal distribution, with mean (b, p) and variance σ (here, we fix σ = 1.0); Each Gaussian is discretized to produce a twodimensional image of size N P I × N P I (here, we fix N P I = 50); All Gaussian images are multiplied by their persistence, p, and then summed to produce the PI; The PI is vectorized into a N 2 P I ×1 topological descriptor vector for disease prediction. More information on PIs and their computation is available in [40] . We compute separate PIs for connected components and loops from each filtration, producing 8 total topological descriptor vectors for each VSI. Each vectorized PI is of length 2,500 because we fix N P I = 50. We use principal components analysis (PCA), to reduce each vectorized PI into a length 2 vector (See Step 3A of Figure 1 ) and facilitate data analysis. The PCA algorithm maximizes the amount of information (i.e., statistical variability) present in the reduced data. We use Sci-kit learn's decomposition package (version 0.24.2) to perform PCA. For each descriptor vector, we use a supervised learning algorithm to predict VSI disease classification. In more detail, we implement support vector machines (SVMs), with radial basis function kernels, to identify the curves or surfaces that best partition the data into distinct groups according to disease type (See Step 3B in We predicted VSI disease classifications using SVMs associated with each standard descriptor vector for the STARE, HRF, and All datasets (Table 2) We used SVMs associated with each topological descriptor vector to predict VSI disease classifications for all datasets (Table 4 ). On the STARE datasets, SVMs associated with flooding (loops) PIs perform best or secondbest and outperform the Df descriptor vector. The SVM trained on the Vietoris-Rips (loops) PIs perform best on the HRF dataset. On the All dataset, the SVM associated with the Vietoris-Rips (connected components) PI performs best, though the flooding (loops) and Vietoris-Rips (loops) PIs also perform well. Plotting the dimension-reduced flooding (loops) PIs from the STARE expert 1 dataset onto their first two A strength of TDA is that it is underpinned by mathematical theory, including stability results which guarantee that small changes to input data (e.g., the removal or addition of vessel segments) do not significantly alter the resulting output [22] . This stability result is illustrated by the good agreement between the output of the flooding filtration (loops) when applied to the two sets of annotations from the STARE dataset. In future work, we plan to implement extended persistence and multiparameter persistence to determine whether they provide more robust quantification of VSIs [44, 45] . We note that artificial neural networks are an alternative approach to automate the detection of microvascular disease [46] . Because deep neural networks perform exceptionally well on computer vision tasks, several recent studies have proposed their use to extract VSIs from fundus images and then use them to predict disease classification [47, 48, 49] . A concern about the use of neural networks, however, is that it can be challenging to interpret the outputs from these black-box models [50] . Topological and statistical analyses, on the other hand, generate metrics that are interpretable. In future work, we plan to combine topological and statistical methods with machine learning algorithms and mathematical modeling to infer the mechanisms that lead to diabetic retinopathy and other microvascular diseases of the retina [32, 51] . A limitation of this study is the small sample size. We focused on the STARE, DRIVE, CHASE, and HRF datasets due to their wide use in the scientific community and public availability. In the future, we aim to produce a fully automated pipeline that uses neural networks to extract VSIs from fundus images and then uses our methodology for their analysis and classification [52] . This will enable us to analyze larger retinal image datasets and to distinguish different disease types (e.g., Normal, diabetic retinopathy, or Glaucoma). In this work, we focused on two-dimensional VSIs, but TDA methods have been used to characterize three-dimensional vascular networks in the brain and in cancer [30, 31] . These methods could be adapted to analyze three-dimensional, optical coherence tomography angiography images [53] . Our work shows how statistical and topological meth- The methods developed in this work could also contribute to the growing use of telemedicine in making healthcare more accessible, affordable and effective. High resolution digital images of a patient's eyes, taken using smartphones, could be sent to clinics for automated analysis and disease classification, prior to review by a medical expert. In addition to delivering remote clinical services to patients, automated diagnoses prior to medical consultations, would enable doctors to devote more time to evaluating non-standard cases. The authors thank BJ Stolz for helpful discussion and commentary. The authors declare no conflicts of interest. Spatial statistical modelling of capillary non-perfusion in the retina Spatial distribution of diabetic capillary non-perfusion. Microcirculation Retinal Microvascular Abnormalities and their Relationship with Hypertension, Cardiovascular Disease, and Mortality Quantitative Retinal Venular Caliber and Risk of Cardiovascular Disease in Older Persons: The Cardiovascular Health Study The Evolving Diabetes Burden in the United States The Prevalence of and Factors Associated With Diabetic Retinopathy in the Australian Population The role of tele-ophthalmology in diabetic retinopathy screening Classification of diabetic retinopathy and diabetic macular edema IDF Diabetes Atlas: Global estimates for the prevalence of diabetes for 2015 and 2040 Fractal characterization of retinal microvascular network morphology during diabetic retinopathy progression DIARETDB1 diabetic retinopathy database and evaluation protocol Quantitative Assessment of Early Diabetic Retinopathy Using Fractal Analysis Retinal Vascular Fractal Dimension and Risk of Early Diabetic Retinopathy: A prospective study of children and adolescents with type 1 diabetes Retinal Vessel Calibers Predict Long-term Microvascular Complications in Type 1 Diabetes: The Danish Cohort of Pediatric Diabetes 1987 (DCPD1987) Retinal vascular fractals predict long-term microvascular complications in type 1 diabetes mellitus: the Danish Cohort of Pediatric Diabetes 1987 (DCPD1987) Early risk stratification in paediatric type 1 diabetes A roadmap for the computation of persistent homology The persistent topology of data Stability of Persistence Diagrams. Discrete & Computational Geometry Assessment of skin barrier function using skin images with topological data analysis. npj Systems Biology and Applications Topology based data analysis identifies a subgroup of breast cancers with a unique mutational profile and excellent survival Lung Topology Characteristics in patients with Chronic Obstructive Pulmonary Disease Topological Data Analysis for Classification of Heart Disease Data Fusion of Persistent Homology and Deep Learning Features for COVID-19 Detection in Chest X-Ray Images Analysis of Spatiotemporal Anomalies Using Persistent Homology: Case Studies with COVID-19 Data Topological Methods for the Analysis of High Dimensional Data Sets and 3D Object Recognition Persistent Homology Analysis of Brain Artery Trees Multiscale Topology Characterises Dynamic Tumour Vascular Networks Topological data analysis distinguishes parameter regimes in the Anderson-Chaplain model of angiogenesis Topological data analysis of high resolution diabetic retinopathy images Locating blood vessels in retinal images by piecewise threshold probing of a matched filter response Ridge-based vessel segmentation in color images of the retina An Ensemble Classification-Based Approach Applied to Retinal Blood Vessel Segmentation Robust Vessel Segmentation in Fundus Images NEFI: Network Extraction From Images Hausdorff Dimension, Its Properties, and Its Surprises Persistence images: a stable vector representation of persistent homology Greedy Sampling for Approximate Clustering in the Presence of Outliers Persistent homology -A survey Computational Topology: An Introduction. Providence R. I.: American Mathematical Society Topological Approximate Bayesian Computation for Parameter Inference of an Angiogenesis Model Multiparameter persistent homology landscapes identify immune cell spatial patterns in tumors Deep Learning in Medicine-Promise, Progress, and Challenges Computer-Assisted Diagnosis for Diabetic Retinopathy Based on Fundus Images Using Deep Convolutional Neural Network Effective blood vessels reconstruction methodology for early detection and classification of diabetic retinopathy using OCTA images by artificial neural network Retinal vascular junction detection and classification via deep neural networks Opportunities and obstacles for deep learning in biology and medicine Progression of Diabetic Capillary Occlusion: A Model Recurrent Residual Convolutional Neural Network based on U-Net (R2U-Net) for Medical Image Segmentation Image Projection Network: 3D to 2D Image Segmentation in OCTA Images