key: cord-0195860-gqiapgiv authors: Dlotko, Pawel; Rudkin, Simon title: Visualising the Evolution of English Covid-19 Cases with Topological Data Analysis Ball Mapper date: 2020-04-07 journal: nan DOI: nan sha: 4d19ca3b426c1864ac9d5bc719e47977c352e981 doc_id: 195860 cord_uid: gqiapgiv Understanding disease spread through data visualisation has concentrated on trends and maps. Whilst these are helpful, they neglect important multi-dimensional interactions between characteristics of communities. Using the Topological Data Analysis Ball Mapper algorithm we construct an abstract representation of NUTS3 level economic data, overlaying onto it the confirmed cases of Covid-19 in England. In so doing we may understand how the disease spreads on different socio-economical dimensions. It is observed that some areas of the characteristic space have quickly raced to the highest levels of infection, while others close by in the characteristic space, do not show large infection growth. Likewise, we see patterns emerging in very different areas that command more monitoring. A strong contribution for Topological Data Analysis, and the Ball Mapper algorithm especially, in comprehending dynamic epidemic data is signposted. As Covid-19 spreads through the global community, the bridge between economics and data science offers much to unlocking the information within the official statistics. Where summary statistics speak of trends, and maps of cases help get a visual handle on the spatial scale, Topological Data Analysis (TDA) after Carlsson (2009) and particularly the Ball Mapper (BM) algorithm of D lotko (2019) can quickly highlight patterns within the characteristics of communities for policy to attend to. This short note, firstly, contributes a first look at how BM produces an abstract two dimensional representation of NUTS3 data and how, by doing so, we can see where in the characteristic space cases are particularly fast rising in terms of number of infections. Whilst regional characteristics are comparatively static, the changing nature of cases facilitates the envisioning of changing outcomes on the point cloud. A second contribution is thus to show how the relative levels of Covid-19 are changing within the space. Moreover we bring to the empirical economics literature an approach that recognises the outcomes we study do not manifest linearly through space as our models may present, nor are they organised neatly along axes. Interactions matter, location in the space is important, and visualising the true picture from our data is the only way to see and quantify what is truly there. Intuition for the BM plot is as straightforward as the wealth of information we would gain from seeing a two-dimensional scatter plot of two characteristics. The BM technique allows us to visualize multiple dimensional data sets represented in the form of point clouds. Having this helps to understand where the considered activity, in this case the infection spread, is most prevalent. As an observation has a coordinate on a scatter plot so it is a point in this multi-dimensional cloud. Where regions in that cloud have high incidences of Covid-19 so attention must focus on the combination of each characteristic that is associated therewith; just as we would look at whether high incidences were top left, top right, etc. in a scatter plot. The dynamics are thus quickly viewed and trends therein exposited. Adda (2016) speaks much to the present situation with Covid-19, placing the spread of epidemics firmly within a world of increased transportation and trade. The high frequency use of influenza data in Adda (2016) aids economists in understanding how economic conditions link to disease spread, engaging a discussion with epidemiology. This contribution to the literature owes much to the pioneering work in combining high, and low, frequency data to marry annually updated regional characteristics with the daily information on Covid-19 cases. Obtaining data at a sufficiently micro level for study in the UK presents its own challenges, met in this paper by the use of NUTS3 level data 1 . A strong focus in the discussion of economics and disease spread is placed on employment, following works on influenza by Markowitz et al. (2019) and Eichenbaum et al. (2020) . Likewise there are considerations of income, working patterns and population age that can help understand spread (Böckerman et al., 2009; Barnay, 2016) . Population density is a natural key to contamination rates also and must be included (Jones et al., 2008) . Familiarity with what propagates the spread of common flu viruses gives strong guidance to what may then drive the spread of Covid-19. This paper maps the characteristics space for UK NUTS3 regions to the extent of available data. Early work to link Covid-19 to economic conditions in China has sought data science to explain which factors are important to spread (Qiu et al., 2020) , but such Machine Learning approaches are subject to overfitting criticisms. Unlike the "black box" nature of non-linear and deep-learner models, the BM approach here is simply a representation of the data that the user interprets. There is also much to remind of other visualisation methods, such as t-SNE (Maaten and Hinton, 2008) as endorsed recently in medical analysis by Linderman et al. (2019) , and the original TDA mapper algorithm of Singh et al. (2007) . Unlike these approaches BM does not require the construction of a full distance matrix between points as t-SNE does, and is more stable than the original mapper as BM has just one parameter and no functional choices. Consequently the insight gained into economics and the spread of disease from this paper have full transparency. In what follows we take the learning from past studies of disease spread, the methodological advancements of TDA and BM and explore what we may learn from the evolving English picture. BM is a means of representing complex and high dimensional data sets as abstract graphs. After covering them with a collection of balls, it creates a visualisation of the considered multidimensional phenomenon as an abstract two-dimensional form. BM graphs are constructed by covering the considered data set with a collection of overlapping balls of a radius . Central points of each ball are referred to as landmarks, the number of landmarks being a function of the density of the data and the choice of . Where there are points in the intersection of two balls those balls are connected by an edge on the graph. Numbers of points covered by a ball guide its size in the graph. Consequently we see the density of data, connectivity and relationships between observations in multi-dimensional space. Overlaying the BM graph with colour facilitates the consideration of an outcome of interest, here the number of cases of Covid-19. By default the colour of a ball picks up the average outcome for the points within the ball, but if desired the colouring function may be altered by the user. As implemented in the Ball Mapper R package Dlotko (2019) a colour bar is added to the plot whereby the lowest values are red and the highest are purple. Labels on the bar direct on the specific meanings of colour for that plot. Seeing pairs of connected balls with very different colouration immediately attracts attention. Recalling that the balls link to observations in the underlying dataset we may Notes: Data on population density, median age, regional Gross Domestic Product (GDP) and regional Gross Value Added (GVA) is collected at the NUTS3 level and downloaded from Eurostat. Data on average hours worked and average annual gross income is taken from the Earnings and hours worked, place of work by local authority: ASHE Table 7 survey produced by the Office for National Statistics (ONS) in the UK. Case data is released by the UK Government via www.gov.uk at the local authority level. Merging from local authority to NUTS3 is performed using the look up queries offered by the ONS and supplemented with manual matching. Where one local authority covers multiple NUTS3 codes cases, GDP and GVA are spread evenly amongst those codes. (n = 134) quickly merge back to the data set using the point ID to link that data with the ball labels. Once this is done it is possible to acquire a more detailed picture of what has been found, including non-continuous variables that had been necessarily kept out of the cloud formation. In this case the region name is a relevant example of something it is useful to match. Grouping by ball offers summary statistics and a chance to measure differentials in the euclidean space. Data on NUTS3 regions is taken from Eurostat, whilst data on earnings in the UK is collected from the Earnings and hours worked, place of work by local authority: ASHE Table 7 survey produced by the Office for National Statistics (ONS) in the UK. Cases of Covid-19 by local authority are reported by the UK government and are as collated for researchers on the Github site of Tom White 2 . Matching from local authority to NUTS3 is performed using the look up queries from the Office for National Statistics, supplemented by some manual work on cases where no match can be found. Because of the nature of reporting, the data used here only considers England. Summary statistics for the variables used are provided in Table 1 , with the six characteristics of the point cloud showing very different scales. To overcome the challenge of having one dominant axis, the axes of the dataset are normalised into the interval [0, 1] so that the maximal value of each column gets mapped to 1, the minimal value gets mapped to 0, and all the intermediate values are mapped inbetween. In this paper raw values of Gross Domestic Product (GDP) and Gross Value Added (GVA) are used as our aim is to distinguish tangibly between regions. Employing the logscaling would create greater spread lower down the distribution, meaning more balls at the lower income level, but would equally connect very different high income areas at the top of the scale. Average hours worked do not vary greatly but we do see a large regional income disparity; such is well studied for the United Kingdom. Because this is economic data, appeal is made to the geographic imbalance of the United Kingdom when considering spatial extents; weighting of GDP and wages towards London means they serve to capture much of the geography of the spread. Likewise on a micro scale the movement away from regional economic centres like Manchester, Bristol or Newcastle conveys a similar spatial extent. Covid-19 case data is updated daily and may be neatly summarised in Table 2 . The rise in cases is clear, but equally the growth of the standard deviation reminds that the disparities across regions are large. These numbers are the total number of cases reported up to, and including, that date. At this time the trend continues upwards and the need to understand more of the spatial economics behind these rises is clear. In the first instance we are concerned with the evolving picture of cases in England. The BM graph is constructed using the six characteristics of the NUTS3 regions shown in Table 1 , recalling that these are normalised onto the interval [0, 1] owing to the varying scales of the different regional economic measures. In choosing the radius for the balls we considered carefully the trade-off between noise in the colouration and being able to chart neatly the dynamics of the cases. Colouration of the balls is a three step process. Firstly the average number of cases on a given day within the ball is calculated. This number is then divided by the sum of cases for all balls at that day. That gives a proportion of the cases within the considered ball to all cases at the given day. Note that this is done in this way because points appear in multiple balls where they sit in the intersection. The resulting variable is then passed back to the BM graph which is subsequently plotted. Altering the colouration in this way reminds that the total number of cases is increasing every day and ensures the balls only show the relative position. The alternative of calculating the new cases is liable to be volatile, high infection areas may report low cases on one day and appear lower risk for example. Hence using our adjustment allows the continued charting of dynamics. Figure 1 shows four plots which chart the changing picture of Covid-19 cases in England. These BM graphs are drawn with six axes and demonstrate a broad connectivity between most regions of England. However there are a series of balls which do not connect at this value of . These include those areas with particularly low population density, coloured red as they have a low number of cases, and those with particularly high earnings such as the combined region of Hammersmith, Fulham, Kensington and Chelsea, coloured purple. Indeed Ball 20, which represents this NUTS3 area is the highest in three of the four plots. Panels (b) to (d) show a broad consistency in the relative proportion of cases in each ball. By the time we reach 3rd April, panel (a) shows the dominance of ball 20 starts to break, with more blues emerging and a general move into the orange across the plot. What we see in panel (d) is a much more even distribution of the cases of the space. Since information about the NUTS3 areas within each ball is known we can quickly check the implications of the transmission. Whilst these new hotspots have similar population densities, median ages, GDP, GVA and hours worked there is notable differential in the average pay. It is the large differential in pay that leaves ball 20 disconnected from the space. From a socio-economic standpoint this transition into the hotspots being in areas of lower wage will be noteworthy in informing the policy response. Because the colouring of each ball shows the relative number of cases within this ball to the global number of cases in all balls for the day, there is not a noticeable change in colour across the first three panels of Figure 1 . Rather, interest instead lies in the way that different subsets of the space change their colouration. Moving from (b) to (f) we see many balls catching up with ball 20, this is particularly true of the balls to the right of the connected shape. In this setting Ball 1 is interesting as it remains red whilst it's neighbour, ball 24, becomes progressively bluer. In early April the brightening of the connected component into orange has coincided with the replacement of ball 20 as the highest number of cases by ball 22. Panels (e) to (g) show the gradual lightening of the London balls, especially ball 20 which is green in the most recent plot of panel (a). The ball 1 to ball 24 contrast remains stark although recent plots are showing a closing of the gap. To understand the plots further we may colour by the values of each axis instead of the number of Covid-19 cases. Figure 2 offers such a plot with one panel for each of the six axes. Panel (a) shows that it is the balls to the lower right of the main shape that have the highest population density, but the colouration of these is very mixed, the right-most set are where the numbers of cases increase notably but the left set do not see any increase at all. This suggests further confounding factors at play. Panel (b) may offer an explanation as those regions with the higher median age do correlate neatly with the ones that have the lowest reported number of Covid-19 cases. GDP and GVA, panels (c) and (d) are similar across the connected shapes as we know it is these two variables that drive the separation of the outlier balls. Likewise there is some similarity in gross pay, panel (f), though there we do witness differentiation between higher rates to the lower right and low rates at the top left. Hours worked do vary greatly with some correlation suggested there with Covid-19 cases. BM graphs offer informative overviews of data, but their value comes in identifying the driving In each case the plot shows the averaged total number of cases reported up to the given date for each ball. Ball numbers refer to those used throughout this paper and are as featured in Figure 2 . forces of those patterns. In the results we saw how there can be very different behaviour in balls that are similar enough to be connected. The most obvious case being balls 24 and 1. The former is predominantly districts of Outer London, whilst the latter is smaller cities such as Peterborough, Swindon, Hartlepool and the Thames Gateway. Connection between the two balls comes from Milton Keynes lying at the intersection. With the exception of Hartlepool, these are places which would see strong commuter links to London and yet the contrast with the balls that contain London is stark. Similar movement towards blue is seen in ball 27, which also contains London boroughs. There is a second interest at the other end of the connected shape where we see balls 9 and 14 become more yellow. This is not taking them to the same scale as the highest set, but it is evidence that there are non-uniform changes in other parts of the shape. These balls are the industrial north, including North-East Manchester, Blackburn and Darwen, Bradford and Sheffield. They are not major urban centres, and their overall population density is relatively low compared to other build up areas. Whether this transition is indicative of a growing cluster is unknown, but such changes are something which authorities may wish to monitor as more data comes in. The cities of Manchester, Liverpool, Nottingham, Birmingham and Bristol sit in ball 7, this lies between the major growth balls and balls 9 and 14. A suggestion is made that these serve as a conduit to their satellites, but ball 8 that has characteristics between the two, does not show such change in early April. As time has gone on we see distinct increases in the relative proportions through ball 8 into balls 9 and 14. The latter remain ahead of their neighbours. Our dynamic plots help greatly in charting change. To chart the evolution of these stories Figure Panel (b) reflects on the two balls to the top left of the plot where the proportion of cases rose quickest, balls 9 and 14. These link through the characteristic space to London via balls 7 and 8, the former including Manchester, Leeds and other cities that would attract commuters from the areas of 9 and 14. Region 7 contains suburbs of Manchester and serves as a link to 9 and 14. Of all the regions plotted it is ball 7 that consistently has the lowest number of cases. As the Covid-19 pandemic grows, so these evolutions will continue to interest. None of our plots suggest any flattening of the new cases curve anywhere except ball 20. Indeed the suggestion that other balls are starting on their own curves will be of concern to those watching the spread of Covid-19. Exploring these developments across the characteristic space will be a rich seem of work as the data develops. Using a highly aggregated dataset of NUTS3 level economic indicators, this paper has charted the spread of the Covid-19 virus across England. The BM algorithm has been employed to represent the economic characteristic space. We have done so across six key axes, giving a mapping of the importance of interactions of these characteristics in determining outbreak extents. Developments in Outer London, and particularly the contrast with relatively similar areas which serve London as commuter towns, will be of interest in modelling the spread. Likewise, the growing intensity seen in the North of England, in the industrial heartlands of Lancashire and Yorkshire, will also be something for further thought for UK policy-makers. Development of the dataset, fine graining the information to better represent target populations and understanding more economic variables, will all help enhance the modelling. A further omission is the spatial dimension, there is no measure here to capture the physical geography; rather we simply appeal to the idea that many of our measures are driven by the distance to London because of the economic imbalance of the United Kingdom. Notwithstanding these concerns, the goal of this paper has been a demonstration of what BM can do and inspire the growth of its use. Our dynamic evaluation is constantly updating, and as more is learned so the narrative takes shape. Extending to other countries where localised reporting of cases is available offers a rich seam for impact generating exploration. Important steps are taken in both opening new perspectives to statistical studies in economics, but also in appreciating the spread of disease across socio-economic space. Economic activity and the spread of viral diseases: Evidence from high frequency data Health, work and working conditions: a review of the european economic literature Economic inequality and population health: looking beyond aggregate indicators Topology and data Ball mapper: a shape summary for topological data analysis BallMapper: Create a Ball Mapper graph of the input data The macroeconomics of epidemics Global trends in emerging infectious diseases Fast interpolation-based t-sne for improved visualization of single-cell rna-seq data Visualizing data using t-sne The effects of employment on influenza rates Impacts of social and economic factors on the transmission of coronavirus disease 2019 (covid-19) in china Topological methods for the analysis of high dimensional data sets and 3d object recognition