key: cord-0744829-k0dyizfc
authors: Kepner, Jeremy; Jones, Michael; Andersen, Daniel; Buluc, Aydin; Byun, Chansup; Claffy, K; Davis, Timothy; Arcand, William; Bernays, Jonathan; Bestor, David; Bergeron, William; Gadepally, Vijay; Houle, Micheal; Hubbell, Matthew; Klein, Anna; Meiners, Chad; Milechin, Lauren; Mullen, Julie; Pisharody, Sandeep; Prout, Andrew; Reuther, Albert; Rosa, Antonio; Samsi, Siddharth; Stetson, Doug; Tse, Adam; Yee, Charles; Michaleas, Peter
title: Spatial Temporal Analysis of 40,000,000,000,000 Internet Darkspace Packets
date: 2021-08-15
journal: IEEE High Performance Extreme Computing Conference (HPEC)
DOI: 10.1109/hpec49654.2021.9622790
sha: 23a8dac66ed5d5272545d6a4f1a1b05532e15cc0
doc_id: 744829
cord_uid: k0dyizfc

The Internet has never been more important to our society, and understanding the behavior of the Internet is essential. The Center for Applied Internet Data Analysis (CAIDA) Telescope observes a continuous stream of packets from an unsolicited darkspace representing 1/256 of the Internet. During 2019 and 2020 over 40,000,000,000,000 unique packets were collected representing the largest ever assembled public corpus of Internet traffic. Using the combined resources of the Supercomputing Centers at UC San Diego, Lawrence Berkeley National Laboratory, and MIT, the spatial temporal structure of anonymized source-destination pairs from the CAIDA Telescope data has been analyzed with GraphBLAS hierarchical hypersparse matrices. These analyses provide unique insight on this unsolicited Internet darkspace traffic with the discovery of many previously unseen scaling relations. The data show a significant sustained increase in unsolicited traffic corresponding to the start of the COVID19 pandemic, but relatively little change in the underlying scaling relations associated with unique sources, source fan-outs, unique links, destination fan-ins, and unique destinations. This work provides a demonstration of the practical feasibility and benefit of the safe collection and analysis of significant quantities of anonymized Internet traffic.

For over five billion people the Internet has become as essential as land, sea, air, and space for enabling activities as diverse as commerce, education, health, and entertainment [1] . Understanding the Internet is likewise as important as studying these other domains [2] . Developing scientific insights on how the Internet behaves requires observation and data [3] - [6] . The largest public Internet observatory is the Center for Applied Internet Data Analysis (CAIDA) Telescope that operates a variety of sensors including a continuous stream of packets from an unsolicited darkspace representing 1 Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein. of the Internet. This network telescope monitors an Internet darkspace (also referred to as a black hole, Internet sink, or darknet) that is a globally routed /8 network that carries almost no legitimate traffic because there are few allocated addresses in this Internet prefix. After discarding the small amount of legitimate traffic from the incoming packets, the remaining data represent a continuous view of anomalous unsolicited traffic, or Internet background radiation. Almost every computer on the Internet will receive some form of this background traffic. This unsolicited traffic results from a wide range of events, such as backscatter from randomly spoofed source denial-of-service attacks, the automated spread of Internet worms and viruses, scanning of address space by attackers or malware looking for vulnerable targets, and various misconfigurations (e.g. mistyping an IP address). In recent years, traffic destined to darkspace has evolved to include longer-duration, low-intensity events intended to establish and maintain botnets. CAIDA personnel maintain and expand the telescope instrumentation, collects, curates, archives, and analyzes the data, and enables data access for vetted researchers. During 2019 and 2020 over 40,000,000,000,000 unique packets were collected by the CAIDA Telescope. This data set represents the largest ever assembled public corpus of Internet traffic, and is perhaps the largest public collection of streaming network events of any type. Figure 1 shows the number of packet in each month and indicates a significant increase aligning with the COVID19 pandemic. Analysis of such a large network data set is computationally challenging [7] - [9] . Using the combined resources of the Supercomputing Centers at UC San Diego, Lawrence Berkeley National Laboratory, and MIT, the spatial temporal structure of anonymized source-destination pairs from the CAIDA Telescope data has been analyzed leveraging prior work on massively parallel GraphBLAS hierarchical hypersparse matrices [10] - [16] . Applying this type of analysis to other collections of billions of network packets has revealed power-law distributions [17] , novel scaling relations [18] , and inspired new models of network traffic [19] . The goal of this paper is to understand the scaling relations revealed by the CAIDA telescope data set. These scaling relations can provide fundamental insights into Internet background traffic. This work can also provide a demonstration of the practical feasibility and benefit of the safe collection and analysis of significant of quantities anonymized Internet traffic.

The outline of the rest of the paper is as follows. First, the network quantities and their distributions are defined in terms of traffic matrices. Second, multi-temporal analysis of hypersparse hierarchical traffic matrices is described. Third, the method for determining scaling relations as a function of the packet window N V is presented along with the resulting scaling relations observed in the gateway traffic data. Finally, our conclusions and directions for further work are presented.

The CAIDA Telescope monitors the traffic into and out of a set of network addresses providing a natural observation point of network traffic. These data can be viewed as a traffic matrix where each row is a source and each column is a destination. The CAIDA Telescope traffic matrix can be partitioned into four quadrants (see Figure 2 ). These quadrants represent different flows between nodes internal and external to the set of monitored addresses. Because the CAIDA Telescope network addresses are a darkspace, only the upper left (external → internal) quadrant will have data. Internet data must be handled with care, and CAIDA has pioneered standard trusted data sharing best practices that include [2] • Data is made available in curated repositories • Using standard anonymization methods where needed:

hashing, sampling, and/or simulation • Registration with a repository and demonstration of legitimate research need Streams of interactions between entities are found in many domains. For Internet traffic these interactions are referred to as packets [20] . Figure 3 illustrates essential quantities found in all streaming dynamic networks. These quantities are all computable from anonymized traffic matrices created from the source and destinations found in packet headers. These sources and destinations are referred as Internet Protocol (IP) addresses.

Processing such a large volume of data requires additional computational innovations. The advent of GraphBLAS hypersparse hierarchical traffic matrices has enabled the processing of hundreds of billions of packets in minutes [16] , [21] - [23] . The CAIDA Telescope archives its trillions of collected packets at the supercomputing center at Lawrence Berkeley National Laboratory (LBNL) where the packets are aggregated into CryptoPAN [24] anonymized GraphBLAS traffic matrices of N V = 2 17 contiguous packets. This process compresses the data down to approximately 3 bytes/packet. The resulting matrices are stored and sent to the MIT SuperCloud where the network quantities shown in Figure 3 are computed. Using 384 64-core compute nodes (24,576 cores total) on the MIT SuperCloud all 40,000,000,000,000 packets were processed in four days.

The code was implemented using Matlab/Octave with the pMatlab parallel library [10] . A typical run could be launched in a few seconds using the MIT SuperCloud triples-mode hierarchical launching system [13] . Typical launch parameters were [384 16 4] , corresponding to 384 nodes, 16 Matlab/Octave processes per node, and 4 OpenMP threads per process. On each node, the 16 processes were pinned to 4 adjacent cores to minimize interprocess contention and maximize cache locality for the GraphBLAS OpenMP threads [25] . Three levels of parallelism were used within the program. At the top level each month in a year was processed independently using 384/12 = 32 compute nodes. Within each month, the packet windows were were split among the 32×16 = 512 Matlab/Octave processes with some overlap to allow for contiguous processing of the streaming data. Within each Matlab/Octave process, the underlying GraphBLAS OpenMP parallelism was used on 4 cores. At the end of the processing the results were aggregated using asynchronous file-based messaging [26] .

The contiguous nature of these data allows the exploration of a wide range of packet windows from N V = 2 17 (subsecond) to N V = 2 27 (minutes), providing a unique view into how network quantities depend upon time. These observations provide new insights into normal network background traffic that could be used for anomaly detection, AI feature engineering, polystore index learning, and testing theoretical models of streaming networks [27] - [29] .

The network quantities depicted in Figure 3 are computable from anonymized origin-destination matrices that are widely used to represent network traffic [30] - [33] . It is common to filter the packets down to a valid set for any particular analysis. Such filters may limit particular sources, destinations, protocols, and time windows. To reduce statistical fluctuations, the streaming data should be partitioned so that for any chosen time window all data sets have the same number of valid packets [15] . At a given time t, N V consecutive valid packets are aggregated from the traffic into a sparse matrix A t , where A t (i, j) is the number of valid packets between the source i and destination j. The sum of all the entries in A t is equal to

All the network quantities depicted in Figure 3 can be readily Formulas for computing network quantities from traffic matrix At at time t in both summation and matrix notation. 1 is a column vector of all 1's, T is the transpose operation, and | | 0 is the zero-norm that sets each nonzero value of its argument to 1 [34] . These formulas are unaffected by matrix permutations and will work on anonymized data.

Summation Example differential cumulative probability (normalized histogram) for two packet windows (N V = 2 17 and N V = 2 27 ) of the number (degree) of source packets from each source using logarithmic bins d i = 2 i . Sources sending out a single packet are denoted "leaf nodes". The source with the largest number of packets dmax is referred to as the "supernode". computed from A t using the formulas listed in Table I . Because matrix operations are generally invariant to permutation (reordering of the rows and columns), these quantities can readily be computed from anonymized data.

Each network quantity will produce a distribution of values whose magnitude is often called the degree d. The corresponding histogram of the network quantity is denoted by n t (d). The largest observed value in the distribution is denoted d max . The normalization factor of the distribution is given by

with corresponding probability

and cumulative probability

Because of the relatively large values of d observed, the measured probability at large d often exhibits large fluctuations. However, the cumulative probability lacks sufficient detail to see variations around specific values of d, so it is typical to pool the differential cumulative probability with logarithmic bins in d

where d i = 2 i [35] . All computed probability distributions use the same binary logarithmic binning to allow for consistent statistical comparison across data sets [35] , [36] . The corresponding mean and standard deviation of D t (d i ) over many different consecutive values of t for a given data set are denoted D(d i ) and σ(d i ). Figure 4 provides an example distribution of external → internal source packets with packet windows of N V = 2 17 and N V = 2 27 at two different times. The resulting distribution exhibits the power-law frequently observed in network data [37] - [43] .

Network traffic is dynamic and exhibits varying behavior on a wide range of time scales. A given packet window size N V will be sensitive to phenomena on its corresponding timescale. Determining how network quantities scale with N V provides insight into the temporal behavior of network traffic. Efficient computation of network quantities on multiple time scales can be achieved by hierarchically aggregating data in different time windows [15] . Figure 5 illustrates a binary aggregation of different streaming traffic matrices. Computing each quantity at each hierarchy level eliminates redundant computations that would be performed if each packet window was computed separately. Hierarchy also ensures that most computations are performed on smaller matrices residing in faster memory. Correlations among the matrices mean that adding two matrices each with N V entries results in a matrix with fewer than 2N V entries, reducing the relative number of operations as the matrices grow.

One of the important capabilities of the SuiteSparse Graph-BLAS library is support for hypersparse matrices where the number of nonzero entries is significantly less than either dimensions of the matrix. If the packet source and destination identifiers are drawn from a large numeric range, such as those used in the Internet protocol, then a hypersparse representation of A t eliminates the need to keep track of additional indices and can significantly accelerate the computations [16] . 

Over 40,000,000,000,000 CAIDA Telescope anonymized packet headers from 2019 and 2020 have been collected and analyzed. The network quantities in Table I The averages and standard deviations of these quantities have been computed over sets of 2 27 packets. A key property is how the various network quantities scale as a function of the window size N V . Figure 6 (top) shows the average total number of unique sources as a fraction of the packet window N V measured for the months of 2019-06 and 2020-06. Figure 6 (bottom) is the result of scaling the data using the formula βN α V Performing a similar analysis across all the network quantities produced the scaling relations for each month. Figures 7,  8, and 9 show the scaling exponents α for the sources, destinations, and links. In most cases, these scaling exponents are remarkably consistent and lie in the range 0.9 < α < 1.0. Three notable exceptions are the scaling of the unique sources, the unique destinations, and the destination supernode fan-in (destination with the most unique sources). The unique sources shown in Figure 7 (top) appear to scale as N 0.5 V in 2019, which increases to N 0.7 V appearing in 2020, implying that the relative diversity of observed darkspace sources grew during 2020. The unique destinations shown in Figure 8 (top) consistently scaled as N 0.8 V throughout 2019 and 2020 indicating that although the source variety may have increased the set of destination addresses they were reaching out to remained similar. The destination supernode fan-in scaling shows significant fluctuations throughout 2019 and 2020 spanning 0.5 < α < 1.0. Table II summarizes the median scaling relations of both the averages and standard deviations of all the network quantities for 2019 and 2020. These results reveal a strong dependence on these quantities as a function of the packet window size N V as well as remarkable consistency between 2019 and 2020. To our knowledge, these scaling relations have not been previously reported and represent a new view on the background behavior of network traffic. The scaling relations Analysis of network quantities over packet windows N V = 2 17 , . . . , 2 27 reveals a strong dependence of many of these quantities on N V as well as remarkable consistency between 2019 and 2020. 

Understanding the behavior of the Internet is essential as the Internet has never been more important to our society. The CAIDA Telescope provides a unique perspective on the Internet by observing a continuous stream of darkspace traffic representing 1/256 of the Internet. Over 40,000,000,000,000 unique packets were collected during 2019 and 2020. This is the largest public corpus of Internet traffic ever collected. The Supercomputing Centers at UC San Diego, Lawrence Berkeley National Laboratory, and MIT have combined their resources to analyze the spatial and temporal structure of anonymized source-destination pairs leveraging GraphBLAS hierarchical hypersparse matrices. Analysis of this unsolicited Internet darkspace traffic has revealed many previously unseen scaling relations. The data show a significant sustained increase in unsolicited traffic corresponding to the start of the COVID19 pandemic, but relatively little change in the underlying scaling relations associated with unique sources, source fan-outs, unique links, destination fan-ins, and unique destinations. This work provides a demonstration of the practical feasibility and benefit of the safe collection and analysis of significant of quantities of anonymized Internet traffic.

Zero botnets: An observe-pursue-counter approach

Measuring the internet

A survey of network flow applications

Measuring the internet

Workshop on internet economics (wie 2019) report

Challenges in parallel graph processing

Tensor decompositions and applications

The world's technological capacity to store, communicate, and compute information

Parallel MATLAB for Multicore and Multinode Computers

Graph algorithms in the language of linear algebra

Mathematics of big data: Spreadsheets, databases, matrices, and graphs

Interactive supercomputing on 40,000 cores for machine learning and data analysis

Hyperscaling internet graph analysis with d4m on the mit supercloud

Streaming 1.9 billion hypersparse network updates per second with d4m

75,000,000,000 streaming inserts/second using hierarchical hypersparse graphblas matrices

Hypersparse neural network analysis of large-scale internet traffic

Multi-temporal analysis and scaling relations of 100,000,000,000 network packets

Hybrid power-law models of network traffic

Software-Defined networking and security: from theory to practice

Mathematical foundations of the graphblas

Design of the graphblas api for c

Algorithm 1000: Suitesparse:graphblas: Graph algorithms in the language of sparse linear algebra

Prefix-preserving ip address anonymization: measurement-based security evaluation and a new cryptography-based scheme

Optimizing xeon phi for interactive data analysis

Large scale parallelization using file-based communications

A demonstration of the bigdawg polystore system

The case for learned index structures

Classifying anomalies for network security

How to identify and estimate the largest traffic matrix elements in a dynamic environment

Estimating pointto-point and point-to-multipoint traffic matrices: an information-theoretic approach

Community structure in time-dependent, multiscale, and multiplex networks

Internet traffic matrices: A primer

Measuring sparseness of noisy signals

Power-law distributions in empirical data

Network science

On the self-similar nature of ethernet traffic (extended version)

On power-law relationships of the internet topology

Internet: Diameter of the world-wide web

Emergence of scaling in random networks

Power-law distribution of the world wide web

Scale-free networks: a decade and beyond

A tale of the tails: Power-laws in internet measurements

The authors wish to acknowledge the following individuals for their contributions and support: Bob Bond, Ronisha Carter, Cary