key: cord-0984407-r4g5ktsq authors: Cuzzocrea, Alfredo title: SPPOLAP: Computing Privacy-Preserving OLAP Data Cubes Effectively and Efficiently Algorithms, Complexity Analysis and Experimental Evaluation date: 2020-12-31 journal: Procedia Computer Science DOI: 10.1016/j.procs.2020.09.337 sha: 88f3d992f9becfbe21635644ff545a430f99d4ae doc_id: 984407 cord_uid: r4g5ktsq This paper provides significant contributions in the line of the so-called privacy-preserving OLAP research area, via extending the previous SPPOLAP’s results provided recently. SPPOLAP is a state-of-the-art algorithm whose main goal consists in computing privacy-preserving OLAP data cubes effectively and efficiently. The main innovations carried-out by SPPOLAP are represented by the novel privacy OLAP notion and the flexible adoption of sampling-based techniques in order to achieve the final privacy-preserving data cube. In line with the main SPPOLAP’s results, this paper significantly extends the previous research efforts by means of the following contributions: (i) complete algorithms of the whole SPPOLAP algorithmic framework; (ii) complexity analysis and results; (iii) comprehensive experimental analysis of SPPOLAP against real-life multidimensional data cubes, according to several experimental parameters. These contributions nice-fully complete the state-of-the-art SPPOLAP’s results. Privacy-preserving big data management and analytics (e.g., [7, 8, 9, 26, 27, 28] ) is a novel and extremely-emergent research area with plenty of real-life applications spanning from sensor network analysis to smart city tools, from social networks analysis to bio-informatics and bio-medical systems, from epidemic analysis and spread prediction tools (like those related to the recent COVID-19 pandemice.g., [10, 11, 12] ) to graph analytics systems, and so forth. In all these application scenarios, the main goal consists in providing big data management and analytics services in a privacy-preserving manner, i.e. preserving the privacy of sensitive information (e.g., [13] ), while still ensuring effectiveness and, moreover, efficiency. In this context, privacy-preserving OLAP (e.g., [14, 15] ) is a prominent research line, due to two main aspects: (i) OLAP tools are "premiere-class" big data analytics tools; (ii) structural nature of OLAP data cubes, i.e. array-based, lends itself "naturally" to elegant privacy-preserving solutions, as confirmed by recent, authoritative initiatives (e.g., [16, 17] ). In this so-delineated context, SPPOLAP [1] is a state-of-the-art privacy-preserving OLAP algorithm that introduces the two following main innovations: (i) a novel privacy OLAP notion; (ii) flexible adoption of samplingbased techniques in order to achieve the final privacy-preserving data cube. As regards the first innovation, in fact, SPPOLAP overcomes classical privacy-preserving approaches (e.g., [16] ) that adopt the vision of "seeing" OLAP data cubes as intrinsic multidimensional arrays, and try to compute the respective privacy-preserving multidimensional arrays. Contrary to this, SPPOLAP introduces the nice and elegant amenity of focusing on the privacy of OLAP aggregations, thus better exploiting the semantics of the powerful OLAP data cube model [18] . As regards the second innovation, too, SPPOLAP effectively and efficiently makes use of sampling-based techniques as follows. First, a suitable privacy grid is computed on top of the target OLAP data cube by considering a reference query workload on the cube. Then, an innovative privacy metrics that focuses on the privacy of OLAP aggregations is introduced, and the so-called privacy threshold is consequently defined, as capturing a lower bound for the deriving inference error. The latter measures how much OLAP queries issued to the privacy-preserving data cube are "wrong", while still keeping an acceptable accuracy degree, so that they preserve the privacy of sensitive OLAP aggregations (i.e., limit the possibility of inferring further information from the knowledge already acquired). Finally, samples are extracted from the grid regions (determined by the reference privacy grid) as to satisfy the privacy threshold, according to a greedy strategy. By "possibly" satisfying the privacy threshold on all the grid regions, then the sampling-based privacy-preserving OLAP data cube is finally obtained. It should be note that, as regards the emerging big data management and analytics research area, SPPOLAP can really provide relevant opportunities. In fact, while the need for powerful big data analytics tools is clear (e.g., [19, 20] ), at the same, the privacy-preserving requirements are, nowadays, indispensable aspects to be considered in every reallife system. Smart city application scenarios represent convincing case studies on this, where the privacy preservation axiom becomes a hard obstacle to smart citizen services for e-society, e-procurement, e-government, etc. With respect to [1] where the main SPPOLAP's results are presented, this paper significantly extends [1] by means of the following contributions:  complete algorithms of the whole SPPOLAP algorithmic framework;  complexity analysis results;  comprehensive experimental analysis of SPPOLAP against real-life multidimensional data cubes, according to several experimental parameters. The remaining part of this paper is organized as follows. In Section 2, we provide a summary of the main SPPOLAP approach. Section 3 provides the complexity analysis and results of the SPPOLAP algorithm. In Section 4, a comprehensive experimental analysis of SPPOLAP against real-life multidimensional data cubes, according to several experimental parameters is presented. Finally, Section 5 contains conclusions and future work of our research. Further, Appendix A reports all the algorithms of the complete SPPOLAP algorithmic framework. In this Section, we provide a summary of the main SPPOLAP approach [1] . Given an OLAP data cube , the goal of SPPOLAP algorithm is computing a synopsis privacy-preserving data cube ′. Here, we first provide basic definitions and metrics introduced by SPPOLAP, then we describe each step of SPPOLAP. Appendix A reports all the algorithms of the whole SPPOLAP algorithmic framework. A data cube is defined as a tuple =< Δ, , , >, such that: (i) Δ is the data domain of , which contains data cells, i.e. the elementary aggregations of Δ computed against the relational data source ; (ii) is the set of dimensions of , which are the functional attributes with respect to which tuples in are aggregated; (iii) is the set of hierarchical representations of the functional attributes in , shaped in form of trees and known as hierarchies; (iv) is the set of measures of , which are the set of attributes with respect to which SQL aggregations stored in data cells of are computed. Let ∈ be a generic dimension of ; we define | | as the cardinality of dimension and ( ) as the hierarchy of . Let C be a D -dimensional data cube. Let m be an integer such that m  D . We define an m-dimensional range query G against C as a tuple G Il0 Il1 … Il(m-1) , such that (i) Ili is a contiguous range defined on the dimension uli of C, with 0  li  D , and (ii)  is a SQL aggregation operator (i.e., SUM, COUNT, AVG). When applied to C, G returns the -based aggregation over the data cells in C, contained in the multidimensional sub-domain of C and bounded by ranges Il0 Il1 … Il(m-1). In this work, we focus on range-SUM queries, as SUM aggregations are very popular in OLAP, and it also supports many different SQL aggregations, such as COUNT and AVG. We define also the region of a query G, H G , as the sub-domain of C bounded by ranges Il0 Il1 … Il(m-1). Given a data domain Δ, we define the volume of Δ as follows: The definition of volume can be extended also to a multidimensional data cube C, thus introducing the volume of C, denoted as C , and to a multidimensional range query G, thus introducing the volume of G, G , which in literature is also known as the selectivity of G. Given a D -dimensional data cube C, its privacy grid is defined as a tuple Z C r0 r1 … r D , such that rk is a range partitioning the dimension uk, with 0  k  D in a rℓ-based partition. By combining the partitions along all dimensions u  D, we obtain Z C as a regular partition of H C (i.e., the multidimensional region associated to C) composed by the grid regions HZ C k rℓ0 k rℓ1 k … rℓ D k. Starting from this definition, we can define Z C as a collection of grid regions, namely, Z C HZ C HZ C k … HZ C Z C -. We now describe metrics that we use for evaluating accuracy of answers to queries and privacy of multidimensional aggregates. We start from the definition of relative query error between exact and approximate answers, which is a well-recognized quality measure in literature (e.g., [2, 3] ). For a given query G, we denote as C G the exact answer to G (more precisely, the answer to G evaluated on the data cube C) and as C' G the approximate answer to G (more precisely, the answer to G evaluated on the synopsis data cube C'). The relative query error is then defined as follows: Since our goal is to preserve privacy, we design metrics that consider how sensitive information can be discovered from aggregate data. Starting from knowledge about a given query G (i.e., volume and exact answer), it is possible to infer knowledge about sensitive ranges of data contained in H G . For instance, it is possible to derive the average value of the contribution throughout which each elementary data cell contributes to C G . We name this quantity as singleton aggregation of G, denoted by V G . V G is defined as follows: Starting from V G , it is possible to progressively discover aggregations of larger ranges of data within H G . Indeed, by combining the knowledge about the synopsis data cube C' and the knowledge about the query G, we can derive an estimation on V G , V' G , defined as follows: such that ΣN G is the number of samples extracted from H G to compute C'. We then introduce the relative inference error EV G , which gives us a metric of the privacy of C' G , defined as follows: We also introduce the definition of user-perceived singleton aggregation, denoted by V'P G , which is the singleton aggregation perceived by external application based on the knowledge made available to them. V' G is defined as follows: Then, we modify Equation (5) as to define the relative user-perceived inference error, as follows: The input of SPPOLAP consists of: (i) the data cube ; (ii) the space constraint ; (iii) an integer parameter (described in the following); (iv) the privacy threshold ; (v) the typical query-workload on , defined as . The first step of SPPOLAP consists in computing the privacy grid of the data cube C, Z C . To compute Z C , we need to calculate the range rℓk for each dimension u  D. We determine rℓk as an adequately-small fraction of the selectivity of queries in Gw. Let S be the typical query selectivity of Gw, computed by composing all the most frequent query ranges in Gw, and HZ C k be the volume of HZ C k. If HZ C k  S, C' can be computed by using the grid region as the elementary reasoning unit, and adopting a resolution level lower than the resolution level of queries against C. By doing so, it is possible to achieve an effective information gain during the computation of C'. If sampling of the grid region HZ C k is performed in a way that satisfies the privacy threshold T while ensuring the accuracy of approximate answers that involve region HZ C k, the same properties can also be inherited by the input queries on C, since the latter queries are defined on top of grid regions. The second step of SPPOLAP consists in obtaining the synopsis data cube C'. This is done by using a greedy strategy guided by the space constraint B. Due to the space constraint B, we must compute a "best-effort" synopsis data cube C' such that: (i) it satisfies the privacy threshold T; (ii) it ensures accuracy of approximate answers; (iii) it fits within B. The greedy strategy works as follows: at each iteration j, it selects a grid region, denoted by H j Z C k, and extracts a set of samples, denoted by Σ H j Z C k . The greedy selection is performed until space constraint B is consumed. The greedy selection criterion considers the properties of data distributions associated to the grid regions in the privacy grid Z C , more precisely, the skewness of data distributions associated to different grid regions. Let F HZ C k be the data distribution associated to grid region HZ C k. The main idea behind the selection criterion is based on the assumption that, in order to describe a grid region whose distribution is skewed (i.e., values of F HZ C k are distributed according to a Zipf distribution, with asymmetric peaks), a number of samples higher than the number of samples required to describe a uniform grid region (i.e. a region where values of F HZ C k are distributed around their average value) is required. We can safely assume that F HZ C k are multidimensional, being the grid regions defined on top of multidimensional data cubes. In order to determine if a given data distribution is skewed, we adopt the result described in [4] . According to [4] , a data distribution F is considered skewed if the skewness value of F, skew F , is greater than its standard deviation,  skew F , by a factor equal to 2.6, namely: skew F    s F . According to [5] , skew F can be computed as follows: where μi F denotes the i-th central moment. Based on the definitions in [4] , we introduce the characteristic function φ F , which maps to 1 if F is skewed and to 0 if F is uniform. φ F is defined as follows: such that q and μ are, respectively, the number of data items and the mean value of the distribution F. Once the region H j Z C k is selected according to the aforementioned greedy criterion, samples must be extracted from it in order to obtain the set of samples Σ H j Z C k . This operation is the baseline for computing the synopsis data cube C'. We employ a strategy called uniform sampling, i.e. based on the conventional Uniform distribution. Uniform sampling works as follows: let Δmin Δmax with Δmax Δmin, be the interval where the data domain Δ is defined (we assume that Δ is one-dimensional). The i-th sample is extracted by first sampling a random indexer n using a Uniform distribution defined in Δmin Δmax and then returning the value Δ n , if it has not been already selected in a previous iteration. This procedure can be easily generalized to a multidimensional data domain by iterating the same procedure over all dimensions u  D. Given the grid region H j Z C k at the iteration j of SPPOLAP, we first consider the corresponding range-SUM query G j Z C k whose range is equal to the range of H j Z C k. Then, we use the input parameter b, with b 0, for iteratively sampling region H j Z C k. More precisely, we sample b-sized sub-sets of samples from H j Z C k until (i) the privacy constraint on H j Z C k is satisfied (i.e., EV P G j Z C k ≥ T) or (ii) B is consumed. According to our design of SPPOLAP, b can be seen as a buffer size, used during the sampling phase. This solution allows avoiding computational overheads that would appear by performing sampling on massive in-size data cubes. Our sampling method considers the nature of OLAP queries and the privacy requirements. More precisely, we apply Uniform sampling on a subset U j Z C k of data cells in H j Z C k whose value is higher than the average value of H j Z C k. This particular strategy allows us to obtain an approximate answer to G j Z C k having a good degree of approximation and, at the same time, a high degree of privacy, as it allows to satisfy the constraint EV P G j Z C k ≥ T. These properties are in turn inherited by input queries on C, as above described. In this Section, we provide the complexity analysis and results of the SPPOLAP algorithm. First, consider that the SPPOLAP approach is inspired by the well-known research area of OLAP data cube compression algorithms (e.g., [25] ). In this context, given an OLAP data cube C with D dimensions, the complexity of any arbitrary data cube compression algorithm Pc due to computing the compressed version of C, said CP, denoted by OPc(n), is already known to being exponential in the number of dimensions D , i.e: Obliviously, the exponential nature of the algorithm cannot be avoided, but suitable lower-costly computations of SPPOLAP can be achieved in most executions. The latter is another hidden goal of the SPPOLAP algorithm. Let Gw be the cardinality of the input query workload Gw, and Z(C) the cardinality of the privacy grid Z(C) built on top of C (see Section 2). The complexity of SPPOLAP algorithm due to computing the privacy-preserving version of C, C', denoted by OSPPOLAP(n), is given by: Since, due to the SPPOLAP construction method (see Section 2), the following relation holds: then, the following relation holds: Hence, we can conclude that, given an OLAP data cube C, the complexity of SPPOLAP algorithm due to computing the privacy-preserving version of C, C', is, for most executions, significantly bounded by the exponential computational complexity with respect to the number of dimensions of C, D . The latter is a significant theoretical result that is further corroborated by the experimental analysis provided in Section 4 (specifically, Section 4.3, where we experimentally assess the performance of SPPOLAP). In this Section, we provide an experimental assessment of SPPOLAP using real-life data cubes. We consider several perspectives of analysis, aiming at testing quality, effectiveness and sensitivity of our proposed technique. SPPOLAP performance is compared with Zero-Sum algorithm described in [6] . We focus on two-dimensional data cubes, which well covers the goals of a reliable experimental evaluation of privacy preservation capabilities. Indeed, we also tested our method on more probing multidimensional data cubes, obtaining results very similar to those experienced on two-dimensional data cubes. For this reason, here we present only the results on two-dimensional data cubes. The parameters used in our evaluations are the following: (i) the cardinality of each dimension of the data cube, denoted by L0 and L1; (ii) the range sizes of grid regions of the privacy grid, K0 and K1; (iii) the sparseness coefficient s, which measures the ratio of non-null data cells with respect to the total number of data cells; (iv) the space constraint B; (v) the privacy threshold T; (vi) C, which models the type of data cube we are using for our evaluation; (vii) the query selectivity S. In addition, we introduce ad-hoc metrics for each perspective of analysis that is object of our experimental assessment. In the quality analysis, we employ the privacy and accuracy factors defined by Sung et al. [6] , respectively FP and FA. We define both of them in the following. Let C and C' be, respectively, the input data cube and the synopsis data cube. Let ω k be a data cube cell having k as multidimensional indexer, with ω C C' , the privacy factor FP measures the average amount of distorted cells contained in blocks of C'. In the Zero-Sum method, the block is a sub-cube with respect to which marginal sums of perturbed data cells along rows and columns are maintained equal to zero. FP is defined as follows: Basically, FP is a way to measure how much good the privacy preservation of C' is. Since Zero-Sum is a method oriented to data cells and our technique is based on the privacy OLAP notion, we will adapt the definition of FP to better fit our problem. First, we replace the concept of block, underlying the definition of Equation (14), with the concept of grid region (indeed, they are very similar). Second, in case C' k NULL (i.e., when C' k has not yet been sampled), we replace C' k with the corresponding singleton aggregation computed with respect to the grid region containing C' k . Concerning accuracy factor FA, it is instead defined in dependence of a query G on the synopsis data cube C', as follows: such that C G is the exact answer to G and C' G is the approximate answer to G. Basically, FA is a measure of how much good the degree of approximation ensured by C' for a given query G is. We extend the definition of Equation (15) to an input query workload Gw, as follows: For the purpose of the quality analysis, Gw is set as composed by the collection of range-SUM queries corresponding to blocks for the case of the Zero-Sum method, and to grid regions for SPPOLAP. In the effectiveness analysis, we use the average relative user-perceived inference error, generalized from Equation (7) on the typical query workload Gw, as follows: In our experimental framework, queries in Gw are synthetically generated as those queries that completely "span" the target data cube and having selectivity S fixed to a percentage of the volume of the data cube. For our experimental evaluation, we select two-dimensional data cubes coming from three well-known datasets: APB, TPC-H and UsCensus. Fig. 1 shows the DFM [21] of the two-dimensional data cube built from APB. In this Section, we describe the results of quality analysis of SPPOLAP on APB, TPC-H and UsCensus data cubes. Quality analysis is first performed with respect to the sparseness coefficient s. The metrics that we use are the above-described accuracy and privacy factors FP and FA, defined in Equation (14) and Equation (16) , respectively. Fig. 4 shows results of quality analysis with respect to the sparseness coefficient s against the target data cubes, with the following settings of experimental parameters: Kk = 10%, Lk = 10%, T = 70%, B = 20%, b = 20%, S = 10%. More precisely, Fig. 4 (a), Fig. 4 (b) and Fig. 4 (c) show the accuracy factor trend of SPPOLAP against the APB, TPC-H and UsCensus data cubes, respectively, in comparison with the results given by Zero-Sum, while Fig. 4 (d) , Fig. 4 (e) and Fig. 4 (f) show the privacy factor trend of SPPOLAP against the APB, TPC-H and UsCensus data cubes, respectively, still in comparison with Zero-Sum. Factor 6 shows results of quality analysis with respect to the sampling percentage sp, which is defined as the number of sampled cells with respect to the total number of cells in the input data cube. In this analysis, the following setting of experimental parameters is used: Kk = 10%, Lk = 10%, T = 70%, B = 20%, s = 20%, b = 20%, S = 10%. In more details, Fig. 6 (a), Fig. 6 (b) and Fig. 6 (c) show the accuracy factor trend of SPPOLAP against the APB, TPC-H and UsCensus data cubes, respectively, while Fig. 6 (d), Fig. 6 (e) and Fig. 6 (f) show the privacy factor trend of SPPOLAP against the APB, TPC-H and UsCensus data cubes, respectively. In this Section, we describe the results of effectiveness analysis of SPPOLAP on APB, TPC-H and UsCensus data cubes. The metrics that we use is the average relative privacy error reported in Equation (17) 7 shows results of effectiveness analysis with respect to the selectivity S against the target data cubes, with the following settings of experimental parameters: Kk = 10%, Lk = 10%, T = 70%, B = 20%, b = 20%, s = 20%. More precisely, Fig. 7 (a), Fig. 7 (b) and Fig. 7 (c) show the effectiveness trend of SPPOLAP against the APB, TPC-H and UsCensus data cubes, respectively, in comparison with the results given by Zero-Sum. Fig. 8 shows results of effectiveness analysis with respect to the space bound B against the target data cubes, with the following settings of experimental parameters: Kk = 10%, Lk = 10%, T = 70%, b = 20%, s = 20%, S = 10%. More precisely, Fig. 8 (a) , Fig. 8 (b) and Fig. 8 (c) show the effectiveness trend of SPPOLAP against the APB, TPC-H and UsCensus data cubes, respectively, in comparison with the results given by Zero-Sum. In this Section, we evaluate the performance of SPPOLAP in comparison with Zero-Sum approach. Performance is modeled in terms of sampling time in seconds. We evaluate performance with respect to buffer size b, due to the fact that, as described in Section 2.3, the use of a buffer size allows us to avoid excessive computation overhead that would be caused by performing sampling on massive in-size data cubes without buffering. Results of the performance analysis are shown in Fig. 9 . In this analysis, the following setting of experimental parameters is superimposed: Kk = 10%, Lk = 10%, T = 70%, B = 20%, s = 20%; S = 10%. Fig. 9 (a), Fig. 9 (b) and Fig. 9 (c) show the sampling time for APB, TPC-H and UsCensus data cubes, respectively, in comparison with the Zero-Sum approach. Starting from the extremely-emerging context of privacy-preserving big data management and analytics, this paper has further extended the research contributions initially provided by the state-of-the-art SPPOLAP algorithm [1] for supporting privacy-preserving OLAP. The novel contributions are focused on complete algorithms of the whole SPPOLAP algorithmic framework, complexity analysis and results, and comprehensive experimental analysis of SPPOLAP against real-life multidimensional data cubes, according to several experimental parameters. Particularly, the experimental results have confirmed the suitability of SPPOLAP in a wide range of emerging big data applications, such as smart city application scenarios, where the privacy preservation axiom becomes a hard obstacle to smart citizen services for e-society, e-procurement, e-government, etc. Future work is mainly oriented towards the extension of the actual framework to other big data management and analytics paradigms (e.g., [22, 23, 24] ). A Robust Sampling-Based Framework for Privacy Preserving OLAP XℓPPX: a Lightweight Framework for Privacy Preserving P2P XML Databases in Very Large Publish-Subscribe Systems Inference Control in Statistical Databases Kendall's Advanced Theory of Statistics Probability, Random Variables, and Stochastic Processes Privacy Preservation for Data Cubes Privacy and Security of Big Data: Current Challenges and Future Research Perspectives Toward Efficient and Privacy-Preserving Computing in Big Data Era Privacy Preserving Data Analytics for Smart Homes Abnormal Respiratory Patterns Classifier May Contribute to Large-Scale Screening of People Infected with COVID-19 in an Accurate and Unobtrusive Manner Rapid AI Development Cycle for the Coronavirus (COVID-19) Pandemic: Initial Results for Automated Detection & Patient Monitoring using Deep Learning CT Image Analysis Propagation Analysis and Prediction of the COVID-19 Privacy-Preserving Big Data Publishing Privacy Preserving OLAP over Distributed XML Data: A Theoretically-Sound Secure-Multiparty-Computation Approach Privacy Preserving OLAP and OLAP Security Privacy Preserving OLAP Privacy Preserving Aggregate Query of OLAP for Accurate Answers Data Cube: A Relational Aggregation Operator Generalizing Group-by, Cross-Tab, and Sub Totals Big Data Analytics OLAP*: Effectively and Efficiently Supporting Parallel OLAP over Big Data The Dimensional Fact Model: A Conceptual Model for Data Warehouses OLAP Analysis of Multidimensional Tweet Streams for Supporting Advanced Analytics Big Graph Analytics: The State of the Art and Future Research Agenda DyNMF: Role Analytics in Dynamic Social Networks OLAP Data Cube Compression Techniques: A Ten-Year-Long History Private Databases on the Cloud: Models, Issues and Research Perspectives Fuzzy Logic-Based Data Analytics on Predicting the Effect of Hurricanes on the Stock Market An Innovative Framework for Supporting Frequent Pattern Mining Problems in IoT Environments This research has been partially supported by the French PIA project "Lorraine Université d'Excellence", reference ANR-15-IDEX-04-LUE. The author is grateful to Dr. Vincenzo De Maio for his contribution in the experimental part of early versions of this work.