key: cord-0032785-a6c6yscb authors: de Oliveira, Hélio M.; Ospina, Raydonal; Leiva, Víctor; Martin-Barreiro, Carlos; Chesneau, Christophe title: A New Wavelet-Based Privatization Mechanism for Probability Distributions date: 2022-05-14 journal: Sensors (Basel) DOI: 10.3390/s22103743 sha: 84b784a89ae586fb95a9f621e2566f5d3489391c doc_id: 32785 cord_uid: a6c6yscb In this paper, we propose a new privatization mechanism based on a naive theory of a perturbation on a probability using wavelets, such as a noise perturbs the signal of a digital image sensor. Wavelets are employed to extract information from a wide range of types of data, including audio signals and images often related to sensors, as unstructured data. Specifically, the cumulative wavelet integral function is defined to build the perturbation on a probability with the help of this function. We show that an arbitrary distribution function additively perturbed is still a distribution function, which can be seen as a privatized distribution, with the privatization mechanism being a wavelet function. Thus, we offer a mathematical method for choosing a suitable probability distribution for data by starting from some guessed initial distribution. Examples of the proposed method are discussed. Computational experiments were carried out using a database-sensor and two related algorithms. Several knowledge areas can benefit from the new approach proposed in this investigation. The areas of artificial intelligence, machine learning, and deep learning constantly need techniques for data fitting, whose areas are closely related to sensors. Therefore, we believe that the proposed privatization mechanism is an important contribution to increasing the spectrum of existing techniques. Probability models capable of capturing the fundamental information contained in modern data, as those used for artificial intelligence [1] and big data [2] , as well as models presenting unique features, have promoted derivations of novel continuous probability distributions [3, 4] . Numerous and diverse approaches have been proposed over time to generate new probability or statistical distributions [5] . One of the most common approaches allows us to enhance the functionality of a base continuous cumulative distribution function (CDF). This can be achieved utilizing various transformations based on exponential, logarithmic, power, or other functions [6] . On this topic, we may refer to the so-called "families of probability distributions", as described in [7, 8] . The new probability distributions may be employed efficiently in diverse settings, as described in [9, 10] . We may also refer to the work stated in [11] pointing out the importance of continuous probability distributions in the definition of various measures. In view of the impacts of the current research on probability distributions [12] , diverse applications related to the areas of artificial intelligence [1] , machine learning [13] , and deep learning [14] constantly require new techniques for data fitting, whose areas are closely related to sensors. Additionally, to aid in the progress of computer sciences, new approaches are welcome to expand the options of a reference probability distribution [15] . An application of probability models can be introduced by perturbing a CDF additively, similarly to how a noise perturbs the signal of a digital image sensor [16] . Surprisingly, such a strategy does not appear to have received much attention in the literature. More precisely, given a continuous CDF, one can add this function to another (the perturbation function) in such a way that the resulting function is also a continuous CDF. To propose a manageable perturbation [17] , one can employ a special, well-known function called wavelet [18, 19] . Basically, such a function has a wave-like oscillation with an amplitude that starts at zero and increases or decreases before returning to zero, one or more times. Wavelets may be utilized to extract information from a wide range of data, including audio signals and images often related to sensors [20] , as unstructured data. To thoroughly analyze data, wavelet sets might be used. For more information on wavelets, we refer the reader to [21] [22] [23] . More specifically, in [24] , transients and their wavelet coefficients are modeled as mixed Laplace probability density functions (PDFs). In [25] , image segmentation based on a wavelet feature descriptor and dimensionality reduction was applied to remote sensing. Thus, one could involve a wavelet function to define a valid perturbation, and then a privatized probability distribution can be obtained through theoretical and practical tools. The main objectives of this article are to propose and derive a naive theory of an additive perturbation on a continuous probability distribution based on a wavelet approach, and to illustrate it with a sensor-related application. The use of wavelets in this probability distribution setting is original, and our findings offer up a new modeling horizon, which are examined in depth. Therefore, we offer a mathematical method for choosing a suitable probability distribution to model data by starting from some guessed-at initial probability distribution. Examples for the proposed method are also presented. For the computational experiments, we utilize a database-sensor and two related algorithms. The rest of the article is organized as follows. Section 2 introduces the new wavelet approach. In Section 3, we discuss the choice of a perturbation for an arbitrary probability distribution. Section 4 proposes a correction for statistical moments due to the perturbation. Then, in Section 5, the generalization of the perturbation approach at further levels is presented. In Section 6, we provide an empirical application of our approach. Finally, Section 7 gives the concluding remarks. Suppose we have a random variable X with a continuous CDF F X . Let us consider an additive (functional) perturbation, denoted as ε-perturbation, so that with the CDF F priv stated in (1) being a privatized CDF. Note that, in the expression defined in (1), the CDF of the variable X has been perturbed and a new function F priv is obtained. However, the choice of the perturbation cannot be arbitrary because it could break the requirements to deal only with a probability distribution. The following conditions must be met by the perturbation: (C1) lim |x|→+∞ ε(x) = 0; (C2) ε is derivable and satisfies |dε(x)/dx| ≤ f X (x), where f X denotes the PDF related to the CDF F X . The conditions (C1) and (C2) above stated guarantee that F priv is also a CDF. This new distribution could be seen as a privatized version of the reference distribution. To describe our new wavelet approach, some definitions need to be given. Let us begin with the mathematical definition of a wavelet. A wavelet is a Lebesgue measurable function ψ(x) that is both absolutely integrable and square-integrable, such that On the one hand, from the expression established in (2) , observe that the absolute value of ψ is integrable over the entire real line and its result is equal to zero (0). On the other hand, in the formula stated in (3), note that the square of ψ is also integrable over R and its result is equal to one (1) . Keep in mind that, in this study, we deal with compactly supported wavelets [26] , that is, the closure of the set upon which the wavelet stands non-vanishing is a compact set. Specifically, if ψ is a wavelet function, then {x: ψ(x) = 0} is a compact set, and we say ψ is a wavelet of compact support. Henceforth, we assume that support{ψ(x)} ≡ [a, b], which plays a crucial role in our proposal [21, 27] . The next definition presents the notion of wavelet cumulative function in this setting. Since only compactly supported wavelets are considered, the wavelet cumulative function given in (4) can be simplified to Thus, from the expression stated in (5), the following properties can be verified: Note that the properties formulated in (6)-(8) are helpful. To begin with, let us deal with the uniform distribution, denoted as U [0, 1], whose CDF is given by Then, we propose to choose a particular perturbation ε according to For the particular choice stated in (9), the new distribution defined in (1) has the same support as the original distribution, with no perturbation added. Furthermore, imposing the condition |ψ(t)| ≤ 1, it follows that From the expression established in (10), we can guarantee that |ε(x)| ≤ x, for all x ∈ [0, 1]. Therefore, the condition F priv (x) ≥ 0 is assured, for all x ∈ [0, 1]. Hence, we must determine whether F priv is always a non-descending function or not. Thus, we examine the behavior of the corresponding PDF formulated as implying where f priv denotes the PDF related to the CDF F priv . From the formulas given in (11) and (12), it follows that +∞ −∞ f priv (x)dx = 1 and f priv (x) ≥ 0, for all x, thereby proving that this is indeed a valid PDF to be considered. Then, this new PDF and its associated CDF might be visualized as a privatized version of the reference distribution, with the privatization mechanism being named wavelet perturbation. This is that we call "privatization analysis". As an example, let us first consider a compactly supported wavelet defined within [0, 1] proposed in [28] and mathematically defined as Figure 1 shows the original distribution, that is, U [0, 1], and the new distribution generated by the perturbation identified in (13). Another family of compactly supported wavelets with parameters that can be adjusted is the beta wavelet family [29] . One of the advantages of adopting beta wavelet perturbations consists of the easy replacement of shape (α > 0) and scale (θ > 0) parameters to make the perturbation ψ beta (x, α, θ) flexible. In other words, this wavelet family allows for a simple parametrization that drives the asymmetry of the resulting probability distribution. The plots of two beta wavelet perturbations are shown in Figure 2 as examples. Figure 2 . This approach can be employed to introduce asymmetries in a chosen probability distribution, controlled by the beta wavelet parameter. Among the compactly supported wavelets, certainly the most used are the Daubechies (DB4) wavelets [27] . Expressions close to approximately the DB4 wavelets of any order have been proposed in [30] . Using Matlab TM commands, these continuous approximations were employed to plot the DB4 perturbation adapted to the U [0, 1] distribution, denoted by Ψ DB4 , in Figure 4 . Now, we offer a valid perturbation for an arbitrary CDF F X . For a given compactly supported wavelet ψ with its cumulative function (see Definition 2), consider a new chosen CDF according to with From (11) and (14), note that F priv (−∞) = 0, F priv (+∞) = 1, and with dε(x)/dx stated in (15) given by Then, ε is a valid perturbation because the condition (C1) is satisfied. In addition, we have lim |x|→+∞ ε(x) = 0 due to b a ψ(u)du = 0, so that the condition (C2) is also satisfied, since by (16), having |dε(x)/dx| ≤ f X (x). Thus, any wavelet of compact support can be used to induce a different perturbation in the vicinity of the probability distribution initially assigned. From the expressions stated in (14)- (17) , note that, after applying the perturbation, the resulting function is also a CDF. In summary, given a random variable X with CDF F X , a perturbation can be added, which guarantees that the modified function is still a CDF around the original CDF. This new CDF, and its associated distribution, as mentioned, are privatized versions of the reference distribution using a wavelet-based privatization mechanism. Based on the random variable X, the hypothesized distribution (initial or prior distribution around which the wavelet perturbation is introduced) has its k-th moment defined by providing its existence in the mathematical sense. By introducing the perturbation defined in (9), the new (adjusted/privatized) k-th moment is stated as Consider the equation given by dF priv (x) = dF X (x) + ψ((b − a)F X (x) + a)dF X (x). Then, by using the expressions given in (18) and (19) , it follows that The second term on the right side of (20) accounts for a moment correction due to the introduced wavelet perturbation. Let us consider now the particular case of a perturbation in a (normalized) uniform distribution, that is, X ∼ U (0, 1). To evaluate the moments of the new CDF F priv , under the wavelet perturbation ψ with a compact support [0, 1], we have Note that the moment of the wavelet used to build the additive perturbation also adds to the moment of the starting distribution, because If the support set is the unit interval, that is [0, 1], then the formulas stated in (21) and (22) may be utilized. In the general case, if ψ has a support [a, b] = [0, 1], we can build a modified (supported-normalized) wavelet defined as Hence, we have that Under the assumption that the integral term given in (23) vanishes, the moments of the new and hypothesized distributions coincide. In the case that a beta perturbation occurs over a U [0, 1]distribution, it depends on its parameters α and θ of the perturbation wavelet. Thus, it is worth rewriting, via the equations stated in (1)- (9) , that approximation detail The interpretation presented in (24) of wavelet theory (approximation + detail) can be generalized into the lines of a wavelet tree with several levels. First, we present level-1 parameters (α, θ) by means of In Figure 3 , we can see examples of this case. Second, we introduce level-2 LH parameters An example can be provided using the parameters α L = 4, θ L = 3, and α H = 3, θ H = 7. These parameters are similar to those employed in Figure 3 . However, note that different wavelets may be selected to fit different segments of the initial distribution support. For instance, in a level-2 perturbation, the sub-level-L can use a beta wavelet, whereas the sub-level-H may employ a Mexican-hat wavelet, denoted by Ψ M , as in Figure 5 . The parameterization α L = 4, θ L = 3, and α H = 3, θ H = 7 is used in Figure 6 , with the corresponding perturbation denoted by Ψ level-2 . Next, we present level-4 LL LH HL HH parameters, (α LL , θ LL : α LH , θ LH . . .α HL , θ HL : α HH , θ HH ) namely, stated as An example of this level-4 approach is illustrated utilizing the values given by (α LL , θ LL : α LH , θ LH . . .α HL , θ HL : α HH , θ HH ) = (4, 3: 3, 7 . . . 5, 3: 2, 7). An interpretation for this approach is considering a distinct perturbation in each quartile of the distribution such as: In short, the privatization mechanism allows us to perturb a probability distribution employing levels (applying a partition on the compact support), which may be very attractive when fitting data. We can use the expression stated in (25) when implementing one level, in (26) when implementing two levels, and in (27) when implementing four levels. Next, we apply our privatization approach to a real-world problem. An e-commerce company sells products on the Internet and wants to analyze the possibility of adding more servers or changing its most important server. By collecting daily data, we find many days in which the best server has almost all its hardware resources consumed 70% of the time. Looking at the empirical PDF and CDF, we see that a triangular distribution, with support on the set [0, 1] and mode equal to 0.7, might represent the data well. However, when performing goodness-of-fit tests, the results tell us that a triangular distribution is not the best option. However, a "quasi-triangular" distribution could be an appropriate probability model for the random variable X that measures the daily proportion of times with full resource consumption of the best server. Among the known techniques to fit data, the privatization mechanism that we propose in this work is an excellent option to slightly perturb the triangular distribution and describe the data well. For the computational experiments, we utilize a database-sensor and two related algorithms. Let X be a continuous variable, which is triangularly distributed, with support on the interval [0, 1], and whose mode is m, for 0 < m < 1. The PDF and CDF of X are, respectively, given by and Now, we use the wavelet function defined in (13) . Figure 7 shows the graphical plot of the CDF corresponding to X (original triangular distribution) and also the graphical plot of the privatized version that corresponds to the random variable X priv (perturbed triangular distribution). We consider the value m = 0.7 in the calculations carried out. Note that, in the perturbed triangular distribution, the CDF values are greater than when compared to the original triangular distribution, for values of X less than 0.5, while for values of X greater than 0.5, the opposite occurs. This behavior is due to the wavelet function employed in such an empirical application. In practice, this method is flexible allowing us to choose the most convenient wavelet to fit the data. For the computational experiments that were carried out, a database-sensor was used. Algorithm 1 shows the steps to perturb a probability distribution with compact support. If a perturbation by levels is required, we propose Algorithm 2 as a generalization of Section 5, where the number k of levels is left to the consideration of the data analyst. Algorithm 1 Approach to perturb a probability distribution with a database-sensor. 1 : Consider a random variable X with compact support [a, b]. 2: Select a wavelet with compact support [a, b] to perturb the distribution of the previous step, with the computations being performed by a first process denoted by A that sends the generated data to a database. 3: State a sensor in the database that detects the entry of new data, so that, using a trigger, the sensor responds sending a copy of the stored data to a second process denoted by B. 4: Establish that process B receives the perturbed data and is responsible for building the CDF of the resulting distribution. 5: Confirm that process B generates the corresponding plots showing, between a and b, the original distribution, wavelet used, and perturbed distribution. Algorithm 2 Approach to perturb a probability distribution by levels. This paper has presented a new method for building an additive wavelet-based perturbation, as a privacy mechanism, to modify a given continuous probability distribution. Then, the initial guess could be perturbed as some sort of "prospecting within the ensemble of possible probability distributions around the starting distribution". The method we have proposed in this investigation is flexible with respect to the perturbation function that may be employed to fit the data, since different wavelets are available. A procedure was also offered to employ four different perturbations, one in each quartile of the original distribution, which can be quite attractive when fitting data. Examples of the proposed method were discussed. Computational experiments were carried out using a database-sensor and two related algorithms. Several knowledge areas can benefit from using the new method proposed in this study. Stochastic programming, simulation studies, and multivariate analysis [31] [32] [33] [34] , among other areas of knowledge, may also benefit from the utilization of the new approach proposed in this investigation. The Internet of things, robotics, monitoring stations, telemetry, and the use of sensors are also important fields for data reading and fitting. Concrete applications via this new approach may now emerge, with an efficient configuration for the involved functions. Another benefit of this technique is its ease of implementation in any programming language. Software developers must be the first to get involved to make this technique available to data analysts. The areas of artificial intelligence, machine learning, and deep learning [35] constantly require new techniques for data fitting, whose areas are closely related to sensors. Accordingly, we think that the proposed privatization mechanism is an important contribution to increasing the spectrum of existing techniques. An avenue of future work to be considered is to provide a method that allows us to determine the most appropriate wavelet during data fitting. Overview of explainable artificial intelligence for prognostic and health management of industrial assets based on preferred reporting items for systematic reviews and meta-analyses Recent developments of control charts, identification of big data sources and future trends of current research A novel claim size distribution based on a Birnbaum-Saunders and gamma mixture capturing extreme values in insurance: Estimation, regression, and applications Truncated inverted Kumaraswamy generated family of distributions with applications Continuous Univariate Distributions Continuous Univariate Distributions Compounding of distributions: A survey and new generalized classes Recent developments in distribution theory: A brief survey and some new generalized classes of distributions Elbatal, I. The truncated Cauchy power family of distributions with inference and applications The truncated Burr X-G family of distributions: Properties and applications to actuarial and financial data A brief review of generalized entropies Two new mixture models related to the inverse Gaussian distribution Classifying COVID-19 based on amino acids encoding with machine learning algorithms Abnormality detection and failure prediction using explainable Bayesian deep learning: Methodology and case study with industrial data On some mixture models based on the Birnbaum-Saunders distribution and associated inference Secure and robust digital image watermarking scheme using logistic and RSA encryption Selesnick, I.W. Introduction to Wavelets and Wavelet Transforms: A Primer High-speed continuous wavelet transform processor for vital signal measurement using frequencymodulated continuous wave radar A survey on change detection and time series analysis with applications Introduction to Time-Frequency and Wavelet Transforms A fast signal estimation method based on probability density functions for fault feature extraction of rolling bearings Image segmentation based on wavelet feature descriptor and dimensionality reduction applied to remote sensing Orthonormal bases of compactly supported wavelets A new information theory concept: Information-weighted heavy-tailed distributions. arXiv 2016 Compactly supported one-cyclic wavelets derived from beta distributions Close approximations for daublets and their spectra. arXiv 2010 A new approach to predicting cryptocurrency returns based on the gold prices with support vector machines during the COVID-19 pandemic using sensor-related data Lot-size models with uncertain demand considering its skewness/kurtosis and stochastic programming applied to hospital pharmacy with sensor-related COVID-19 data A new principal component analysis by particle swarm optimization with an environmental application for data science A new algorithm for computing disjoint orthogonal components in the parallel factor analysis model with simulations and applications to real-world data Information Theory, Inference and Learning Algorithms The authors warmly thank the editors and reviewers for their helpful comments which have led to an improved version of our paper. There are no conflict of interest declared by the authors.