key: cord-0129870-0ac9v2ky authors: Wo'zniak, Michal; Zyblewski, Pawel; Ksieniewicz, Pawel title: Active Weighted Aging Ensemble for Drifted Data Stream Classification date: 2021-12-19 journal: nan DOI: nan sha: bad31276095ad88710222dc590b9aef56d20cfab doc_id: 129870 cord_uid: 0ac9v2ky One of the significant problems of streaming data classification is the occurrence of concept drift, consisting of the change of probabilistic characteristics of the classification task. This phenomenon destabilizes the performance of the classification model and seriously degrades its quality. An appropriate strategy counteracting this phenomenon is required to adapt the classifier to the changing probabilistic characteristics. One of the significant problems in implementing such a solution is the access to data labels. It is usually costly, so to minimize the expenses related to this process, learning strategies based on semi-supervised learning are proposed, e.g., employing active learning methods indicating which of the incoming objects are valuable to be labeled for improving the classifier's performance. This paper proposes a novel chunk-based method for non-stationary data streams based on classifier ensemble learning and an active learning strategy considering a limited budget that can be successfully applied to any data stream classification algorithm. The proposed method has been evaluated through computer experiments using both real and generated data streams. The results confirm the high quality of the proposed algorithm over state-of-the-art methods. The paper focuses on constructing efficient data stream classifiers. Recently, most processed data is characterized by a large volume that comes to be processed as a data stream. This requires that the designed methods will take into account the streaming nature of the data and to select for this purpose appropriate processing so that the employed training algorithm can improve the recent classification model. On the other hand, it should be noted that traditional processing methods assume the stationarity of the classification task and thus do not take into account that probabilistic characteristics may change during the model lifetime. This phenomenon, called concept drift [69] , generally negatively affects the classification quality achieved by the model. Therefore, the proposed classifier training methods should also be able to cope with the elimination of the impact of concept drift on the classifier quality. The appearance of the concept drift [69] is common in many practical decisionmaking tasks, as fraudsters may change the content of the e-mail to get past spam filters. The occurrence of changes in the characteristics of a classification task is usually unpredictable. Still, we can identify factors that influence some problems, as the current pandemic situation related to the spread of the covid- virus very strongly influences consumer behavior. In the data classification task, the classifier aims to predict the class label j (j ∈ M = {1, . . . , M }, where M is a finite set of labels). The decision is made based on the attribute values of a given instance x (x = [x (1) , . . . , x (d) ] T ∈ X , where X is a d-dimensional feature space), i.e., Ψ : X → M. We assume that x and i are observed values of a pair of random variables (X, J) [18] . When we observe that the join distribution between two different time stamps varies, it means that concept drift appears. The concept drift impacts the mentioned probability distributions [20] , as either real or virtual concept drifts. The first one means that changes will impact the shapes of decision boundaries, i.e., the posterior probabilities p(i|x) have been changed [56, 69] . The virtual drift does not alter the shape of decision boundaries, but it impacts the unconditional probability density functions [68] . Olivera et al. [49] noted that although virtual concept drift does not affect the change in decision boundaries and has not been the focus of much research, it is important to note that it can also affect the usefulness of learned decision boundaries by classifiers, e.g., for unrepresentative learning sets used to build a classifier. Therefore, in practical terms, it is not crucial what type of concept drift occurs, at the end of the result in all scenarios, the current model needs to be altered. Another taxonomy of concept drift is based on the rapidity of change. Mainly, we may distinguish (i) sudden (or abrupt) concept drift, when the new concept suddenly replaces the old concept; and (ii) incremental concept drift when we may observe a steady progression from the old concept toward a new one such that at each time step the distance from the old concept increases and the distance to the new one decreases [67] . Minku et al. [45] proposed that probabilistic concept drift should also be considered, i.e., it occurs when there are two alternating concepts, such that initially, one concept dominates and over time, the other concept begins to dominate. However, many researchers, as Huang et al. [26] do not distinguish between these two types of drift. Gradual concept drift should also be mentioned when the two concepts may occur with different intensities during the period of change between the old and the new concept. An interesting phenomenon is periodic changes, referred to as recurring concept drift [59] when previously occurring concepts return. This type of drift is typical for data streams associated with seasonal phenomena. In this case, we may observe a variant of recurring concept drift called a cyclical concept, when a specific number of concepts recur in a particular order [25] . Although many methods have been proposed that attempt to classify nonstationary data streams efficiently, there is still a need for new approaches that are the focus of intense research. Several methods dedicated to such classification can be found in the literature, including algorithms that continuously train a model of the classifier, the so-called online learners. Domingos [16] formulated the following conditions that should be satisfied by such methods: (i) each object must be processed only once during training; (ii) the system should consume only a limited amount of memory and processing time, regardless of the execution time and the amount of data processed; (iii) the training process can be stopped at any time, and its accuracy should not be lower than that of a classifier trained on batch data collected up to a given time. These methods are generally suitable for stationary data streams because all learning objects are equally valid, regardless of when they appeared. However, these methods could be applicable if we could control the process of forgetting objects from an outdated concept. Hence, among the methods used to classify non-stationary data streams, we can distinguish those that incorporate a forgetting mechanism. This approach is based on the assumption that the most relevant data have arrived recently, as they contain features of the current concept. However, their relevance decreases over time. Therefore, narrowing the scope of the data to those that have been read recently can help create a dataset that embodies the actual context. Three strategies are possible here: (i) instance selection using a sliding window that cuts off older instances; (ii) weighting the data based on relevance; and (iii) using bagging and boosting algorithms that focus on misclassified instances. For a sliding window, the main question is how to adjust the window size. On the one hand, a shorter window allows focusing on the emerging concept, although the data may not be as representative as in the case of a longer window. On the other hand, a wider window may result in a mixture of instances representing different concepts. Bifet and Gavalda [7] observed that although classifiers trained on wider windows are characterized by greater stability, they do not respond quickly enough to sudden concept drift. Examples of the relationship between window size, classifier accuracy and standard deviation are discussed in detail in [36] . Therefore, some advanced algorithms dynamically adjust the window size depending on the detected condition, e.g., flora [69] and adwin [8] . Then variances are compared using F-test and chunk size increases if the p-value is less than a predefined threshold. More advanced algorithms can even use multiple windows [39] . Another approach is to use so-called concept drift detectors, i.e., algorithms that can inform the classification system about changes occurring in the data stream distributions. A decision is made based on incoming information about new examples, i.e., labels or classifier performance are required to detect a concept drift. Several approaches aim to detect drift from unlabeled data, but they are more suitable for detecting virtual concept drift [59] . Additionally, many detectors can return both a signal that drift has been unambiguously detected and a warning level has been reached. It is a signal to start collecting new data to update or rebuild the model soon, i.e., when the explicit detection signal is returned. It is also important to realize that drift detection is a non-trivial task since the detection should be performed as soon as possible to replace the outdated model and thus minimize the reconstruction time. On the other hand, false alarms are unacceptable, as they will lead to misadaptation of the model and spending resources where there is no need to do so [24] . Many drift detection methods have been proposed, including cumulative sum [51] , which is a simple sequential analysis technique based on measuring the average value of the input data. The Exponentially Weighted Moving Average [54] combines current and historical observations in a way that can quickly detect changes in the mean using aggregate chart statistics. Well-known ddm (Drift Detection Method ) [22] incrementally estimates the error of the classifier assuming convergence of the classifier training method. eddm (Early Drift Detection Methods) [3] is an extension of ddm, where the window size selection procedure is based on the same heuristics. Interesting drift detectors based on Hoeffding and McDiarmid inequalities were proposed in [10] . Also worth mentioning is adwin [6] , which is based on an adaptive sliding window, being highly suitable for handling sudden drifts. While Nishida's algorithm [8] assumes that the accuracy of the classifier for recent examples is the same as the overall accuracy since the beginning of learning if the target concept is stationary. We also should mention combined models that combine the outputs of several detectors and then send a signal to a learning algorithm based on an ensemble decision. When one has an appropriate set of detectors and a good combination rule, the resulting ensemble detection can be expected to be more accurate and robust to noise. Maciel et al. propose a three-detector ensemble [44] , based on different default detector combinations depending on the selected ensemble sensitivity. Du et al. [17] developed the ensemble pruning technique to choose the most valuable component of the drift detector ensemble. Lapinski et al. [74] also studied several novel models of combined drift detectors. The last approach to classifying non-stationary data streams is ensemble classifiers, which will be discussed in more detail in the next section. Here it is worth noting that the appropriate mechanisms must be developed to adapt the classification model to the changing probabilistic characteristics of the task. These can be divided into the following approaches [34] : (i) dynamic combiners -base classifiers are trained before the model is run, and a combination rule (e.g., for weighted voting or aggregation by changing the weights associated with each classifier in the ensemble) is responsible for adapting to changes [27, 41] ); (ii) updating ensemble members using recent training examples [9, 50, 66] ; (iii) modifying ensemble line-up [29] . An additional complication during the classification of non-stationary data streams is that the delivery of labeled data is required when we train the classifier under the changing probabilistic characteristics. This process can be difficult, primarily due to the frequent delay in delivering the correct classification. For example, the true label for the credit approval task for the client's score is not known until two years. The problem concerning the delay in accessing the label is significant. It should be noted that, especially in cases where the delay is substantial, the correct label may already be delivered when the evaluated object belongs to an outdated concept. Therefore, to the authors' best knowledge, there are currently no known solutions for this problem. However, the issue we want to address in this paper is the labeling cost. The labeling itself generally involves domain experts, so it is expensive. To reduce the expenses incurred for object labeling, it is possible to use methods that have their origin in semi-supervised learning methods, including those based on active learning methods [23] . Such an approach allows the selection of valuable learning instances according to their influence on the classifier's quality, which should be labeled. It thus positively influences the required budget needed for object labeling. In a nutshell, the main contributions of this work are as follows: The structure of this article is as follows. The following section will briefly present works related to classifier ensemble for the non-stationary data stream and active learning. Then awae will be discussed, and the active learning poolbased sampling technique with limited labeling budget bals for data stream classifiers will be presented. The following section contains the experimental results and answers to the researcher questions posed at the beginning of this section. The last part summarizes the work and indicates directions for possible further research. The data stream can be split into into small data blocks called chunks. Therefore learning from these portions of data are called batch-based or chunk-based learning [32] . Choosing the proper size of the chunk is crucial because it may significantly affect the classification [28] . Although there are methods mentioned earlier to adjust the size of the data chunk to changing distributions, most chunkbased classification methods assume that the size of the data chunk is priorly set and remains unchanged during the data processing. Instead of chunk-based learning, the algorithm can learn incrementally (online), i.e., training examples arrive one by one at a given time and are not kept in memory. When processing a non-stationary data stream, we can rely on a drift detector to point moments when data distribution has changed and take appropriate actions. Alternatively, we may use inherent adaptation capabilities of models. One class of such models is classifier ensemble [35] . Firstly, it is worth mentioning work by Krawczyk et al. [32] where a comprehensive overview of classifier ensemble learning from data streams is presented. Due to processing data streams, we may employ online and chunk-based ensemble learning. Within the first group of methods, we should first mention Online Bagging proposed by Oza and Russel [50] , which is inspired by bagging, whereby when a new object arrives, it is used to train each base classifier. The number of recent object presentations for each base classifier is determined by P oisson(1) distribution. The proposed method was later developed by Lee and Clyde [40] in the Bayesian Online Bagging algorithm. In contrast, Bifet et al. proposed Adaptive-Size Hoeffding Trees algorithm. It ensures ensemble diversity by learning Hoeffding trees of different sizes. Leverage Bagging [9] combines the simplicity of bagging with adding more randomization to the input and output of the classifiers by using P oisson(λ) distribution, where the user can determine λ. Leverage Bagging also uses random output codes and allows training new base classifiers when concept drift is detected using adwin. Oza and Russell [50] also authored Online Boosting. It employs a sequence of base models trained during the boosting procedure. When a new learning example is received, each base classifier is updated multiple times according to the P oisson(λ) distribution. For the first classifier in the sequence λ = 1 and in the case of misclassification, the parameter λ is increased for the next base classifier, and in the case of correct classification, it is decreased accordingly. Several important modifications of the above methods are also worth mentioning. Santos et al. [55] proposed Adaptable Diversity-based Online Boosting (adob), which can accelerate the update of base classifiers after concept drifts by making the λ parameter dependent on the prediction quality of base classifiers. Based on this approach, Accuracy Weighted Diversity-based Online Boosting (awdob) [4] employs weighted voting according to previous base classifier evaluations results. Baros et al. developed Boosting-like Online Learning Ensemble (bole) [5] , which used a modification of the adob algorithm involving weakening the requirements to allow the base classifier to vote and use different concept drift detector. Also Ultra Fast Forest Tree (ufft) [21] is worth mentioning. It uses an ensemble of online training binary Hoeffding trees. Lan et al. [38] proposed Ensemble of Online Extreme Learning Machines being a simple aggregation of randomized neural networks trained online. Shan et al. [58] proposed an online active learning ensemble that composed a long-term stable classifier and multiple dynamic classifiers. This method also used an active learning strategy to select instances to label. Let us move on to the second group of chunk-based stream processing methods to build classifier ensembles. Streaming Ensemble Algorithm (sea) [63] is the classifier ensemble with changing lineup, where the individual classifiers are trained on the following data chunks. The base classifiers with the lowest accuracy are removed from the ensemble to keep the model up-to-date. Wang et al. proposed Accuracy Weighted Ensembles (awe) [70] employing the weighted voting rules, where weights depend on the accuracy obtained on the testing data. Brzezinski and Stefanowski proposed Accuracy Updated Ensemble (aue), extending awe by using online classifiers and updating them according to the current distribution [12] . Shan et al. [58] developed an online classifier ensemble consisting of a so-called stable classifier and multiple updating base classifiers to better react to different concept drifts. This approach also employed active learning to minimize the required number of labeled instances. Cano and Krawczyk proposed Kappa Updated Ensemble (kae) [13] that combines online and block-based approaches. kae uses Kappa statistic for dynamic weighing and selection of base classifiers. It is also worth mentioning work by Cohen and Straus [14] where the problem of maintaining time-decaying was formulated and analyzed, and statistics of a data stream. Liu et al. [42] proposed dividing data chunks in case of drift occurrence. Lu et al. [43] compared two chunks based on classifier predictions variance calculated using the original method called subunderbagging. Bifet et al. [8] introduced a method for handling concept drift with varying chunk sizes. Each incoming chunk is divided into two parts: older and new. Empirical means of data in each subchunk are compared using Hoeffding bound. If the difference between two means exceeds the threshold defined by confidence value, then data in the older window is qualified as out of date and is dropped. Later window with data for current concept grows, until next drift is detected and data is split again. This approach allows for detecting drift inside the chunk. Due to the high cost of data labeling, active learning methods are gaining more and more popularity [57] , including data stream classification [2] . Zliobaitė et al. [64] discussed the theoretical framework for predictive model learning using active learning approach and described tree labeling strategies. Kurlej and Wozniak [37] proposed the active learning data stream classification methods based on minimal distance classifiers. The decision about a given instance labeling depends on the distance between an example and the current decision boundary. Nguyen et al. [48] developed an incremental algorithm cslstream, that performs clustering and classification at the same time. Zgraja et al. [72] developed alcc (Active Learning by Clustering for Drifted Data Stream Classification) algorithm which employed query by clustering into new classifier training. Mohamad [46] proposed a similar data stream classifier, which combines uncertainty and density-based querying criteria. Bouguelia et al. [11] proposed a new query strategy based on instance weighting, but the time complexity of the weigh calculation was relatively high. Korycki and Krawczyk [31] applied an active learning strategy to classify data streams and combine it with the self-labeling approach. Ksieniewicz et al. [33] focused on active learning strategy for neural network classifiers. The authors implemented the forgetting mechanism using the catastrophic forgetting phenomenon. Shan et al. [58] employed a mixed strategy (based on active learning uncertainty sampling and random sampling) to select incoming objects to label. This approach was inspired by [71] where the proposed algorithm reacts to concept drift by using a mixed strategy to choose the instances to be labeled. Let's Π denotes a pool of L base classifiers Π = {Ψ 1 , Ψ 2 , Ψ k ..., Ψ L } to be used by the combined classifierΨ . In this study, we employ weighted voting, where weights are assigned to each base classifier, i.e.,Ψ where w k is the weight that is assigned to the kth individual and [ ] denotes Iverson's bracket. In this section, we will discuss a novel awae algorithm (Active Weighted Aging Ensemble), which can adapt the model to changes caused by the appearance of concept drit. On the other hand, it reduces the cost of labeling the data in each data chunk by using an active learning strategy with a maximum labeling budget assumed. It is worth noting here that the proposed algorithm is inspired by previous works of the team related to the wae algorithm [70] , among others. However, the presented version has a number of modifications in terms of reducing the label demand and proposes new mechanisms for calculating weights for base classifiers and ensemble pruning. We will first briefly introduce the most critical components of the awae algorithm, propose their implementation, and then present a run-time analysis of the proposed method. Instead of drift detection, awae tries to construct a self-adapting classifier ensemble that can adapt to the changes in a data stream. We assume that the classified data stream is given in the form of data chunks denoted as DS k , where k is the chunk index. The concept drift could appear in the incoming data chunks. Therefore based on each chunk, one individual is trained and we check if it could form a valuable ensemble with the previously trained models. If the pool size of L classifiers is exceeded, then the least valuable classifier must be removed from the pool (we choose L out of L + 1 individuals). This choice (ensemble pruning) is made based on a proposed criterion that is either the chosen diversity measure (in the case of pre pruning, i.e., before the weights are computed) or a linear combination of diversity measure and accuracy (in the case of post pruning after the weights of the base classifiers are calculated). Three procedures are used to calculate the weights. The weight calculating procedure calculates weights according to the importance of a given classifier for ensemble quality. The rejuvenating procedure artificially reduces the residence counter of classifiers with a high impact on the quality of the ensemble classifier, i.e., it weakens the forgetting effect associated with the selected classifiers. In contrast, aging procedure reduces the weights of the base classifiers depending on the value of the residence index. The detailed description of the awae is presented in Alg. 1. awae has built-in mechanisms to ensure a maximum ensemble size. If this is exceeded, a pool of L base classifiers with the best evaluation function value is selected. Pruning is possible either as soon as a classifier is added to the ensemble or only after determining its weight used by the combination rule. In the first case, pruning is performed if the variable pre pruning == 1 and the selected diversity measure (GeneralizedDiversity proposed by Partridge and Krzanowski [52] was selected in the experiments) is used for evaluation [35] . When post pruning is selected (variable post pruning == 1), the following criterion is used where P a (Π) stands for the accuracy of the classifier ensemble, diversity(Π) denotes its diversity, while α ∈ [0, 1] is user-defined parameter used for the linear combination of accuracy and diversity. Both metrics are calculated on the basis of incoming data chunk. We propose the following methods of weight calculation: The same weights for each classifier in the pool, i.e., majority vote is use as the combination rule Input: data stream DS = {DS1, DS2, ...}, data chunk size, train -classifier training function using active learning stategy; L -maximal ensemble size; pruning -pruning procedure; pre pruning -pre pruning procedure on/off; post pruning -post pruning procedure on/off; criterion p -pruning criterion; diversity -diversity measure; weighting -weight calculating procedure; aging -aging procedure; reindex -reindexing of classifier identifiers starting from 1; reju -rejuvenating procedure. 1: k := 0 2: Π = ∅ 3: while new data chunk DS k do 4: if pre pruning then 7: Π ← pruning(Π, L, diversity) 8: end if 9: reindex(Π) 10: w := 0 11: weighting(Π, DS k ) 12: reju(Π, DS k ) 13: aging(Π, DS k ) 14: for j := 1 to |Π| do 15: if w(Ψj) == 0 then 16: Kuncheva's weights -suggested by Kuncheva [35] w Weights proportional to accuracy related to the whole ensemble accuracy if w(Ψ i ) < θ then w(Ψ i ) = 0, where θ ∈ [0, 1] is the parameter which responsible for the removing less important classifiers. Weights proportional to accuracy related to the whole ensemble accuracy using bell curve if w(Ψ i ) < θ then w(Ψ i ) = 0, where θ ∈ [0, 1] as previously stands for the parameter responsible for the removing less important classifiers. We propose the following aging methods: Weight aging proportional to classifier accuracy where itter(Ψ ) stands for the residence counter of the classifier Ψ in Π), i.e., how many iterations have elapsed since a given base classifier was trained. if w(Ψ i ) < θ then w(Ψ i ) = 0, where θ ∈ [0, 1] as previously stands for the parameter responsible for the removing less important (old enough) classifiers and δ is user-defined parameter responsible for the aging rate. Gaussian aging if w(Ψ i ) < θ then w(Ψ i ) = 0, where θ ∈ [0, 1] as previously stands for the parameter responsible for the removing less important (old enough) classifiers and ξ is user-defined parameter responsible for the aging rate. We propose to rejuvenate an individual classifier if it has a big impact on the classifier ensemble, i.e., if its weight is bigger than average weight of base classifiers. Then the residence counter of a base classifier in the ensemble (itter) is decreased. The idea is presented in Alg. 2, where [] stands for entier. The training algorithm train used by awae algorithm also takes into account the possibility of limited label access, in which the Budget Active Labeling Strategy (bals) algorithm [73] -based on the random budget and active learning paradigm -is employed. The decision about the object labeling in each consecutive data chunk depends on two parameters: threshold t -which is responsible for choosing the "interesting" examples, i.e., if support function related with the decision is lower than a given threshold the object seems to be interesting and algorithm is asking for its label. -budget b -which defines the percent of instances in each data chunk, for which a label will be randomly obtained. The addition of the randomly chosen budget to the instances selected using the threshold-based active learning method aims at increasing the generalization ability of a classifier while reducing the possibility of overfitting. The exact procedure of the processing performed by the bals algorithm is presented in the Alg. 3, and below is a brief description of the functions used in it: update classif ier() -Updates the classification model using the labeled instances from a given data chunk, or with whole data chunk in case of the first iteration. active learning() -Selects instances for labeling from a given data chunk based on the support function threshold, which can be interpreted as a distance from the decision boundary in the case of binary classification problems. -random budget() -Selects a random percent of samples from a given data chunk for labeling. -get labels() -Obtains the labels for the selected instances. Input: input data stream, data chunk size Ψ -classification algorithm, t -threshold value, b -budget value. 1: k := 0 2: while new data chunk DS k do 3: if k == 0 then 4: Ψ ← update classif ier(Ψ, DS k ) 5: else 6: X k = active learning(t, DS k ) 7: X k ← random budget(b, DS k ) 8: LS k = get labels(X k ) 9: Ψ ← update classif ier(Ψ, LS k ) 10: end if 11: k := k + 1 12: end while The awae algorithm can be decomposed into a few stages. First, for each data chunk, a new base classifier is trained. Training time depends on the type of classification algorithm but is constant for each instance. This results in computational complexity of O(|DS i |) for each new classifier training. Then, the rejuvenation process is performed for each base model, with the computational complexity of O(|Π|). If the maximal ensemble size L is exceeded, the pruning process is performed. By limiting the possible combinations of base classifiers in the pool only to those containing |Π| − 1 elements, the computational complexity of pruning is O(L). Next, the weight calculation and aging are performed for each base model with a complexity of O(|Π|) and finally, all weights are updated with the same computational complexity (O(|Π||)). The active learning bals algorithm is composed of two stages. First, it calculates each sample's distance from the decision boundary (which is an absolute difference of obtained support and .5), which has the complexity of O(|DS i |). Then the objects are sorted according to the obtained distance and only those whose distance values do not exceed the set threshold t are used. This operation has the computational complexity of O(|DS i | log |DS i |). In the second stage, bals uses a simple sampling without replacement in order to choose the random budget b of instances from each data chunk DK i . This operation has the computational complexity of O(b log b). In this section, we will describe the details of a conducted experimental study that can assess the usefulness of awae The experiments are designed to answer the following research questions: RQ1. What is the best parameter setting for awae and how it impact the behaviour of the proposed algorithm as its ability to classify data streams with concept drift? RQ2. How flexible is awae to be used with the different classifiers? RQ3. How does awae behave when there is limited access to the labels? RQ4. How does the awae compare to the state-of-the-art algorithms,explicitly designed for the drifting data streams classification tasks? RQ5. How do the selected state-of-the-art algorithms behave when using the proposed bals active learning strategy? Data streams. In order to perform the experimental evaluation of the awae, data streams -both synthetic and based on real concepts -with various characteristics were used. Synthetic data streams were generated using the Python stream-learn library 1 . Three balanced streams were prepared, differing in the concept drift type, and replicated five times based on a random state value to stabilize the results and enable statistical analysis. The following parameters characterized these streams: data chunks number -200, -chunk size -250, -global label noise -1%, -concept drift typesudden, gradual, and incremental, -drifts number -10, -number of features -8. The second data source was a generator, which enabled the creation of data streams based on static concepts originating from known benchmark static datasets available in data repositories such as uci [19] or keel [1] . The datasets used in the stream generation process are presented in Table 1 . The generation procedure is based on the temporal interpolation of normalized random projections of original data set into a given subspace, which, depending on the interpolation, may generate sudden drifts (nearest-neighbor interpolation) and incremental drifts (cubic interpolation) [30] . Finally, the experiments were also carried out using 9 real benchmark data streams [13, 60] and presented in Table 2 . Multiclass streams have been binarized and the longest possible fragments have been cut from them, ensuring both classes are present in data chunks containing 250 instances. Reference methods. Throughout the conducted experiments the proposed method was compared with a selection of state-of-the-art data stream classification algorithms: -Streaming Ensemble Algorithm (sea) [62] -which trains a new base classifier on each incoming data chunks, and adds it to the maintained classifier pool. In case of exceeding the set maximum pool size, the worst model is removed. -Accuracy Weighted Ensemble (awe) [65] -which gives individual base classifiers a weight on the basis of mean squared error. -Accuracy Updated Ensemble (aue) [12] -which modifies the awe algorithm in order to allow updating of base models. -Learn++ for Non Stationary Environments (nse) [15] -which combines base classifiers using dynamically weighted majority voting, where weights are calculated based on classifiers' errors. -Online Active Learning Ensemble (oale) [58] -which employs a long-term stable classifier and multiple dynamic models, supplemented by a hybrid labeling strategy. Each ensemble method contained a base classifier pool with maximum size equal to 10 model. The following parameters were selected for base classifiers: -Gaussian Naïve Bayes (gnb) -where the portion of the largest feature variance that is added to variances for calculation stability is equal to 1e − 9, -Hoeffding Tree (ht) -where the number of instances that should be observed before leaf split attempts is equal to 200 and Hellinger distance is used as a split criterion, -Multilayer Perceptron (mlp) -with a single hidden layer containing 100 artificial neurons, ReLU activation function, adam optimizer, constant learning rate, and 200 maximum iterations. Experimental protocol. All experiments were conducted using the streamlearn library and based on the Test-Than-Train [32] evaluation protocol. Statistical analysis. It was performed using the t-test [61] during all of the conducted pairwise comparisons. The results of all of the performed tests were reported at a significance level α = 0.05. Reproducibility. All experiments were conducted using the stream-learn library and based on the Test-Than-Train [32] evaluation protocol. As we mentioned above all ensemble methods have been tested for three types of base classifiers, namely, Gaussian Naive Bayes (gnb), Hoeffding Tree (ht), and Multilayer Perceptron (mlp) according to scikit-learn and scikit-multiflow implementations [47, 53] . In the case of synthetic streams, accuracy score values are reported, while in experiments containing streams based on real concepts -in order to eliminate the impact of possible data imbalance -a balanced accuracy score was used. The experiments presented in this article can be replicated using the code available in the GitHub repository 2 . We proposed three groups of experiments to answer the formulated research questions. For all experiments carried out in the restricted access to labels scenario, the following values of the bals algorithm hyperparameters were adopted: The above values were selected on the basis of previous research results on the bals algorithm. Experiment 2 -Comparison with state-of-the-art on synthetic streams The second experiment presents a comparative analysis of the awae method (in the default and optimized configuration) with state-of-the-art methods depending on the type of concept drift and the base classifier used, measuring their performance on the examples of synthetic data streams. The presented results have been extended by statistical analysis using the t-student test. The study covers both scenarios with full and limited labeling. The oale algorithm in the second experiment was deprived of the active learning module to ensure a reliable comparison. The third experiment expands the research from the second one by analyzing real streams with real concepts based on the same pool of comparative methods and identical assumptions. Here, the oale algorithm is used together with described bals approach, which is similar to the active learning technique used originally by the authors of the Online Active Learning Ensemble. Active Weighted Aging Ensemble is characterized by a reasonably strong hyperparameterization, allowing for the adaptation of the modeling procedure individually to each problem under consideration. Unfortunately, such an approachdue to evident data peeking coming from fine-tuning -would hinder the proper comparative assessment of the recognition effectiveness against the state-of-theart methods. Therefore, a preliminary experiment was designed and carried out to select global strategies for scenarios with full and limited labeling. The conducted review indicated, however, that for both of these strategies, the best hyperparameterization is consistent -showing the same values -differing in the other terms, like drift type, base classifier, pruning approach and classifier weighting method, therefore this chapter presents only the results for the full labeling scenario. Even the reduced analysis leads to many tables with the statistical analysis of individual configurations. Therefore the article has been extender with supplementary materials 3 in which there is a full set of reference results. This chapter has been limited to the visualization of statistical relationships of models with different configurations (Fig. 1) and table with the selected hyperparameterization for the particular scenarios (Table 3 ). The initial phase of the analysis eliminated the rejuvenation parameter. It combined pruning criterion (Section II.A.3) as essential for processing, leaving four factors tested in six pairs using sudden, gradual and incremental drift using three classification algorithms. The analyzed factors were: pruning location -post-or pre-pruning (2 values), -aging -proportional, constant and gaussian (3 values), -weight calculation -same, kuncheva, proportional to accuracy and bell curve (4 values), -θ -range from 0 to 10% (5 values). When comparing pruning location with the aging, a significant advantage of proportional aging was visible, with slight differences between pre-and postpruning, in some cases showing a statistically insignificant advantage of postpruning. These relations were not dependent on the drift nature or the base classifier used. In the juxtaposition of the weight calculation with the aging, the high stability of the proportional strategy was also evident. Still, it turned out to be slightly inferior, without a significant statistical difference, to the constant strategy paired with the Kuncheva weight calculation method for each case except for gradual drifts using mlp as a base classifier. On the other hand, the comparison of the weight calculation with the pruning location showed a significant advantage of Kuncheva weight over all other strategies, especially with the application of post-pruning. Comparing the value of θ parameter with the aging would theoretically suggest the selection of a constant strategy -leading to the highest results -, but it also shows a strong dependence of this parameter on theta value, building a relatively narrow window in which it is possible to achieve the local optimum. A minimally worse but statistically dependent result, already desensitized to theta value, is achieved with proportional aging, which seems to be a more universal choice of default awae hyperparameterization. The comparison of the value of θ parameter with the pruning location shows that, up to a certain range, there is a robust and proportional dependence of the model quality on θ parameter, especially enhanced by post-pruning, which, however, is interrupted suddenly around 0.1 of theta parameter, leading to a significant decrease in predictive ability of a system. The last comparison between the θ parameter and the weight calculation is the only one that shows the differences in the hyperparameterization of the methods for different drift dynamics. It is clearly visible here that the bell method of weight calculation -which was not apparent in previous analyzes -works best for streams with gradual and incremental characteristics, when the Kuncheva method promotes streams with sudden drift. As can be seen from the above analysis, the available parameterization of the awae method allows for its strong adaptation to the problems. Nevertheless, the quality adjustments introduced by the changes does not always lead to statistically significant changes. Therefore, an attempt was made to select the default configuration, carried out on the review of all 120 combinations, among which the most common among the solutions statistically dependent on the best, for scenarios with full labeling, were the following settings: θ -5%, -weight calculationbell, -agingconstant. pruning -both pre-and post-pruning were dependent to the best solution. In the case of the active learning scenarios, this hyperparameterization only changed when Hoeffding Trees was used as the base classifier. Such configuration is presented in Table 3 . The evaluation of data stream classification methods with the use of synthetic streams, thanks to the possibility of replication of many streams sharing the same characteristics, allows for clear identification of their basic properties, desensitized to the observations of outliers of detailed parameters of the detailed concept. The relevant analysis was carried out in the second experiment for scenarios with full labeling (Table 4 ) and using active learning (Table 5 ). In the case of the full labeled synthetic streams, mlp clearly turns out to be the best classification algorithm, which allows for the highest classification quality for each type of drift. Additionally, it is the algorithm that works best when paired with the proposed awae algorithm. Both in the case of sudden, gradual, and incremental drifts, it achieves a significant statistical advantage over any of the state-of-the-art methods. Relatively the worst is the match between ht-awae, where the statistical advantage is achieved only over the sea, and awe methods, and the ht-aue match is the best, which is still statistically significantly worse than the mlp-awae. The gnb-awae pair allows obtaining results dependent on nse, with which together it is the best choice when using the Gaussian Naive Bayes classifier. The results for the active learning strategy are similar. Here also globally the best combination is mlp-awae. The presentation of the processing efficiency of classification algorithms in the case of real streams is optimal when visualizing the accumulative sums of the flow efficiency, thanks to which both the dynamics of learning new concepts and the possible tendency to reduce the discriminative abilities of the recognition system in the course of neutralization to successive drifts are clearly visible. Therefore, the experimental evaluation of the awae algorithm on streams with real concepts ( Figure 2 ) and real streams (Figure 3 ) was carried out using appropriate plots. In the case of the classification of streams with real concepts, the observations from Experiment 2 are replicated, where the mlp-awae combination turned out to be the best match. In the case of learning with full labeling, mlp-nse is highly competitive with it, and ht-aue is equal competition in the case of active learning scenarios. This is visible in both cubic and nearest drift. The only cases in which the mlp-awae combination do not turn out to be the optimal or optimal-dependent processing strategy is a part of real streams, in particular from the insects group, where it differs significantly from other comparative methods. It probably results from the different characteristics of the concept, where the default parameterization defined by the outcome of Experiment 1 turns out to be a wrong strategy. It can be seen that awae establishes similar parameters when dealing with the full labeled data stream. The preference is constant aging with θ = 5% and the method of weight calculating consistent with Eq. 6. For the active learning scenario, we may observe that similar parameter values are determined when mlp and gnb are chosen as base classifiers. The situation is different when ht is selected when the suggested parameters strongly depend on the drift type present (RQ1 answered). awae can work with any type of base classifier, but in the experiments, it was limited to only three methods, of which the best quality is obtained by combining awae with mlp (RQ2 answered). The same combination also performs best when compared to state-of-the-art methods. For synthetic streams, for all drift types, we observe statistically significantly better quality of mlp-awae classification compared to reference methods using mlp as base classifier. In the case of using gnb, awae is statistically significantly better than all methods except nse, which also for this classifier achieves better quality than the other methods. When ht is used, aue is by far the best method, especially in the case of sudden drift. However, it should be noted that a similar relationship can be observed for real data streams (RQ3 and RQ4 answered). Similar observations can be made for real data streams when the active learning strategy bals is applied, the mlp-awae performs best. The exceptions are selected streams from the insects group, where awae performs quite average, while the best quality is mainly characterized by aue and awe. For all real data streams, olae and nse perform the worst for scenarios with bals. However, it should be noted here that olae has its own built-in active learning method that bals replaced for the experiments, which did not perform well for it and making a general conclusion about olae quality in the case of its modification would be unfair (RQ5 answered). To summarize, the awae algorithm can be recommended as an effective tool for data stream processing, emphasizing, however, that this method is quite sensitive to configuration and requires special attention when used for a particular task. It is visible for some real data streams, where a lack of appropriate parameter settings may deteriorate the predictive performance. This study aimed to develop a novel, effective framework for a drifted data stream classification task. We proposed the Active Weighted Aging Ensemble algorithm that utilizes the changing ensemble lineup to appropriately react to concept drift and active learning strategy to reduce the budget required for upcoming instance labeling. The research conducted on benchmark data streams confirmed the effectiveness of the proposed solution. It highlighted its strengths in comparison with state-of-the-art methods. It is also worth mentioning that estimated computational complexity is acceptable and comparable to the benchmark algorithms. This work is a step forward towards using active and ensemble learning to design effective classification models for drifted data streams. The obtained results encourage us to continue working on this concept. Future research may include: -Application of the proposed method to imbalanced data stream classification, especially considering the application of data preprocessing for data balancing. -Employing a drift detection techniques to speed-up the reaction to the concept drift, especially in the case of abrupt changes. -Evaluation of how awae is robust to different distributions of the label and feature noise. -Using awae on massive and high-speed data streams requires a deeper study on the effective ways of its parallelization. -Proposing an effective active learning strategy for multi-modal data stream processing, especially for possible label propagation among modalities. -Application of awae to a real-world data stream susceptible to the presence of data stream, i.e., medical or banking data. Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework KDD '11, Association for Computing Machinery Early drift detection method Accuracy weighted diversity-based online boosting A boosting-like online learning ensemble Kalman filters and adaptive windows for learning in data streams Learning from time-changing data with adaptive windowing Learning from time-changing data with adaptive windowing Leveraging bagging for evolving data streams Online and non-parametric drift detection methods based on hoeffding's bounds An adaptive streaming active learning strategy based on instance weighting Accuracy updated ensemble for data streams with concept drift Kappa updated ensemble for drifting data stream mining Maintaining time-decaying stream aggregates Incremental learning of concept drift from streaming imbalanced data A general framework for mining massive data streams A selective detector ensemble for concept drift detection Pattern Classification UCI machine learning repository A survey on concept drift adaptation Forest trees for on-line data Learning with drift detection Learning cost-sensitive active classifiers Adaptive Filtering and Change Detection Learning from streaming data with concept drift and imbalance: an overview Tracking drift types in changing data streams Adaptive mixtures of local experts Streaming chunk incremental learning for class-wise data stream classification with fast learning speed and low structural complexity Dynamic weighted majority: a new ensemble method for tracking concept drift Data stream generation through real concept's interpolation Combining Active Learning and Self-Labeling for Data Stream Mining Ensemble learning for data stream analysis: A survey Data stream classification using active learned neural networks Classifier ensembles for changing environments Combining Pattern Classifiers: Methods and Algorithms -second edition Impact of window size in active learning of evolving data streams Active learning approach to concept drift problem Ensemble of online sequential extreme learning machine Using multiple windows to track concept drift Lossless online bayesian bagging The weighted majority algorithm Weighted ensemble with dynamical chunk size for imbalanced data streams in nonstationary environment Adaptive chunk-based dynamic weighted majority for imbalanced data streams with concept drift A lightweight concept drift detection ensemble The impact of diversity on online ensemble learning in the presence of concept drift A bi-criteria active learning algorithm for dynamic data streams Scikit-multiflow: A multi-output streaming framework Concurrent Semi-supervised Learning with Active Learning of Data Streams Tackling virtual and real concept drifts: An adaptive gaussian mixture model Online bagging and boosting Continuous Inspection Schemes Software diversity: practical statistics for its measurement and exploitation Scikit-learn: Machine learning in Python Exponentially weighted moving average charts for detecting concept drift Speeding up recovery from concept drifts Incremental learning from noisy data Active Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning Online active learning ensemble framework for drifted data streams Concept drift detection and model selection with simulated recurrence and ensembles of statistical detectors Challenges in benchmarking stream learning algorithms with real-world data How to design the fair experimental classifier evaluation A streaming ensemble algorithm (sea) for large-scale classification A streaming ensemble algorithm (sea) for large-scale classification Active learning with drifting streaming data Mining concept-drifting data streams using ensemble classifiers Resampling-based ensemble methods for online class imbalance learning Characterizing concept drift Effective learning in dynamic environments by explicit context tracking Learning in the presence of concept drift and hidden contexts Application of combined classifiers to data stream classification Active learning over evolving data streams using paired ensemble framework Active learning by clustering for drifted data stream classification Combination of active and random labeling strategy in the non-stationary data stream classification An empirical insight into concept drift detectors ensemble strategies This work is supported by the CEUS-UNISONO programme, which has received funding from the National Science Centre, Poland under grant agreement No. 2020/02/Y/ST6/00037.