key: cord-0434916-hsfmz96r
authors: Kozal, Jkedrzej; Guzy, Filip; Wo'zniak, Michal
title: Employing chunk size adaptation to overcome concept drift
date: 2021-10-25
journal: nan
DOI: nan
sha: 65420ad635cd7aef7be3fe1cc8257a64ff6b59c0
doc_id: 434916
cord_uid: hsfmz96r

Modern analytical systems must be ready to process streaming data and correctly respond to data distribution changes. The phenomenon of changes in data distributions is called concept drift, and it may harm the quality of the used models. Additionally, the possibility of concept drift appearance causes that the used algorithms must be ready for the continuous adaptation of the model to the changing data distributions. This work focuses on non-stationary data stream classification, where a classifier ensemble is used. To keep the ensemble model up to date, the new base classifiers are trained on the incoming data blocks and added to the ensemble while, at the same time, outdated models are removed from the ensemble. One of the problems with this type of model is the fast reaction to changes in data distributions. We propose a new Chunk Adaptive Restoration framework that can be adapted to any block-based data stream classification algorithm. The proposed algorithm adjusts the data chunk size in the case of concept drift detection to minimize the impact of the change on the predictive performance of the used model. The conducted experimental research, backed up with the statistical tests, has proven that Chunk Adaptive Restoration significantly reduces the model's restoration time.

Data stream mining focuses on the knowledge extraction from streaming data, mainly for the predictive model construction aimed at assigning arriving instances to one of the predefined categories. This process is characterized by additional difficulties that arise when data distribution evolves over time. It is visible in many practical tasks as spam detection, where the spammers still change the message format to cheat anti-spam systems. Another example is medical diagnostics, where new SARS-CoV-2 mutations may cause different symptoms, which forces doctors to adapt and improve diagnostic methods [82] .

The mentioned above phenomenon is called concept drift, and its nature can vary due to both the character and the rapidity. It forces classification models to adapt to new data characteristics and forget old, useless concepts. An important characteristic of such systems is their reaction to the concept drift phenomenon, i.e., how much predictive performance deteriorates when it occurs and when the classification system will obtain the approved predictive quality for the new concept. We should also consider another limitation: the classification

system should be ready to classify incoming objects immediately, and dedicated computing and memory resources are limited.

Data processing models used by stream data classification systems can be roughly divided into two categories: online (object by object) processing (online learners), or block-based (chunk by chunk) data processing (block-based learners) [27] . Online learners require model parameters to be updated when a new object appears, while the block-based method requires updates once per batch. The advantage of online learners is their fast adaptation to concept drift. However, in many practical applications, the effort of necessary computation (related to updating models after processing each object) is unacceptable. The model update can require many operations that involve changing data statistics, updating the model's internal structure, or learning a new model from scratch. These requirements can become prohibitive for high-velocity streams. Hence, more popular is block-based data processing, which requires less computational effort. However, it limits the model's potential for quick adaptation to changes in data distribution and fast restoration of performance after concept drift. In consequence, a significant problem is the proper selection of the chunk size. Smaller data block size results in faster adaptation. However, it increases the overall computing load. On the other hand, larger data chunks require less computation but result in a lower adaptive capacity of the classification model. Another valid consideration is the impact of chunk size on prediction stability. Models trained on smaller chunks typically have larger prediction variance, while models trained with larger chunks tend to have more stable predictions when the data stream is stationary. If concept drift occurs, a larger chunk increase probability that the data from different concepts will be placed in the same batch. Hence, selecting the chunk size is a trade-off encompassing computation power, adaptation speed, and predictions variance.

The trade-off described above includes features that are equally desired in many applications. Especially consumption of computation power and adaptation speed are both important when processing large data streams. We propose a new method that alleviates the downfalls of choosing between small or large chunk sizes by dynamically changing the current batch size. More precisely, our work introduces the Chunk-Adaptive Restoration (CAR), a framework based on combined drift and stabilization detection techniques that adjusts the chunk sizes during the concept drift. This approach slightly redefines the previous problem based on the observation that for many practical classification tasks, a period of changes in data distributions is followed by stabilization. Hence, we propose that when the concept drift occurs, the model should be quickly upgraded, i.e., the data should be processed in small chunks, and during the stabilization period, the data block size may be extended. The advantage of the proposed method is its universality and the possibility of using it with various chunk-based data stream classifiers.

This work offers the following contributions:

• Proposing the Chunk-Adaptive Restoration framework to empower fluent restoration after concept drift appearance.

• Formulating the Variance-based Stabilization Detection Method, a technique complementary to all concept drift detectors that simplifies chunk size adaptation and metrics calculation.

• Employing Chunk-Adaptive Restoration for the adaptive data chunk size setting for selected state-of-the-art algorithms.

• Introducing a new stream evaluation metric, Sample Restoration, to show the gains of the proposed methods.

• Experimental evaluation of the proposed approach based on various synthetic and real data streams and a detailed evaluation of its usefulness for the selected state-of-art methods.

This section provides a review of the related works. Firstly, we will discuss challenges specific to the learning from non-stationary data streams. Next, we discuss different methods of processing data streams. Following, we describe existing drift detection algorithms and ensemble methods. We continue by reviewing existing evaluation protocols and computational and memory requirements. We conclude this section by providing examples of other data stream learning methods that employ variable chunk size.

A data stream is a sequence of objects described by their attributes. In the case of a classification task, each learning object should be labeled. The number of items may be vast, potentially infinite. Observations in the stream may arrive at different times, and the time intervals between their arrival could vary considerably. The main differences between analyzing data streams and static datasets include [56] :

• No one can control the order of incoming objects

• The computation resources are limited, but the analyzer should be ready to process the incoming item in a reasonable time

• The memory resources are also limited, but the data stream size may be huge or infinite, which causes memorizing all the items impossible • Data streams are susceptible to change, i.e., data distributions may change over time

• The labels of arriving items are not for free, for some cases impossible to get, or available with delay (e.g., in banking for credit approval task after a few years)

The canonical classifiers usually do not consider that the probabilistic characteristics of the classification task may evolve [65] . Such a phenomenon is known as concept drift [30] and a few concept drift taxonomies have been proposed. The most popular consider how rapid the drift is, then we can distinguish sudden drift and incremental one. An additional difficulty is a case when, during the transition between two concepts, objects from two different concepts appear for some time simultaneously (gradual drift). We can also take into consideration the influence of the probabilistic characteristics on the classification task [33] :

• virtual concept drift does not impact the decision boundaries but affects the probability density functions [66] , and Widmer and Kubat [30] imputed it rather to incomplete data representation than to the true changes in concepts,

• real concept drift affects the posterior probabilities and may impact the unconditional probability density function [30] .

The data stream can be divided into small portions of the data called data chunks. This method is known as batch-based or chunk-based learning. Choosing the proper size of the chunk is crucial because it may significantly affect the classification [54] . Unfortunately, the unpredictable appearance of the concept drift makes it difficult. Several approaches may help overcome this problem, e.g., using different windows for processing data [68] or adjusting chunk size dynamically [30] . Unfortunately, most chunk-based classification methods assume that the size of the data chunk is priorly set and remains unchanged during the data processing. Instead of chunk-based learning, the algorithm can learn incrementally (online) as well. Training examples arrive one by one at a given time, and they are not kept in memory. The advantage of this solution is the need for small memory resources. However, the effort of necessary computation related to updating models after processing each individual object is unacceptable, especially in the high-velocity data streams, i.e., Internet of Things (IoT) applications.

When processing a non-stationary data stream, we can rely on a drift detector to point moments when data distribution has changed and take appropriate actions. The alternative is to use inherent adaptation properties of models (update & forget). In the following subsection, we will discuss both of these approaches.

A drift detector is an algorithm that can inform about any changes taking place within data stream distributions. The data labels or a classifier's performance (measured using any metric, such as accuracy) is required to detect a real concept drift [69] . We have to realize that drift detection is a non-trivial task. The detection should be done as quickly as possible to replace an outdated model and minimize restoration time. On the other hand, false alarms are unacceptable, as they will lead to an incorrect model adaptation and resource spending where there is no need for it [70] . DDM (Drift Detection Method) [71] is one of the most popular detectors that incrementally estimates an error of a classifier. Because we assume the classifier training method's convergence, the error should decrease with the appearance of subsequent learning objects [72] . If the reverse behavior is observed, then we may suspect a change of probability distributions. DDM uses the three-sigma rule to detect a drift. EDDM (Early Drift Detection Methods) [73] is an extension of DDM, where the window size selection procedure is based on the same heuristics. Additionally, the distance error rate is being used instead of the classifier's error rate. Blanco et al. [74] proposed very interesting drift detectors that use the non-parametric estimation of classifier error employing Hoeffding's and McDiarmid's inequalities.

One of the most promising data stream classification research directions, which usually employs chunk-based data processing is the classifier ensemble approach [27] . Its advantage is that the classifier ensemble can easily adapt to the concept drift using different updating strategies [60] :

• Dynamic combiners -individual classifiers are trained in advance, and they are not updated anymore. The ensemble classifier adapts to changing data distribution by changing the combination rule parameters.

• Updating training data -incoming examples are used to retrain component classifiers (e.g., online bagging [62] ).

• Updating ensemble members [64, 67] .

• Changing ensemble lineup -replacing outdated classifiers in the ensemble, e.g., new individual models are trained on the most recent data and added to the ensemble. The ensemble pruning procedure is applied, which chooses the most valuable set of individual classifiers [61] .

A comprehensive overview of techniques using classifier ensemble [27] was presented by Krawczyk 

Because this work mainly focuses on improving classifier behavior after the concept drift appearance, apart from the classifier's predictive performance, we should also consider memory consumption, the time required to update the model, and time to decide. However, it should also be possible to evaluate how the model reacts to changes in the data distribution. Shaker and Hüllermeier [41] presented a complete framework for evaluating the recovery rate, including the proposition of two metrics restoration time and maximum performance loss. In this framework, the notion of pure streams was introduced i.e., streams containing only one concept. Two pure streams S A and S B are mixed into third stream S C , starting with concepts only from the first stream and gradually increasing a 2.6 Computational and memory requirements 2 RELATED WORKS percentage of concepts from the second stream. Restoration time was defined as a length of the time interval between two events -first a performance measured on S C drops below 95% of a S A performance, and then the performance on S C rise above 95% of S B performance. The Maximum performance loss is the maximum difference between S C performance and lowest performance on either S A or S B . Zliobaite et al. [75] proposed that evaluating the profit from the model update should consider the memory and computing resources involved in its update.

While designing a data stream classifier, we should also consider the computation power and memory limitations and that we usually have limited access to data labels. These data stream characteristics pose the need for other algorithms than ones previously developed for batch learning, where data are stored infinitely and persistently. Such learning algorithms cannot fulfill all data stream requirements, such as memory usage constraints, limited processing time, and one scan of incoming examples. However, simple incremental learning is usually insufficient, as it does not meet tight computational demands and does not tackle evolving nature of data sources [58] .

Constraints on memory and time have resulted in different windowing techniques, sampling (e.g., reservoir sampling), and other summarization approaches. Also, we have to realize that when the concept drift appears, data from the past may become irrelevant or even harmful for the current models, deteriorating the predictive performance of the classifiers. Thus an appropriate implementation of a forgetting mechanism (where old data instances are discarded) is crucial.

Dynamic chunk size adaptation was proposed in some works earlier [79] [80] [81] . Liu et al. [79] utilize information about the occurrence of drift from drift detector. If drift occurs in the middle of the chunk, data is divided into two chunks, hence dynamic chunk size. If there is no drift inside the chunk, the whole batch is used. In the prepared chunk, the majority class is undersampled. A new classifier is trained and added to the ensemble, and older classifiers are updated. Lu et al. [80] also utilize an ensemble framework for imbalanced stream learning. In this approach, chunk size grows incrementally. Two chunks are compared based on ensembles predictions variance. An algorithm for calculating prediction variance called subunderbagging is introduced. Computed variance is compared using F-test. Chunk size increases if the p-value is less than a predefined threshold; otherwise, the whole ensemble is updated with the selected chunk size. The whole process repeats as long as the p-value is lower than the threshold. In both of these works, dynamic chunk size was used as means of handling imbalanced data streams. In contrast, we show that changing chunk size can be beneficial when handling concept drifts in general. Therefore, we do not focus primarily on imbalanced data.

Bifet et al. [81] introduced a method for handling concept drift with varying chunk sizes. Each incoming chunk is divided into two parts: older and new. Empirical means of data in each subchunk are compared using Hoeffding bound. If the difference between two means exceeds the threshold defined by confidence value, then data in the older window is qualified as out of date and is dropped.

Later window with data for current concept grows, until next drift is detected and data is split again. This approach allows for detecting drift inside the chunk.

This paper presents a general framework that can be used for training any chunkbased classifier ensemble. This approach aims to reduce the restoration time, i.e., a period needed to stabilize the classification model performance after concept drift occurs. As we mentioned, most methods assume a fixed data chunk size, which is a parameter of these algorithms. Our proposal does not modify the core of a learning algorithm itself. Still, based on the predictive performance estimated on a given data chunk, it only indicates what data chunk size is to be taken by a given algorithm in the next step. We provide schema of our method in Fig. 1 . The intuition tells us that after the occurrence of the concept drift, the size of the chunk should be small to quickly train new models that will replace the models learned on the data from the previous concept in the ensemble. When the stabilization is reached, the ensemble contains base models trained on data from a new concept. In this moment we can extend the chunk size so classifiers in the ensemble can achieve better performance and even greater stability by learning on larger portions of data from the streams because the analyzed concept is already stable. Let us present the proposed framework in detail.

Starting the learning process, we sample the data from the stream with a constant chunk size c and monitor the classifier performance using a concept drift detector to detect changes in data distribution. When the drift occurs, we decrease the chunk size to the smaller value c d c, i.e., c d is the predefined size of a batch for concept drift. Size of subsequent chunks after drift at given time t are computed using the following equation:

where α > 1. The chunk size grows continuously with each step to reach the original value c unless the stabilization is detected. Then the chunk size is set to c immediately. Let us introduce the Variance-based Stabilization Detection Method (VSDM) to detect the predictive performance stabilization. First, we define the fixed-sized sliding window W containing the last K predictive performance metric values obtained for the most recent chunks. We also introduce the stabilization threshold s . The stabilization is detected when the following condition is met:

where V ar(W ) is a variance of scores obtained for the last K chunks. Sample data stream with detected drift and stabilization is presented in Fig. 2 . The primary assumption of the proposed method is a faster model adaptation caused by the increased number of updates after a concept drift. This strategy allows for using the larger chunk sizes when the data is not changing. It also reduces the computational costs of retraining models. Alg. 1 present the whole procedure. Our method works with existing models for online learning. For this reason, we argue that the approach proposed in this paper is easier to deploy in practice. 

Our method only impacts the size of the chunk. All other factors like the number of features or classifiers in the ensemble are the same as in the basic approach. For this reason, we will focus here only on the impact of chunk size on memory and time complexity. With memory complexity, our method could impact only the size of buffers for storing samples from a stream. When no drift is detected, the standard chunk size is used. This dictates the required size of buffers for storing samples. For this reason, memory complexity for storing samples is O(c). CAR works the same way as a base method when no drift is detected, and the data stream is stable. Therefore, in this case, the time complexity is the same as in the base method. When drift is detected sizes of subsequent chunks (g(c) ). When CAR is enabled and concept drift is detected chunk size is changed to c d . Each consecutive chunk at time t have size c t = α t c d , with t = 0 directly after the drift was detected. Chunk size grows until stabilization is detected or current chunk size is restored to original size c. For simplicity we skip case when stabilization is detected. With this assumption, we write condition for restoring the original chunk size:

Where t s is time when chunk size is restored to original value. From this equation we obtain t s directly:

The number of operations required by CAR after concept drift was detected is

Using big-O notation:

Therefore CAR time complexity depends only on chunk size and computational complexity of used models.

Restoration time cannot be directly utilized in this work, as we do not have access to pure streams with separate concepts. For this reason, we introduce a new Sample Restoration (SR) metric to evaluate the Chunk-Adaptive Restoration performance compared to standard methods used for learning models on data streams with concept drift. We assume that there is a sequence of N chunks between two stabilization points. Each element of such a sequence is determined by the chunk size c t and the achieved model's accuracy acc t . Let us define the index of the minimum accuracy as:

and the restoration threshold is given by the following formula:

where p ∈ (0, 1) is the percentage of the performance that has to be restored, and the multiplier is the maximum accuracy score of our model after the point when it achieved its minimum score. Finally, we look for the lowest index t r after which the model exceeds the assumed restoration threshold:

Sample Restoration is computed as the sum of chunk sizes from the concept drift's beginning to the t r :

In general, SR is the number of samples needed to obtain the p percent of the maximum performance achieved on the subsequent task. 

Data streams. Experiments were carried out using both synthetic and real datasets. Stream-learn library [1] was employed to generate the synthetic data containing three types of concept drift: abrupt, gradual, and increment, all generated with the recurring or unique concepts. We tested parameters such as chunk sizes and the stream length for each type of concept drift. All streams were generated with 5 concept drifts, 2 classes, 20 input features, of which 2 were informative and 2 were redundant. In the case of incremental and gradual drifts concept, sigmoid spacing was set to 5. Apart from the synthetic ones, we employed the Usenet [2] and Insects [10] data streams. Unfortunately, the original Usenet dataset contains a small number of samples, so two selected concepts were repeated to create a recurring-drifted data stream. Each chunk of the Insects data stream was randomly oversampled because of the significant imbalance ratio. [74] was employed as a concept drift detector. We used implementation available on the public repository [78] . The size of a window in FHDDM was equal to 1000, and the error probability allowed δ = 0.000001. Classifier ensembles. Three models classifier ensembles dedicated to data stream classification were chosen for comparison:

• Weighted Aging Classifier (WAE) [76] • Accuracy Weighted Ensemble (AWE) [3] ,

• Streaming Ensemble Algorithm (SEA) [6] , All ensembles contained 10 base classifiers. Experimental protocol. In our experiments, we apply the models mentioned above to selected data streams with concept drift. We measure Sample Restoration. These results are reported as a baseline. Next, we apply Chunk-Adaptive Restoration and repeat experiments to establish the proposed model's influence on the ability to handle concept drift quickly. As the experiments were conducted with the balanced data, the accuracy was used as the only indicator of the model's performance. As the experimental protocol Test-Then-Train was employed [77] . Statistical analysis. Because Sample Restoration can be computed for each drift and concept drift can occur multiple times, we report average Sample Restoration for each stream with standard deviation. To assess the statistical significance of the results, we used a one-sided Wilcoxon signed-rank test in a direct comparison between the models with the 95% confidence level. Reproducibility. To enable independent reproduction of our experiments, we provide a github repository with code 1 . This repo also contains detailed results of all experiments. Stream-learn [1] implementation of the ensemble models was utilized with the Gaussian Naïve Bayes and CART as base classifiers from sklearn [83] . Detailed information about used packages is provided in the yml file with a specification of the conda environment.

In our first experiment, we examine the impact of the chunk size on the model performance and general capability for handling data with concept drift. We train the AWE model on a synthetic data stream with different chunk sizes to evaluate these properties. The stream consists of 20 features, 2 classes, and it contains only 1 abrupt drift. Results are presented in Fig. 3 . As expected, chunk size has an impact on the maximal accuracy that the model can achieve. It is especially visible before drift, where models with larger chunks obtain the best accuracy. Also, with larger chunks variance of accuracy is lower. In ensemblebased approaches, a base classifier is trained on a single chunk. A larger chunk means that more data is available to the underlying model. Therefore it allows for the training of a more accurate model. Interestingly we can see that for all chunk sizes, performance is restored roughly at the same time. Regardless of the chunk size, a similar number of updates is required to bring back the model performance. Please keep in mind that the x-axis in Fig. 3 is the number of chunks. It means that models trained on larger chunks require a larger number of learning examples to restore accuracy. These results give the rationale behind our method. When drift is detected, we change chunk size to decrease the consumption of learning examples required for restoring accuracy. Next, we gradually increase chunk size to improve the maximum possible performance when the model recovers from drift. It allows for a quick reaction to drift and does not limit the model's maximum performance. In principle, not all models are compatible with changing chunk size. Also, batch size cannot be decreased indefinitely. Minimal chunk size should be determined case by case, dependent on the base learner used in an ensemble or used model in general. Later in our experiments, we use chunk sizes of 500, 1000, and 10000 to obtain a reliable estimate of how our method will perform in different settings.

After chunk size was selected, we fine-tuned other hyperparameters, and then we proceeded to further experiments. Firstly set two values manually, based on our observations. First is α (i.e., constant that determines how fast chunk size grows after drift was detected) equal to 1.1. Second is drift chunk size equal to 30, as it is a typical window length in drift detectors.

Next, we find the best for the stabilization window size and the stabilization threshold. We conduct grid search with windows size values 30, 50, 100, and stabilization thresholds 0. From provided data, we can conclude that the smaller the drift chunk size, the lower the SR is. This observation is in line with intuition about our method. Smaller drift chunk size provides a larger benefit during drift compared to normal chunk size. The same dependency can be observed for the stabilization threshold. Intuitively, a lower threshold means that stabilization is harder to reach. We argue that this can be beneficial in some cases when working with gradual or incremental drift. In this scenario, if stabilization is reached too fast, then chunk size is immediately brought back to the standard size, and there is no benefit from a smaller chunk size at all. Lowering the stabilization threshold could help in these cases. In later experiments, we use the stabilization window size equal to 30 and the variance stabilization threshold equal to 0.0001.

In this part of the experiments, we compare the performance of the proposed method to baseline. Results were collected following the experimental protocol described in the previous sections. To save space, we do not provide results for all models and streams. Instead, we plot accuracy achieved by models on selected data streams. These results are presented in Fig. 4 , 5, 6, and 7. All learning curves were smoothed using a 1D Gaussian filter with σ = 1.

From provided plots, we can deduce that the largest gains from employing the CAR method can be observed for an abrupt data stream. In streams with gradual and incremental drifts, there are fewer or none sudden drop of accuracy that the model can quickly react to. For this reason, the CAR method does not provide a large benefit with this kind of concept drifts. During a more detailed analysis of obtained results, we observed that the stabilization for gradual and incremental drifts is hard to detect. Many false positives usually cause an early return to the original chunk size, influencing the performance achieved on those two types of drifts. FHDDM caused another problem regarding the early detection of the gradual and incremental concept drifts. Usually, this is a desired feature. In our method, early drift detection initiates the chunk size change when two data concepts are still overlapping during stream processing. As the transition between two concepts takes much time, when one concept starts to dominate, the chunk size could be restored to its original value too early, affecting the achieved results.

We also observe larger gains from applying CAR on streams with bigger chunk size. To illustrate please compare results from Fig. 4 to Fig. 5 . One possible explanation behind this trend is that gains obtained from employing CAR are proportional to the difference in size between the base and drift chunk size. In our experiments, drift chunk size was equal to 30 for all streams and models. This explanation is also in line with the results of hyperparameter experiments provided in Tab. 2.

We conclude this section by providing a statistical analysis of our results. Tab. 3 shows the results of the Wilcoxon test for Naïve Bayes and CART base models. We state meaningful differences in the Sample Restoration between the baseline and the CAR method for all models. 

Real-world data often contain noise in labeling. For this reason, we evaluate if the proposed method can be used for data with varying amounts of noise in labels. We generate a synthetic data stream with two classes, base chunk size 1000, drift chunk size 100, and single, abrupt concept drift. We randomly select a predefined fraction of samples in each chunk and flip labels for selected learning examples. Next, we measure the accuracy of the AUE model with Gaussian Naïve Bayes base model on a generated dataset with noise levels 0, 0.1, 0.2, 0.3, and 0.4. Results are presented in Fig. 8 . We note for low levels of noise i.e., up to 0.3, restoration time is shorter. With a larger amount of noise, there is no sudden drop in accuracy. Therefore CAR has no impact on the speed of reaction to drift. It should be noted that results for CAR with noise levels 0.2, 0.3, and 0.4 were generated with the stabilization detector turned off. With a higher amount of noise, stabilization was detected very fast. Therefore chunk size was quickly set to base value. In this case, there was no benefit of applying CAR. This indicates that the stabilization method should be refined to handle noisy data well.

Firstly we evaluated the impact of chunk size on the process of learning in the data stream with single concept drift. We learn that models with larger chunk size can obtain larger maximum accuracy, but the required number of updates to restore accuracy is similar regardless of chunk size (RQ1 answered). The main goal of introducing the Chunk-Adaptive Restoration was to prove its advantages in controlling the number of samples during the restoration period while dealing with abrupt concept drift. The statistical tests have shown a significant benefit of employing it in different stream learning scenarios (RQ2 answered). The highest gains of employing the method were observed when the large original chunk size was used. With a bigger chunk size, there are fewer model updates, resulting in a delay of reaction to concept drift.

The number of samples that can be saved depends on the drift type and the original chunk size. When dealing with abrupt drift, the sample restoration time can be around 50% better than the baseline (RQ3 answered). We noticed that for each of the analyzed classifier ensemble methods, CAR minimized restoration time and achieved better average predictive performance. It is worth noting that the simpler the algorithm, the greater the profit from using CAR. The most considerable profit was observed for SEA and AWE, while in the case of WAE, sometimes the native version outperformed CAR for the Average Sample Restoration metric (RQ4 answered). When a small amount of noise is present in labels, CAR can still be useful, however in some cases stabilization detector should not be used. With a larger amount of noise, there is no gain from using the proposed method (RQ5 answered).

The work focused on the Chunk-Adaptive Restoration framework, which is dedicated to chunk-based data stream classifiers enabling better recovery from concept drifts. To achieve this goal, we proposed new methods for stabilization detection and chunk size adaptation. Their usefulness was evaluated based on computer experiments conducted on the real and synthetic data streams. Obtained results show a significant difference between the predictive performance of the baseline models and models employing CAR. Chunk-Adaptive Restoration is strongly recommended for abrupt concept drift scenarios because it significantly can reduce model downtime. The performance gain is not visible for other types of • Improving the Chunk-Adaptive Restoration behavior for gradual and incremental concept drifts.

• Adapting the Chunk-Adaptive Restoration to the case of limited access to labels using a semi-supervised and active learning approach.

• Proposing a more flexible method of changing data chunk size, e.g., based on the model stability assessment.

• Adapting the proposed method to imbalanced data stream classification task, where changing the data chunk size may be correlated with the intensity of data preprocessing (e.g., the intensity of data oversampling).

• Improve stabilization method to better handle data streams with noise.

but it still achieves acceptable results. The future works may focus on: References

stream-learn-open-source Python library for difficult data stream batch analysis

Tracking recurring contexts using ensemble classifiers: An application to email filtering

Mining Concept-Drifting Data Streams Using Ensemble Classifiers

Accuracy Updated Ensemble for Data Streams with Concept Drift. Hybrid Artificial Intelligent Systems

Reacting to Different Types of Concept Drift: The Accuracy Updated Ensemble Algorithm

A Streaming Ensemble Algorithm (SEA) for Large-Scale Classification

Online bagging and boosting

Scikit-learn: Machine Learning in Python

NC: Combining Ensemble of Classifiers With Dynamically Weighted Consult-and-Vote for Efficient Incremental Learning of New Classes

Challenges in Benchmarking Stream Learning Algorithms with Real-world Data. Data Mining And Knowledge Discovery

Learning Deep Neural Networks on the Fly. Proceedings Of The Twenty-Seventh International Joint Conference On Artificial Intelligence, IJCAI-18

Connectionist learning procedures

Continual lifelong learning with neural networks: A review. Neural Networks. 113 pp

Learn to Grow: A Continual Structure Learning Framework for Overcoming Catastrophic Forgetting

Measuring Catastrophic Forgetting in Neural Networks

Fine-Tuning Convolutional Neural Networks for Biomedical Image Analysis: Actively and Incrementally. 2017 IEEE Conference On Computer Vision And Pattern Recognition (CVPR)

Summarization and Classification of Wearable Camera Streams by Learning the Distributions over Deep Features of Out-of-Sample Image Sequences

Identifying and Alleviating Concept Drift in Streaming Tensor Decomposition

DEMO: Real-time Edge Analytics and Concept Drift Computation for Efficient Deep Learning From Spectrum Data

Semantic Drift Compensation for Class-Incremental Learning

Adversarial Concept Drift Detection under Poisoning Attacks for Robust Data Stream Mining

Temporal Matrix Factorization for Tracking Concept Drift in Individual User Preferences

Evolving Deep Convolutional Neural Networks for Image Classification

Minimally distorted Adversarial Examples with a Fast Adaptive Boundary Attack

Self-adaptive Re-weighted Adversarial Domain Adaptation

Machine Learning for Streaming Data: State of the Art, Challenges, and Opportunities. SIGKDD Explorations Newsletter

& Others Ensemble learning for data stream analysis: A survey

Deep learning in neural networks: An overview

The Problem of Concept Drift: Definitions and Related Work

Learning in the Presence of Concept Drift and Hidden Context. Machine Learning. 23 pp

Generative Adversarial Network Training is a Continual Learning Problem

Continual Learning for Conditional Image Generation

Generative Adversarial Nets

Deep Learning Face Attributes in the Wild

Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

Progressive Growing of GANs for Improved Quality, Stability, and Variation

VEEGAN: Reducing Mode Collapse in GANs using Implicit Variational Learning

Mode Regularized Generative Adversarial Networks

Recovery analysis for adaptive learning from nonstationary data streams: Experimental design and case study

Wasserstein Generative Adversarial Networks

Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks

Layer-Wise Relevance Propagation: An Overview. Explainable AI: Interpreting, Explaining And Visualizing Deep Learning

Deep Residual Learning for Image Recognition

Learning under Concept Drift: A Review

An introduction to ROC analysis

Adversarial Music: Real world Audio Adversary against Wake-word Detection System

Scribble-to-Painting Transformation with Multi-Task Generative Adversarial Networks

Pros and cons of GAN evaluation measures

Generative Continual Concept Learning

Continual lifelong learning with neural networks: A review. Neural Networks. 113 pp

REFERENCES REFERENCES

Streaming chunk incremental learning for class-wise data stream classification with fast learning speed and low structural complexity

Mining concept-drifting data streams using ensemble classifiers

Machine Learning for Data Streams: With Practical Examples in MOA

Data stream analysis: Foundations, major tasks and tools

Open Challenges for Data Stream Mining Research

A survey on data preprocessing for data stream mining: Current status and future directions

Multiple Classifier Systems, 5th International Workshop

Fixed-size ensemble classifier system evolutionarily adapted to a recurring context with an unlimited pool of classifiers

Classifier ensembles: Select real-world applications

MOA: a Real-time Analytics Open Source Framework

New ensemble methods for evolving data streams

Effective learning in dynamic environments by explicit context tracking

Combining Online Classification Approaches for Changing Environments

Using multiple windows to track concept drift

Concept Drift Detection and Model Selection with Simulated Recurrence and Ensembles of Statistical Detectors

Adaptive Filtering and Change Detection

Learning with drift detection

Statistical and Neural Classifiers: An Integrated Approach to Design

Early drift detection method. Fourth International Workshop On Knowledge Discovery From Data Streams

Online and Non-Parametric Drift Detection Methods Based on Hoeffding's Bounds

Towards cost-sensitive adaptation: When is it worth updating your predictive model

Weighted Aging Classifier Ensemble for the Incremental Drifted Data Streams. Flexible Query Answering Systems

MOA: Massive Online Analysis

Weighted Ensemble with Dynamical Chunk Size for Imbalanced Data Streams in Nonstationary Environment

Adaptive Chunk-Based Dynamic Weighted Majority for Imbalanced Data Streams With Concept Drift

Learning from Time-Changing Data with Adaptive Windowing

SARS-CoV-2 variants, spike mutations and immune escape

This work is supported by the CEUS-UNISONO programme, which has received funding from the National Science Centre, Poland under grant agreement No. 2020/02/Y/ST6/00037.