key: cord-0179925-fh0pz54s
authors: Korycki, Lukasz; Krawczyk, Bartosz
title: Instance exploitation for learning temporary concepts from sparsely labeled drifting data streams
date: 2020-09-20
journal: nan
DOI: nan
sha: 73837013602e951586e50f55b3a3d506578e0281
doc_id: 179925
cord_uid: fh0pz54s

Continual learning from streaming data sources becomes more and more popular due to the increasing number of online tools and systems. Dealing with dynamic and everlasting problems poses new challenges for which traditional batch-based offline algorithms turn out to be insufficient in terms of computational time and predictive performance. One of the most crucial limitations is that we cannot assume having access to a finite and complete data set - we always have to be ready for new data that may complement our model. This poses a critical problem of providing labels for potentially unbounded streams. In the real world, we are forced to deal with very strict budget limitations, therefore, we will most likely face the scarcity of annotated instances, which are essential in supervised learning. In our work, we emphasize this problem and propose a novel instance exploitation technique. We show that when: (i) data is characterized by temporary non-stationary concepts, and (ii) there are very few labels spanned across a long time horizon, it is actually better to risk overfitting and adapt models more aggressively by exploiting the only labeled instances we have, instead of sticking to a standard learning mode and suffering from severe underfitting. We present different strategies and configurations for our methods, as well as an ensemble algorithm that attempts to maintain a sweet spot between risky and normal adaptation. Finally, we conduct a complex in-depth comparative analysis of our methods, using state-of-the-art streaming algorithms relevant to the given problem.

Contemporary data has moved beyond being small and static collections of information. Nowadays, we need to tackle not only the massive volume of data sets, but also their velocity, as new data may arrive anytime, continuously flooding the system [1] . If we focus only on storing the information and analyzing it in an offline fashion, then we may face the problem known as data tombs [2] . This term is used to describe enormous data collections, holding potentially valuable knowledge that is never analyzed due to its size and time constraints. Even if we alleviated the data tomb problem by using high-performance computing environments, we would still be introducing a delay between the moment when information arrived and when it was analyzed. In such a setting, we cannot respond immediately and when we finish the analysis of our data collection it may turn out to be outdated. In order to extract the most current and valuable knowledge modern data must be analyzed in (near) real-time. The most recent example of such dynamic data is the information about the spread of COVID-19 [3] . Here, we are dealing with modeling a completely new event that at the given stage offers limited information. Any new data is therefore of extremely high value and must be incorporated into the predictive system as soon as possible. The underlying learning model must be capable of flexible adaptation, as even a small amount of data gathered in the last couple of hours may completely invalidate previously made predictions. COVID-19 showed why predictive models should be updated in an online manner and how predicted dynamics may change rapidly whenever a new batch of data becomes available [4] . When dealing with such an impactful event, one cannot delay obtaining vital conclusions by using offline models.

Such problems gave a rise to the field of data stream mining. Here, we assume that new information arrives continuously and our learning model must be capable of making quick decisions and incorporating new data on-the-fly. Streaming algorithms continuously update themselves in order to capture the most current state of a stream and its dynamics [5] . This is necessary, as data streams are subject to a phenomenon known as concept drift [6] . It assumes that the distribution and characteristics of analyzed data will change over time, generating temporarily valid concepts and rendering previously trained models irrelevant. When a drift occurs, old information stored in the model should be discarded and learning over a new concept should be prioritized. Due to the varying nature of concept drift, it may affect different properties of data (feature distribution, class labels, or proportion of instances in classes) under different rapidity of changes (sudden or gradual). These factors must dictate a way in which the model will adapt to drifts, making understanding the dynamics of change a vital task [7, 8] .

Data stream mining poses additional challenges, other than creating accurate predictive models. We must assume working under time and memory constraints. Designing machine learning methods for data streams can be viewed as a multi-criteria optimization, where one aims at reaching a trade-off between several factors that define efficacy in streaming environments [9] . As we cannot store the entire data stream, we should carefully choose which ones to keep and which ones to discard (or, in the case of online learning, assume that each instance must be discarded after a single pass). The trained model must quickly provide predictions for new arriving instances to avoid bottlenecks [10] . Furthermore, when faced with a concept drift, we need not only to detect it as soon as possible but also to adapt the model to the new state of the stream with the lowest possible latency. The time between the occurrence of the concept drift and classifier re-training is known as a drift recovery rate [11] .

Reducing the time when our classifier is incompetent (i.e., not adapted to the new concept) should be seen as one of the main goals of learning from data streams. The speed of adaption can be affected by the training procedure of the classifier, as well as by the access to data from the new concept -the more instances representative to the new concept we have, the more likely we are to reduce the recovery time. However, we must be aware that a fully supervised learning from data streams is an unrealistic scenario. It would require access to an oracle providing class labels for every single instance in the stream. As such labels usually come from a domain expert, we cannot assume that we can obtain all of them -we are restricted by both the budget (i.e., time of the domain expert is costly) and human limitations (instances arrive with such a speed and in such a volume that is impossible for a human to tackle). A realistic scenario for data stream mining assumes either limited access to ground truth [12] , or delayed labeling [13] . This restricted access to labeled data strengthens another problem known as underfitting, making effective drift adaptation even more challenging.

In this paper, we address an extremely important challenge of data stream mining: how to improve learning from drifting data while having only strictly limited access to class labels. While many of the existing works focus on how to reasonably choose instances for labeling, in this work, we aim at providing further improvements based solely on the few labeled instances given to us and at no additional cost. Our assumption is that any obtainable improvement under strict labeling constraints is vital for the underlying classifier, offering faster adaptation to new concepts. The main contributions of our work include the following.

• Enhancing drift adaptation with instance exploitation. Our idea is to intensively exploit labeled instances at our disposal, aiming at reinforcing adaptation to non-stationary concepts and providing a faster classifier recovery. By reusing the labeled instances we should be able to avoid underfitting and offer faster and more accurate reactions to concept drift.

• Exploitation techniques for improved drift adaptation. We introduce three techniques for instance exploitation to empower active learning from sparsely labeled drifting data streams. They are based on a reactive sliding window that uses one of three probabilistic sampling techniques to select instances for a more aggressive exposure to the online classifier.

• Ensemble architectures to avoid overfitting. In order to minimize the risk of overfitting, we propose two simple, yet effective ensemble architectures. They are based on two paired learners, one preferring aggressive instance exploitation and the other learning in a standard way. Our architectures are capable of dynamic switching between these models, with an added procedure for improving the weaker of the two learners.

• Flexible framework. Our proposal is a universal wrapper that can be used to enhance any online active learning algorithm dedicated to data streams.

• Insights into the role of enhanced drift adaptation. We offer an exhaustive experimental study on both artificial and real data streams, paired up with in-depth analysis that offers unique insights and better understanding of how to efficiently improve the adaptation to concept drift by using already labeled instances without any additional labeling costs.

Data stream is defined as a sequence < S 1 , S 2 , ..., S n , ... >, where each element S j is a new instance. In this paper we assume the (partially) supervised learning scenario with classification task and thus we will define each instance as S j ∼ p j (x 1 , · · · , x d , y) = p j (x, y), where p j (x, y) is a joint distribution of j-th instance, defined by a d-dimensional feature space and assigned to class y. Each instance is independent and drawn randomly from a probability distribution Ψ j (x, y).

Concept drift. When all instances come from the same distribution, we deal with a stationary data stream. In real-world applications data very rarely falls under stationary assumptions [14] . It is more likely to evolve over time and form temporary concepts, being subject to concept drift [6] . This phenomenon affects various aspects of a data stream and thus can be analyzed from multiple perspectives. One cannot simply claim that a stream is subject to the drift. It needs to be analyzed and understood in order to be handled adequately to specific changes that occur [7, 8] . More precise approaches may help us achieving faster and more accurate adaptation [11] . Let us now discuss the major aspects of concept drift and its characteristics.

Influence on decision boundaries. Firstly, we need to take into account how concept drift impacts the learned decision boundaries, distinguishing between real and virtual concept drifts [15] . The former influences previously learned decision rules or classification boundaries, decreasing their relevance for newly incoming instances. Real drift affects posterior probabilities p j (y|x) and additionally may impact unconditional probability density functions. It must be tackled as soon as it appears since it impacts negatively the underlying classifier. Virtual concept drift affects only the distribution of features x over time:

where Y is a set of possible values taken by S j . While it seems less dangerous than real concept drift, it cannot be ignored. Despite the fact that only the values of features change, it may trigger false alarms and thus force unnecessary and costly adaptations.

Locality of changes. It is important to distinguish between global and local concept drifts [16] . The former one affects the entire stream, while the latter one affects only certain parts of it (e.g., regions of the feature space, individual clusters of instances, or subsets of classes). Determining the locality of changes is of high importance, as rebuilding the entire classification model may not be necessary. Instead, one may update only certain parts of the model or sub-models, leading to a more efficient adaptation.

Speed of changes. Here we distinguish between sudden, gradual, and incremental concept drifts [6] .

• Sudden concept drift is a case when instance distribution abruptly changes with t-th example arriving from the stream:

• Incremental concept drift is a case when we have a continuous progression from one concept to another (thus consisting of multiple intermediate concepts in between), such that the distance from the old concept is increasing, while the distance to the new concept is increasing:

• Gradual concept drift is a case where instances arriving from the stream oscillate between two distributions during the duration of the drift, with the old concept appearing with decreasing frequency:

where δ ∈ [0, 1] is a random variable.

Recurrence. In many scenarios it is possible that a previously seen concept from k-th iteration may reappear D j+1 = D j−k over time [17] . One may store models specialized in previously seen concepts in order to speed up recovery rates after a known concept reemerges [18] .

Presence of noise. Apart from concept drift, one may encounter other types of changes in data. They are connected with the potential appearance of incorrect information in the stream and known as blips or noise. The former stands for singular random changes in a stream that should be ignored and not mistaken for a concept drift. The latter stands for significant corruption in the feature values or class labels and must be filtered out in order to avoid feeding false [19] or even adversarial information to the classifier [20] .

Feature drift. This is a type of change that happens when a subset of features becomes, or stops to be, relevant to the learning task [21] . Additionally, new features may emerge (thus extending the feature space), while the old ones may cease to arrive.

Classifiers for drifting data streams. In order to be able to adapt to evolving data streams, classifiers must either have explicit information on when to update their model, or use continuous learning to follow the progression of a stream. Concept drift detectors are external tools that can be paired with any classifier and used to monitor a state of the stream [22] . Usually, this is based on tracking the error of the classifier [23] or measuring the statistical properties of data [24] . Drift detectors emit two-level signals. The warning signal means that changes start to appear in the stream and recent instances should be stored in a dedicated buffer. This buffer is used to train a new classifier in the background. The drift signal means that changes are significant enough to require adaptation and the old classifier must be replaced with the new one trained on the most recent buffer of data. This reduces the cost of adaptation by lowering the number of times when we train the new classifier, but may be subject to costly false alarms or missing changes appearing locally or on a smaller magnitude. Alternative approaches assume using classifiers that are capable of learning in an incremental or online manner. Sliding windows storing only the most recent instances are very popular, allowing for natural forgetting of older instances [25, 26] . The size of the window is an important parameter and adapting it over time seems to yield the best results [27] . Online learners are capable of learning instance by instance, discarding data after it passed the training procedure. They are very efficient on their own, but need to be equipped with a forgetting mechanism in order not to endlessly grow their complexity [28] . Adaptive Hoeffding Trees [29] and gradient-based methods [30] are among the most popular solutions.

Ensemble approaches. Combining multiple classifiers is a very popular and powerful approach for standard learning problems [31] . The technique transferred seamlessly to data stream mining scenarios, where ensemble approaches have displayed a great efficacy [9] . They not only offer improved predictive power, robustness, and reduction of variance, but also can easily handle concept drift and use it as a natural way of maintaining diversity. By encapsulating new knowledge in the ensemble pool and removing outdated models, one can assure that the base classifiers are continuously mutually complementary, while adapting to changes in the stream. There are two main approaches for ensemble design in streaming environments: (i) updating base classifiers and (ii) updating the ensemble set-up. The first approach assumes that our ensemble uses classifiers capable of incremental or online learning. New instances arriving from the stream are used to continuously update those classifiers, without adding or removing any models. This provides natural adaptation to changes in the stream and retaining knowledge from previous concepts. A diversity assurance method is required, as feeding the entire stream to all base classifiers would force them to converge to a similar, if not identical, model. Popular solutions are based on the usage of online versions of bagging [32] , boosting [33] , Random Forest [34] , or instance-based clustering [35] . The second approach to the ensemble design assumes that we can add new classifiers to the ensemble pool and that outdated or irrelevant classifiers can be removed via a pruning procedure. Classifiers are combined with weights reflecting the time they spent in the ensemble and their current accuracy [36] . Accuracy Weighted Ensemble [37] and Accuracy Updated Ensemble [38] enable natural encapsulation of changes in data in a form of new classifiers and allow using any base learner, as incremental mode is not required. However, methods using the dynamic ensemble set-up react slower to concept drift than their online counterparts, as they need to gather enough of new instances for a classifier training. As both approaches have their advantages and drawbacks, hybrid solutions emerge. Kappa Updated Ensemble [39] is a most recent example, where all base classifiers are updated in an online manner to offer high reactivity to concept drift, while new classifiers are trained in the background and used to replace the current ones whenever changes in the data are too severe for online updates to be handled. This allows tackling both sudden and gradual changes in streaming data.

While a plethora of new approaches and algorithms have been proposed for continual learning from data streams, the predominant part of them share a common weakness -they are based on a naive assumption that we have unrestrained access to labeled instances that represent a given problem in a sufficient way [9] . Authors usually omit the aspect of acquiring annotated data for their methods and focus on improving their capacity to adapt, given a sufficient amount of supervision. Such an assumption is highly unfeasible in real-life scenarios [40] . Instance labeling is usually connected with a monetary cost, which quickly becomes prohibitive if one wants to label a potentially infinite stream and does not have an "unlimited" budget. Even if the cost is not an issue, human experts cannot work non-stop and require a certain amount of time to analyze each instance, while being subject to various limitations, such as availability and throughput [13, 41] . Finally, living in the era of deep learning, we must acknowledge the fact that such models require significantly higher amounts of labeled data than their shallow counterparts [42] . This has lead to a significant growth of labeling services [43] . If we encounter such problems while working with offline models, then we should expect them to be even more severe in online systems.

Analogously to the traditional batch-based machine learning approaches, this critical obstacle has been addressed by some semi-supervised and unsupervised methods, which aim at alleviating the lack of supervision by utilizing unlabeled instances. The former group is represented mainly by algorithms based on the incremental cluster-then-label approach [44, 45] and online self-labeling [35] . Other methods involves graph-based label propagation techniques [46] or online co-training [47] . The unsupervised methods present a different, proactive approach, in which models try to anticipate changes in decision boundaries without labeled instances. This involves capturing some general characteristics of a stream just from initially available labeled data and simulating their most probable evolution over time [48] , or very interesting attempts to model dynamics of a decision space [49] . While utilizing unlabeled data seems perfectly adequate for scenarios involving streams, it may be surprising why the actual number of works using it is so limited. It is possible that obtaining significant improvements from unreliable unlabeled data is very difficult in dynamic environments. Indeed, one of the main assumptions of semi-supervised learning is that in order to work correctly unlabeled data have to be drawn from the same, stable distributions as observed labeled instances [50] . This may be prohibitive in many streaming scenarios, especially if we add the fundamental cluster and smoothness requirements. In fact, results from some publications suggest that either feasible improvements from unlabeled data are barely significant or very unstable and highly dependent on characteristics of a specific stream [51] .

As a result of the potential weak spots, it is not a coincidence that a significant attention was given to methods shifted towards supervised approaches, which are more likely to provide a reliable information regardless of a state of a stream. One of the attractive solutions is to focus on a careful selection of instances to be labeled -namely active learning [52] . In static scenarios, those methods look for the most representative objects which can be used for an effective concept modeling. In dynamic environments, a crucial aspect of a proper stream labeling is a swift adaptation to changes in class characteristics [12] . It is not only about finding the most valuable objects for one batch of data, but also about proper exploration of a decision space in order to find subspaces that are no longer valid and to sample from those regions a sufficient number of instances. Dynamic active learning ensures a more efficient discovery of changes and faster recovery rate after a concept drift, which prevents a classifier from prolonged drops in performance [11] . Most popular strategies use adaptive thresholds on classifier certainty (or support functions) [12] or density information [53] . The problem of obtaining class labels becomes even more difficult when dealing with imbalanced class distributions in streams [43, 54] . All of the existing active learning strategies are significantly impaired when the label query budget (i.e., the number of instances allowed to be labeled) is small. In many real problems labeling even as little as 1% of instances from the stream may be too costly. As a consequence, even if a strategy picks only valuable objects to be labeled and used, the effect of the single update may still be insufficient. Furthermore, it is crucial to remember that when we deal with evolving data streams and only low budgets are available, we usually get very sparse snapshots of observed concepts. Arriving labeled objects often come from distant parts of the stream, so if the data is non-stationary, they likely represent different distributions constituting temporary concepts (Fig. 1 ), especially when a rate of incoming instances is low.

As a result of these two observations, it is very likely that, while working with dynamic streams and a substantially limited labeling budget, we will encounter the underfitting problem. In fact, we may even completely overlook some of the ephemeral concepts, leaving them unnoticed. Thus, the very few labeled objects we obtain are essential for keeping our models up-to-date and we should exploit them as much as possible to avoid the waste of potential benefits. Finally, since active learning methods can be seen as exploration methods and semi-supervised ones as regularization-exploitation, approaches combining both of them are a promising research direction [48, 51] . However, due to the mentioned problems with dynamic environments and using unsupervised input, in this work, we focus on investigating how significant improvements can be obtained solely from the exploitation of scare but reliable supervised information, provided with actively selected instances. To the best of our knowledge, it is the first such an approach to the problem.

Our hypothesis is that when dealing with temporary and evolving concepts under strictly limited access to labeled data, it is reasonable to aggressively exploit the data we currently have in order to overcome underfitting and obtain efficient up-to-date models. Let us consider the following example. In Fig. 2 we can see an adaptation process for a classifier while recognizing two classes (marked as red and blue points). Fig. 2a presents a state just before a drift. The classifier has learned the previous concept with sufficient accuracy, however, after the change (Fig. 2b) its model becomes practically useless. There is a need to update it, but due to a limited budget only few labeled instances selected by the active learning strategy are available (yellow). Since we deal with data streams, we can assume that the data points are currently stored in a sliding window. In the standard active learning approach, the chosen objects are used only once (λ = 0) and the effect is unacceptable -the class boundary did not change at all. This observation is connected with the nature of many incremental parameters maintained by the classifiers. Their internal estimators, cannot be updated efficiently using only few new values. The most straightforward solution is to reuse the limited labeled instances several times to influence the metrics in a more significant manner and boost the adaptation. We can clearly see that after some number of repetitions λ the decision boundary is reshaping towards a proper model -we exploit the acquired knowledge ( Fig. 2c -2f) . The final result is highly dependent on the representativeness of the instances selected by active learning, but this has been already investigated in other publications [12] . While we are aware of the fact that such an approach may potentially cause overfitting to the few instances, our assumption is that it is less important than the underfitting problem encountered before the risky adaptation was performed.

Such a wrapper can work with many online base learners, especially with those dedicated to evolving data streams. The currently most important online algorithm -Adaptive Hoeffding Tree (AHT) -may benefit from such a method, since its internal splits are rearranged only if a significantly large amount of data differs from a previous concept [29] . In the case of the stochastic gradient descent classifier (SGD) or neural networks, while exposing wrongly classified instances several times, we can speed-up the gradientbased optimization of a loss function. Of course, while using such algorithms we can also directly focus on the criterion, for example by adjusting learning rate or momentum, however, such solutions are limited only to the gradient-based classifiers [55] . Finally, Naïve Bayes may also work with the introduced method, since its internal estimators can be updated incrementally. It is also worth noting that we may avoid reusing instances several times by employing weight-sensitive classifiers, which can be straightforwardly adapted for this approach. In this work, due to their greater flexibility, we focus only on the more generic instance-based algorithms.

This approach can be associated with a mechanism similar to experience replay, which is a method of tackling catastrophic forgetting in deep neural networks [56] . The fundamental difference is that instead of reusing instances for stabilizing models of incoming classes, we want to erase outdated knowledge more quickly and practically "overfit" an online model to current concepts. The method is also related to resampling, known from imbalance learning, which also utilizes some instances multiple times [57] .

One may notice that another popular approach, that allows for the most current instances to significantly affect a maintained model, is to simply retrain the model from scratch, using a collected batch of data [58] . However, while such methods work well with sudden concept drifts, they are not suitable for incremental changes. In addition, forgetting the whole priorly gathered information may be a wrong idea if the budget is limited and if we take into consideration the fact that streams may be characterized by hybrid drifts or changes affecting only some parts of the analyzed space. In the example above ( Fig.  2) , we would lose all the information about red points at the bottom, which were not influenced by the concept drift. We do not want to completely forget previous knowledge, just quickly and incrementally adapt to the current situation when lacking labels. One well-known way to overcome this problem is to use ensembles that consist of classifiers constructed on the basis of the most current batches [37] . Still, such an approach is much more time and memory consuming than single-classifier frameworks. Another problem is that highly limited budgets may be prohibitive in the context of building a diverse pool of classifiers. Finally, all the strategies using the retraining approach, are highly dependent on effective concept drift detectors, which increase the overall complexity of a system and are often affected by the limited access to labeled instances as well.

We propose a simple, generic, single-classifier framework that uses aggressive online updates based on instances selected for labeling by an active learning strategy. Our implementation utilizes a reactive sliding window that stores the most recent labeled instances and exposes them several times to the online classifier in order to exploit the knowledge that comes with them. We present three different probabilistic sampling methods that determine how the instances are selected from the window. The general assumption is that while active learning will effectively explore a decision space, by selecting the most representative seeds, the instance exploitation wrapper will sufficiently utilize them in order to enhance the adaptation of a classifier using only few labels. That should help us tackle the underfitting problem.

The generic wrapper framework is presented in Alg. 1. When a new instance x appears, the actual budget spendingb is checked. Since the data stream is infinite by definition, the current spending has to be estimated as a ratio of objects that have been already labeled to the total number of registered instances [12] . If the value is below an available budget B, we use the active learning strategy to check if the instance is worth labeling. Some of the online strategies that can be used here were presented in [12] . The methods are focused on selecting instances close to the decision boundary, based on the uncertainty of a classifier. The idea is that those regions are most likely to be affected by drifts. In order to make the strategies more exploratory and sensitive to changes in other subspaces, randomization techniques can be utilized. One of the methods implementing a hybrid comprehensive querying is the RandVar algorithm. In addition to the complex sampling, the strategy uses a variable threshold balancing budget spending. We select this method as our default strategy due to its theoretically sound design.

After receiving a positive response, the object is queried (we obtain its true label y), the current labeling expenses are increased and the classifier is updated with the labeled instance. Then, we activate our instance exploitation strategy, which selects indices I of previously acquired labeled instances in order to reuse them and boost the adaptation process.

Data: labeling budget B, ActiveLearningStrategy, ExploitationStrategy, window size ωmax Result: classifier L at every iteration

ifb < B and ActiveLearningStrategy (x) = true then request the true label y of instance x; update labeling expensesb; update classifier L and sliding window W with (x, y);

until stream ends;

Since we want to adapt a predictive model to current concepts, our exploitation methods work with a batch of the most recent instances. We maintain a sliding window W that is updated with every labeled instance we receive, before using an exploitation strategy. We define the window W as a sequence of ω ∈ 1, ω max instances w i ∈ W , where i ∈ 1, ω , the oldest instance is w 1 and the newest one is w ω . It is important to remember that for very low budgets, wide windows may not be able to represent actual concept properly, since the relatively small number of labeled instances will cover only few batches. For example, if B = 1%, ω max = 1000 and a stream consists of 100 000 instances, then the window will not slide at all, since all labeled instances will cover exactly one such window. Therefore, in order to have a reactive sliding window, when using a low budget or dealing with a low-rate stream, its size must be adequate.

Our exploitation methods consists of strategies that simply reuse instances selected from a sliding window. In practice, they generate a multiset of indices I = {i | i ∈ 1, ω }, where |I| = λ. The λ parameter determines the intensity of the exploitation process. A single index is selected using the formula i = φ(ω, r) = rω , where r is a random variable r ∼ P(0, 1). As a result, each strategy is modeled by a probability distribution P(0, 1) -it determines on which parts of the window a strategy focuses. We may see the sliding window as a probabilistic or fuzzy one.

Uniform Window (UW) -the most straightforward solution is to select the instances uniformly, which results in each instance having an equal chance of being chosen (Fig. 3a) . The probability value is given by the uniform distribution, so a single object has 1/ω chance of being used again, where ω is a number of instances within the window after a new object is added. Therefore, the indices i are drawn by the strategy using r = u ∼ U(0, 1), where u is a random variable sampled from the uniform distribution U(0, 1). This approach keeps adapting a classifier to a wider context of the maintained batch of data and by doing so it may reduce the risk of falling into local minima. On the other hand, if the sliding window is insufficiently reactive, this method may fail at handling more rapid changes, since, in such case, the strategy will keep providing outdated instances. We assume that this method may work well while dealing with gradual concept drifts, when instances for both older and newer concepts should contribute to updates. The complexity for this method is solely based on the intensity λ, so it is O(λ).

Exponential Window (EW) -instead of selecting instances with an equal probability, we can focus more on the newer objects, while leaving a non-zero sampling rate for the rest of the window (Fig. 3b) . We model this behavior, using the normalized exponential distribution r = e ∼ E n (0, 1). Utilizing a truncated exponential transform, we have e = −ln(u)/γ, where γ is the parameter of the exponential distribution and u ∼ U(0, 1) [59] . This strategy assumes that a sliding window may imperfectly represent the current concepts, e.g, by having an inadequate size or by insufficient exploitation of the newer instances. It attempts to handle this problem on its own by increasing the chances for the more recent data. Theoretically, this approach may be adequate for more rapid incremental changes and, at the same time, acceptable for gradual and sudden concept drifts. We set γ = 4 as our default value, which provides a functionality balanced between the two other strategies. The complexity is the same as for the previous approach -O(λ).

Single Exposition (SE) -the most extreme method is reusing only the latest instance w ω (Fig. 3c) . Obviously, probabilities for such a strategy are:

One of the advantages of this method is that, in practice, we do not even have to maintain a sliding window. It is enough to update the classifier with the instance several times and proceed to the next incoming objects. The strategy is a fully online one. While it has indisputable advantages regarding processing performance, it is also possible that such intense focus on a single instance may more likely lead to detrimental overfitting, which is opposite to the problem we want to solve (we spend more time on this aspect in the next sections). On the other hand, it may be a good choice for sudden drifts. The complexity of this strategy is also O(λ). 

It is easy to notice that two the most important parameters of the strategies are: intensity λ and the size of the sliding window ω max . Generally, we may want to keep λ relatively high for low budgets, low-rate data and unstable temporary concepts to compensate limited exploration, and low for the opposite, since having an abundance of labeled data and a sufficiently updated model, we can rely on the mechanism less.

In addition, we have to control the computation time. When it comes to ω max , we most likely should do the opposite -keep the size of the window low for limited budgets, low-rate data and drifting concepts, and high for more labeled instances and stable streams. These two heuristics should give us more reactivity to changes (plasticity) in the former case and provide us with more comprehensive and stable generalization (stability) in the latter one. In other words, depending on the situation we may either need to entirely focus on tackling underfitting, or to be a bit more careful no to end up with overfitting.

In our work, we empirically analyze how the values of the parameters should change depending on an amount of labeled data, as well as we provide a dynamic control technique for adjusting the values adequately to the state of a stream in a learning process. The idea is that since a model should quickly adapt to new data when it suffers from underfitting or after a concept drift, we can control λ and ω max based on a value of a current error, which should be high in both situations and which is a reliable stateof-the-art input in many drift detectors [22] . Therefore, if ∈ 0, 1 is a currently registered error for a model learning from a stream, for the intensity we can use:

and for the size of a window:

where λ max and ω max are fixed maximum values selected for a given budget. Obviously, a new problem emerges -how should we determine the current value of ? The most straightforward solution is to calculate an error within a sliding window. However, in such case, we will, again, struggle with finding a proper window size ω that should depend on the amount of available data and a state of a stream. Fortunately, this problem has been already addressed by the well-known ADWIN algorithm [27] , which provides an adaptive window capable of calibrating its size adequately to incoming data. At its core, the algorithm checks for each subwindow W 0 and W 1 whether

where θ cut is a significance threshold based on the Hoeffding bound [60] using a specified confidence value α θ 1 . If the condition holds, the algorithm shrinks the window in order to keep only the most recent consistent values, which are used for the calculation of the estimated average ADW . We use this indicator as a control signal in our methods, so we have = ADW . Finally, since ADWIN provides the optimal window size in an online way, we can set ω max = ω ADW . We empirically show that all of those choices are reasoned, also in our scenario involving strict limitations on supervision.

The presented approach to handling underfitting while learning on a budget has one significant potential weakness -it may turn one problem into its opposite. Since our method prefers aggressive updates using few labeled instances, it becomes more likely that at some point we may encounter the overfitting problem instead of underfitting, even if we try to adequately adjust the dynamic parameters λ and ω max or when we choose a safer exploitation strategy. To alleviate it, we propose an ensemble of two paired learners, in which one of them performs risky adaptation with instance exploitation (L r ) and the another one learns in a standard way, without any additional enhancement (L s ). Such a simple ensemble can learn in two modes.

Switching -in this mode, we maintain both learners in parallel (Alg. 2) and use only the currently better one for prediction, based on their respective errors r and s (Alg. 3). The assumption is that for some parts of a stream the learner intensively exploiting instances will take the risk in the right direction, providing an efficient up-to-date model more quickly, while for the other parts (mostly more stable) it will fall into local minima, giving way to the basic learner. Since the latter tries to learn in a different way, it is possible that its perspective will allow it to temporarily outperform its counterpart, preventing overfitting. Elevating -in the alternative mode, we not only temporarily switch between better approaches (Alg.

3), but also elevate the worse model (replacing it with the better one), if a difference between them is significant according to a chosen test δ( r , s , α e ) (Alg. 4), where α e determines a significance level of the test. In this case, the idea is that instead of waiting for one model to catch up with learning, we simply instantly bring it up to speed by replacing with the more effective model. One should be careful, however, since having a better model for one timestamp does not mean that it will be easier to adapt it to a new concept from there. To test whether two classifiers are significantly different we can use the Welch's test, which assumes that two populations (for r and s ) have unequal variances [61] . We find this assumption reasonable in our case. The statistic t required for the test is obtained using the formula:

where required average errors , their variance σ and the numbers of samples N can be easily calculated within a sliding window, including ADWIN, which we use as a default estimator. The same values are required for the degrees of freedom ν, which can be approximated using the Welch-Satterthwaite equation:

where ν r = N r − 1 and ν s = N s − 1 are the degrees of freedom associated with variance estimates. Since the Welch's test can be easily implemented in our setting, we use it as our default test. It is worth noting that the effectiveness of both modes highly depends on the precision of determining the true value of a current error, especially when supervision is strictly limited. This problem has been already mentioned in Sec. 4.2.3 and we empirically investigate it in the context of the proposed ensembles.

In this section, we present a comprehensive empirical evaluation of the proposed approaches. Our general goal was to provide a complex and, at the same time, an in-depth analysis of the strategies and their parameters in different settings. We focused on showing how limiting the labeling budget affects learning from data streams with or without the exploitation wrappers. The specific research questions we asked are given as follows.

• RQ1: Does the instance exploitation improve classification while learning from sparsely labeled non-stationary data streams?

• RQ2: How the proposed methods should be configured depending on the available labeling budget and stability of a stream?

• RQ3: Is the ensemble technique, using the standard and risky classifier, capable of alleviating potential overfitting problems that may occur as a result of using the approach? Does it improve the overall performance?

• RQ4: Are our methods competitive when compared with some of the state-of-the-art classifiers dedicated to data streams?

In order to improve reproducibility of this study, the source code for the experiments presented in the following sections, along with information about benchmarks and details of configurations, have been uploaded to our public repository: github.com/mlrep/ie-20 .

For the purpose of the experiments, we utilized two groups of data stream benchmarks. Each of them was dedicated to a different goal of this study.

Synthetic streams. The first one consists of streams that were created using artificial drift generators. Such benchmarks are commonly used while evaluating algorithms on dynamic data [22, 39] . Their welldefined characteristics can be utilized to evaluate the algorithms in very specific situations, for example, when adapting to stable or unstable concepts, or to different types of drifts. The streams were created using tools available in MOA [62] , which provides some state-of-the-art generators that synthesize drifting streams by simulating transitions between different concepts. They are based on the following sigmoidal formula:

where s controls the duration of change and t 0 is a peak of it. There are several different concepts that can be generated -gaussian clusters (RBF), decision spaces modeled by a random decision tree (TREE), linearly separated subspaces (SEA) or concepts defined by some fixed formulas (STAG). There is also a simulation of a rotating hyperplane (HYPER), which does not use the function given above. We generated 14 synthetic streams (Tab. 1), using different types of concepts and drifts. The latter include very sudden, less or more gradual and very slow changes. For the HYPER concept we used mediocre incremental rotations (defined by the rate parameter ρ). This diversification of benchmarks should provide a reliable generalization of observations. Due to the deterministic character of the synthetic streams, we decided to use them in the first part of our experiments, in which we thoroughly investigate the influence of different configurations and the potential usability of the presented heuristics.

Real streams. While the artificial data provides a well-defined environment for evaluation, real world streams may differ from them in some unspecified, unknown aspects. There is still very little research done on general dynamics of streams, therefore, it is difficult to say how realistic are the concepts and drifts generated by the artificial sources. Due to this reason, it is always important to evaluate streaming algorithms on real examples. In our experiments, we used 14 state-of-the-art real streams that are widely used in data stream mining [32, 39] . Analogously to the synthetic streams, the chosen set represent a variety of different problems with diverse characteristics (Tab. 1). Most of the selected streams are snapshots of some continuous observations registered in industrial systems or laboratories. All of them can be found either in the UCI repository, Kaggle competitions or in popular related publications.

As opposed to the synthetic data, most of the real streams are not transparent when it comes to their internal characteristics. It means that usually we cannot say when exactly a drift occurs or what its specific type is. On the other hand, they provide a more realistic general validation. Because of that, we utilized the real streams in the second part of our experiments, conducting the final comparison between our methods (tuned in the previous step), baseline models without instance exploitation and some stateof-the-art streaming classifiers. By doing so, we also avoided fitting parameters of our algorithms to a specific testing set and fairly evaluated proposed default settings.

In this subsection, we present details of the set-up used in the experiments. For any very specific parameters or subroutines, please, refer to the source code given in the repository in Sec. 5.

Overview. The experiments were conducted in the following order. First, using the synthetic streams, we evaluated how the exploitation strategies should be configured. We checked how different values of intensity (λ max ) and window size (ω max ) affect the learning process. We also investigated whether adjusting these values dynamically provides any improvements in terms of performance and utilized resources. As the last part of the tuning stage, we analyzed the behavior of the supporting ensembles, focusing on the correctness of deciding which approach (standard or risky) is currently the better one, depending on the significance threshold (α e ). Finally, we prepared the final comparison between the proposed algorithms, baseline and other well-known classifiers learning in a conservative way. It was conducted using the real streams. Last but not least, since at its core our work emphasizes the problem of limiting the access to supervision, the whole analysis was considered mainly from the perspective of having different labeling budgets (B).

Evaluated parameters. Below, we enclose values of parameters that were considered in our experiments. We evaluated all the values given in the third column. The last one presents other parameters that were predetermined before evaluating the main parameter. While the choice of the default value for ω max and the input error was somehow justified theoretically and empirically [27] , we chose λ either arbitrarily for the initial tests (ω max ), or based on empirical observations, for the experiments involving ensembles (α e ), which should use reasonably tuned base learners. Due to significant differences, we decided to set individual λ max for both classifiers. Multiple values in the last column mean that different settings were used for different labeling budgets. In our experiments, we considered B = {50%, 20%, 10%, 5%, 1%} and B = 100% where it was necessary, including the final comparison. In addition to the given fixed values, we used α θ = {0.05, 0.1, 0.1, 0.2, 0.2} and 0.002 for B = 100% (its default value) in all cases. It is the only parameter that has to be set for ADWIN. For the final comparison, we extracted recommended default configurations, based on the obtained results. We present them after the tuning phase, at the beginning of the final evaluation.

Classifiers. We used two different base learners: AHT and SGD, in order to show that our framework is, indeed, able to work as a flexible wrapper. Since these classifiers use completely different learning approaches, they seem to be good candidates for providing a more general insight. As our baseline (Base) we used AHT and SGD combined with RandVar (ALRV) -the active learning strategy mentioned in Sec. 4.2. Due to the fact that we utilize the same algorithm as the default query strategy in our framework, the proposed methods differ from the baseline only by applying the exploitation strategy. During the final comparison, beside of the mentioned baseline, we tested two additional variants using other active learning approaches: random query (ALR) and selective sampling (ALS) [12] . Furthermore, since our methods involve ensemble techniques, we compared them with some of the state-of-the-art committees for streaming data: online bagging (OBAG) [33] , incremental bagging with ADWIN (ABAG), adaptive online boosting (ADOB) [63] , DWM [36] , LNSE [64] , AUC [38] (only for AHT due to the implementation constraints) and AWE [37] (only for SGD). It is worth noting that the given ensembles represent diverse approaches to adaptation, as mentioned in Sec. 2, which, to the best of our knowledge, have not been evaluated in the context of limited supervision. All of the mentioned classifiers used recommended parameter settings given in MOA.

Metrics. Most of our experiments measured the predictive performance of the considered algorithms. For this purpose, we collected kappa values in two ways. The first one involved obtaining averaged global results for whole streams (aggregated predictions), based on the test-then-train procedure [58] . The second approach was focused on registering temporal performance within a sliding window, using the prequential evaluation technique [65] . In the case of synthetic streams, it allowed us to distinguish between stable and drift periods, and to report them separately. It is important to keep in mind that the global average tends to be biased towards stable periods, which are longer in most of the streams. Due to this reason, we also reported the average of the two separated metrics, which is more balanced. Analogously to the discussion made in Sec. 4.2.3, we used the ADWIN-based metrics, as recommended in [65] . In the final phase of the experiments, we employed the Bonferroni-Dunn rank test (α = 0.05) to analyze the statistical generalization and significance of results. Finally, we find more specific metrics used in this study selfexplanatory. They are briefly introduced with their presentation.

As we mentioned in the previous sections, the first part of the study presents different configurations and versions of the proposed algorithms, evaluated using fully controlled synthetic streams. It should provide the reader with some intuition about what to expect from those techniques under specific levels of supervision. After that, we proceed to the final evaluation and conclusions that can be drawn based on all the conducted experiments.

Fixed intensity. The first evaluation focused on studying how intense the instance exploitation should be, meaning, how many labeled instances from the sliding window should be sampled at each step, after receiving a new object.

In Tab. 3 we can see kappa values averaged over all streams for the given budgets and both classifiers. One distinctive observation is that for different classifiers we can observe significantly different preferences. While with AHT majority of the best results were obtained for λ < 100, SGD worked best with the most intensive exploitation for λ = 1000. It suggests that, in general, AHT better adapts to streams even under limited supervision, while SGD severely suffers from underfitting, leaving plenty of room for improvements. Another pattern that the results suggest is that for AHT using UW and EW there is a trend in relation between required intensity and the available labeling budget. Indeed, if we look closer and analyze Fig.  4 , which presents the ratio between our strategies and the baseline, we will notice that as supervision is getting more and more limited, we gain more and more from exploiting the labeled instances we have (from about 1.05 to 1.2 on average). Furthermore, one should notice that for lower budgets below B = 10%, the difference between safer exploitation (λ ≤ 10) and more risky one (λ ≥ 100) becomes much more significant in favor of the latter approach. Finally, the gain that comes with applying the strategies is higher during drifts (1-1.4), when underfitting is more likely to occur regardless of the budget, than during stable periods (1.02-1.15).

One can also notice that the EW strategy (in Fig. 4 , the red line represents the best value for a given budget) was slightly better than UW, especially for the lowest budget B = 1%. It is not surprising since in such scenarios even a reactive sliding window may contain some relatively old instances, while it is more important to focus on the strictly limited amount of the most recent information. The EW strategy provides this additional focus. Interestingly, in the case of AHT, the SE strategy was able to provide some small improvements (1-1.1 on average) only for lower budgets and only with λ = 1, so when using the newest instance twice. It shows that this strategy may tend to overfit more than UW or EW as it focuses entirely on one instance without replaying a wider context of instances. Nevertheless, even with λ = 1 the SE strategy provides competitive results for budgets higher than B = 5%, when exploitation becomes less important.

Trends presented in Fig. 5 confirm that SGD requires much more attention that AHT, especially for lower budgets. One can easily see that the baseline using this classifier benefits enormously from employing instance exploitation. Regardless of a strategy used, our wrapper was able to improve the learning process 2-3 times over the baseline. The characteristic curvature of the trends comes from the fact that there is a sweet spot between possible improvements and available supervision. Analogously to the results for AHT, the EW strategy seems to be the best choice.

Dynamic intensity. In our next experiment, we investigated if the proposed heuristic for controlling intensity in a dynamic way -increasing it during drifts and decreasing for stable concepts, based on the current error -may provide any improvements. Tab. 4 presents the average performance of strategies using the dynamic control. We can see that general trends and relations remained the same for both classifiers. Figure 5 : Improvement over Base given a budget for SGD using fixed intensities: 1k, 100, 10, 1, best EW.

More useful information gives us Fig. 6 presenting the trade-off between obtained change in performance and computational time for less (λ max = {1, 10}) and more risky (λ max = {100, 1000}) exploitation, compared with its fixed-intensity counterparts. It is clear that those adjustments provide speed-up approximately proportional to the error used as the control signal (Eq. 6). In the case of AHT, it is the most significant for the worst-performing risky SE (more than 2.5 times faster) and lower for more reliable strategies and configurations (1.2-1.6). For SGD the speed-up is extremely high (1.5-5.5) due to generally worse performance of the classifier. We can see that the speed-up for both base learners increases as budget gets smaller, since the performance gets worse on average when we limit supervision. On the other hand, one should notice that the mentioned improvement of running time comes at some cost of predictive performance. It is significant for SGD, especially on very low budgets (about 0.9 of the kappa obtained for fixed intensity), but barely noticeable for AHT with B ≤ 10% (0.98 at most) and practically negligible for higher budgets. In fact, it even provided some improvements for SE, which exhibited the worst performance so far (which we can most likely attribute to overfitting). The explanation of these observations is simple -since the dynamic control may only lower intensity and since SGD gained a substantial amount of enhancement by enabling very intensive instance exploitation, any reduction of it will cause drops in obtained improvements. Due to the fact that AHT is less reliant on our strategies, its adaptation will be impaired to a lesser extent.

Fixed window size. In this subsection, we investigate how the size of a sliding window of labeled instances should be configured in order to provide necessary reactivity to new concepts and better performance as a consequence. The average results presented in Tab. 5 definitely outline an easily visible trend for both classifiers -the less labeled data we have, the smaller sliding window should be used. The differences between considered sizes for a given budget are more significant and dynamic for AHT, which is visualized in Fig. 7 showing obtained kappa for both UW and EW. It is clear and intuitive that larger windows (ω max = 10000 and ω max = 1000) are more reliable when we have more labeled data and when concepts are stable, since more labeled instances will better generalize a given problem and help prevent overfitting. Once we limit the available supervision, the rate of changes between subsequent instances is more likely to increase (Sec. 4). Therefore, in order to keep the representation of current concepts up-to-date we need to replace older instances with newer ones faster, so the window size should be smaller. For example, we can see that while ω max = 10000 is the optimal size if we can label half of a whole stream, it becomes completely useless once we limit the budget to B = 1%, when even ω max = 1000 becomes less adequate than ω max = 100. An observation that the size has to be even smaller for drifting concepts is not surprising -the problem of developing reactive windows for drifting data has been already covered in many related publications [27] .

In the case of SGD, the differences between window sizes are slightly smaller (Fig. 8) , since the overall performance of this classifier is predominantly reliant on a proper choice of the level of intensity. Also, it seems that SGD works best with ω max ≤ 100, which suggests that it prefers intensive updates based on the most recent data.

Finally, for both classifiers, the EW strategy was able to reduce the negative impact of too large window sizes for smaller budgets -one can notice that the curves for ω max = 10000 and ω max = 1000 are slightly elevated compared with UW. The reason for that is the inherent focus of EW on the most recent instances, which alleviates the problem of storing too old instances, which we assumed in Sec. 4.2.2. Regardless of that, we did not find any significant difference between EW and UW in this experiment, which suggests that intensity is a more impactful factor.

Dynamic window size. Analogously to the experiments focused on intensity, we studied the impact of controlling the window size in a dynamic way. In Tab. 6 one can see that while the general trends and relations remain, again, unchanged, using the size returned by ADWIN is definitely the most reliable solution for AHT, providing from 0.02 to more than 0.15 higher kappa values than any other window. For SGD differences between smaller sizes and ADWIN are hardly significant for B ≥ 20% and in favor of the former for B ≤ 10%. The trade-off between the quality of classification and memory consumption is presented in Fig. 9 . The results for EW (UW exhibits analogous characteristics) show that while combining the dynamic adjustments with smaller windows we can use less memory (about 1.4-1.7 times less) at a cost of impaired performance (about 0.94), doing the same with larger windows results in improving both (about 1.1-1.7 for memory and 1.01-1.05 for kappa). It is intuitive, since the shrinking heuristic (Eq. 7), by making the small windows (containing 10-100 instances) even smaller, will more likely increase a chance of overfitting than efficiently improve reactivity. On the other hand, making too large windows more flexible provides enhanced reactivity to new concepts without too significant loss in generalization, especially for smaller budgets B ≤ 10% for which concepts evolve at a higher rate.

When it comes to using a window size based on ADWIN, beside of the good predictive performance, we also observed that this approach tends to aggregate relatively large amounts of instances, especially for high budgets. Since there is no fixed counterpart for ADWIN, in Fig. 9 we present a ratio between ADWIN and a standard window storing a comparable number of instances, which was ω max = 10000. Based on that, we can see that the ADWIN-driven control provides the presented performance at a quite relevant Table 6 : Average kappa for AHT and SGD using dynamic sliding windows. memory cost, which decreases with the budget. It is also important that even if ADWIN stores plenty of instances, it maintains a good quality of classification regardless of the number of labeled instances, doing it better than ω max = 10000 and ω max = 1000 for AHT and competitively for SGD.

In the final section of the first part of the experiments, we analyze the significance level α e , which is the only parameter that has to be set for the elevating strategy. Simultaneously, we also use this section to investigate the potential usefulness of the proposed ensemble techniques. All results for the elevating ensemble (EWE, SEE) using different α e along with results for the switching committee (EWS, SES), baseline and single-classifier exploitation methods (EW, SE) are presented in Fig. 7 . Since EW performed at least as good as UW in all previous evaluations, we decided to omit it in the rest of the study.

The presented results clearly indicate that there is no significant difference when changing the α e value. Furthermore, it seems that for synthetic streams the ensemble methods were not able to provide significant improvements over tuned EW or SE for both classifiers. The only exception is for SE for which we purposely set nonoptimal intensity λ max = 10, slightly higher than previously obtained results Table 7 : Average kappa for AHT and SGD ensembles compared with Base and single-classifier models. suggested. This allowed us to check whether the ensembles are able to surpass the ineffective exploitation strategy by using the alternative standard base learner when it is needed. In Fig. 10 we can see that, indeed, improperly configured SE failed for budgets B ≤ 10%, performing worse than the baseline for stable concepts by leading to overfitting. However, when combined with switching or elevating, the whole method was able to achieve performance at least as good as the baseline, or even to improve upon it for B ≤ 5%. In addition, the ensembles boosted adaptation during concept drifts (by about 0.1), which is a very important observation, since dealing with changes under strictly limited supervision is an extremely challenging task. The elevation of results obtained for SE is a good indicator that our ensembles should be able to increase the lower bound of our approach, by providing that it will never be worse than the baseline. After looking at the results in Fig. 11 we can understand that this is not a coincidence. The bar plots present how many times on average either the standard base learner (Stand-TP, Stand-FP) or the risky one (Risky-TP, Risky-FP) was correctly (true positive, TP) or incorrectly (false positive, FP) elevated with respect to the available budget and used significance level α e . The correctness depended on the precision of the error estimated based on the partial information from labeled instances. We can easily notice that a prevalent number of elevations was done correctly. The only configuration that stands out with a visible number of false positives is the one using α e = 0.2, which is, in fact, an unusual value for the Welch's test. Nevertheless, it still did not affect performance in a meaningful way. The results prove that we can effectively track an error even under strictly limited supervision and utilize it in switching or elevating techniques.

A more careful analysis of the average number of elevations gives us some additional observations. Firstly, the total number of elevations decreases with the budget and significance level α e , which is intuitive. Secondly, in most of the cases, if an exploitation strategy is properly configured, we will be replacing mainly the standard learner with the risky one. This can be mostly seen for EW, however, we would observe the same relation also for SE, if it was properly configured for AHT, or for SGD with the exclusion of the SEA stream. This was the only benchmark for which the SE strategy was constantly failing when combined with the latter classifier, resulting in more than 500 elevations and biasing the average value. We could use median instead, however, we decided to keep the average to show that the SE strategy tends to be more sensitive to improper configuration and overfitting than EW or UW, taking into consideration results for both AHT and SGD. Finally, one may wonder why even if we replace EW or SE with a standard learner multiple times, we do not obtain significant improvements over them (except for the specified cases for AHT with SE)? The reason may be that even if we elevate the exploiting learner it keeps falling into the same pitfalls again and again, as the internal characteristics of the streams do not allow any more improvements than the strategies give themselves in other parts of the streams. For a similar reason, we cannot see additional enhancements when using switching. Fortunately, the differences are more visible for real data streams, in favor of the ensembles, which makes them more than just safe lower bound providers.

Default configurations. Based on the observations made in the previous sections, we determined default settings for our strategies and used them in the final comparison, which was conducted using real streams. Since we found trade-offs obtained for the dynamic heuristics fair, we decided to use dynamic intensity controlled by the ADWIN error ( = ADW ), as well as dynamic window size determined by the same algorithm (ω max = ω ADW ). We used the same significance levels α θ as for synthetic streams. For our elevating ensembles, due to the lack of impact, we chose one of the widely used values, α e = 0.05. Finally, in order to find a compromise between evidence suggesting lower values of intensity for AHT and making sure that we fully utilize the potential of our strategies, we distinguished two groups of λ max values: less risky λ l max and risky λ r max . In addition to that, since AHT and SGD exhibited substantially different preferences regarding required intensity, we set: λ l max = {1, 1, 1, 1, 1, 10}, λ r max = {100, 100, 100, 1000, 1000, 1000} for AHT, where each value corresponds to a budget ranging from B = 100% to B = 1%, respectively, and λ l max = 10, λ r max = 1000 for SGD. Results. The average performance of all considered classifiers under varying budget is presented in Tab. 8. It provides a general overview of the final comparison. The main observation is that the proposed instance exploitation strategies are capable of providing the best predictive quality (in bold) out of all models and that the risky ones exhibited much better quality than the more conservative configurations. Most importantly, our best strategies were able to outperform baseline models using only active learning (ALR, ALRV, ALS) for both base learners (by about 0.1 kappa for AHT and 0.15 for SGD), regardless of the available budget. It shows that the depicted problem of underfitting while learning temporary drifting concepts under limited supervision is critical in real scenarios and active learning fails to solve it efficiently on its own.

In Fig. 12 we can see gain obtained from using our exploitation strategies compared with the best baseline model using only active learning (AL) for each budget. It provides more insight into trends and relations present in the aggregated results. Firstly, as we already mentioned, the more risky approaches, applying more intense exploitation of labeled instances, outperforms the safer configurations in almost all cases. We can see that EW r and EWS r provided about 0.1-0.2 improvement in gain over EW l and EWS l for both AHT and SGD. In the case of SE and SES, the enhancement remained the same for SGD and for AHT using SES. These relations are analogous for EWE and SEE. Interestingly, for AHT with SE r we noticed that while for lower budgets the base learner did not work well with this extreme strategy, the switching technique SES r empowered the strategy to become more efficient even than SE l , which was initially better than SE r . This shows that taking even a very risky adaptation while having a good lower-bound backup may still be beneficial. On the other hand, results for AHT using SE r and SE l indicate that we may need to be a bit more careful with more reactive classifiers. Statistical rank tests presented in Fig.  13 prove that the advantage of the more risky exploitation is significant, as the best risky strategies for AHT (EWS r , EWE r ) and SGD (SE r , SES r , SEE r ) are significantly better than any less risky method.

Secondly, Fig. 12 shows that, generally and analogously to the results for the synthetic streams, the improvement over the baseline models increases as we limit available supervision. It is, again, intuitively correct, since if we have less labeled instances, the risk of encountering underfitting becomes higher. The only exception for this trend can be noticed when using the extremely limited budget B = 1% and the reason for that is the same as the one given for the characteristic curve for SGD in Fig. 5 . The increase is more clear for SGD, which, in general, adapts less efficiently than AHT, giving more occasions for improvements. Regardless of the budget, almost all strategies turned out to be significantly better than a model without instance exploitation (Fig. 13 , ALRV, in bold), which proves the general usefulness of the proposed approaches. Interestingly, since our methods were able to provide some improvements also for fully labeled streams (B = 100%), we can assume that it may enhance adaptation to unstable data in general -limiting budget only makes the problem harder and more severe. Figure 12 : Improvement over the bast AL given a budget for AHT and SGD using different strategies on real data streams.

Finally, based on Fig. 12 and Fig. 13 we can claim that switching and elevating ensembles provide intended improvements. In almost all cases they maintained at least as good performance as the best base learner, either a single-classifier strategy (SE, EW) or a standard model without exploitation (ALRV). As a consequence, the ensembles are always at least as good as the baseline, providing noticeable improvements in almost all cases, as opposed to simpler strategies (SE r for AHT and EW l for SGD). Fig. 13 suggests that the improvements over the single-classifier methods are usually on the verge of significance. The only exception is, one more time, SE r for SGD -since it worked outstandingly well with this classifier (especially for the lowest budgets), there was a low chance that the baseline could provide better prediction and, as a consequence, any incorrect switch led to some reduction in improvements. This once again shows that even very extreme and risky exploitation makes sense. In addition, compared with results obtained for synthetic streams, the ensembles were also able to improve upon SE and EW in some cases, even if the simpler strategies already provided gain over the baseline. This very likely shows that real streams are more unstable or complex than synthetic ones, thus requiring more intensive adaptation from classifiers. Last but not least, we did not obtain a significant difference between the switching and elevating techniques. As the final part of our experimental study, we compared our best strategies (the risky ones) with some state-of-the-art classifiers, including models supported by different active learning strategies and popular ensembles. Based on the results in Tab. 8 and the statistical tests in Fig. 14 , we can claim that our best strategies (in bold in the latter) are capable of significantly improving upon all baseline models using only active learning and upon most of the considered ensembles. We can distinguish EWE r , EWS r and SES r methods. They outperformed all other classifiers on average, except for ABAG and DWM, which were significantly competitive, especially for AHT, even under strictly limited supervision. It exhibits their generally good resilience to such limitations. The EW-based ensembles provided the best results with AHT, while the SE-based ones dominated the highest ranks with SGD. This shows that our ensembles are able not only to improve upon single-classifier models without instance exploitation, but also to compete with state-of-the-art ensembles designed for data streams. One should also keep in mind, that while all the other committees use at least 10 base learners, our ensemble utilizes only two of them.

The complementary results presented in Fig. 15 show the frequencies of occupied ranks for each of the considered methods given high (B ≥ 20%) and low budgets (B ≤ 10%). Unsurprisingly, we can see that the best strategies were most frequently the best ones (green bars), while very rarely ending up as the worst ones (red bars), regardless of the available supervision. One interesting observation is that our ensembles were able to maintain safer lower bounds also for the presented ranks -they very efficiently reduced the number of lowest ranks compared with their simpler counterparts (SE r , EW r ), bringing them down to zero for the best strategies (EWS r and EWE r for AHT, SES r and SEE r for SGD). It is important if someone cares about the worst possible scenario. Figure 15 : Ranks for AHT and SGD using different strategies compared with other classifiers on real data streams.

In order to provide the reader with some specific examples of how our methods work in practice, in Fig.16 and Fig.17 we presented accuracy series (instead of kappa for readability) registered for all real streams when the budget was reasonably limited to B = 10%. The best exploiting strategy, the best baseline using active learning and the best other ensemble were chosen for each stream separately. For most of the streams, we can easily notice that instance exploitation (green) was able to elevate the temporal performance of standard models that were reliant solely on active learning (red). The improvements may have diverse character. For example, our strategies were able to provide overall a more stable, saturated and higher level of predictive performance for Activity-Raw, Covertype, EGG, Sensor or Weather, among others, for both AHT and SGD. For other streams, the enhanced adaption meant that instance exploitation was able to alleviate severe temporal drops in accuracy, most likely due to concept drifts, for example, for Gas, Poker or Spam. There were very few exceptions when none of our strategies was able to provide improvements, like Airlines and DJ30, or Crimes for SGD. Finally, when compared with the considered state-of-the-art ensembles (blue), we can also see that our strategies were, in most cases, competitive in improving adaptation in the same way as against the baseline models. Figure 17 : Accuracy series for SGD given B = 10%: the best strategy vs. the best AL and the best (not our) ensemble.

To summarize all the presented results and observations, we finalize our study with a list of the most important conclusions. Primarily, they address the research questions introduced at the beginning of the experimental study.

• Instance exploitation improves learning from drifting streams on a budget (RQ1). In most of the considered cases, almost all strategies provided significant improvements over the baseline using only active learning and adapting in a conservative way. Without more intensive learning, the base classifiers were not able to deal with strict supervision limitations and drifts at the same time, even if supported by a more strategic label query. They struggled with providing reactive up-to-date models for dynamic temporary concepts and end up with unresolved and severe underfitting. Our method addressed this problem by forcing the classifiers to take the risk of exploiting only the labeled instances they could use and enhancing the adaptation process as a result. The improvements were more impactful for real streams. With the abundance of positive examples collected from diverse measurements and statistical tests, and supported by the fact that generally higher intensity of exploitation was preferred, we prove that the problem of underfitting is critical for realistic streaming scenarios and that risking overfitting by using instance exploitation is more reasonable in practice.

• No free lunch theorem for strategies and settings (RQ2). We were not able to explicitly select one of the three strategies (UW/EW, SE) as the best one. While EW r worked best with AHT, SGD exhibited the best synergy with SE r . Furthermore, although all the strategies generally worked better with the more risky settings using higher values of intensity λ max , we were forced to distinguish different values of this parameter for different classifiers. Also, AHT preferred larger sliding windows than SGD. The distinction was caused by the fact that some algorithms may be more reactive to changes (AHT) than others (SGD), therefore, they may need different amounts of instance exploitation.

• Dynamic control comes with a reasonable trade-off (RQ2). The heuristics for controlling intensity and window size in a dynamic way do not provide improvements in predictive performance in most cases, even impairing it to some limited extent. On the other hand, they are able to reduce running time and memory consumption, respectively. For the latter, adjusting the window size based on ADWIN is a contradictory exception from the observations -this algorithm provides the best quality of classification for AHT and reliable one for SGD, while requiring more memory than most of the other windows. We would recommend using fixed intensity λ max and dynamic window size based on ADWIN (ω max = ω ADW ), if someone does not have problems with utilization of resources. Otherwise, we suggest using dynamic intensity (Eq. 6) and dynamic window (Eq. 7) with a limited size ω max = 1000, while utilizing ADWIN-based error as a control signal = ADW .

• Switching and elevating alleviates the risk of overfitting (RQ3). The ensemble-based techniques for avoiding turning one problem (underfitting) into another (overfitting) provided the assumed improvements. The methods are able to guarantee a more reliable lower bound on performance. Since we can efficiently track errors even under limited supervision, the ensembles can correctly determine a temporarily better approach. As a consequence in a prevalent number of cases (except for SGD using SE r ) they are at least as good as the standard or risky learner, providing additional enhancements in many scenarios. One should, however, keep in mind that the advantage of using the ensembles is usually on the brink of significance. Furthermore, we did not observed a significant difference between switching and elevating, therefore, since the latter requires very time-consuming model replacements, we recommend using switching. Finally, taking into account all considered factors, we can distinguish EWS and SES as the best methods on average.

• Proposed strategies are competitive against other streaming classifiers (RQ4). The final comparison revealed that our algorithms, mainly committees, not only improve upon the baseline using only active learning, but that they can also be at least competitive against some of the state-ofthe-art streaming ensembles, including bagging and boosting, and outperform them in many cases. At the same time, while other ensembles utilize 10 base learners, ours use only two of them. Finally, although we could extend this comparison to more ensembles, one should notice that this study was not entirely focused on the ensemble-based methods. In fact, there is a lot of potential for further improvements in this direction. We conducted only the limited comparison to place our algorithms in some recognizable context.

In this paper, we have addressed a challenging, yet crucial issue in the data stream mining domain -how to increase the adaptation capabilities of classifiers under concept drift while dealing with very limited access to class labels. Sparsely labeled data streams are predominant in a plethora of real-world applications and a lack of labeled instances increases the already high difficulty of recovery from changes. We have proposed a novel and flexible framework for instance exploitation in order to boost the learning from streaming data on a budget. By using a sliding window with a probabilistic sampling for selecting instances for additional exposure to the online classifier we were able to force faster adaptation rates, resulting in a significantly improved robustness to concept drift. We developed three instance exploitation strategies to alleviate the problem of underfitting of classifiers to new emerging concepts by various levels of expositions of labeled instanced obtained from active learning. In order to minimize the risk of overfitting, we have proposed two ensemble architectures that dynamically switch between learners based on aggressive and standard instance exposition. All our strategies are incorporated in a flexible wrapper framework capable of working with any active learning strategy and online classifier.

Based on extensive experimental study, we were able to show the tremendous gains of using our strategies while learning temporary concepts from data streams and having limited access to class labels. The analysis of the results allowed us to reach in-depth and unique insights into the adaptation of classifiers under concept drift, showing how access (or lack of thereof) to an adequate number of labeled instances is of crucial importance in the dynamic settings. We formed a set of recommendations on how and when to use the proposed strategies in order to improve learning rates from sparsely labeled non-stationary data streams at no additional cost.

Our future works will concentrate on using instance exploitation to leverage ensemble learning and control ensemble diversity under concept drift, as well as to improve learning from sparsely labeled imbalanced data streams.

Learning in Nonstationary Environments: A Survey

Databases, data tombs and dust in the wind

Dynamics of the COVID-19 Contagion and Mortality: Country Factors, Social Media, and Market Response Evidence From a Global Panel Analysis

COVID-19 Future Forecasting Using Supervised Machine Learning Models

Evolving rule-based classifiers with genetic programming on GPUs for drifting data streams

Learning under Concept Drift: A Review

Survey of distance measures for quantifying concept drift and shift in numeric data

PCA-based drift and shift quantification framework for multidimensional data

Ensemble learning for data stream analysis: A survey

A survey on data preprocessing for data stream mining: Current status and future directions

Recovery analysis for adaptive learning from non-stationary data streams: Experimental design and case study

Active Learning With Drifting Streaming Data

Handling delayed labels in temporally evolving data streams

Analyzing concept drift: A case study in the financial sector

GMM-VRD: A Gaussian Mixture Model for Dealing With Virtual and Real Concept Drifts

Learning with Local Drift Detection

SCR: simulated concept recurrence -a non-supervised tool for dealing with shifting concept

Employing dropout regularization to classify recurring drifted data streams

Online ensemble learning with abstaining classifiers for drifting and noisy data streams

Handling adversarial concept drift in streaming data

A survey on feature drift adaptation: Definition, benchmark, challenges and future directions

A drift detection method based on dynamic classifier selection

Unsupervised Drift Detector Ensembles for Data Stream Mining

Nearest Neighbor Classification for High-Speed Big Data Streams Using Spark

Multi-Label Punitive kNN with Self-Adjusting Memory for Drifting Data Streams

Learning from Time-Changing Data with Adaptive Windowing

Adaptive online extreme learning machine by regulating forgetting factor by concept drift map

Adaptive Learning from Evolving Data Streams

Variance-Reduced Stochastic Gradient Descent on Streaming Data

A survey of multiple classifier systems as hybrid systems

Leveraging Bagging for Evolving Data Streams

Online Bagging and Boosting

Abdessalem, Adaptive random forests for evolving data stream classification

Clustering-Driven and Dynamically Diversified Ensemble for Drifting Data Streams

Dynamic Weighted Majority: An Ensemble Method for Drifting Concepts

Mining concept-drifting data streams using ensemble classifiers

Reacting to Different Types of Concept Drift: The Accuracy Updated Ensemble Algorithm

Kappa Updated Ensemble for drifting data stream mining

Sentiment analysis on big sparse data streams with limited labels, Knowledge and Information Systems

Classification of Evolving Data Streams with Infinitely Delayed Labels

A Survey on Data Collection for Machine Learning: a Big Data -AI Integration Perspective

Learning from crowdsourced labeled data: a survey

Semi-supervised learning in nonstationary environments

Classification of Data Streams by Incremental Semi-supervised Fuzzy Clustering

Semi-Supervised Learning on Data Streams via Temporal Label Propagation

Co-training Semi-supervised Learning for Single-Target Regression in Data Streams Using AMRules

COMPOSE: A Semisupervised Learning Framework for Initially Labeled Nonstationary Streaming Data

Learning Dynamics of Decision Boundaries without Additional Labeled Data

Semi-Supervised Learning

Combining Active Learning and Self-Labeling for Data Stream Mining

On-line active learning: A new paradigm to improve practical useability of data stream modeling methods

A Bi-Criteria Active Learning Algorithm for Dynamic Data Streams

Active Learning with Abstaining Classifiers for Imbalanced Drifting Data Streams

New Class Adaptation Via Instance Generation in One-Pass Class Incremental Learning

Advances in Neural Information Processing Systems

Learning from imbalanced data: open challenges and future directions

A survey on concept drift adaptation

On the expectation of the maximum of IID geometric random variables

Probability Inequalities for Sums of Bounded Random Variables

The Generalization of 'Student's' Problem when Several Different Population Variances are Involved

MOA: Massive Online Analysis

Machine Learning and Knowledge Discovery in Databases

Incremental Learning of Concept Drift in Nonstationary Environments

Efficient Online Evaluation of Big Data Stream Classifiers