key: cord-151871-228t4ymc authors: Unceta, Irene; Nin, Jordi; Pujol, Oriol title: Differential Replication in Machine Learning date: 2020-07-15 journal: nan DOI: nan sha: doc_id: 151871 cord_uid: 228t4ymc When deployed in the wild, machine learning models are usually confronted with data and requirements that constantly vary, either because of changes in the generating distribution or because external constraints change the environment where the model operates. To survive in such an ecosystem, machine learning models need to adapt to new conditions by evolving over time. The idea of model adaptability has been studied from different perspectives. In this paper, we propose a solution based on reusing the knowledge acquired by the already deployed machine learning models and leveraging it to train future generations. This is the idea behind differential replication of machine learning models. learning system's environment is prone to change in time. The Gartner Data Science Team Survey [2] found that over 60% of machine learning models developed in companies are never actually put into production, due mostly to a lack of alignment with environmental demands. To survive in such an ecosystem, machine learning models need to adapt to new conditions by learning to evolve over time. This notion of adaptability has been present in the literature since the early times of machine learning, as practitioners have had to devise ways in which to adapt theoretical proposals to their everyday life scenarios [3, 4, 5] . As the discipline has evolved, so have the available techniques to this end. Consider, for example, situations where the underlying data distribution changes resulting in a concept drift. Traditional batch learners were incapable of adapting to such drifts. Online learning algorithms were devised [6] to succeed in this task by iteratively updating their knowledge according to the data shifts. In other situations, it is not the data that change but the business needs themselves. For instance, fraud detection algorithms [7] are regularly retrained to incorporate new types of fraud. Commercial machine learning applications are designed to answer very specific business objectives that may evolve in time. Take, for example, the case where a company wants to focus on a new client portfolio. This may require evolving from a binary classification setting to a multi-class configuration [8] . Under such circumstances, the operational point of a trained classifier can be changed to adapt it to the new policy. Alternatively, it is also possible to add patches in the form of wrappers to already deployed models. These endow models with new traits or functionalities that help them adapt to the new data conditions, either globally [9] or locally [10] . In contrast, when it is the very nature of the system that changes, alternatives are scarce. When change affects the entire ecosystem of a model, most existing solutions are not applicable. Say, for example, that one of the original input attributes is no longer available, that a devised black-box solution is required to be interpretable or that updated software licenses require moving to a new production environment. It is such drastic changes in the demands of a machine learning system that this article is concerned with. A straightforward solution in this context is to discard the existing model and re-train a new one. However, by discarding the given solution, we loose all the acquired knowledge and have to rebuild and validate the full machine learning stack. This is seldom the most efficient not the most effective way for tackling this challenge. In this paper, we explore this problem and advocate for imitating the way in which biological systems adapt to changes. In particular, we stress the importance of reusing the knowledge acquired by the already existing machine learning model in order to train a second generation that is better adapted to th enew conditions. Adaptation of machine learning models to new scenarios or domains is a well known problem in the machine learning literature. The most well known research branches are transfer learning and domain adaptation. Domain adaptation usually deals with changes in the data distributions as time evolves, or with learning knowledge representations for one domain such that they can be transferred to another related target domain. For example, due to the COVID-19 pandemic, several countries decided to accept card payments without introducing the pin up to 50 euros instead of the previous 20 euros limit, in order to minimize the interactions of card holders with the points of selling. This domain modification may affect to the fraud card detection algorithms, requiring modifications in their usage. Conversely, transfer learning refers to the specific case where the knowledge acquired when solving one task is recycled to solve a different, yet related task [11] . In general, the problem of transfer learning can be mathematically framed as follows. Given source and target domains D s and D t and their corresponding tasks T s and T t , the goal of transfer learning is to build a target conditional distribution P (y t |x t ) in D t for task T t from the information obtained when learning D s and T s , where T s = T t and D s = D t . Advantages of this kind of learning when compared with traditional learning are that learning is performed much faster, requiring less data, and even achieving better accuracy results. In this work we focus in a different adaptation problem. In our described scenario, the task remains the same but the existing solution is no longer fit because of changes in the environmental constraints. We can frame the problem at hand using the former notation as follows. Given a domain D s , its corresponding task T , and the set of original environmental constraints C s that make the solution of this problem feasible we assume an scenario were a hypothesis space H s has been defined. In this context, we want to learn a new solution for the same task and for a new target scenario defined by the set of feasibility constraints C t , where C t = C s . This may or may not require the definition of a new hypothesis space H t . In a concise form and considering an optimization framework this can be rewritten as This problem corresponds to that of environmental adaptation. Under this notation, the original solution corresponds to a model h s that belongs to the hypothesis space H s defined for the first scenario. This is a model that fulfills the constraints C s and maximizes P (y|x; h) for a training dataset S = {(x, y)}, defined by task the T on the domain D s . Adaptation involves transitioning from this original scenario to Scenario II; a process which may be straightforward. Although, in general, this is not the case. The solution obtained for the first scenario may be unfeasible in the second; thus, effectively ceasing the lifespan of the learned model. This happens when h s is outside of the feasible set defined by C t . In the most benevolent case, there may exist another solution from the original hypothesis space, H s , that fulfills the new constraints. However, there might be no overlapping between the set of constraints C t and the set of models defined by H s . This implies that no model of that hypothesis space can be considered a solution. In such cases, adaptation involves definition of a new hypothesis space H t altogether. Take, for example, the case of an application with a multivariate Gaussian kernel support vector machine. Assume that due to changes in the existing regulation, models are required to be fully interpretable in the considered application. The new set of constraints is not compatible with the original scenario and hence we would require a complete change of substrate. In this scenario, we introduce the notion of differential replication of machine learning models as an efficient approach to ensure model survival in a new demanding environment, by building on knowledge acquired in previous generations. This is solving for scenario II considering the solution obtained for Scenario I. Differential replications ensures the environmental adaptation of machine learning models when they are subjected to constant changes. "When copies are made with variation, and some variations are in some tiny way "better" (just better enough so that more copies of them get made in the next batch), this will lead inexorably to the ratcheting process of design Solving the former environmental adaptation problem can sometimes be straightforwardly done by discarding the existing model and re-training a new one. This is possible when the new model hypothesis space H t is required to provide feasible solutions for the constraint set C t . However, it is worth considering the costs of this approach. In general, rebuilding a model from scratch (i) implies obtaining the clearance from legal, business, ethical, and engineering departments, (ii) does not guarantee that a good or better solution of the objective function will be achieved 1 , (iii) requires a whole new iteration of the machine learning pipeline, which is costly and time-consuming, (iv) assumes full access to the training dataset, which may no longer be available or require a very complex version control process. Plus, in many companies machine learning solutions are kept up-to-date using automated systems that continuously evaluate and retrain models; a technique known as continuous learning. Note, however, that this may take huge storage space, due to the need to save all the new incoming information. Hence, in the best case scenario, re-training is an expensive and difficult approach that assumes a certain level of knowledge that is not always guaranteed. Nonetheless, this is the most commonly used technique for solving this environmental adaptation problem. In what follows we consider other techniques. Environmental adaptation is well known for living beings. Under the theory of Natural Selection, adaptation relies on changes in the phenotype of a species over several generations to guarantee its survival and evolution. This is sometimes referred to as differential reproduction. In the same lines, we define differential replication of a machine learning model as a cloning process in which traits are inherited from generation to generation while at the same time adding variations that make descendants more suitable/fit to the new environment. More formally, differential replication refers to the process of finding a solution h t that fulfills the constraints C t , i.e. it is a feasible solution, while preserving/inheriting features from h s . Note that, in general, P (y|x; h t ) ∼ P (y|x; h s ). In the best case scenario, we would like to preserve or improve the performance of the source solution h s , also known as the parent. However, this is requirement that may not always be achieved. In a biological simile, requiring a guepard to be able to fly may imply loosing its ability to run fast. In this section, we consider existing approaches to implement differential replication in its attempt to solve the problem of environmental adaptation. The notion of differential replication is built on top of two concepts. First, there is some inheritance mechanism that is able to transfer key aspects from the previous generation to the next. That would account for the name of replication. Second, the next generation should display new features or traits not present in their parents. This corresponds to the idea of differential. These new traits should make the new generation more fit to current environment to enable environmental adaptation of the offspring. Particularizing to machine learning models, implementing the concept of differential may involve a fundamental change in the substratum of the given model. This means we need a new hypothesis space that will ensure that part of the models in that space fulfill the constraints of the new environment C t . Consider, for example, the case of a large ensemble of classifiers. In highly time demanding tasks, this model may be too slow to provide real time prediction when deployed into production. Differential replication of this model enables moving from this architecture to a simpler, more efficient one, such as that of a shallow neural network [12] . This "child" network can inherit the decision behavior of its predecessor while at the same time being more fit to the new environment. Conversely, replication requires that some behavioral aspect be inherited by the next generation. Usually, it is the model's decision behavior that is inherited, so that the next generation will replicate the parent decision boundary. Replication can be attained in many different ways. As shown in Fig. 1 , depending on the amount of knowledge that is assumed about the initial data and model, mechanisms for inheritance can be categorized as follows: • Inheritance by sharing the dataset: Two models trained on the same data are bound to learn similar decision boundaries. This is the weakest form of inheritance possible, were no actual information is transferred from source to target. Here the decision boundary is reproduced indirectly and mediated through the data themselves. Re-training falls under this category. This form of inheritance requires no access to the parent model, but assumes knowledge of its training data. • Inheritance using edited data: Editing is the methodology that allows data selection for training purposes [13, 14, 15] . Editing can be used to preserve those data that are relevant to the decision boundary learned by the original solution and use them to train the next generation. Take, for example, the case where the source hypothesis space corresponds to the family of support vector machines. In training a differential replica, one could retain only those data points that were identified as support vectors. This mechanism assumes full access to the model internals, as well as to the training data. • Inheritance using model driven enriched data: Data enrichment is a form of adding new information to the training dataset through either the features or the labels. In this scenario, each data sample in the original training set is augmented using information from the parent decision behavior. For example, a sample can be enriched by adding additional features using the prediction results of a set of classifiers. Alternatively, if instead of learning hard targets one considers using the output of the parent's class probability outputs or logits as soft-targets, this richer information can be exploited to build a new generation that is closer in behavior to the parent. Under this category fall methods like model distillation [12, 16, 17, 18] , as well as techniques such as label regularization [19, 20] and label refinery [21] . In general, this form of inheritance requires access to the source model and is performed under the assumption of full knowledge of the training data. • Inheritance by enriched data synthesis A similar scenario is that where the original training data is not accessible, but the model internals are open for inspection. In this situation, the use of synthetic datasets has been explored [12, 22] . In some cases, intermediate information about the representations learned by the source model are also used as a training set for the next generation. This form of inheritance can be understood as a zero-shot distillation [23] . • Inheritance of internals model's knowledge: In some cases, it is possible to access the internal representations of the parent model, so that more explicit knowledge can be used to build the next generation [24, 25] . For example, if both parent and child are neural networks, one can force the mid-layer representations to be shared among them [26] . Alternatively, one could use the second level rules of a decision tree to guide the next generation of rule-based decision models. • Inheritance by copying In highly regulated environments, access to the original training samples or to the model internals is not possible. In this context, experience can also be transmitted from one model to its differential replica using synthetic data points labelled according to the hard predictions of the source model. This has been referred to as copying [27] . Note that on top of a certain level of knowledge about either the data or the model, or both, some of the techniques listed above often impose also additional restrictions on the considered scenarios. Techniques such as distillation, for example, assume that the original model can be controlled by the data practitioner, i.e. internals of the model can be tuned to force specific representations of the given input throughout the adaptation process. In certain environments this may be possible, but generally it is not. To illustrate the utility of differential replication, we describe six different scenarios where it can be exploited to ensure a devised machine learning solution adapts to different changes in its environment. A widely established technique to provide explanations is to use linear models, such as logistic regression. Model parameters, i.e. the linear coefficients associated to the different attributes, can be exploited to provide explanations to different audiences. Although this approach works in simple scenarios where the variables do not need to be modified nor pre-processed, this is never the case for real life applications, where variables are usually redesigned before training and new more complex features are often introduced. This is even worse when, in order to improve model performance, data scientists create a large set of new variables, such as bi-variate ratios or logarithm scaled variables, to capture non-linear relations between original attributes that linear models cannot handle during the training phase. This results in new variables being obfuscated and therefore often not intelligible for humans. A straightforward solution using differential replication is to replace the whole predictive system, composed by both the pre-processing/feature engineering step and the machine learning model by a copy that considers both steps as a single black box model [28] . Doing this, we are able to deobfuscate model variables by training copies to learn the decision outputs of trained models directly from the raw data attributes without any pre-process. In-company infrastructure is subject to continuous updates due to the rapid pace with which new software versions are released to the market. Changes in the organizational structure of a company may drive the engineering department to change course. Say, for example, that a company whose products were originally based on Google's Tensorflow package [29] makes the strategic decision of moving to Pytorch [30] . In doing so, they might decide to re-train all models from scratch in the new environment. This is a long and costly process that can even result in a loss of performance. Specially if original data is not available or the in-house data scientist are new to this framework. Alternatively, using differential replication, the knowledge acquired by the existing solutions could be exploited in the form of hard or soft labels or as additional data attributes for the new generation. As a model is tested against new data throughout its lifespan, some of its learned biases may be made apparent. Under such circumstances, one may wish to transit to a new model that inherits the original predictive performance but which ensures non-discriminatory outputs. A possible option is to edit the sensitive attributes to remove any bias, reducing in this way the disparate impact in the task T , and then training a new model on the edited dataset. Alternatively, in very specific scenarios where the sensitive information is not leaked through additional features, it is possible to build a copy by removing the protected data variables [31] . Or even, redesign the hypothesis space considering a loss function to account for the fairness dimension when training subsequent generations. Batch machine learning models are rendered obsolete by their inability to adapt to a change in the data distribution. When this happens the most straightforward solution is to wait until there are enough samples of the new distribution and re-train the model. However, this solution is timely and often expensive. A faster solution is to use the idea of differential replication to create a new enriched dataset able to detect the data drift. For example, including the soft targets and a timestamp attribute in the target domain, D t . Then, train a new model using this enriched dataset that replicates the decision behavior of the previously trained classifier. Finally, we also need to allow the new model to accept new incoming data samples to be learned. This second characteristic can be added by incorporating the online requirement in the C t constrains for the differential replication process [32] . Developing good machine learning models requires abundant data. The more accessible the data, the more effective a model will be. In real applications, training machine learning models usually requires collecting a large volume of data from users, often including sensitive information. Directly releasing these models trained using user data could derive in privacy breaches, as the risk of leaking sensitive information encoded in the public model increases. Differential replication can be used to avoid this issue by is training another model, usually a simpler one, that replicates the learned decision behavior but which preserves the privacy of the original training set by not being directly linked to these data. The use of distillation techniques in the context of teacher-student networks, for example, has been reported to be successful in this task [33, 34] . Auditing machine learning models is not an easy task. When an auditor wants to audit several models under the same constraints, it is required that all models audited fulfill an equivalent set of requirements. Those requirements may limit the use of certain software libraries, or of certain model architectures. Usually, even within the same company, each model is designed and trained on its own basis. As research in machine learning grows, new models are continuously devised. However, this fast growth in available techniques hinders the possibility of having a deep understanding of those models and makes the assessment of some auditing dimensions a nearly impossible task. In this scenario, differential replication can be used to establish a small set of canonical models into which all others can be translated. In this sense, deep knowledge of these set of canonical models would be enough to drive auditing tests. For example, let us consider that the canonical model is a deep learning architecture with a certain configuration. Any other model can be translated into this particular architecture using differential replication 2 . The auditing process need then only consider how to probe the canonical deep network to report impact assessment. In this paper we have described a general framework based on knowledge reuse from generation to generation to extend the useful life of machine learning models by adapting them to their changing environment. To tackle this issue of environmental adaptation, we have proposed a mechanism inspired in how biological organisms evolve: differential replication. Differential replication allows machine learning models to modify their behavior to meet the new requirements defined by the environment. We envision this replication mechanism as a projection operator able to translate the decision behavior of a machine learning model into a new hypothesis space with different characteristics. This allows traits of a given classifier to be inherited by another, more suitable under the new premises. We have listed different inheritance mechanisms to achieve this goal, depending on specific knowledge availability scenarios. These range from the more permissive inheritance by sharing the dataset to the more restrictive inheritance by copying, which is the solution requiring less knowledge about the parent model and training data. Finally, we provide examples of how differential replication applies in practice for six different real-life scenarios. On The Origin of Species by Means of Natural Selection, or Preservation of Favoured Races in the Struggle for Life Magic quadrant for data science and machine learning platforms Engaging the ethics of data science in practice Fairer machine learning in the real world: Mitigating discrimination without collecting sensitive data The fallacy of inscrutability Large scale online learning End-to-end neural network architecture for fraud scoring in card payments Online error-correcting output codes Uncertainty estimation for black-box classification models: a use case for sentiment analysis Why should i trust you?: Explaining the predictions of any classifier A survey on transfer learning Model compression Application of proximity graphs to editing nearest neighbor decision rule Geometric decision rules for instance-based learning algorithms Application of the gabriel graph to instance based learning Distilling the knowledge in a neural network Rethinking the inception architecture for computer vision Knowledge distillation in generations: More tolerant teachers educate better students When does label smoothing help Revisiting knowledge distillation via label smoothing regularization Label refinery: Improving imagenet classification through label progression Using a neural network to approximate an ensemble of classifiers Zero-shot knowledge distillation in deep networks Knowledge distillation from internal representations Explaining knowledge distillation by quantifying the knowledge Transactional compatible representations for high value client identification: A financial case study Copying machine learning classifiers Towards global explanations for credit risk scoring Tensorflow: A system for large-scale machine learning Automatic differentiation in pytorch Using copies to remove sensitive data: A case study on fair superhero alignment prediction From batch to online learning using copies Patient-driven privacy control through generalized distillation Private model compression via knowledge distillation This work has been partially funded by the Spanish project PID2019-105093GB-I00 (MINECO/FEDER, UE), and by AGAUR of the Generalitat de Catalunya through the Industrial PhD grant 2017-DI-25.