key: cord-0455488-cm6tqbzk
authors: Wang, Zijie; Zhou, Lixi; Das, Amitabh; Dave, Valay; Jin, Zhanpeng; Zou, Jia
title: Survive the Schema Changes: Integration of Unmanaged Data Using Deep Learning
date: 2020-10-15
journal: nan
DOI: nan
sha: 4d0a105776ca30a7a0242bce57d1a77b493568ae
doc_id: 455488
cord_uid: cm6tqbzk

Data is the king in the age of AI. However data integration is often a laborious task that is hard to automate. Schema change is one significant obstacle to the automation of the end-to-end data integration process. Although there exist mechanisms such as query discovery and schema modification language to handle the problem, these approaches can only work with the assumption that the schema is maintained by a database. However, we observe diversified schema changes in heterogeneous data and open data, most of which has no schema defined. In this work, we propose to use deep learning to automatically deal with schema changes through a super cell representation and automatic injection of perturbations to the training data to make the model robust to schema changes. Our experimental results demonstrate that our proposed approach is effective for two real-world data integration scenarios: coronavirus data integration, and machine log integration.

It was reported in 2018 that data scientists spent 80-90% efforts in the data integration process [8] , [69] , [74] . The schema change, which impacts applications and causes system downtimes, is always a main factor leading to the tremendous human resource overhead required for data integration. Schema changes are often caused by software evolution that is pervasive and persistent in agile development [36] , or the diversity in data formats due to the lack of standards [13] . Example: coronavirus disease 2019 (COVID-19) data integration. To predict the coronavirus outbreak, we integrate the coronavirus data repository at Johns Hopkins University (JHU) [4] and the Google mobility data [3] . The JHU's data repository maintains the world coronavirus cases on a daily basis as a set of CSV files. However, we find the schema of the data has frequent changes. As illustrated in Fig. 1(a) , the changes include attribute name changes (e.g., Longitude → Long ), addition and removal of attributes (e.g., from six attributes initially to 15 attributes), attribute type changes (e.g., date formats), key changes (e.g., from (country/region, province/state) to (combined key), (FIPS) and (country region, province state, Admin2). The data scientist's Python code for parsing these files easily breaks at each schema change, and requires manual efforts to debug and fix the issues. This becomes a ubiquitous pain for the users of this JHU data repository, and people even launched a project to periodically clean the JHU coronavirus data into a stable R-friendly format 1 . However such cleaning 1 https://github.com/Lucas-Czarnecki/COVID-19-CLEANED-JHUCSSE is purely based on manual efforts and obviously not scalable. Prior arts. Schema evolution for data that was managed in relational databases, NoSQL databases, and multi-model databases are well-established research topics. The fundamental idea is to capture the semantic mappings between the old and the new schemas, so that the legacy queries can be transformed and/or legacy data can be migrated to work with the new schemas. There are two general approaches to capture the semantics mappings: (1) To search for the queries that can transform the old schema to the new schema [9] , [25] , [30] , [51] , [61] , [66] .

(2) To ask the database administrators (DBAs) or application developers to use a domain-specific language (DSL) to describe the schema transformation process [11] , [17] , [31] - [33] , [41] , [53] , [55] , [56] , [64] .

However, these approaches are not applicable to unmanaged data, including open data such as publicly available CSV, JSON, HTML, or text files that can be downloaded from a URL, and transient data that is directly collected from sensor devices or machines in real-time and discarded after being integrated. That's because the history of schema changes for these data is totally lost or become opaque to the users. It is an urgent need to automatically handle schema changes for unmanaged data without interruptions to applications and any human interventions. Otherwise, with the rapid increase of the volume and diversity of unmanaged data in the era of Big Data and Internet of Things (IoT), it is unavoidable to waste a huge amount of time and human resources in manually handling the system downtimes incurred by schema changes. A deep learning approach. In this work, we argue for a new data integration pipeline that uses deep learning to avoid interruptions caused by the schema changes. In the past few years, deep learning (DL) has become the most popular direction in machine learning and artificial intelligence [46] , [65] , and has transformed a lot of research areas, such as image recognition, computer vision, speech recognition, natural language processing, etc.. In recent years, DL has been applied to database systems and applications to facilitate parameter tuning [47] , [71] , [76] , [81] , indexing [21] , [43] , partitioning [34] , [86] , cardinality estimation and query optimization [39] , [44] , and entity matching [24] , [37] , [42] , [57] , [73] , [82] . While predictions based on deep learning cannot guarantee correctness, in the Big Data era, errors in data integration are usually tolerable as long as most of the data is correct, which is another motivation of our work. To the best of our knowledge, we are the first to apply deep learning arXiv:2010.07586v1 [cs.DB] 15 Oct 2020 to learn the process of join/union/aggregation-like operations with schema changes occurring in the data sources. However, it's not an easy task and the specific research questions include: (1) It is not straightforward to formulate a data integration task, which is usually represented as a combination of relational or dataflow operators such as join, union, filter, map, flatmap, aggregate, into a prediction task. What are effective representations for the features and labels?

(2) How to design the training process to make the model robust to schema changes? (3) Different model architectures, for example, simple and compact sequence models such as Bi-LSTM and complex and large transformer such as GPT-2 and BERT, may strike different trade-offs among accuracy, latency, and resource consumption. What are the implications for model architecture selection in different deploying environments? (4) Annotating data to prepare for training data is always a major bottleneck in the end-to-end lifecycle of model deployment for production. Then how to automate training data preparation for the aforementioned prediction tasks? Uninterruptible integration of fast-evolving data. In this work, we first formulate a data integration problem as a deep learning model that predicts the position in the target dataset for each group of related data items in the source datasets. We propose to group related items in the same tuple or object that will always be processed together, and abstract each group into a super cell concept. We further propose to use source keys and attributes as features for describing the context of each cell, and use the target keys and attributes as labels to describe the target position where the super cell is mapped to. The features and labels can be represented as sentences or sentences with masks so that the representation can be applicable to state-ofart language models, including sequence models like Bi-LSTM and transformers like GPT-2 and BERT.

Then, to seamlessly handle various schema changes, inspired by adversarial attacks [28] , [45] , which is a hot topic in DL, we see most types of schema changes as obfuscations injected to testing samples at inference time, which may confuse the model that is trained without noises. Therefore, just like adversarial training [28] , [45] , [68] , we address the problem by adding specially designed noises to the training samples to make the model robust to schema changes. The techniques we employ to do so include replacing words by randomly changed words and synonyms that are sampled from Google Knowledge Graph. In addition, we propose to add an aggregation mode label to indicate how to handle super cells that are mapped to the same position, which can well handle the schema change of the type key expansion in the example we give earlier.

Based on the above discussions, we propose a fully automated end-to-end process for uninterruptible integration of fast-evolving data sources, as illustrated in Fig. 1(b) .

Step 1, the system will leverage our proposed Lachesis intermediate representation (IR) [86] to automatically translate the user's initial data integration code into the executable code that automatically creates training data based on our proposed representation.

Step 2, obfuscations will be automatically injected to the training data to make the model robust to various schema changes.

Step 3, different model architectures will be chosen to train the predictive model that will perform the data integration task depending on the model deploying environment. Due to the space limitation, this paper will focus on Step 2 and 3, while we will also discuss Step 1 as well as other techniques for reducing training data preparation overhead such as crowdsourcing and model reuse.

The contributions of this work include: (1) As to our best knowledge, we are the first to systematically investigate the application of deep learning and adversarial training techniques to automatically handle schema changes occurring in the data sources.

(2) We propose an effective formulation of the data integration problem into a prediction task as well as flexible feature representation based on our super cell concept (Sec. II). We also discuss how to alleviate the human costs involved in preparing training data for the representation (Sec. V).

(3) We represent the common schema changes as various types of obfuscations, which can be automatically injected to the training data to make the model training process robust to these types of schema changes. (Sec. III) (4) We compare and evaluate various trade-offs made by different model architectures, including both simple sequence model and complex transformer model for two different data integration tasks involving semi-structured and non-structured data respectively. (Sec. IV and Sec. VI)

We assume that in a typical data integration scenario that converts a set of source datasets into one target dataset, each of the source datasets may have heterogeneous formats such as CSV, JSON, text, etc., which may not be managed by any relational or NoSQL data stores; However, the target dataset must be tabular, so that each cell in the target dataset can be uniquely identified by its tuple identifier and attribute name.

Given a set of objects, where each object may represent a row in a CSV file, a JSON object, a time series record, or a file, there are a few candidate representations for formulating the predictive task, including dataset-level representation, objectlevel representation, attribute-level representation, cell-level representation, and our proposed super cell based representation. The coarser-grained of the representations, the fewer times of inferences are required, and the more efficient of the prediction. However, a coarser-grained representation also indicates that the prediction task is more complex and harder to train a model with acceptable accuracy, because a training sample will be larger and more complex than other finergrained representations, and the mapping relationship to learn will naturally become more complicated.

Various levels of representations for the motivating example are illustrated in Fig. 2 collect sufficient training data represented at the dataset-level. The object-level representation groups all attributes and is not expressive enough in describing the different transformations applied to each attribute. Similarly, the attribute-level representation assembles all values in the same attribute and is not efficient in expressing logics like filtering or aggregation. The cell-level representation, referring to a value of a single attribute in one tuple, is at the finest granularity, which simplifies the learning process. However, it may incur too many inferences and waste computational resources, particularly if there exist multiple cells in one object that will always be mapped/transformed together. Therefore we propose and argue for a super cell representation. A super cell is a group of cells in an object that will always be mapped to the target table together in a similar way, such as the values of the confirmed attribute and the recovery attribute, as shown in Fig. 2 . While the flexible granularity of a super cell is between the object-level and cell-level, it can well balance the expressiveness and the performance regarding the training and testing process. For each super cell, we represent its features as a sentence that concatenates the keys, as well as the value and attribute of each cell in the super cell. Particularly the keys in the context features should include both of the join key if applicable and the tuple/object identifier of the local table. We find that for 1-1 join, the join key usually also serves as a local tuple/object identifier; for 1-N and N -M joins, the join key is not necessarily the local key; and for non-join operations (e.g., union, filter, aggregate), no join key is required. We assume the target dataset is always tabular and propose a label representation that includes the tuple identifier (i.e., key of the target table) of the super cell, the target attribute name of each cell, and an aggregation mode to specify how values that are mapped to the same position should be aggregated into one value, e.g., add, avg, max, min, count, replace old, discard new, etc..

Suppose there are m source datasets, represented as D = {d i }(0 ≤ i < m), each source dataset is modeled as a set of n i super cells, denoted as d i = {s ij }(0 ≤ i < m, 0 ≤ j < n i ). We further describe each super cell s ij ∈ d i as a triplet that consists of three vectors for the keys shared by each cell in the super cell, attribute names of each cell, and values of each cell, respectively, represented as s ij = ( key ij , attribute ij , value ij ). We can further define a super set to describe the current state of the entire data repository:

A super cell may be mapped to zero, one, or more than one positions in the target dataset, depending on the operations involved in the data integration task. For example, as illustrated in Fig. 3 , in a 1-N join operation, a super cell in the table at the left-hand side may be mapped to many positions in the target table. Thus we can represent the target positions where the super cell is mapped to as a list of triples: f (s ij ) = {( key T ij , attribute T ij , agg mode)}, where each triple refers to one position that is indexed by the target keys key T shared by all cells in the super cell, as well as attribute name of each cell in the super cell, denoted as attribute T . 

Given a fast evolving data repository S = {s ij ∈ d i |∀d i ∈ D}, for a data integration request, the user input should specify the schema of the expected target table. The schema includes a list of p attributes denoted as A = {a k }(0 ≤ k < p), where a k represents the k-th attribute in the target table, as well as a list q attributes that serve as minimum tuple identifier (i.e. key) denoted as A key = {a l }(0 ≤ l < q). We further denote a set of all possible key values in the target table as R = {r l |r l ∈ a l }(0 ≤ l < q). Then we need find a model f S→(A∪{N U LL})×(R∪{N U LL}) that predicts a set of target positions denoted as {( key T ij , attribute T ij , agg mode)} for each super cell s ij ∈ S, where each element of the key ∀x ∈ [0, q), key T ij [x] ∈ R, and each element of the attribute vector ∀y ∈ [0, |s ij |), attribute T ij [y] ∈ A, where |s ij | denotes the number of cells in the super cell s ij . If a super cell doesn't belong to the target table and should be discarded, we define that r l = N U LL and a k = N U LL in this case.

We identify five basic types of data schema changes, which cover all of the relational or NoSQL schema changing patterns [11] , [17] , [31] - [33] , [41] , [53] , [55] , [56] , [64] as well as schema changes that we have discovered from open data. We first discuss the impact of each type and then propose our approaches to handle these changes by adding perturbations to the training process.

(1) Domain pivoting. For example, originally the dataset was stored as three CSV files, describing the daily confirmed coronavirus cases, daily recovery cases, and daily death cases; later the schema changed to a new set of CSV files so that each file described all information (confirmed, death, recovery cases) on that specific date. We observed such changes prevalent in the coronavirus data [3] , [4] , [6] and the weather database hosted by National Oceanic and Atmospheric Administration (NOAA) [7] . Such changes will easily break a conventional Python-based data integration pipeline.

(2) Key expansion. For example, the key of the dataset is lowered down from the state level (country/region, province/state) to the county level (combined key), (FIPS) and (country region, province state, Admin2), which means a tuple in the original table (that describes statistics for a state) is broken down into multiple tuples with each describes the statistics for a county. Such changes cannot be easily handled by conventional data integration methodologies.

(3) Attribute name and ordering change. For example, in the first CSV file added to the JHU COVID-19 daily report data repository on Jan 22nd, 2020, the third column name is "Last Update". But in the CSV file added to the same repository on Sept 24th, 2020, the same column is moved to the fifth position, and the name is slightly changed to "Last Update". Such changes may interrupt a conventional program that joins two datasets on the "Last Update" column. (4) Value type/format change. For example, in the daily COVID-19 file created on Jan 22nd, the "Last Update" has values in the format of "1/22/2020 17:00". However, in the file on Sept 24th, the value format has changed to "2020-09-25 04:23:21". A conventional exact join operation using "Last Update" as the join key cannot handle such value format change, unless the programmer chooses to convert it into a similarity join, which is more complicated and much slower than an exact join and will result in significantly higher development costs [79] , [84] . (5) Addition or removal of non-key attributes. For example, the JHU COVID-19 global daily report has changed from six attributes initially to 14 attributes after a few months. This change may not make much influence, if the affected attributes are not used by users' data integration workloads. On the contrary, if a required column is totally removed, there is no way to handle such a situation without interruption, even if using a deep learning approach, so we mainly focus on the first four types of schema changes. Perturbation based on schema changes. First, our super cell based representation will not be affected by dimension pivoting, attribute ordering change, and addition or removal of irrelevant attributes, because the context for any cell remains the same despite of these schema changes. Therefore a model trained with our proposed representations is robust to these types of schema changes.

Second, schema changes such as renaming of an attribute and reformatting of cell values, are similar to adversarial attacks, which confuse the pre-trained models by adding noises to the expected testing samples. Adding perturbations to training data is an effective way of training robust models against adversarial attacks [28] . Inspired by this analogy, we add specially designed perturbations to training data to handle these parts of schema changes. If we see each super cell representation as a sequence of words (i.e., a sentence), the training data is a corpus of sentences. Then we can augment the training data by adding new sentences (i.e., perturbations), which are changed from existing sentences by randomly replacing a word using synonyms extracted from Google Knowledge Graph [3] , [4] , [6] , or randomly modified words by removing one or more letters. Then we train a character-based embedding on this augmented corpus using fastText [5] , which maps related words to vectors that are close to each other so that the model can recognize the similarity of such words. We observe through experiments that character-based embedding can achieve better accuracy and reliability than word-based embedding and can smoothly handle outof-vocabulary words. It also shows that our locally trained embedding significantly outperforms pre-trained embeddings with Google News or Wikipedia.

Third, to make the deep learning model robust to key expansion, as mentioned, we add a new label to the representation called "aggregation mode". Each value of the label represents an aggregation operator such as sum, avg, min, max, which will be applied to the cells that are mapped to the target position; replace, which means the cell will replace the old cell that exists in the same position of the target table; or discard, which means the new cell will be discarded if an older cell has been mapped to the same target position.

In recent several years, Natural Language Processing (NLP) has experienced several major advancements including the bi-directional mechanism, attention mechanism, transformer mechanism, and so on. Existing works show that the final hidden state in Bi-LSTM networks cannot capture all important information in a long sentence. Therefore, the attention mechanism was introduced to address the problem by preserving information from all hidden states from encoder cells and aligning them with the current target output. Later such idea was integrated into the transformer architectures, so that encoders had self-attention layers, while decoders had encoder-decoder attention layers. Most recently, to make the transformer architecture more flexible to applications other than language translation, GPT-2 that only uses the decoders' part and BERT that only uses encoders' part are invented and achieve great success in a broad class of NLP problems. Our assumption is that on one hand, more complicated models like GPT-2 and BERT may naturally achieve better accuracy than a simpler model like Bi-LSTM; but on the other hand, these complex models may require significantly higher storage and computational resources, as well as more training data. It is important to know the trade-offs among accuracy, latency, and resource consumption, made by different model architectures. We mainly consider two types of language model architectures: (1) simple and compact sequence models based on customized local character-based embedding and Bi-LSTM; and (2) complex and large pre-trained transformer models, such as GPT-2 [62] and BERT [20] .

Our Bi-LSTM model architecture, includes an embedding layer that has 150 neurons; a Bi-LSTM layer that consists of 512 neurons; and a fullyconnected layer that has 256 neurons.

2) Transformer Model: Moreover, we also consider transformer models based on GPT-2 [62] and BERT [20] . We use a pre-trained GPT-2 small model or a pre-trained BERT base model as the backend, which connects to a frontend classifier composed of four convolutional layers and a fully connected layer. During the training process, the parameters of the GPT-2 small model and the BERT base model are freezed, and only the parameters of the frontend will be updated.

The pre-trained GPT-2 small has 117 millions of parameters, including 12 layers of transformers, each with 12 independent attention mechanisms, called "heads", and an embedding size of 768 dimensions. The hidden vector output from the GPT-2 small model is reshaped to add a channel dimension and then passed to four convolutional layers, including two maxpooled 2D convolution layer and two average-pooled 2D convolution layer respectively, the output is applied with a hadamard product, and then sent to a fully connected layer. The BERT base model has 110 millions of parameters, with 12 transformer blocks, and each has 768 hidden neurons and 12 self-attention heads. It uses the same architecture of the frontend classifier with the GPT-2 small model. Although GPT-2 and BERT are both based on the transformer model, they use different units of the transformer. GPT-2 is built using transformer decoder blocks constructed by the masked selfattention layers, while the BERT utilizes transformer encoder blocks with self-attention layers.

Although transformer models usually achieve better accuracy than sequence models through its attention mechanism, they also require significantly more storage space. For example, GPT-2 small, which is the smallest variant of GPT-2 model requires more than 500 megabytes of storage space; The BERT base model 2 that we use takes 450 megabytes of storage space. In contrast, the Bi-LSTM model is smaller than 1 megabyte.

For each super cell, the model will predict a set of target positions in the form of {( key T ij , attribute T ij , agg mode)}, as we mentioned in Sec. II. Then based on each super cell and its predicted positions, a general data assembler will put each value to the right places in the target table. Based on the configuration, the assembler can work in either local mode by buffering and writing one file to store the target dataset in local or dispatch the assembled tuples to users' registered deep learning workers (i.e., target data is consumed by a deep learning application) once an in-memory buffer is full. In the latter case, in each deep learning worker's side, a client is responsible for receiving and assembling tuples into the target dataset. During the dispatching process, the output table will be partitioned in a way to guarantee load balance and ensure the independent identical distribution (i.e., IID) to avoid introducing bias.

An important objective of this work is to free human experts from all dirty works of wrangling with schema changes. Therefore it's critical to reduce the human efforts required in training data preparation, such as parsing and annotating data. We propose to automate the training data creation by utilizing conventional Python code developed for integrating an initial set of data sources. The users' Python codes specify how to transform the data sources (usually with heterogeneous formats) to a target table (usually in tabular format), which is exactly the information needed for creating the training data. This gives us an opportunity to translate users' data integration code to training data preparation code.

For relational data, the integration logic can be fully expressed in SQL, which maps to relational algebra. Then it is easy to generate code for training data creation process based on the relational algebra. First, all key and join key information are well maintained and can be directly retrieved. Second, it is easy to identify which attributes of a table will always be processed similarly by analyzing the relational algebra expression, so that the values of these attributes in the same tuple can be grouped into a super cell. For example, by analyzing a query coded up for a data integration task such as (21, 5, 17) ) . This output can be easily transformed into a base set of training data, into which the perturbations will be injected.

However, because the integration code of open data, is usually written in an object-oriented language such as Python, Java, C++, the code after compilation is opaque to the system, and it is hard to modify the code directly. One solution is to map the integration code to an intermediate representation (IR), such as Weld IR [60] that is integrated with libraries like numpy and SparkSQL; and our proposed Lachesis IR [86] . Such IR is usually a directed acyclic graph (DAG), and can be reasoned by the system. In this DAG, each node is an atomic computation, and each edge represents a data flow or a control flow from the source node to the destination node. The atomic computations useful to data integration workloads usually can be composed by three categories of operators: (1) Lambda abstraction functions such as a function that returns a literal (a constant numerical value or string), a member attribute or a member function from an object; unary functions such as exp, log, sqrt, sin, cos, tan, etc..

(2) Higher-order lambda composition functions such as binary operators: &&, ||, &, |, <,>, ==, +, -, * , /; conditional operator like condition? on_true:on_false; etc.. (3) Set-based operators such as scan and write that reads/writes a set of objects from/to the storage; map, join, aggregate, flatten, filter, etc.. We propose to modify existing intermediate representations, so that a super cell based processor can be derived from each atomic computation. We assume that each source dataset can be represented as Pandas dataframes. Then by traversing the IR graph, the system can understand the keys and the super cell mapping relationship. The super cell based processor of each of atomic computations transforms each super cell representation accordingly. For example, map operator that transforms a date cell "2020-10-06" to "Oct 6, 2020" as an example, the processor takes a super cell {"keys": ["2020-10-06", "AZ", "US"], "attributes": ["Date"], "cells":["2020-10-06"]} as input, and outputs {"keys": ["2020-10-06", "AZ", "US"], "attributes": ["Date"], "cells":["Oct 6, 2020"]} so that the contextual relationship between "Oct 6, 2020" and its source key and attribute name is preserved. The write's processor transforms each super cell into a f eature, label representation, such as {"source super cell":{"keys": ["2020-10-06", "AZ", "US"], "attributes": ["Date"], "cells":["Oct 6, 2020"]}, "target position": {"keys": ["Oct 6, 2020", "Arizona", "United States"], "attributes":["datetime"]}}. In this way, we can obtain training data automatically.

However, the limitation of above approach is that it may not work if the input object is totally nested and opaque and cannot be represented as a set of cells like Pandas dataframes or Spark dataframes. For example, a corpus of totally unstructured text files, unavoidably requires human pre-processing efforts. Thereby, we design following approaches to further alleviate the problem: model reusing and crowdsourcing.

According to Sec. V-A, if we are able to convert an unstructured dataset, e.g., a set of opaque and nested objects or a set of unstructured texts, into a Pandas dataframe or similar structures, the code generation approach maybe applicable to automate the training data creation process. However, it is nontrivial to identify the parsing logic, perform such conversion and identify the keys. All these tasks are hard to automate. We consider crowdsourcing as a potential approach to alleviate the burden from the data scientists or domain experts for these tasks. However, based on our experiments of crowdsourcing 160 key identification tasks to 8 graduate students, and 164 undergraduate students from an introductory database course, requesting to identify all keys. We find that the accuracy is merely 65.7%. First, some of the datasets, particularly these scientific datasets, require domain-specific knowledge to tell the tuple identifier, because these attribute names are acronyms or terms that are not understandable to most people who are not in the domain, and usually datasets are not shipped with detailed explanations for each attribute. Second, for large datasets, it is impossible for a person who are not familiar with the datasets to tell the keys. Third, it is not easy to find a lot of people who has database knowledge. Other tasks such as identifying super cells and parsing unstructured datasets are even more challenging for crowdsourcing platforms due to the expert knowledge required in nature.

Another approach to reduce human efforts involved in preparing training data is to reuse models for similar data integration tasks. For this purpose, we design and develop a system, called as ModelHub, which searches for reusable models for a new data integration task by comparing the attributes of the target dataset (that is created by the programmer's initial data integration code) with the target dataset of each existing data integration models. We leverage locality sensitive hashing (LSH) based on MinWise hash [19] , [85] for text-based data and LSH based on JS-divergence [15] , [48] for numerical data to accelerate the attribute-matching process. Another benefit of utilizing the LSH is that, in the ModelHub platform, each model only needs to be uploaded with LSH signatures of the target dataset's attributes, while the target dataset does not need to be submitted, which saves the storage overhead and also addresses privacy concerns.

We mainly answer following questions in this section: (1) How effective is our proposed deep learning representation for different data integration tasks?

(2) How effective are the perturbations added to the training data for handling various types of schema changes? (3) How will different super cell granularities affect the accuracy, and the overheads for the training, testing, and assembling process? (4) How will different model architectures (complex and large models vs. simple and compact models) affect the accuracy and latency for different types of data integration tasks? (5) How will our approach of handling schema changes improve productivity and alleviate programmers' efforts?

Based on the proposed training data representation and training data perturbation methodology, we have created training data to train Bi-LSTM model, GPT-2 small model, and BERT base model for two scenarios: coronavirus data integration and heterogeneous machine data integration. The first scenario mainly involves tabular source datasets in CSV formats with aforementioned schema changes. However, the source datasets for the second scenario are mainly unstructured text data, in which most of the similar terms in different platforms are expressed very differently (e.g., CPU user time is logged as "CPU usage: 14.90% user" in MacOS, "%Cpu(s): 14.9 us" in Ubuntu, and "400%cpu 86%user" in Android). Model Architectures. We compare three neural networks: Bi-LSTM, GPT-2 small with a CNN frontend classifier, and BERT base with the same CNN frontend classifier. The model architectures are described in Sec. IV.

Model Training For the training process of both scenarios, Bi-LSTM is relatively slower in converging, requiring around 50 epochs; while the models leveraging pre-trained GPT-2 small and BERT base are much faster to converge, requiring only around 5 epochs, as illustrated in Fig. 4 . Metrics. We evaluate and compare the accuracy, the storage overhead, and the end-to-end training and inference latency, with all types of schema changes as mentioned in Sec. III applied at the inference stage. The accuracy of the data integration model is defined as the ratio of the number of super cells that has been predicted with correct target positions and aggregation actions to the total number of super cells in the testing data.

Hardware Platform. For all experiments, if without specification, we use one NVIDIA Tesla V100 GPU from Google Colab. All running times (e.g., training time, inference time) are measured as the average of multiple repeated runs.

B. Coronavirus Data Integration Scenario 1) Experiment Setup: We evaluate our system in a COVID-19 data integration scenario that is close to the example in Sec. I. We predict COVID-19 trend using daily and regional information regarding the number of vaqarious cases and mobility factors. Given a set of raw data sources, we need to create a 2-dimensional target dataset on daily basis. In the target dataset, each row represents coronavirus and mobility information for a state/province on the specific date, and each column represents the state, country, number of confirmed cases, recovery cases, death cases, and the mobility factors regarding workplace, grocery, transit, etc.. The target dataset can be used as inputs to various curve-fitting techniques [1] , [2] for COVID-19 prediction. Datasets. We assume the user specifies/recommends a small set of initial data sources. For the first scenario, the user specifies the John Hopkins University's COVID-19 github repository [4] and Google mobility data [3] . The statistics about the above source tables are illustrated in Tab. I. The JHU dataset contains 258 files with each file representing COVID-19 statistics on a specific date. These files have tens of versions, growing from 36 rows and 6 attributes to 3956 rows and 14 columns. Perturbations. We add perturbations such as random changes to attribute names and values, and replacing attribute names and value tokens by synonyms as described in Sec. III to 58.3% of the attributes in the training data. In addition, we add key expansion changes, which accounts for 18.6% of the rows in the training data. We test the model using JHU-COVID-19 data and Google mobility data collected from Feb 15, 2020 to Oct 6, 2020, as illustrated in Tab. I. 2) Overall Results: The overall results are illustrated in Tab. II, which show that employing a complex transformer like the pre-trained GPT-2 small and BERT base, we can achieve better accuracy, though more complicated models require significantly more storage space and computational time for training one epoch or inference.

The results also show that with the increase of the granularity of super cells , the required training and testing time will be significantly reduced, while the accuracy will decrease. 3) Ablation Study: Using the Bi-LSTM model with singlecell representation, we also conducted detailed ablation study as illustrated in Tab. III. It shows that handling value format changes (e.g., date format change like 10-06-2020 and 06102020; and different abbreviations of region and subdistricts like AZ and Arizona.) is a main factor for accuracy degradation. Using a customized synonymous dictionary to encode these format changes for adding perturbations to the training data can greatly improve the accuracy compared with using synonyms extracted from Google Knowledge Graph, as illustrated in Tab. IV. In addition, we also find that using character-based embedding can significantly outperform wordbased embedding, as illustrated in Tab. V.

We developed the data integration code using Python and Pandas dataframe to integrate the JHU COVID-19 data collected on Feb 15, 2020 and the time-series Google mobility data. After Feb 15, 2020, the first schema evolution of the JHU COVID-19 data schema that breaks the integration code and causes system downtime, happened on Mar 22, 2020. we invite an experienced software engineer, a Ph.D student, and an undergraduate student to develop the revisions respectively and ask them to deliver the task as soon as possible. We record the time between the task assignment and code submission, as well as the time they dedicated to fixing the issue as they reported. We find that although the reported dedicated time ranges from 15 to 25 minutes; the time between the task assignment and code submission ranges from one to three days. This example illustrates the unpredictability of human resources. In contrast, our proposed data integration pipeline can smoothly handle schema changes without any interruptions or delays, and requires no human intervention at all. Performance of Python-based Integration Code. We run our Python-based and human-coded data integration pipeline on the aforementioned daily JHU COVID-19 data and Google mobility data in a C4.xlarge AWS instance that has four CPUs and eight gigabytes memory, and it takes 417 seconds of time on average to integrate data for one day, without considering the time required to fix the pipeline for schema changes. 97% of the time is spent on removing redundant county-level statistics from the relatively large Google mobility file that has 2.5 millions of tuples. Otherwise the co-existing state-level and county-level statistics in the Google mobility file will confuse the join processing. This observation indicates that with the acceleration of high-end GPU processor, the overall training and inference latency of using a deep learning based pipeline is lower than using the traditional human-centered pipeline.

Considering that the training process only needs to be carried out at the beginning and when a concept drift [75] is detected.

C. Machine Log Integration 1) Environment Setup: Suppose a lab administrator developed a Python tool to integrate various performance metrics of a cluster of MacOS workstations, such as CPU utilization (user, system, idle, wait), memory utilization (cached, buffered, swap), network utilization (input, output), disk utilization (write, read), and so on. The tool collects these metrics by periodically reading the output of an omnipresent shell tool "top" 3 and then perform a union operation for time-series metrics collected from each machine. Now the lab purchased four Ubuntu Linux servers. However, because the "top" tool's output in Ubuntu is very different from MacOS, the Python tool cannot work with these new Linux machines without additional coding efforts. Such problem is prevalent in machine or sensor data integration, where different devices produced by different manufacturers may use different schemas to describe similar information.

2) Overall Results: The results are illustrated in Tab. VI and Tab. VII, showing that our approach can achieve acceptable accuracy. Particularly, the transformer models can achieve significantly better accuracy than the Bi-LSTM model. For this case, with the increase in super cell granularity (i.e., decrease in number of super cells per target tuple), the accuracy of the Bi-LSTM network is improved, while the accuracy of the transformer-based models is slightly degraded. The transformer-based models can achieve significantly better accuracy, while the computational time required for training (per epoch) and inference is significantly higher. Also the larger of the super cell granularity, the fewer number of training and testing samples. Therefore, the time required for training and testing is also significantly reduced with the increase in the super cell granularity. 

In this section, we discuss the process of assembling prediction results into tabular files. We mainly measure how the sizes of source datasets, target datasets, and granularity of super cells affect the overall latency of the assembling process. The results are illustrated in Fig. 5 , which show that increasing super cell granularity will significantly reduce the assembling latency. It indicates that if storage space is not the bottleneck, using a transformer-based model and the largest possible super cell granularity will achieve acceptable accuracy while significantly reducing the computational time required for training, inferences, and assembling. 

Schema evolution in relational database, XML, JSON and ontology has been an active research area for a long time [23] , [63] . One major approach is through model (schema) management [10] , [12] and to automatically generate executable mapping between the old and evolved schema [51] , [77] , [80] . While this approach greatly expands the theoretical foundation of relational schema evolution, it requires application maintenance and may cause undesirable system downtimes [18] . To address the problem, Prism [18] is proposed to automate the end-to-end schema modification process by providing DBAs a schema modification language (SMO) and automatically rewriting users' legacy queries. However, Prism requires data migration to the latest schema for each schema evolution, which may not be practical for today's Big Data era. Other techniques include versioning [40] , [54] , [67] , which avoids the data migration overhead, but incurs version management burden and significantly slows down query performance. There are also abundant works discussing about the schema evolution problem in NoSQL databases, Polystore or multi-model databases [33] , [35] , [53] , [70] Most of these works are mainly targeting at enterprise data integration problems and require that each source dataset is managed by a relational or non-relational data store. However the open data sources widely used by today's data science applications are often unmanaged, and thus lack schemas or metadata information [50] . A deep learning model, once trained, can handle most schema evolution without any human intervention, and does not require any data migration, or version management overhead. Moreover, today's data science applications are more tolerant to data errors compared to traditional enterprise transaction applications, which makes a deep learning approach promising.

Data discovery is to find related tables in a data lake. Aurum [26] is an automatic data discovery system that proposes to build enterprise knowledge graph (EKG) to solve real-world business data integration problems. In EKG, a node represents a set of attributes/columns, and an edge connects two similar nodes. In addition, a hyperedge connects any number of nodes that are hierarchically related. They propose a two-step approach to build EKG using LSH-based and TFIDF-based signatures. They also provide a data discovery query language SRQL so that users can efficiently query the relationships among datasets. Aurum [26] is mainly targeting at enterprise data integration. In recent, numerous works are proposed to address open data discovery problems, including automatically discover table unionability [59] and joinability [83] , [85] , based on LSH and similarity measures. Nargesian and et al. [58] propose a Markov approach to optimize the navigation organization as a DAG for a data lake so that the probability of finding a table by any of attributes can be maximized. In the DAG, each node of navigation DAG represents a subset of the attributes in the data lake, and an edge represents a navigation transition. All of these works provide helpful insights from an algorithmatic perspective and system perspective for general data discovery problems. Particularly, Fernandez and et al. [27] proposes a semantic matcher based on word embeddings to discover semantic links in the EKG.

Our work has a potential to integrate data discovery and schema matching into a deep learning model inference process. We argue that in our targeting scenario, the approach we propose can save significant storage overhead as we only need store data integration models which are significantly smaller than the EKG, and can also achieve better performance for wide and sparse tables. We will prove in the paper that the training data generation and labeling process can be fully automated.

Traditionally, to solve the data integration problem for data science applications, once related datasets are discovered, the programmer will either manually design queries to integrate these datasets, or leverage a schema matching tool to automatically discover queries to perform the data integration.

There are numerous prior-arts in schema matching [29] , [38] , [51] , [72] , which mainly match schemas based on metadata (e.g., attribute name) and/or instances. Entity matching (EM) [16] , which is to identify data instances that refer to the same real-world entity, is also related. Some EM works also employ a deep learning-based approach [24] , [37] , [42] , [49] , [57] , [73] , [82] . Mudgal and et al. [57] evaluates and compares the performance of different deep learning models applied to EM with three types of data: structured data, textual data, and dirty data (with missing value, inconsistent attributes and/or miss-placed values). They find that deep learning doesn't outperform existing EM solutions on structured data, but it outperforms them on textual and dirty data.

In addition, to apply schema matching to heterogeneous data sources, it is important to discover schemas from semistructured or non-structured data. We proposed a schema discovery mechanism for JSON data [78] , among other related works [22] , [52] .

Our approach proposes a super cell data model to unify open datasets. We train deep learning models to learn the mappings between the data items in source datasets and their positions as well as aggregation modes in the target table.

If we see the context of a super cell in the source as an entity, and the target position of the super cell as another entity, the problem we study in this work shares some similarity with the entity matching problem. The distinction is that the equivalence of two "entities" in our problem is determined by users' data integration logic, while general entity matching problem does not have such constraints.

Thirumuruganathan and et al. [74] discuss various representations for learning tasks in relational data curation. Cappuzzo and et al. [14] further propose an algorithm for obtaining local embeddings using a tripartite-graph-based representation for data integration tasks such as schema matching, and entity matching on relational database. We are mainly targeting at open data in CSV, JSON and text format and choose to use a super cell based representation. These works can be leveraged to improve the super cell representation and corresponding embeddings proposed in this work.

In this work, we propose an end-to-end approach based on deep learning for periodical extraction of user expected tables from fast evolving data sources of open datasets. We further propose a relatively stable super cell based representation to embody the fast-evolving source data and to train models that are robust to schema changes by automatically injecting schema changes (e.g., dimension pivoting, attribute name changes, attribute addition/removal, key expansion/contraction, etc.) to the training data. We formalize the problem and conduct experiments on integration of open COVID-19 data and machine log data. The results show that our proposed approach can achieve acceptable accuracy. In addition, by applying our proposed approach, the system will not be easily interrupted by schema changes and no human intervention is required for handling most of the schema changes.

Caltech covid-19 modeling

Covid-19 data repository by the center for systems science and engineering (csse) at johns hopkins university

Harvard covid-19 data: county age&sex with ann

National oceanic and atmospheric administration

The seattle report on database research

A semantic approach to discovering schema mapping expressions

Applying model management to classical meta data problems

Model management 2.0: manipulating richer mappings

Data warehouse scenarios for model management

A plea for standards in reporting data collected by animal-borne electronic devices

Creating embeddings of heterogeneous relational datasets for data integration tasks

Locality-sensitive hashing for f-divergences: Mutual information loss and beyond

Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection

Automating the database schema evolution process

Graceful database schema evolution: the prism workbench

Locality-sensitive hashing scheme based on p-stable distributions

Pre-training of deep bidirectional transformers for language understanding

Alex: an updatable adaptive learned index

Automatic generation of normalized relational schemas from nested key-value data

Semantic integration research in the database community: A brief survey. AI magazine

Deeper-deep entity resolution

Clio: Schema mapping creation and data exchange

Aurum: A data discovery system

Seeping semantics: Linking datasets using word embeddings for data discovery

Domain-adversarial training of neural networks

Schema mapping discovery from data instances

Clio: A semi-automatic tool for schema mapping

Codel-a relationally complete language for database evolution

Living in parallel realities: Co-existing schema versions with a bidirectional database evolution language

Migcast: Putting a price tag on data model evolution in nosql data stores

Learning a partitioning advisor for cloud databases

Evolution management of multi-model data

Data migration

Low-resource deep entity resolution with transfer and active learning

A collective, probabilistic approach to schema mapping using diverse noisy evidence

Learned cardinalities: Estimating correlated joins with deep learning

Ontology versioning on the semantic web

Evolution management of multi-model data

Toward building entity matching management systems

The case for learned index structures

Learning to optimize join queries with deep reinforcement learning

Adversarial machine learning at scale

Deep learning. nature

Qtune: A query-aware database tuning system with deep reinforcement learning

S2jsd-lsh: A locality-sensitive hashing schema for probability distributions

A comprehensive benchmark framework for active learning methods in entity matching

Open data integration

Schema mapping as query discovery

Nose: Schema design for nosql applications

Query rewriting for continuously evolving nosql databases

Managing and querying transaction-time databases under schema evolution

Prima: archiving and querying historical data with evolving schemas

Scalable architecture and query optimization fortransaction-time dbs with evolving schemas

Deep learning for entity matching: A design space exploration

Organizing data lakes for navigation

Table union search on open data

Weld: A common runtime for high performance data analytics

Mapping xml and relational schemas with clio

Language models are unsupervised multitask learners

An online bibliography on schema evolution

Managing schema evolution in nosql data stores

Deep learning in neural networks: An overview

Discovering queries based on example tuples

ACM SIGMOD international conference on Management of data

Non-blocking Lazy Schema Changes in Multi-Version Database Management Systems

Learning from simulated and unsupervised images through adversarial training

Data integration: The current status and the way forward

Nosql schema evolution and data migration: State-of-the-art and opportunities

Using probabilistic reasoning to automate software tuning

Schema mappings and data examples

Reuse and adaptation for entity resolution through transfer learning

Data curation with deep learning

The problem of concept drift: definitions and related work

Automatic database management system tuning through large-scale machine learning

Preserving mapping consistency under schema changes

Schema management for document stores

Ed-join: an efficient algorithm for similarity joins with edit distance constraints

Semantic adaptation of schema mappings when schemas evolve

An end-to-end automatic cloud database tuning system using deep reinforcement learning

Auto-em: End-to-end fuzzy entity-matching using pre-trained deep models and transfer learning

Josie: Overlap set similarity search for finding joinable tables in data lakes

Auto-join: Joining tables by leveraging transformations

Lsh ensemble: Internetscale domain search

Lachesis: Automated generation of persistent partitionings for big data applications