key: cord-0555854-lmmlvexc
authors: Santra, Abhishek; Komar, Kanthi; Bhowmick, Sanjukta; Chakravarthy, Sharma
title: From Base Data To Knowledge Discovery -- A Life Cycle Approach -- Using Multilayer Networks
date: 2021-05-24
journal: nan
DOI: nan
sha: 545c94f9731a67ec467ea70287ad5cc1b385b2dd
doc_id: 555854
cord_uid: lmmlvexc

Any large complex data analysis to infer or discover meaningful information/knowledge involves the following steps (in addition to data collection, cleaning, preparing the data for analysis such as attribute elimination): i) Modeling the data -- an approach for modeling and deriving a data representation for analysis using that approach, ii) translating analysis objectives into computations on the model generated; this can be as simple as a single computation (e.g., community detection) or may involve a sequence of operations (e.g., pair-wise community detection over multiple networks) using expressions based on the model, iii) computation of the expressions generated -- efficiency and scalability come into picture here, and iv) drill-down of results to interpret or understand them clearly. Beyond this, it is also meaningful to visualize results for easier understanding. Covid-19 visualization dashboard presented in this paper is an example of this. This paper covers all of the above steps of data analysis life cycle using a data representation that is gaining importance for multi-entity, multi-feature data sets - Multilayer Networks. We use several data sets to establish the effectiveness of modeling using MLNs and analyze them using the proposed decoupling approach. For coverage, we use different types of MLNs for modeling, and community and centrality computations for analysis. The data sets used - US commercial airlines, IMDb, DBLP, and Covid-19 data set. Our experimental analyses using the identified steps validate modeling, breadth of objectives that can be computed, and overall versatility of the life cycle approach. Correctness of results is verified, where possible, using independently available ground truth. We demonstrate drill-down that is afforded by this approach (due to structure and semantics preservation) for a better understanding and visualization of results.

sets and objectives how the expressions corresponding to objectives are evaluated using an efficient decoupling-based approach and results drilled down to obtain actionable knowledge from the data set.

Using the widely used Enhanced Entity Relationship (EER) approach for data representation, we demonstrate how to generate EER diagrams for data sets and further generate, algorithmically, MLNs as well as Relational schema for analysis and drill down, respectively. Using communities and centrality for aggregate analysis, we demonstrate the flexibility of the chosen model to support diverse set of objectives. We also show that compared to current analysis approaches, a "divide-and-conquer" approach of MLNs is more appropriate as well as efficient, and more importantly preserves structure and semantics of the results. For this computation, we need to derive expressions for each analysis objective using the MLN model. We provide guidelines to translate English queries into analysis expressions based on keywords.

Finally, we use several data sets to establish the effectiveness of modeling using MLNs and analyze them using the proposed decoupling approach. For coverage, we use different types of MLNs for modeling, and community and centrality computations for analysis. The data sets used are from US commercial airlines, IMDb (a large international movie data set), the familiar DBLP (or bibliography database), and the Covid-19 data set. Our experimental analyses using the identified steps validate modeling, breadth of objectives that can be computed, and overall versatility of the life cycle approach. Correctness of results are verified, where possible, using independently available ground truth. Furthermore, we demonstrate drill-down that is afforded by this approach (due

Big data analytics is predicated upon our ability to model and analyze disparate and complex data sets. Relational database management systems (or RDBMSs) have served well for modeling and querying data sets that needs to be managed over long periods of time. Data warehouses and On-Line Analytical Processing (or OLAP) came about to improve the querying aspects of RDBMSs using more powerful multi-dimensional analysis queries that were not possible with SQL. This evolution has continued with NoSQL systems providing alternate data models and analysis for data that were difficult (or inefficient) using RDBMSs.

On the mining and knowledge discovery side, long-term data management is not an issue, but as the complexity of data increases, data models are needed for modeling the data in the best possible way to develop specific algorithms for mining. Although graph models have been used for this, we see a need for more powerful data models that can capture data semantics better. We see the applicability of Multilayer Networks (or MLNs) and its analysis potential as another important step in the evolution of aggregate analysis of complex data sets.

Our focus is on the complete life cycle and automating all the steps 2 , as much as possible, for knowledge discovery. Hence, we are not considering efficiency issues, benchmarks, or other approaches, such as neural networks. This helps us focus on the complete life cycle rather than comparisons of individual steps with alternatives. We are also delimiting our approach, in this paper, to data sets with diverse types of entities that are defined by multiple features and interact through different relationships. Although graphs are used as the basis, the analysis and computations performed in knowledge discovery are quite different from the ones addressed in NoSQL systems, such as Neo4J [21] . As an example, consider a data set about a group of actors and directors (as entities). Each person has some attributes associated with them, such as who they co-act with, which genre they direct, etc. (termed features.) Actors and directors can also be connected (termed relationships) if an actor is directed by a director. Implicit relationships can also be inferred between two entities if they share similar features or transitivity holds. For this data set, finding a strong group of actors and directors, where actors co-act and directors direct similar genres cannot be expressed as a query. It is a computation that requires finding communities that satisfy certain properties.

A critical question is how to model multi-entity and -feature data sets that also involve relationships among entities explicitly as well as more precisely and express & analyze objectives clearly. Data set characteristics growing from single entity type and feature and/or relationship to multiple features and relationships leads to the following new challenges;

-Model Expressiveness. With the presence of multiple types of entities and different relationships among the same-and different-types of entities, it is important to use a model that is expressive, i.e., preserves clarity of semantics as well as structure of the data 3 being modeled. -Flexible Analysis Alternatives. Analysis objectives on multi-feature data may be specified on a subset of entities/features. For example, given a data set about actors, movies in which they have acted, and directors who have directed the movies (see IMDb data set used in this paper), the analysis can involve actors and their movie rating, or other actors with whom they work, or any combination thereof. Also, in the IMDb data set, although both actors and directors are people, they are considered separate types of entities, as they perform different roles. The challenge is to allow for flexibility of mixing and matching of entities/features for analysis, while avoiding loss of information or redundancy of computations. -Analysis Objectives to Expressions and Their Efficient Computation: Beyond modeling and flexibility of analysis, it is important to have a methodology for translating or mapping analysis objectives into computations on the model and further compute them efficiently. In this paper, we present an algorithmic approach to translate objectives to analysis expressions using computations available on the data model. For efficiency of computation, we use an approach that has been proposed for MLNs [72, 73] . Here data structures, use of parallel/distributed approaches, and the ability to scale become important. -Drill-down and Visualization of Results. Finally, understanding the results (knowledge discovered) is extremely important. The model and computation using the model should not mask the structure and semantics of the results. This is where we also think the MLN approach has an edge as compared to earlier approaches to this problem. Visualization is likely to become easier once structure and semantics of results are preserved. We have developed an architecture for visualizing both base data and analyzed results [71] . We show the results from that dashboard in this paper.

For a given data set D with T entity types, F features, and a set of analysis objectives O, propose a life cycle framework that (i) generates an expressive model for the data that preserves structure and semantics, (ii) allows flexibility of selecting a combination of entities and features for analysis, (iii) algorithmically generates expressions for O and computes them efficiently, and (iv) supports drill-down and visualization of knowledge discovered for easier understanding.

Recently, a generalized framework (a model and computation using that model) has been proposed that addresses some of these challenges. Earlier work in multi-feature data analysis (see Section 2) were focused either on a specific application or a specific analysis technique. In this paper, we focus on the knowledge discovery life cycle using the new approach over the MLN data model.

Overview of the Paper. Our main contribution in this paper is to develop a complete data analysis (or knowledge discovery) life cycle using a generalized framework for modeling & computation on multi-feature, multi-entity data sets and show its versatility and effectiveness. Based on the survey and comparison of the currently used techniques for modeling and analyzing multifeature data, we have chosen the MLN approach for modeling the data and its analysis using the decoupling approach. We have chosen the EER approach for modeling data and generating the MLN model. We also demonstrate that by combining these techniques the chosen framework can handle diverse data sets with multiple features and entity types as well as varied analysis objectives i.e., the entire life cycle.

-Expressive Modeling: In Section 4 we compare the advantages and disadvantages of several approaches for modeling of multi-feature, multi-entity data, and show why multilayer networks (MLNs) are a better alternative with many advantages. -Analysis Life Cycle: In Section 5, we present the steps of the analysis life cycle from data set description and objectives to discovering knowledge corresponding to those objectives. Further, we demonstrate how drill-down of results for visualization can be accomplished using the same framework. -Efficient Analysis: In Section 6 we summarize the decoupling approach using which information about each feature/entity type is analyzed separately and then composed efficiently to obtain results for the objectives. -Algorithmic Translation of Objectives: In Section 7 we present an approach to translate analysis objectives into computation expressions using the generated MLN model characteristics and available computations. -Validation, Drill-down, and Visualization: In Section 8, we present drilldown and visualization of results. Where possible, we validate our experimental results with independently available ground truth. Further, several visualization approaches have been used in our dashboard and is capable of visualizing both base data as well as analysis results. The goal is to facilitate better understanding of the results.

We present related work in Section 2. We present data sets used along with expected analysis objectives in Section 3. And conclusions in Section 9.

We present related work corresponding to the steps of the life cycle in this section.

EER Modeling: Since the Seventies, EER model [39] has served as a methodology, for database design, for representing important semantic information about the real-world applications. Relational database modeling has clearly benefited from this body of work and has motivated UML (Unified Modeling Language) for OO (Object-Oriented) design. A good EER diagram based on the data and queries to be supported, is critical and goes a long way for an error-free relational database schema. Numerous tools [13, 12, 17, 24, 7, 9, 11] have been developed for creating the EER diagram and algorithmically mapping it into relations for different commercial DBMSs.

However, with the emergence of data sets with relationships among entities and complex application requirements, such as shortest paths, important neighborhoods, dominant nodes (or groups of nodes), etc, [54, 43] , the relational data model was not the best choice for modeling as well as analyzing them [38] . This led to the evolution of NoSQL data models including the graph data model [29] . In many applications, such as Facebook (friendship relationship), Movies (collaborations relationship) and Twitter (follower-followee relationship), relationships needed to be modeled explicitly using the graph model. This gave access to a wide variety of analyses that were available for these data models. Recently, there has been some work in the area of graph modeling from EER diagrams, but is limited to simple attribute graphs only [70, 39, 67, 28] . Only recently, an approach [57] has been proposed for using the EER approach for generating MLNs using data sets and analysis objectives. This approach has been adapted in this paper.

There are a number of tools developed for creating the EER diagram and these tools also translate the EER diagram developed and generate system-specific Relational schema for Oracle, DB2, etc. This makes the development of a model a bit easier although the mapping from requirements to an EER diagram is still subjective, relies on experience, and hence not unique. When it comes to mining, this modeling approach is not typically adopted. A decision to use a specific mining algorithm is based on the context and experience as well as desired objectives and if needed the data is transformed into a different format. For example, association rule algorithms use different representations of data than many other mining algorithms. However, when a graph is used as a data model, choice of nodes, edges and their labels (if needed), becomes important and there may be multiple alternative ways of creating them depending on the analysis objectives. Further, creating edges may need a similarity/proximity criteria which needs to be identified or specified.

Our approach for data analysis stems from: i) the need for analyzing the same data in multiple ways, ii) a number of aggregate analysis (e.g., community, centrality, substructure, to name a few) alternatives that can be used, and iii) need for generating expressions for analysis rather than a single computation as is typically done in mining. Hence, this approach and framework, although new to knowledge discovery, is effective as we illustrate it in Section 7. This problem requires further research as we consider our contribution in this paper to be a starting point.

MLNs can be classified into Homogeneous (HoMLNs), Heterogeneous (HeMLNs) and Hybrid (HyMLNs) depending upon the characteristics of entities/nodes in each layer and their connectivity to other layers. If the entities and their types are of the same type across all layers, it is a HoMLN where same entities across layers are assumed to be connected implicitly. If the entities and their types are different from layer to layer, explicit edges are used, as needed, to connect entities between two layers and these correspond to HeMLNs. Hybrid MLNs (or HyMLNs) are a combination of these two.

Analysis of HoMLNs: Community detection algorithms have been extended to HoMLNs for identifying tightly knit groups of nodes based on different feature combinations (review: [55, 51, 87, 61] .) Algorithms based on matrix factorization [49] , cluster expansion [60] , Bayesian probabilistic models [88] , regression [36] and spectral optimization of the modularity function based on the supra-adjacency representation [90] have been developed. Further, methods have been developed to determine centrality measures to identify highly influential entities [46, 78, 89] . However, all these approaches analyze a MLN by reducing it to a simple graph either by aggregating all (or a subset of) layers that is likely lead to loss of semantics as the entity and feature type information is lost. Other approaches that consider the entire MLN as a whole result in increased complexities due to repeated traversals of individual as well as connected layers.

Recently developed decoupling-based approaches combine partial analysis results from individual layers systematically in a loss-less manner to compute communities [72] or centrality hubs [73] for layer combinations. There is no aggregation of layers in this approach. Due to the "divide and conquer" approach of decoupling, this method has been shown to be more efficient as it avoids re-computation of layer communities/hubs, and also provides flexibility of analysis. Analysis of HeMLNs: Majority of HeMLN work (reviews in [76, 81] ) focuses on developing meta-path based methods for similarity of objects [85] , object classification [84] , missing link prediction [91], ranking/co-ranking [75] , and recommendations [77] . Few existing works propose methods to generate clusters of entities [64, 82] . Most of them concentrate mainly on inter-layer edges and not the networks themselves. Moreover, existing approaches (typeindependent [48] and projection [31] ) do not preserve the structure or types and labels of nodes/edges without extensive mapping and unmapping before and after computation. The type independent approach collapses all layers into a single graph keeping all nodes and edges (including inter-layer edges) sans their types and labels. Similarly, the projection-based approach projects nodes of one layer onto another layer and uses layer neighbor and inter-layer edges to collapse two layers into a single graph with one entity type instead of two.

Drill-down of analysis results is critical especially for complex data which has both structure and semantics. For example, it is not sufficient to know the identities of objects in a community, but also additional details of the objects. Similarly, for a centrality hub. As we are using the MLNs as the data model, we also need to know the objects across layers and their inter-connections, if it is a HeMLN. From a computation/efficiency perspective, minimal information is used for analysis and the drill-down phase is used to expand upon, to the desired extent. Our algorithms, especially the decoupling-based, make it easier to perform drill-down without any additional mappings back and forth for recreating the structure. Our schema generation also separates information needed for drill-down (Relations) and information needed for analysis (MLNs) from the same EER diagram. It also generates needed information for the translation of objectives into expressions.

Visualization is not new either and there exists a wide variety of tools for visualizing both base data, results, and drilled-down information in multiple ways [5, 20, 25]. Our focus, in this paper, is to make use of available tools in the best way possible and not propose new ones. For example, we have experimented with a wide variety of tools including, maps, individual graph and community visualization, animation of features in different ways, hovering to highlight data, and real-time data fetching and display, based on user input as menu. Perhaps the main contribution of visualization is our architecture with clearly defined modules for analysis, visual output generation, and userinteraction. In addition to the efficiency aspects of the analysis module, We have also paid attention to efficiency of visualization creation and its access by caching pre-generated results (to avoid re-generation) and use of a hash lookup [71] .

Finally, this paper is not about efficiency of MLN analysis as they are discussed in other places [72, 73] . Also, we are not comparing graph-based analysis with other approaches, such as neural-network based knowledge discovery to keep the paper focused on the life cycle. It is possible that expression generation and analysis could be replaced with other approaches that fit this framework.

We present a subset of the data sets we have analyzed using the approach presented in this paper with their description along with analysis objectives. We have chosen data sets from different application domains to illustrate the versatility of our framework. While much larger data sets can be generated, we selected these because reliable ground truth from orthogonal sources were available for some. Due to space constraints, we are discussing a subset of analysis objectives driven by coverage. The data sets chosen cover all types of MLNs and illustrate the generality of the framework. direct flight between them. The data set is characterized by single entity type (city). The multiple features are due to the presence of multiple airlines. A similar data set for European carriers has been analyzed in [37] in a different way. Analysis Objectives. Our analysis objectives are: i) For American, Southwest, Spirit, Delta, Frontier and Allegiant Airlines rank the top five cities, that provide the maximum coverage (A1), and ii) predict which city (taking its population into consideration) could be selected as the next hub(s) for Allegiant Airlines to expand its coverage and avoid competition with other airlines (A2).

2. Bibliography Database (DBLP): As most researchers are familiar, the DBLP data set is a publicly available information repository about computer science publications in various conferences and journals. It contains author names, their affiliation, year of publication of papers, conference/journal names, and links to the papers [8] . Clearly, there are multiple entities that can be related based on different types of relationships. Analysis Objectives. Our objectives for this data set are: i) For each 3-year interval group, find the most actively publishing strong author collaboration groups (A3), ii) for each conference-based paper group, find the most popular author collaboration group and further for each of them identify their most active 3-year interval group(s) (A4), and iii) identify author collaboration groups who have published in conferences VLDB and SIGMOD, but have never published in conferences DASFAA and DaWaK (A5).

The IMDb data set is publicly available and has information about movies, TV episodes, actor, directors, ratings and genres of the movies, etc. [19] . Here the entities are of different types, such as actors, directors, movies, etc. The features can also differ since actors can be connected based on co-acting or if they have worked in movies of the same genre. This data set can also be enriched by involving additional information about actors and directors from their social media presence, such as Facebook and Twitter. This is not elaborated due to space constraints in this paper. Analysis Objectives. Some of the analysis objectives for this data set are: i) Cluster actors who have acted together and have a similar average rating (A6), ii) find the groups of actors who have never acted together, but are highly rated on an average and have worked in similar genres (A7), iii) identify genre-based groups of actors and directors having strong collaborations (A8); and iv) identify, for each movie rating group the genre-based most popular actor and most popular director groups. From this result, find the actor and director groups having strong collaborations (A9).

Covid-19 data reported by New York Times 4 [6] along with census data [27] , and data from other trusted sources have been used to compile a data set for the 3141 counties in the US starting from February 2020 that includes features, such as number of daily new cases, number of daily new deaths, latitude-longitude of the county seat, the mean per capita income, population density by land area, total land area, educational qualifications, traffic movement and so on. Analysis Objectives. The inclusion of this data set is for two purposes: i) to demonstrate how MLN analysis can be applied to data sets such as Covid and ii) to visualize analyzed results instead of base data as is typically done. A number of similarity analysis can be done on different desired features. In this paper, we leverage the MLN aggregate analysis approach to visualize how Covid has spread geographically in two arbitrarily selected periods. This can be used for understanding the Covid situation corresponding to pre and post a major event, such as Vaccination drive, Spring break, lockdowns, long holiday weekends, etc. Specifically, we want to visualize how counties with similar percentage increase in covid cases/deaths has changed across any two userdefined disjoint periods starting from February 2020, and combine that with other demographics information. Specific objective used in the visualization is given in Section 7 as (A10) and visualization results are shown in Section 8.4.

Our selected data sets and analysis objectives are quite varied to illustrate the versatility of the approach. They range from analysis of finding coverage of individual airlines, clusters of co-actors/co-authors to more complicated predictions like the next planned hub of an airline, future potential teaming of actors, high quality actor-director collaborations and active periods of most popular co-authors. In addition, some of the data sets, based on the objectives, may come out either as homogeneous or heterogeneous or hybrid further depicting the capability and completeness of the modeling approach.

A Network (or graph), G is an ordered pair (V, E), where V is a set of vertices and E is a set of edges. An edge (v, u) is a 2-element subset of the set V . The two vertices that form an edge are said to be neighbors of each other. Here we consider graphs that are undirected (the vertices in the edge are unordered.)

A Community in a graph consists of groups of vertices that are more connected to each other than to other vertices in the graph. Several algorithms have been proposed for community detection in a simple graph. This objective is achieved by optimizing network parameters such as modularity [40] or conductance [59] .

Centrality Metrics are used for measuring the importance of vertices. They include degree centrality (number of neighbors), closeness centrality (mean distance of the vertex from other vertices), betweenness centrality (fraction of shortest paths passing through the vertex), and eigenvector centrality (the https://www.nytimes.com/interactive/2020/us/about-coronavirus-data-maps.html for how this data is curated.

number of important neighbors of the vertex) [65] . Choice of the metric is derived from analysis objectives.

We discuss the different models using which multi-feature data can be represented as a graph and argue why using multilayer networks is a better alternative.

Graphs are widely used for modeling data as rich collections of computations have been developed over the years. Their usage has become even more pronounced and important due to Internet and social networking platforms, such as Twitter, Facebook, LinkedIn. Newer systems such as Neo4J are a result of this trend. We only discuss graph alternatives in this paper due to its appropriateness for these data sets. However, we use the traditional relational model which is better suited for drill-down analysis.

Here the data set is represented by a single network or graph. The vertices represent the entities and the edges represent the similarity of end points based on a feature or the dyadic relationships between them. At most one edge is assumed between nodes and labels may not be supported.

Advantages. This way of modeling data as networks is very popular due to extensive research in this area and availability of several algorithms, such as detecting cliques, communities, centrality metrics, mining subgraphs, motifs, search, etc.

Disadvantages. Single graph, however, is not best-suited for representing multiple entities and features. Although labels can be used for different entities and features, it is difficult to combine features of different categories (e.g., numerical and categorical), in a meaningful way as one labeled edge. The problem compounds when the entities are also of different types. Moreover, as discussed in Section 6, when analyzing a subset of entities and/or associated feature types, separate graphs have to be generated for each such combination and analysis. 2. Attribute or Knowledge Graphs: Here additional features of the data sets can be represented by including node types in terms of labels (even multiple labels) and multiple edges, even self-loops, corresponding to relationships for different features.

Advantages. Attribute graphs have been successfully used in subgraph mining [41, 58, 53] , querying [54, 43, 42] and searching [52] over multi-entity types and multi-feature data sets. They capture more semantic information than simple graphs, and can handle both multiple types of features and entities.

Disadvantages. Algorithms for some key analysis functions, such as community and centrality detection are not yet available for general attribute graphs. Hence, these graphs need to be converted to a monoplex for analysis. Although different features can be stored in the graph, for every subset of features, the analysis has to be done separately. If a subset of entities/features are used for analysis, elaborate book keeping is needed before and after the analysis to identify node/edge semantics. In other words, structure is not well-captured in this representation. 3. Multilayer Networks: Given the pros and cons of the above options, we propose modeling multi-feature, multi-entity type data sets, as multilayer networks (MLNs). Informally, MLNs are layers of single graphs (or monoplexes). Each layer, typically, captures the semantics of one particular feature along with associated entities. As in a monoplex, the graph vertices represent the entities of the data set and the edges represent similarity between the feature values or the dyadic relationship between the end point vertices. The vertices of two layers can also be connected. To differentiate, we term the edges within a layer as intra-layer edges and the edges across the layers as inter-layer edges.

There exist, primarily, two distinct types of multilayer networks -homogeneous and heterogeneous. If each layer of a MLN has the same set of entities of the same type or nodes, it is termed a Homogeneous MLN (or HoMLN.) For a HoMLN, intra-layer edges are shown explicitly and inter-layer edges are not shown, as they are implicit. As an example, the US-Airlines data set can be modeled using HoMLN. The nodes in each layer are the same (cities) and edges correspond to the flights between cities. Each layer captures a different airline. Within a layer, two nodes (cities) are connected if there is a direct flight between them for that airline. It is also possible to capture additional information, such as distance, number of flights per week using edge labels in this model. Modeling of this data set using the EER diagram and the generation of MLN for the US-Airlines is discussed in Section 5.

When the set and types of entities are different across layers, then the MLN is termed as a heterogeneous multilayer network (HeMLN). IMDb data set is an example which generates a HeMLN. Each layer has a different entity type as its nodes (e.g., actors, directors, and movies). The graph of a layer is defined with respect to the chosen features and entity types. In this case of HeMLNs, the inter-layer links are defined explicitly based on feature semantics that correspond to an edge (e.g., directs-actor, directs-movie, acts-in-a-movie). It is also possible that a data set generates a combination of the above two, termed a hybrid multilayer network (or HyMLN.) Note that whether a data set generates a HoMLN or HeMLN or HyMLN depends on the data set description and objectives being analyzed. Our choice of four different data sets is to showcase the effectiveness of the approach for generating appropriate MLN type needed for the analysis.

Advantages. Compared to the other options, multilayer networks are a more expressive and elegant for modeling data sets with multiple entities, features, and relationships. In MLNs each chosen feature (or combination) is modeled in a separate layer and thus this model can support both heterogeneous and homogeneous data sets. MLNs are also better suited from an information representation (i.e., structure and semantics) viewpoint and its visualization. Instead of cluttering all the entities and relationships in a single graph (or layer), they are logically separated and hence are easy to understand. The intra-and inter-layer relationships are also separated semantically. Each incremental change to each feature or relationship, as modeled by addition/deletion of vertices and edges can be easily included without extensive re-modeling of the already created MLN. Unlike most currently used approaches there is no need to convert a MLN representation to another one (simple or attributed) for analysis when the decoupling approach, discussed in Section 6, is used. Finally, a subset of the layers can be analyzed making this model flexible from a selective analysis perspective.

Challenges: Having argued for MLNs for modeling, the primary challenge is to develop new algorithms for MLNs for performing analysis. This needs to be done preserving the MLN structure and semantics, as much as possible, during analysis (i.e., without collapsing them as is done by current approaches). The difficulty with the alternative approaches that collapse is the reconstruction of final analysis results to understand them clearly. This requires two mappings one before collapsing and one for reconstruction. This adds additional computation which is separate from drill-down. Semantics preservation is certainly needed for drill-down of results as shown in Sec. 8. Decoupling-based algorithms used in this paper, by definition, preserve both structure and semantics (removing mapping of entities and features back and forth) making it easier for drill-down analysis. Preservation of structure and semantics also facilitates visualization clarity of results directly. Both computing community and centrality efficiently computation algorithms used in this paper keep the MLN structure intact. Figure 1 shows the steps of the data analysis life cycle from gathering base data and analysis objectives to discovery of knowledge, its drill-down, and visualization. Given a data set along with its description and the desired analysis objectives, the first step is to choose a model for representing the data. For this paper, we have chosen the multilayer network as the data model, based on the discussion in Section 4. This will be generated from the data set and the analysis objectives using the approach presented in Section 5.1. Once the model is generated, as shown in the figure, expressions for analysis objectives are generated using the MLN model characteristics (along with available computations for the chosen model) and only the analysis objectives. This is an important step and is currently done using keywords and descrip- Eventually, this step needs to be automated, as much as possible, using natural language processing of the objectives along with the model characteristics and heuristics. Following this step is the actual computation of the generated expressions to discover knowledge. Ideally, any available algorithms (for graph and/or MLN in our case) can be used for this purpose. Use of model characteristics and algorithms as input, the approach used in this paper to generate the expressions are described in Section 7. The results obtained need to be drilled down further using additional attributes that were not retained as part of the MLN model. This is done using additional attributes specified in the EER diagram that are mapped to a relational model from the same EER diagram 5 . For example, if a director id is used in the MLN models, details of the director in terms of location, number of movies directed etc. may be needed for drill-down. Hence, capturing all information needed for analysis and drill-down is extremely important as part of the EER diagram creation step. We will illustrate the results of detailed drill-down in Section 8. Even a better way is to visualize the drilled down results to make it is easier for a non-technical person to understand and interpret the analysis results. This is illustrated in Section 8.4.

Enhanced/Extended Entity Relationship (EER) approach is widely used for modeling data (and functionality desired) from which, typically, a database schema is derived. This has been used for all three types of traditional databases -hierarchical, network, and relational. This modeling technique predates the UML (Unified Modeling Language) widely used today for modeling objectoriented applications as well as algorithmic flow and activity diagrams. The purpose of the EER approach is to convert data requirements gathered during the knowledge acquisition phase to develop a more precise and unambiguous model/representation that can be used for algorithmically generating the MLN schema in our case. We have used the same underlying principles for creating an EER diagram of the data set and analysis objectives (queries are used, instead, in the database context) from which a MLN as well as a relational schema are generated. MLN schema/model is used for deriving expressions for knowledge discovery corresponding to the objectives while the relational schema/model is used for drill-down of the knowledge discovered. Once the EER diagram is generated, generation of MLN is done algorithmically. The details of motivation and an algorithm to translate an EER diagram to a MLN can be found in [57] . Below, a brief overview is provided through examples using the data sets and analysis objectives described in Section 3. For generating EER diagrams, typical heuristics used are: nouns as entities, verbs as relationships, and adjectives as attributes. Other considerations based on objectives (e.g., coverage of objectives in our case) are also taken into account.

Creating an EER diagram for the US airline data set is relatively straight forward. Although the data set contains airports and their unique codes along with flight information between airports, for simplicity of analysis 6 , we have used cities instead of airports. Other information such as flight number, number of flights per week are also be available. The data set contains direct flight information for American, Southwest, Spirit, Delta, Allegiant, and Frontier, that were active in February 2018. The analysis objectives are: Figure 2 Since the objectives are to analyze each airline for maximum coverage, it is clear that each airline needs to be modeled separately in the EER diagram. For that, US City can be modeled as an entity. The direct airline flights are modeled as self-relationship between cities connecting them. A relationship is used for each airline in the EER diagram -American Direct-Flight, Southwest Direct-Flight, etc.. The resulting EER diagram is shown in Figure 2 . Since the analysis objectives are for these airlines, only the cities that are served by all the airlines are considered. If an individual airline is analyzed, all cities served by that airline can be included. The objectives also indicate the need for additional information for each hub for objective (A2) as only the hub information is not sufficient. Additional information about entities (e.g., population, per capita income etc.) is modeled as attributes of relation US City (see Figure 3 (b)) that will be used for drill-down analysis and ranking of cities as will be shown in Section 8.

When the algorithm given in [57] is applied, the 6 layer homogeneous MLN shown in figure 3 (a) is generated. In addition, the relations shown in figure 3 (b) are also generated for drill-down analysis and additional computations, if any.

Let us consider the DBLP data set along with analysis objectives described in Section 3. An EER diagram for that is constructed as follows. For Based on data set description and analysis objectives ((A3) -(A5)), the EER diagram shown in Figure 4 has been developed by following the same steps used in the previous example: i) identifying Entities, ii) identifying Relationships, including self and binary relationships, and iii) Cardinality information. Author, Paper and Year come out as three entities, where some entity 7 Note that some of the objectives indicate specific analysis requirements, such as 3-year periods and others are stated in general terms, such as collaboration groups. These essentially are parameters associated with the objectives that need to be resolved for generating layers prior to analysis. Hence, these are modeled as parameters of the relationships whose values are needed for creating the layer graph prior to actual analysis, but are not needed for generating the MLN schema. This is important as these parameters provide a way to perform new analysis on the same data set without changing the model and expressions generated. Layers that are affected by the parameters need to be re-generated. This adds significant advantage for the breadth of analysis supported by the approach presented in this paper. characteristics, such as institution and keywords are modeled as attributes of Author and Paper entities, respectively. Total Papers attribute of (Author ) entity is shown as a derived attribute as it can be calculated using writes binary relationship. Collaborates-with, Same-Conference and Same-Interval are the three self relationships that relate two authors if they have worked together on papers, two papers if they are published in same conference, and two instances of years if they are in the same disjoint interval, respectively. The Collaborates-with and Same-Interval self relationships are associated with the attributes num-of-papers and 'k'-year-interval-id, respectively, to capture the parameters implicitly specified in the objectives. The value of these relationship attribute parameters become the basis for relating two entities and connecting two nodes in the MLN layer graph.

In addition to the Collaborates-with relationship, the author entity is associated to 4 other self relationships -Collaborates-in-VLDB, Collaboratesin-SIGMOD, Collaborates-in-DASFAA and Collaborates-in-DaWak, that capture collaboration relationship between two authors for specific conferences as required by objective (A5). Even these have the attribute parameter, num-of-papers, that defines the implicit notion of collaboration in the objective. Since Author and Paper are distinct entities by definition, a binary relationships is needed to capture paper authorship which is a many-to-many relationship. Hence, Writes, and similarly Active-in and Published-in binary relationships capture the information to indicate if an author has written a paper, whether an author was actively publishing in a year and in which year a paper was published, respectively. Finally, the data characteristics and intuitive assumptions have been used to deduce the min-max cardinality.

Once the EER diagram is developed using the data set and analysis objectives, the algorithm in [57] is used to generate the MLN schema shown in Figure 5 (a) which happens to be Hybrid. The expression generation phase will demonstrate how this hybrid MLN will be used appropriately for expression generation of the objectives. The five self-relationships with the Author entity (Collaborates-With, Collaborates-in-VLDB, Collaborates-in-SIGMOD, Collaborates-in-DASFAA and Collaborates-in-DaWaK ) are mapped to five homogeneous AUTHOR layers. For generating these layers, two authors are connected if they have collaborated on at least 3 papers, that is num-of-papers parameter value is at least 3. The other two self-relationships (Same-Interval, and Same-Conference) are mapped to two layers -YEAR-Same-Interval and PAPER-Same-Conference. Based on the requirement of analyzing 3-year periods, two year nodes are connected in the layer graph if they have same value of 'k'-year-interval-id attribute parameter, where k = 3. For instance, for the years 2001 to 2018 present in the data set, [2001] [2002] [2003] is interval 1 and [2016-2018] is interval 6. The binary relationships Writes, Active-in and Published-in correspond to inter-layer edges between the corresponding layers representing the entities. Few inter-layer edges are not illustrated in the figure to maintain the visual clarity of the figure. Additionally, using the same EER diagram, a Relational schema is also generated as shown in Figure 5 (b). These are used for drill-down analysis.

For the rest of the data sets used in this paper, we will only show the MLN schema generated for them in Section 7 and not go into the details of the EER modeling due to space constraints. A similar process as illustrated above is used.

Since there are not many algorithms available for MLNs and there are a number of widely-used algorithms for single graphs for community, centrality, and substructure discovery, current approaches to MLN analysis take advantage of this by converting a MLN into a single graph.

The basic idea is to map the multilayer networks to an equivalent single graph in various ways [33, 56] . However, through this process, many of the information in the multilayer graphs can be lost, if appropriate mappings are not created and used. In some cases mappings can become fairly complicated. There are mainly two approaches for converting a MLN into a single layer network. The first, used for homogeneous MLNs, is to aggregate the edges of the multilayer network. Specifically, given two vertices v and u, the edges between them from each layer are aggregated to form a single aggregated edge. This process is repeated for all the vertex pairs. Some typical aggregation functions are Boolean AND (intersection), OR (union) or linear functions when the edges are weighted. An example, from homogeneous MLNs, would be aggregating routes of different airlines [37] by applying OR (union). This will give rise to multiple edges between nodes or a single edge (if desired from analysis perspective). Mapping has to capture this information clearly and used before and after analysis.

For heterogeneous MLNs, aggregation is performed in many ways. The first is type independent [48] , that is ignore the different types of the entities (and labels) present, and essentially treat it as a homogeneous MLN with a subset of vertices in each layer. The second method is projection-based [31] . Here, if two vertices in a layer are connected to a common vertex in another layer, then an edge is inferred between them. Such "projections" of one layer onto another layer produce inferred edges and then these edges are aggregated.

A third approach, used for HeMLNs, is to transform the multilayer network into an attribute graph, where the vertices and edges are labeled based on their types. This graph is analyzed to find specified subgraphs, such as patterns of authors, papers and venues [81] or vulnerabilities in infrastructure networks [30] .

Issues. Single graph/network approach has the advantage that many analysis algorithms for community and hub detection are available (e.g., Infomap [35] , Louvain [32] being prominent ones for community detection). However, the aggregation approaches preserve neither structure nor semantics of MLNs (without explicit mapping and unmapping) as they aggregate layers. Importantly, aggregation approaches are likely to result in some information loss or distortion of properties [56] or hide the effect of different entity types and/or different intra-or inter-layer relationship combinations [45] . In cases, where the multilayer network is converted to an attribute graph, algorithms for aggregate computations (e.g., community, hub) do not exist. Again, they have to be separated into simple graphs (with at most one edge between any pair of nodes) for analyzing. This adds not only additional cost, but the purpose of modeling is defeated to a large extent. Some approaches use the multilayer network as a whole [86, 61] and use inter-layer edges, but do not preserve the layer semantics completely. An alternative is to separate desired subgraphs and use single network algorithms which defeats the purpose of modeling as attribute graphs and is inefficient.

Network decoupling is a method by which MLNs can be analyzed without being transformed. The decoupling approach preserves the structure and semantics of the layers natively in the result and at the same time can take advantage of the existing algorithms. The network decoupling approach developed in [72, 73] is the equivalent of "divide and conquer" for MLNs. This is summarized in Figure 6 (b) and is applied as follows, for a given analysis function Ψ and composition function, Θ: -(ii) Second, for any two chosen layers, apply a composition function Θ to compose the partial results from each layer to generate intermediate results.

puted.

This is in contrast to current approaches described earlier. Figure 6 (a) indicates aggregation-based approaches where structure and semantics are lost (sans mapping.) Figure 6 (c) illustrates MLN approaches where only inter-layer edges are used instead of all edges.

Advantages: The decoupling approach has several advantages over the traditional methods. By using the aggregation approach, information pertaining to the individual layers is lost and it is difficult to measure their relative importance to the system as a whole. In contrast, network decoupling retains the semantic information of each layer and therefore their individual importance and contribution can be measured. The "divide and conquer" approach also facilitates the mix and match of the features and relationships. In the aggregation approach, each time a subset of features is selected, the analysis has to be recomputed, even when the subsets might have overlaps. This leads to redundant computations. Using the decoupling approach, redundant analysis are avoided, since each layer, corresponding to a particular feature is analyzed separately, and then combined. Finally and importantly, the structure and semantics are preserved in the results explicitly a there is no conversion needed in this approach.

Challenges. The decoupling approach can be applied to both HoMLN and HeMLN and hence to HyMLN as well. Moreover, the success of this approach is dependent on correctly matching the analysis objectives using appropriate Ψ as the aggregate function and Θ as the composition function. In Section 7, we show how the network decoupling approach can be applied for our data sets and appropriately determine Ψ and Θ for the diverse analysis objectives (A1) through (A10).

A number of algorithms that use the decoupling approach have been developed for community detection for both HoMLN [72] and HeMLN [74] as well as centrality detection [73] for HoMLN. An algorithm for substructure discovery on MLNs has been developed in [68] . There are also some algorithms that compute substructures [34] and community [61] directly on MLNs without collapsing or aggregating them. In this paper, we use the decoupling based algorithms as they cover the needs of all analysis objectives under consideration.

As indicated in Section 5.1, the modeling of data sets and the generation of a MLN (HoMLN/HeMLN/HyMLN) depends mainly on the relationships identified on the entities in the data set. Typically, self-relationships generate HoMLNs, n-ary (mostly binary as we do not use/support hyper edges in MLNs yet) relationships generate HeMLNs. The EER diagram is also converted to a relational schema for drill-down and additional processing as specified in objectives. The details of creating edges in each layer come from the attributes (deemed as parameters) from the relationships used for MLN generation. For example, for (A4), the 3-year period explicitly provided in the analysis objective is modeled in the EER diagram as an attribute of the relationship same-interval and is used for creating the layer graph. Each three consecutive years form a clique in that layer. If a threshold is needed for creating edges of the layer Year to capture similarity, it becomes a parameter of the analysis alternatives. As an example, for (A3) 3 publications together has been used for the parameter num-of-papers attribute associated with the collaborateswith relationship that generates the Author layer. This is a parameter that can be modified to perform a different set of analysis. The MLN model nor the expressions change, except the graph of the layer generated based on this parameter value.

In addition to these, table 2 is generated indicating possible Θ for each pair of layers for the DBLP data set. Similar tables are generated for each data set. This is dependent on the outcome of modeling. Table 1 is a table that indicates available Ψ options for each layer. A similar table is generated for each layer produced during modeling of each data set. This depends on algorithms available for computing expressions as shown in Figure 1 . This is independent of the modeling step.

For this paper, we assume community and centrality (both node and closeness) for Ψ . Other possible Ψ options (shown in Table 1 ) can be any graphbased analysis approach like degree centrality, interesting substructure discovery and so on. The complement of a graph (unary NOT) is another Ψ option that produces a graph with complement set of edges based on the input graph.

For composition of homogeneous layers, we assume binary AND and OR binary compositions.For composing heterogeneous layers, we assume Maximum Weighted Matching (MWM) bipartite approach [50] as discussed in [74] .

Briefly, algorithms for computing 2-layer communities for HoMLNs use either Boolean AND or OR composition. Algorithms for these are in [72, 73] . Boolean NOT operator (complement of a graph) can be used for any layer and further composed with other layers using AND or OR. Multiple layer community computation is done by applying the operators on the result of the previous step. The order of Boolean operator application can be user-specified (or generated as we show in this paper.)

Composing HeMLNs for community detection is challenging since the entities are of different types. As described in [74] , each community is considered to be a meta-node. Two meta-nodes in two different layers are connected if there is at least one inter-layer edge between them. The weight of these edges (meta-edges) between the meta-nodes is given by the number of inter-layer edges between them. This construction creates a bipartite graph. These meta nodes (communities) in the bipartite graph are paired using the composition function (Θ) Maximum Weighted Matching (MWM) as proposed by Jack Edmonds [50] . Thus, the paired meta-nodes correspond to the heterogeneous MLN communities.

The MLNs derived for the US-Airlines and the DBLP data sets (described in Section 5) are shown in Figures 3 and 5, respectively. In Figures 7, we show the MLNs derived after modeling the IMDb (Figure 7 (a) ) and Covid (Figure 7 (b) ) data sets as EER diagrams and applying the algorithm in [57] . The figures do not include a few inter-layer edges for IMDb MLN to maintain clarity. The EER diagrams and other relations derived for drill-down follow the approach illustrated for the other two data sets.

We use the characteristics of the MLNs generated to create a table for each data set for looking up Θ during translation. Another table is created for each layer in each data set to indicate what operations (Ψ ) are available. The Ψ (Table 1) and Θ (Table 2) lookup tables have been shown for DBLP MLN. In addition, as the current approach is based on extracted keywords and their interpretation (e.g., nouns as layers, verbs as Ψ , and conjunctions as Θ), another table is used for lookup during translation. This is shown in Table  3 which groups keywords and their possible synonyms in each category with the corresponding choice for computation. The scope of translation depends on the available Ψ and Θ. Objectives that cannot be mapped to available computable operations are indicated as such. Then, we convert each analysis objective for that data set to expressions that can be computed on the MLN model. That is Ψ and Θ are inferred. Also, additional computations on the result (e.g., sorting, ranking, top-k, ...) may need to be performed based on the wording in the objectives in addition to translation. The challenge here is the automation of analysis expression generation given objectives in English that is meant for human consumption. From our experience in analyzing many data sets for diverse objectives, these objectives can be expressed in multiple ways and can be manually translated into several alternative expressions. This is due to multiple interpretations which lead to multiple expressions for the same objective. Figuring out the order of com-putations of expressions is another challenge. As explained earlier, we use a keyword-based heuristics approach in this paper as there is no general purpose NLP-based approach that we are aware of. Below, we explain our approach and provide intuitive explanations of how they are generated.

Similar to the heuristics used for EER modeling, nouns, verbs, and conjunctions are identified from objectives. We have highlighted parts of each objective used for expression generation. Phrases in the objective are underlined, italicized, and shown in bold below to indicate their use, respectively, for layer selection (using nouns), analysis function (Ψ ) determination (using verb forms), and composition function (Θ) identification (using conjunctions) for the generation of the expression. These can be isolated by an NLP keyword/phrase analysis and used for looking up Ψ and Θ using the three tables indicated earlier. If a mapping is not possible (or keyword not properly recognized), lookup of the tables will fail. Additional computations, as needed, are also inferred from the keywords/phrases in the objective, in the form of FILTER operation. For example, predict is translated to ranking based on sort and choosing the top-k values. Listing of top k entities translates to sorting and retaining top 5 entities. Here, for the list of the objectives, the FILTER operation is applied (as per requirement) at the end of the analysis expression or earlier as shown with each expression.

Mapped to Ψ /Θ group, cluster, strong/dense group Ψ = Community coverage Ψ = Closeness Centrality direct neighbors, hubs Ψ = Degree Centrality frequent/interesting patterns Ψ = Substructure Discovery never, not Ψ = Complement (NOT) and, but, yet Θ = AND for HoMLN layers or, either Θ = OR for HoMLN layers and, but, yet, for each, for every Θ = MWM for HeMLN layers Coverage in (A1) corresponds to cities from which one can cover most number of cities using least number of flights. Hence, Coverage can be used to translate the objective intent to closeness centrality as Ψ as shown in Table 3 . For airlines, since low closeness centrality value means low average distance (number of flights) to cover cities, that will provide cities which can be ranked further to fetch the top-5. The expression derived is shown below. Expression: Ψ (Each layer), FILTER = top-5 using closeness value; Ψ = Closeness Centrality; Θ = N/A (A2) Predict which city (taking its population into consideration) could be selected as the next hub(s) for Allegiant Airlines to expand its coverage and avoid competition with other airlines.

Analysis (A2) is more complicated and perhaps beyond the current approach. However, it is clear from the objective and the mapping table that Ψ is closeness centrality. However, prediction is for cities that are not currently Allegiant hubs which requires eliminating the hubs generated from the Ψ computation. Avoiding competition indicates non-overlap with cities that have high closeness value for other airlines resulting in the expression shown below. Ψ and Θ can be looked up. Sorting on population at the end is clear. Using this simple approach, it is difficult to derive the differences. This is a good example of the challenge we mentioned earlier that is difficult for the keyword-based heuristics translation! Note that this objective can be specified and computed for any airline.

In addition, other city attributes (than population), such as mean/median income or combinations can be used. These do not affect either the model or the expression generation. Ordering the resulting cities based on the population (or any attribute) and choosing the top one is the last step. Table 3 , the required analysis is community detection (phrases = group, strong group), which have to be combined using MWM composition due to the identification of heterogeneous layers (phrase = for each serves as conjunction), resulting in the expression shown below. = group) . Moreover, all are homogeneous layers which support boolean composition, as looked up from table 2. First, we compute 2-layer community using AND (between Author-Collaborates-in-VLDB and Author-Collaborates-in-SIGMOD), which is again AND composed with the NOT communities of the other two conferences (phrases = and, but.) Thus, generating the final expression.

Note that the expression generated for this can be re-written to improve computation efficiency. We do not further discuss that in this paper except to indicate that expressions generated can be further optimized using any means available without affecting correctness.

(A6) Cluster actors who have acted together and have a similar average rating.

Expression generation for (A6) is quite straightforward with the identification of ACTOR-Acts-with and ACTOR-Similar-AverageRating layers. The cluster keyword indicates community computation on the individual layers that are combined further using the AND composition (conjunction = and.) Expression: Ψ (ACTOR-Acts-with) Θ Ψ (ACTOR-Similar-AverageRating); Ψ = Community; Θ = AND (A7) Find the groups of actors who have never acted together, but are highly rated on an average and have worked in similar genres. For (A7), the layers identified are ACTOR-Similar-AverageRating, ACTOR-Similar-Genre, and ACTOR-Acts-with. These are homogeneous layers as well and community detection is required (phrase = group.) Based on the objective, NOT is applied to layer ACTOR-Acts-With (phrase = never ) before composing it with layer Actor-Similar-AverageRating, which is then finally composed with layer Actor-Similar-Genre using AND composition (phrases = but, and) as Θs throughout leading to the generated expression. sort-on-AverageRating needs to be inferred from the keyword highly in the objective to output top k results. Order of composition is inferred from the objective as given. Table 4 gives the mapping of each analysis question (A1) to (A10) to their actual computation specification (in left to right order), analysis function (Ψ ) and composition function (Θ). We computed the results for each analysis objective using the expressions derived and shown in Table 4 and compare it, where possible, with independently available ground truth. This helps validate both the modeling and analysis aspects of the life cycle approach proposed. We will not focus on the the efficiency of the decoupling approach as it has been established elsewhere [72, 73] . Structure-and semantics-preserving aspects of the decoupling approach allows us to drill-down and show details of experimental results.

(A3) YEAR-Same-Interval Θ AUTHOR- Collaborates-With Community MWM (A4) PAPER-Same-Conference Θ 1 AUTHOR- Collaborates-With Θ 2 YEAR-Same-Interval Community MWM (A5) (AUTHOR-Collaborates-in-VLDB Θ 1 AUTHOR-Collaborates-in-SIGMOD)

MLN Details: Based on the direct flights that were active in February 2018, US-Airline MLN layers are generated whose statistics are shown in Table 5 . Based on the expression derived, we computed the closeness centrality for each layer. We ranked the cities in each layer according to their closeness centrality value. Top 5 hubs (higher rank, fewer flights required for coverage, more central city)) were identified for each airline. For all 6 airlines, the ground truth obtained from [47] matched our results. In Table 6 we have listed top 5 hubs for 4 airlines. As a drill-down byproduct, it is interesting to see common hubs (highlighted) between airlines which is also verified by the ground truth. On computing the expression, from the high closeness centrality cities of the Allegiant airline, we eliminated all those cities that are also high closeness in each of the competitor airlines. From this set of cities, we ranked those cities that are not currently Allegiant hubs, based on their population. This information is available from the City relation (Figure 3 (b) ) that was obtained as a by-product of the EER → MLN process. Table 7 shows the resulting set of cities where Allegiant Airline can potentially expand its operations. We validated our result by the fact that Grand Rapids has been converted to a hub by Allegiant as of July 6, 2019 [44] . (A3) For each 3-year interval group, find the most actively publishing strong author collaboration groups. As per the expression, on applying MWM on the community bipartite graph created with all Paper and Author communities, we obtained 6 community pairings for the co-author groups who have published most number of times in each 3-year period, (shown in Figure 8 with list of few prominent authors.) This visualization was accomplished by the drill-down of raw results with the help of relations obtained earlier in Figure 5 (b). The author community ids have been shown that are generated by the community detection algorithm in the Ψ phase of the decoupling approach. Quality of these results are validated by the following facts, Such insightful results can be further drilled-down to find active periods of co-author subgroups, research labs and universities.

(A4) For each conference-based paper group, find the most popular author collaboration group and further for each of them identify their most active 3-year interval group(s). In order to generate the required communities, based on the expression in Table 4 , the most popular author groups for each conference are obtained by MWM (first composition). The matched 6 author communities are carried forward to find the disjoint year periods in which they were most active (second composition.) 6 communities are obtained (path shown by bold blue lines in Figure 9 .) Few prominent names have been drilled-down and shown in Figure 9 based on citation count (from Google Scholar profiles.) For example, for SIGMOD, VLDB and ICDM the most popular researchers include Srikanth Kandula (15188 citations), Divyakant Agrawal (23727 citations) and Shuicheng Yan (52294 citations), respectively who were active in different periods in the past 18 years. (A5) Identify author collaboration groups who have published in conferences VLDB and SIGMOD, but have never published in con- Figure 10 for few well-known groups most of whose members had collaborated on a paper that was published in both VLDB and SIGMOD (high ranked), but never in DAS-FAA or DaWaK (low to medium ranked).

There is a high probability that the work done by these groups is not only of good quality but also widely accepted. This claim is validated through the following facts:

- Figure 10 

MLN Details: For IMDb MLN, we extracted, for the top 500 actors, the movies they have worked in (7500+ movies with 4500+ directors). The actor set was repopulated with the co-actors from these movies, giving a total of As explained for DBLP, the relationship attribute parameters in the EER model help in quantifying the similarity of actors and directors based on movie genres they have worked in. A vector was generated with the number of movies for each genre he/she has acted-in/directed. In order to consider the similarity with respect to frequency of genres, two actors/directors are connected if the Pearsons' Correlation between their corresponding genre vectors is at least 0.9 (Other values can also be used based on similarity strength.) Widely used Louvain method [32] is used to detect layer-wise communities (Ψ ). Table 9 provides the layer statistics of the generated MLN. They acted together in many highly rated bollywood movies. -Jackie Chan (along with other lesser known actors) was among the prominent actors from the co-actor group from Hong Kong.

(A7) Find the groups of actors who have never acted together, but are highly rated on an average and have worked in similar genres. Following the expression, we detected 900 groups of actors most of whom have not worked together but have similar genre preferences and average rating.

From the results, we drilled down into the communities that corresponded to high average rating and have listed a few recognizable actors along with prominent genres from those communities in Table 10 . Out of these, as per reports in 2017, there had been talks of casting Johnny Depp and Tom Cruise in pivotal roles in Universal Studios' cinematic universe titled Dark Universe [79] . (A8) Identify genre-based groups of actors and directors having strong collaborations. On computing the expression, 49 similar genre-based community pairs are obtained, where most actor-director pairs have interacted with each other at least once. Intuitively, a group of actors that prominently works in some genre (say, Drama, Action, Romance, ...) must pair up with the group of directors who primarily make movies in the same genre.

In Figure 11 we have drilled-down and visualized the community pairings for the Action and Comedy genres with few famous actors and directors from each community. Such pairings may help production houses to sign up actors and directors for different movie genres. Recently, Vin Diesel signed up for Avatar 2 and 3 (Action movie) which is being directed by James Cameroon and this will be the first time they will be collaborating [80] . Interestingly, even though they did not work together ever, we paired them together in the groups that corresponded to the Action genre on the basis of high interaction among other similar actors and directors. (A9): Identify, for each movie rating group the genre-based most popular actor and most popular director groups. From this result, When finding the communities across three layers, using the expression in Table 4 , we first combine results of each of two layers (ACTOR-Similar-Genre, DIRECTOR-Similar-Genre) with that of common layer (MOVIE-Similar-Rating) to find most popular genre-based group for each movie rating. Figure 12 (a) shows drill-down results of one such intermediate combination, where actors (community A144) and directors (community D91) are paired with movies (community M3). However, the most popular actor and director groups for [6-7) movie rating (represented by M3) do not have many interactions among them as they belong to different dominant genre groups.

Finally, the interactions between DIRECTOR-Similar-Genre communities and ACTOR-Similar-Genre communities are calculated to complete the analysis expression listed in Table 4 . Only one HeMLN community drilled-down and visualized in Figure 12 (b) was obtained. The drill-down of Figure 12 (b) indicates, both the popular groups for [7] [8] movie rating are from Drama genre and many of these actor-director pairs have collaborated on many movies, such as Leonardo DiCaprio, Kate Winslet with Sam Mendes for Revolutionary Road, Sean Penn with Gus Van Sant for Milk and so on. Thus, popular groups A175 and D106 paired up with each other. Most importantly, it is possible by drilling-down into the results to flesh out potential actor-actor or actor-director collaborations based on identifying the missing links for high degree nodes in the generated HeMLN communities. One such combination is DiCaprio-Swank-Mendes who never worked together even though most of their movies belong to highly rated drama genre. 

MLN Details: Each node in a layer corresponds to a county (3141 nodes.) County nodes are connected as a clique if they have the similar increase, using several bands from spike (> 100% increase) to big dip (100% decrease) and a few in between, in the number of Covid new cases/deaths/hospitalizations/... and hence varies.

Based on the analysis objective, 2 disjoint intervals (each ranging from 1 to 30 days) are selected -either arbitrarily or before and after based on an event (e.g., July 4 th , Thanksgiving) to visually understand the effect of Covid be-tween the chosen two intervals. This is translated to the generation of layers for each interval, respectively, where the 3141 US counties with similar number of new cases are connected from each interval. As per the expression, communities are detected (using Louvain [32] ) for the individual layers to find the geographical regions (or counties) that have the same percentage of increase or decrease (using bands) and 2 maps are displayed side-by-side using different colors and counties within the same band having the same color. The colors range from spike to a big dip in the number of new cases/deaths/hospitalizations/.... Different interval selections can be made based around the event of interest to analyze and visualize the effects through the live dashboard that makes use of the multilayer network architecture underneath 9 . The effects of two major events visualized in this paper are discussed below. More details can be found in [71] . For the pre spring break layer, the 7-day intervals used were Feb18-Feb24 and Feb25-Mar2. For the post spring break layer, the 7-day intervals used were Mar20-Mar26 and Mar27-Apr2. The drilled-down results have been visualized in Figure 13 that show how post the spring break there was a spike in the number of daily cases in counties across the US. Various reports attributed this massive surge due to the widespread travel to popular tourist destinations during the break leading to crowds and non-adherence to social distancing norms [66, 63, 62] . (A10) (ii) Visualize how the geographical regions corresponding to the clusters of US counties with decline in daily confirmed cases For the pre vaccination drive layer, the 3-day intervals considered were Sep20-Sep22 and Oct21-Oct23 in 2020. For the post vaccination drive layer, the 3-day intervals were Jan20-Jan22 and Feb21-Feb23 in 2021. The community (groups of counties) results have been drilled-down from the individual layers and the ones displaying a downward trend have been visualized in Figure 14 . This illustration clearly shows how the vaccination drive has become one of the reasons that has led to controlling the spread of COVID across US in the past few months. This fact is also verified from independent sources that say how the administration of the vaccine has led to a decline in severe cases, hospitalizations and deaths not only in the US, but some other parts of the world as well [83, 69] .

More layers and decoupling-based compositions addressing the analysis objectives listed below are being developed with a revised version of visualization that provides more interaction and choices.

-What is the effect of traffic movement on new cases across major (or centrally connected) counties? How to choose a county for lockdown so that it has maximum impact? -Compare the increase/decrease in the number of new cases with respect to average education, per-capita income, mask usage and population density.

The success of any data analysis life cycle is predicated on our ability to: i) appropriately model the data set and automate schema generation as much as possible, ii) provide analysis alternatives using the model used, iii) map user-specified objectives to analysis expressions, iii) develop algorithms for computing expressions, preferably efficiently, and iv) ability to drill-down and visualize results for ease of understanding and interpretation. This, as we know, is an iterative process.

In this paper, our focus has been to address the complete life cycle for aggregate analysis using the Multilayer Networks (or MLNs) as the underlying data model. In this paper, we demonstrate how to create EER diagrams for "multi-entity, multi-feature" data sets. We use an algorithm that maps the EER diagram into MLNs of different kinds as appropriate. We have shown how user-specified objectives, in English, can be translated using heuristics based on keywords to aggregate analysis expressions for all types of MLNs. We have also demonstrated the applicability of the decoupling approach for efficient analysis of complex data sets using multilayer networks. Drill-down and visualization of the results is an important but sometimes ignored component. We have shown how analysis results can be visualized using a general-purpose approach with a Covid-19 dashboard.

We are working on further automating the translation of objectives into analysis expressions by using natural language processing and model characteristics. We are also working on developing decoupling-based efficient algorithms for aggregate analysis for centrality, substructure discovery, motif detection, to name a few, on MLNs. We are also broadening our analysis to include non-Boolean compositions for homogeneous MLNs, weighted graphs, and labeled graphs. This will further extend the expressive power of the MLN data model and the automation of the data discovery life cycle. 

Allegiant airlines routes

American airlines routes

Aws glue

The centre for disease control covid dashboard

Dblp dataset

Dbschema

Ibm infosphere data architect

Toad data modeler

The property graph database model

Survey of graph database models

On robustness in multilayer interdependent networks

A multilayer network approach for guiding drug repositioning in neglected diseases

Fast unfolding of community hierarchies in large networks

The structure and dynamics of multilayer networks

Mining coherent subgraphs in multi-layer graphs with edge labels

Community detection and visualization of networks with the map equation framework

Mining hidden community in heterogeneous social networks

Emergence of network features from multiplexity

DB-Subdue: Database Approach to Graph Mining

The entity-relationship model-toward a unified view of data

Finding community structure in very large networks

Duplicate reduction in graph mining: Approaches, analysis, and evaluation

Plan before you execute: A cost-based query optimizer for attributed graph databases

Query processing on large graphs: Approaches to scalability and response time trade offs

Grand rapids is 'sweet spot' for airline base

Navigability of interconnected networks under random failures

Centrality in interconnected multilayer networks

Major airline hubs

Layer aggregation and reducibility of multilayer interconnected networks

Clustering with multi-layer graphs: A spectral perspective

Maximum matching and a polyhedron with 0, 1-vertices

Community structure in graphs

Efficient keyword search on graphs using mapreduce

Substucture Discovery in the SUBDUE System

Querying knowledge graphs by example entity tuples

Community detection in multi-layer graphs: A survey

Multilayer networks

Eer→mln: Eer approach for modeling, mapping, and analyzing complex data using multilayer networks (mlns)

Finding frequent patterns in a large sparse graph *

Community structure in large n/ws: Natural cluster sizes and absence of large well-defined clusters

Scalable community discovery on textual data with relations

Community detection in multiplex networks

Jue insight: College student travel contributed to local covid-19 spread

The costly toll of not shutting down spring break earlier

Community structures in bipartite networks: A dual-projection approach

Networks: An Introduction

New study: College spring break helped spread the coronavirus

Conceptual and database modelling of graph databases

Mln-subdue: Decoupling approach-based substructure discovery in multilayer networks (mlns)

Reduction in covid-19 patients requiring mechanical ventilation following implementation of a national covid-19 vaccination program-israel

Modeling graph database schema

Cowiz: Interactive covid-19 visualization based on multilayer network analysis

Efficient community re-creation in multilayer networks using boolean operations

Hubify: Efficient estimation of central entities across multiplex layer compositions

A new community definition for multilayer networks and a novel approach for its efficient computation

Constrained-meta-path-based ranking in heterogeneous information network

A survey of heterogeneous information network analysis

Semantic path based personalized recommendation on weighted heterogeneous information networks

Centrality rankings in multiplex networks

Dark universe: Johnny depp and javier bardem join tom cruise in universal's monster movie franchise

Avatar 2 and 3: Vin diesel joins cast of james cameron's long awaited sequels

Mining heterogeneous information networks: a structural analysis approach

Rankclus: integrating clustering with ranking for heterogeneous information network analysis

Four reasons experts say coronavirus cases are dropping in the united states

Text classification with heterogeneous information network kernels

Relsim: relation similarity search in schema-rich heterogeneous information networks

Community extraction in multilayer networks with heterogeneous community structure

Community detection in social networks

A model-based approach to attributed graph clustering

Acknowledgements For this work, Drs. S. Chakravarthy and A. Santra were partially supported by NSF Grant 1955798 and Dr. Bhowmick was partially supported by NSF grant 1916084.