key: cord-1018869-sw41jdqw authors: Rani, Asma; Goyal, Navneet; Gadia, Shashi K. title: Big social data provenance framework for Zero-Information Loss Key-Value Pair (KVP) Database date: 2021-11-09 journal: Int J Data Sci Anal DOI: 10.1007/s41060-021-00287-9 sha: e555bb517977951c70cf2f47e7e0de5828ee46a5 doc_id: 1018869 cord_uid: sw41jdqw Social media has been playing a vital importance in information sharing at massive scale due to its easy access, low cost, and faster dissemination of information. Its competence to disseminate the information across a wide audience has raised a critical challenge to determine the social data provenance of digital content. Social Data Provenance describes the origin, derivation process, and transformations of social content throughout its lifecycle. In this paper, we present a Big Social Data Provenance (BSDP) Framework for key-value pair (KVP) database using the novel concept of Zero-Information Loss Database (ZILD). In our proposed framework, a huge volume of social data is first fetched from the social media (Twitter’s Network) through live streaming and simultaneously modelled in a KVP database by using a query-driven approach. The proposed framework is capable in capturing, storing, and querying provenance information for different query sets including select, aggregate, standing/historical, and data update (i.e., insert, delete, update) queries on Big Social Data. We evaluate the performance of proposed framework in terms of provenance capturing overhead for different query sets including select, aggregate, and data update queries, and average execution time for various provenance queries. In computing world, data are defined as the factual information in digital form used as a basis for various qualitative and quantitative analysis. The twenty-first century will be known as the century of data as it has witnessed an unprecedented growth of data in almost all domains [5, 6] . With the rapid evolvement of social media and web-based communication, everyone has become more enthusiastic about sharing their thoughts, ideas, opinions and other content through a social media platform, causing an exponential growth in the size of social data [29] . Social media platforms are the major source of unstructured data in the current times. Unstructured data is characterized by ad hoc schema, and therefore cannot be stored in SQL databases. The growth of unstructured data has led to interest in NoSQL databases, as they are much better suited due to their flexible schema. Therefore, in this paper, our first motivation is to build a flexible KVP data model in Apache Cassandra by using a novel query-driven approach to correlate big social data through relationships and dependencies. From the past few years, social media has become a common platform for global conversation around the world due to its giant size, vast availability, intense speed, and wide range of variant content. On the other hand, several illegitimate activities are engendered by misusing these social content through social engineering [18, 19, 56] to accomplish various objectives. One of the main causes behind the illegitimate activities on social media is the separation of digital content from its provenance [12] . In this paper, our second motivation is to explore the need of provenance information associated with the digital content published on social media and to design an efficient social data provenance framework for key-value pair (KVP) database. Social Data Provenance [30] involves following three dimensions, viz. "What", "Who", and "When". What provides the description about the social media posts, Who describes the correlations among social media users, and When characterizes the evolution of users' behaviour over time. Like data provenance [24, 49] , social data provenance also describes the ownership and origin of such information. The term "Big Data" is characterized by 7 V's, viz. Volume, Velocity, Veracity, Variety, Variability, Visualization, and Value. Veracity of big data that is defined as quality, accuracy and truthfulness of source of data, is directly linked with data provenance. Currently, Big data and social media have become the synonymous to each other, as the major portion (over 90%) of total data in the world are produced through several social media platforms such as Twitter, Facebook, Instagram, etc. This rapidly growing large sized human-generated data is known as the Big Social Data (BSD) [37, 47] . One of the major challenges that is usually faced by the several big social data applications is to design a flexible data model in NoSQL databases, as the traditional data modelling approaches are not suitable for correct and efficient data model design in such databases. Provenance about derivation history of big social data is usually called Big Social Data Provenance (BSDP) [21] . In Social Data Analytics, the credibility of an analysis generally depends upon the quality and truthiness of input data which is assured by the Social Data Provenance [54] . In this way, social data provenance plays a major role in clarifying opinions to avoid rumours, investigations, and explaining how and when this information is created and by whom. However, distillation of provenance information from such a huge amount of complex data is an extremely tedious task, due to its diverse formats. Barbier [3] identified some of the following issues to address the key challenges in capturing, storing and querying provenance for social data: -Currently, no social media platform provides any provenance information to the users to identify the originators or sources of the published information. -A wide variety of digital content including text, images, and multimedia files are dynamically generated through various social media sites. However, there is no common format of such data that is available to understand the provenance information associated with them. -No common application programming interface (API) and architectural solutions are provided by the developers to access and manage provenance data. -There is no widely accepted mechanism which has the potential to trace out the provenance objects from such unstructured distributed data. In addition, several other challenges such as designing automatic provenance capturing mechanism, minimizing provenance capturing and querying overhead, different granularity levels at which provenance needs to be captured, and provenance data analysis through provenance visualizations, etc., are also explored for provenance support in big data application by different authors in [7, 14, 15, 23] . Because of these remarkable challenges, the necessity of capturing and querying provenance information associated with social data has raised a growing interest in the era of social data analytics. In computer science, provenance has been studied mainly in the following two perspectives: first is database provenance or data provenance and second is scientific workflow provenance or workflow provenance [50] . Workflow Provenance is a coarse-grained information that captures the information about process and entities involved in that process as a black box, while Data Provenance captures fine-grained information. It focuses on how any result is derived, what queries are executed, what operations are performed on data. In this paper, our main focus is on "Data Provenance". Social Data Provenance describes the origin, derivation and transformations of social content throughout its lifecycle. It is also categorized in the following two categories based on its granularity level, viz. fine-grained and coarse-grained provenance [50] . Recently, social data provenance has gained a lot of attentions, as it serves different purposes such as audit trail, data discovery, update propagation, incremental maintenance, rumour identification, justification of a query result, etc. Several web-based tools [30, 42] are developed to capture pre-defined provenance attributes such as name, gender, religion, location, etc., from different social networking accounts associated with a particular twitter user. Although these attributes capture complete details of a social media user, but it neither provides a provenance path nor a propagation history and updates of any social content published on a social media platform. To reconstruct and integrate provenance of messages in social media, a workflow provenance model PROV-SAID [16, 51, 52] based on W3C PROV data model is also proposed for a small dataset. However, most of the existing approaches are not scalable to track provenance metadata for social media efficiently. They are suitable to capture workflow provenance at coarse-grained level only. Further, as the social data are constantly changing over time, yet no any existing framework is capable to capture provenance for historical queries, which is an essential requirement of social data provenance. Therefore, the viability of such a framework becomes the necessity to engender the trust among social media users. To accomplish this, we propose a Big Social Data Provenance (BSDP) framework for key-value pair database that is capable in capturing, storing and querying provenance information for different query sets including select, aggregate, standing/historical, and data update (i.e., insert, delete, update) queries on a live streamed twitter data set. Relational databases have been the mainstay of the data community for decades starting from mid 1980's. They are ideal for structured data and predictable workload. But these databases are not scalable for handling big data which encompasses not just structured data, but also semi-structured and unstructured data. Not only SQL (NoSQL) databases have been proposed as an alternative to SQL databases to handle the challenges posed by big data, as these databases efficiently support to a low latency, horizontal scalability, efficient storage, high availability, high concurrency, and reduced operational costs [8, 17, 35] . Apache Cassandra is one of the most popular key-value pair (KVP) database which belongs to the NoSQL family. The key strengths of Apache Cassandra [20] are its simplicity, scalability, and a very fast streamlined NoSQL architecture in which each column is a data structure that contains a key, value, and a timestamp. It is also used in application development by Facebook, Twitter, Cloudkick, Mahalo, etc. Social media is the major source of unstructured data in the current scenario. Therefore, Apache Cassandra is a good choice to handle this extremely high volumes of unstructured data. For applications those are related to auditing, security, and accountability, there is a need to restore all the operations performed on a database to produce the same result as of their previous executions. This leads to the requirement of managing all the updates (i.e., insert, delete, and update operations) without any loss of information as a provenance data. But the conventional/snapshot database systems do not maintain the history of all data objects and store only the current snapshot of data, as a result they are ill-suited for such applications. Zero-Information Loss Database (ZILD) [4] is a special type of database which is based on temporal database and maintains temporal data as a history of all the updates along with the complete information of operational activities performed in that database. Therefore, it is well suited for designing the provenance framework especially in capturing provenance for update, insert, delete, and historical/standing queries. In this paper, we design and develop a Big Social Data Provenance (BSDP) framework for keyvalue pair database [45] . The proposed framework is capable to answer the following questions, viz. what type of provenance data should be reconstructed from social media?, To which extent it will be useful?, How to capture this provenance data?, How and where to store provenance data?, How to query/analyse provenance data?, etc. In summary, the main contributions of this paper are: 1. BSDP, a novel provenance solution for live streamed big social data that integrates both online and offline module and captures fine-grained and coarse-grained provenance. 2. Fine-grained provenance is captured in the form of Provenance Path Expressions that consists of keyspace, column family, row key and column name contributed towards each result tuple, while coarse-grained provenance is captured in form of query statements with their execution time. 3. We introduced a novel query-driven data model design methodology for Apache Cassandra. 4. Social data are constantly changing over time; however, no any existing approach is capable to capture provenance for historical queries, which is an essential requirement of social data provenance. On the contrary, our framework aims to maintain all data update (i.e., insert, delete, and update) operations without any loss of information. 5. It supports to perform historical data queries (i.e., querying a data element with a given time in the past and with a time range specified in the query statement) using User-Defined Constructs (UDCs) in CQL (Cassandra Query Language) and capturing provenance for standing/historical queries (i.e., it traces the provenance for all the result tuples of a query executed in the past). 6. Our proposed framework is developed around the novel concept of Zero Information Loss Database (ZILD) [4] . By a Zero Information Loss Database, we mean that no data value, no user, and no query and its result is ever lost. ZILDs are very useful in tracking any "data manipulations" that have taken place on social media. 7. Existing solutions for social data provenance are dedicated to a particular social media platform with limited query support and suitable for a small data set. On contrary, our framework provides a generalized provenance solution which is capable to extract real-life social data from different social media platforms through live streaming by using their supporting APIs, for instance Graph API for Facebook social graph, Twitter's Streaming API for Twitter's Network. 8. We propose different provenance generation algorithms for select, aggregate, standing, and data update queries with insert, delete and update operations. All the captured provenance is further stored in Zero-Information Loss Key-Value Pair Database (ZILKVD). In addition, all the extracted social data and their provenance information are stored in a common keyspace of Apache Cassandra using a query-driven approach for fast read/write operations and efficient provenance visualization. A case study of Twitter Social Network is given to show the feasibility and usefulness of our proposed framework in capturing, storing and querying social data provenance. Social data analytics is an emerging research field that integrates social communications with data analytics. It extracts meaningful insight from extensively large data sets. It can be used to understand the user's behaviour, and to model social interactions among social media users. Big Social Data [47] is mainly characterize by 3 V's, viz. volume, velocity, and variety, where volume means rapidly growing social data, velocity is related to the dissemination of information with tremendous speed, and variety refers to diverse formats of social data. Nowadays, the volume, velocity, and variety of Big Social Data are facing the challenges of capturing provenance [12] and evaluating trustworthiness of social data [29] . Therefore, an efficient provenance data management system is required to trace out the provenance information through provenance capturing and querying for social data generated from various social media platforms. The importance of social data provenance in social media is also presented in [21, 36, 46] with several key challenges such as measuring quality and truthiness of social data, provenance storage, provenance querying, etc. [7, 9, 14, 15, 23, 49, 53] . Several research works are carried out to identify the suitability of NoSQL database to manage big social data with efficient storage, fast querying, and horizontal scalability [20, 33] . Different approaches are proposed to model a huge volume of Twitter data set in Apache Cassandra NoSQL database for an efficient querying [11, 26, 40] . A provenance data model for data intensive workflows is proposed in [13] to capture provenance information for Map Reduce workflows using Kepler-Hadoop framework. The proposed provenance model is a good initiation for scientific workflows; however, it is not much efficient in terms of storage space and query execution overhead. In line with the provenance data model for scientific workflows, RAMP model is proposed in [28, 39] for Generalized Map and Reduce Workflows (GMRWs) using a wrapperbased approach for provenance capturing and tracing. In this model all the transformations are either map or reduce functions rather than having one map function, followed by one reduce function. Further, HadoopProv model [2] is introduced for provenance tracking in Map Reduce workflows, where provenance tracking takes place in Map and Reduce phases separately, and construction of provenance graph is deferred at query stage, to minimize the temporal overhead. Several web-based tools [30, 42] are developed to capture pre-defined provenance attributes such as name, gender, religion, location, etc., from different social networking accounts associated with a particular twitter user. Although these attributes capture complete details of a social media user, it neither provides a provenance path nor a propagation history and updates of any social content published on a social media platform. Further, a provenance path algorithm [25] is proposed to capture provenance path of an information, to explain how this information propagates in a social network but to a few known recipients only. To reconstruct and integrate provenance of messages in social media, a workflow provenance model PROV-SAID [16, 51, 52] based on W3C PROV data model is proposed. Although the proposed solution identifies the posted tweets that are copied from other published tweets without giving credit to original tweeter like a retweet, it is suitable for a small dataset only. Applications of standard PROV-DM model are proposed in [27] to manage provenance data for bioinformatics workflows in a cloud computing environment using different families of NoSQL databases. A provenance framework based on algebraic structure of semirings for three specific graph algorithms is presented in [41] , to compute provenance of regular path queries (RPQ) over graph database via applying annotations like labels and weight functions which is a quite complex process. A provenance model for vertexcentric graph computation and a declarative data-log based query language is presented in [38] , to capture and query graph analytics provenance for both online and offline mode. Further, a provenance model for stream processing system (s2p) is proposed in [55] . Although this model is suitable to capture fine-grained (operator level) and coarse-grained (process level) provenance through online and offline parts, yet does not provide provenance support for historical queries. To satisfy the need of Big Data Provenance, a rule-based framework for provenance identification and collection from log files is proposed in [22] . The proposed framework reduces the source code instrumentation, yet raises several questions about completeness of provenance information, as logs may not capture complete information including derivation process. Another big provenance framework is proposed in [10] for provenance collection and storage in an unstructured or semi-structured format, for scientific applications. The proposed framework is light-weighted and built on multilayered provenance architecture that supports a wide range of provenance queries. A provenance model for Apache Cassandra, i.e. a key-value pair database, is proposed in [31, 32] to capture provenance information using provenance policies. In this model, provenance querying is performed through resource expressions and a set of predefined operators. The proposed model is implemented on a small sized patient information system and uses legacy thrift APIs rather than CQL3, that makes it difficult to write a query. Various change data capture (CDC) schemes are investigated in [48] for Apache Cassandra to track modifications in source data. The logic of each scheme is implemented in Cassan-dra by combining a Map Reduce framework with distributed computing. A layer-based architecture for provenance collection and querying in scientific applications is presented in [1] , which stores semi-structured provenance documents in MongoDB in a BSON format. The proposed architecture is prominent for simple queries but not efficiently respond to complex queries. From the available literature, it is evident that most of the existing provenance models are suitable to capture provenance for workflows at coarse-grained level only rather than fine-grained level. Secondly, some of them are not suitable to capture provenance information for a large size social media data including all types of query set. In this paper, we try to bridge this gap by designing an efficient big social data provenance framework on the top of a key-value pair database for capturing, storing and querying provenance information for different query set including select, aggregate, standing/historical, and data update (i.e., insert, delete, update) queries on live streamed big social data. Summary of different characteristics of existing provenance solutions for social data and our proposed BSDP framework is given in Table 1 . In this paper, we propose a Big Social Data Provenance (BSDP) Framework build upon Zero-Information Loss Key-Value Pair Database (ZILKVD) that efficiently captures provenance for all queries including select, aggregate, standing/historical, and data update queries with insert, delete, and update operations. ZILKVD [45] is developed based on the concept of Zero-Information Loss Database (ZILD) [4, 43, 44, 46] . The proposed framework is very beneficial in tracing out the origin and derivation history of a query result. It also supports provenance querying for historical data. The major steps involved in designing the proposed provenance framework are as follows: 1. Fetching a huge volume of real-life social data from Twitter's network through live streaming by using Twitter Streaming API's. based upon a query-driven approach to correlate big social data through relationships and dependencies, in appropriate formats so that it makes sense for further analysis. 3. Designing ZILKVD architecture with data version support to maintain all insert, delete, and update operations in the form of provenance data, that will aid in Historical data queries and Standing queries. 4. Proposing following three provenance generation algorithms, viz. SelectProv, AggreProv, and StandProv to generate provenance information for select, aggregate, and standing/historical queries, and to store captured provenance in ZILKVD. 5. To provide provenance querying support for historical data and tracing out the origin and derivation history of a query result. In addition to the above tasks, a performance analysis of proposed provenance capturing and querying algorithms are also presented for different query sets. Over the past few years, more than 90% of total sized data are contributed by the desperate usage of various social media platforms. Several leading social media platforms such as Twitter, Facebook, Instagram, etc., are fully responsible for this mammoth data. Out of these social media platforms, Twitter is one of the most precious mines of pretty specific and publicly available pullable social data, that allow users to share their thoughts with massive audience. It is tuned for very fast communications over internet with more than 150 million active users publishing approximate 500 million tweets daily. A twitter user can either create its own tweet or can retweet the information that has already been tweeted by some other user. A twitter user can choose to follow other users also. For instance, if a user A follows user B, then user A can see B's tweets in his 'timeline'. Twitter's popularity as a massive source of information has led to research in various domains [34] . Researchers can obtain this information from twitter through publically available Twitter APIs. These APIs are categorized in the following two categories; first is REST APIs for conducting specific searches, reading user profile or posting new tweets, and second is Streaming APIs to collect a continuous stream of public information. In our framework, we are using Streaming APIs to continuously stream the tweets and related information whenever the new tweet is published as shown in Fig. 1 . Twitter provides an open standard for authorization known as Open Authentication (OAuth). This authentication mechanism allows controlled and limited access to protected information. Traditional authentication mechanism is vulnerable to theft, while OAuth mechanism provides a more secure approach without using user's username and password. By using a three-way handshaking, it allows users to grant third party access to their data. As user's password for his/her twitter account is never shared with this third-party application, therefore, user's confidence in the application is also improved. Twitter APIs can only be accessed by a twitter application using OAuth authorization mechanism. To get the authorization for accessing the protected data, user first creates a twitter application which is also known as consumer. After registering this application on twitter, a consumer key and a consumer secret key is issued to the application by twitter that will uniquely identify this application. By using these consumer key and consumer secret key, application creates a unique twitter link through which user authenticate him/herself to twitter. After verifying the user's identity, twitter issues an OAuth verifier to the user. Application uses this OAuth verifier to request an Access Token and Access Token Secret that is unique to the user. Now, twitter application authenticates the user on twitter by using these Access Token and Access Token Secret, and make API calls on behalf of the user, see Fig. 2 . By using these Access Credentials, we fetched all the tweets related to a specific event through live streaming to design an efficient key-value pair data model as explained in Algorithm 1 (i.e., TweetCassandra). The two inputs to the algorithm are (1) Twitter API Access Creden-tials, i.e. Consumer Key (C k ), Consumer Secret Key (C sk ), Access Token (A t ), Access Token Secret (A ts ); and (2) The bulk proliferation of social data has been imposing several challenges in the field of social data analytics such as efficient data model design, querying techniques, etc. But traditional data management and processing tools are incapable to handle this limitless data. Relational/SQL databases are ideal for structured data and predictable workload but not scalable for handling Big Data which encompasses not just structured data, but also semi-structured and unstruc-Algorithm Open authentication process of Twitter tured data. Social media platforms are the major source of unstructured data in the current times. Unstructured data is characterized by ad hoc schema and therefore cannot be stored in SQL databases. The growth of unstructured data has led to interest in NoSQL databases. NoSQL databases are much better suited due to their flexible schema. NoSQL represents a family of databases in which each database is quite different from others having literally nothing in common. The only commonality is that they use a data model with structure that is different from the traditional row-column relation model of RDBMSs. Graph, Document, Column-oriented, & Key-value pair are the four kinds of NoSQL databases. The basic architecture of a KVP database consists of a twocolumn hash table in which each row contains a unique id known as a "key", and a "value" associated with this key. The KVP databases are a good choice to handle extremely high volumes of data in a distributed processing environment as they have a built-in redundancy, which is capable to handle the losses of storage nodes. The key strengths of KVP databases are their simplicity, scalability, and a very fast streamlined NoSQL architecture. These have the capabilities to perform an extremely fast read and write operations. Apache Cassandra is one of the most popular KVP database that comes under the ambit of NoSQL databases. It is a distributed column family store in which each column is a data structure that contains a key, a value, and a timestamp; therefore, it is also named as key-value pair column-oriented data store, see Fig. 3 . The brief introduction of elementary components of information in Apache Cassandra is given below: Column: Column is a smallest unit of information that contains a key, value, and timestamp. Super Column: Super Column or composite column is a group of similar columns, or columns likely to query together with common name. A Row is a group of orderable columns, i.e., columns are stored in sorted order by their column names, with a unique row key or primary key that can uniquely identify data. Column Family: Column Family is similar to a table in relational database but no pre-defined schema, and also provides flexibility to have different number of columns in different rows. Column families are stored in separate files on the disk. Keyspace is the highest level of information in Apache Cassandra, analogues to the database in relational database, which is the set of related column families. It also maintains the information about data replication, and replication strategy on nodes. Although Apache Cassandra is known for flexible data management to manage world's biggest datasets on clusters of several nodes deployed at different data centres, however, one of the major challenges that big social data applications face when choosing Apache Cassandra is data model design that is significantly different from traditional data model design methodologies. Traditional data model design methodology (i.e. used in relational databases) is purely a data-driven approach. On the contrary, data model design for Cassandra begins with application-specific queries, and it is purely a query-driven approach. Several SQL constructs such as data aggregation, table joins, etc., are not supported by Cassandra Query Language (CQL). Therefore, data mod-elling in Cassandra relies on denormalization of database schema that enable a complex query to execute on a single column family only, to retrieve the required information. In this way, data duplication is common in Cassandra column families to support a variety of queries. Database schema design for big social data in Cassandra requires not only the understanding of relationships and dependencies among social data, but also the understanding of needs to access this data through a query driven approach. In this paper, we applied a query-driven methodology in KVP data model design. By a query driven, we mean designing a data model on the basis of what type of queries our database will required to support. This approach provides not only the sequence of tasks but also aids in determining what type of data will be needed and when? In our proposed framework, we designed a query-driven data model based on frequent queries required to execute on Twitter dataset. Initially, all the tweets posted by different Twitter users in the response of a particular event are fetched through Twitter's Streaming APIs. However, all such information is not being useful for our data model; therefore, only required information, viz. tweet id, tweet text, tweet published date, hashtags, user_name, screen_name, profile created date, twitter id, location, friend list, follower list, etc., is extracted from the input list of tweet objects. Simul- taneously, pre-processing on extracted data is performed to convert them in a required format. Afterwards, all such pre-processed data is stored in different column families of Apache Cassandra. The snapshot of KVP data model design in Apache Cassandra is given in Fig. 4 . Proposed data model contains a keyspace named "NewTwitter_Keyspace" that consists of 20 Column Families. The various column names of these column families with their row keys are also mentioned in Fig. 4 . All 20 column families are organized on the basis of social data set fetched from the Twitter's network to support different query sets for capturing, storing, and querying provenance. Cassandra Query Language (CQL) is used for querying and to communicate with Apache Cassandra. The proposed provenance framework for big social data is developed on top of Zero-Information Loss Key-Value Pair Database (ZILKVD). ZILKVD is designed by using the concept of Zero-Information Loss Database [4] , to maintain all the insert, delete, and update operations without losing any information as a provenance data. The architecture of ZILKVD consists of following components, viz. Query Parser, Query Rewriter, Query Generator, Processing Module, and KVP Database, see Fig. 5 . When user issues a query, it is sent to the Query Parser to parse the query and to identify the type of that query, i.e. Insert (I), Update (U), or Delete (D) query. -If issued query type is an "Insert Query" (i.e., to insert a new row in database), then the parsed results are sent to the Query Rewriter as mentioned in step I 1 and corresponding Rewritten Insert Query (Q i ) is generated in step I 2 . Here, "valid_from" column of this new row in corresponding column family is being set to the "current date/time" and then it is sent to the KVP database for further execution. -If issued query type is a "Delete Query" (i.e., to delete an existing row from the database), then the parsed results are sent to the Query Generator as mentioned in step D 1 and corresponding Update Query (Q u ) is generated in step D 2 . Here, the value of "valid_to" column of the row to be deleted from the corresponding column family is being set to the "current date/time" and then it is sent to the KVP database for further execution. -If issued query type is an "Update Query" (i.e., to update an existing row in database), then the parsed results are sent to both Query Generator and Processing Module in steps U 1a and U 1b , respectively. Then, in step U 2 , corresponding Select Query (Q s ) generated from Query Generator is executed on KVP database, to retrieve the following information, viz. value of primary key columns of the row to be updated, old value of column before performing update, and its write time in database. This information is sent to the Processing Module in step U 3b , to generate corresponding Provenance Path Expression (ProvPathExp) in the following format, i.e., "Keyspace/Column_Family/RowKey/Update_Column_ Name", and then sent back to the Query Generator in step U 4 . -Now, in step U 5 , Query Generator generates an insert query (Q i ) to insert the following information in "update_provenance" column family, viz. Query statement, ProvPathExp, old_value, old_value writetime (i.e., its valid_from time), new_value, current Date/Time, etc., for further execution on KVP database. -Afterwards, both the queries (i.e., generated insert query Q i and issued update query U) are executed on KVP database in step U 5 and U 6 , respectively, to maintain the complete history of data update operations. Finally, following information, viz. Query Id, Query Statement, its time of execution, etc., are also inserted in "query_table" column family through an insert query executed on KVP database in step U 7 . The high-level details of the implementation code of ZILKVD Design are given in Algorithm 2 and 3. Two inputs to Algorithm 2 are (1) A KVP Database (D kv ) and (2) A query Q (i.e., insert, delete or update query), and output of the algorithm is a ZILKVD database with complete history maintained. According to algorithm 2, the issued input query Q is first parsed to retrieve the required information, i.e., parsed result R p and to identify the query type, i.e., Q t , refer to line 1. If Q t is an insert query, then a corresponding rewritten insert query Q i is generated and sent for the execution on D kv , refer to lines 3 and 4. If Q t is a delete query, then a corresponding update query Q u is generated and sent for the execution on D kv , refer to lines 6 and 7. If Q t is an update query, then Algorithm 3, i.e., UpdateCassProv, is called, refer to line 9. The following two inputs, i.e., query Q and its parsed result R p , are passed to the algorithm 3 and provenance path expression (P p ) of updated columns and updated "query_table" and "update_provenance" column families are obtained as outputs of the algorithm 3. According to the algorithm 3, all the required information such as KS, CF, PK, CN u , CV u , CT u , etc., are retrieved from R p , see line 1. -If Q contains a "Where Clause" in its query statement, then value of V pk is retrieved and assigned to RK to uniquely identify a row, refer to lines 2 to 4. Afterward, a corresponding select query Q s is generated and executed to retrieve old value of column before update, and its write time in database, i.e., CV o and CV owt , respectively, refer to lines 5 and 6. Now, provenance path P p (i.e., KS/CF/RK/CN u ) is generated and column family "update_provenance" is updated with updated values of Q id , Q, P p , CV o , CV owt , CV u , CT u , and current date/time, refer to lines 7 and 8. -Similarly, if Q does not contain a "Where Clause," then again Q s is generated and executed to store all the query results in RS, refer to line 11. Now, for each result tuple r of RS, value of following parameters, i.e., V pk , CV o , CV owt , etc., are retrieved and value of V pk is assigned to RK. Afterwards, corresponding provenance path P p (i.e., KS/CF/RK/CNu) is generated and column family "update_provenance" is updated with updated values of Q id , Q, P p , CV o , CV owt , CV u , CT u , and current date/time, refer to lines 12 to 16. -Finally, Q is executed and column family " query_ table" is also updated with updated values of following parameters, i.e., Q id , Q, current date/time, etc., refer to lines 19 and 20. A demonstration of above algorithms with illustrative example query 1 is given below: Example Query 1: Update location of the user with name "DDNewsAndhra". Cassandra Query 1: update user_details set location = 'Andhra' where screen_name='DDNewsAndhra'; Initially, the above Example Query 1 is passed as an input query (Q) to the Algorithm 2, where the query is parsed to identify its type (i.e., Update Query) and to retrieve the required information. Now, both query (Q) and its parsed results (R p ) are passed as inputs to the Algorithm 3. Here, the provenance path expression (i.e., ProvPathExp) of updated tuples along with the updated column families, viz. "query_table" and "update_provenance" of underlying KVP database is obtained as outputs of above algorithm. A snapshot of "update_provenance" column family is shown in Fig. 6. We designed and implemented three provenance generation algorithms for select, aggregate, and standing queries, respectively. The high level details of all the algorithms along with their illustrative example queries are given in the following subsections: Proposed framework supports to capture provenance information for select queries. The high-level details of provenance generation algorithm for select queries, i.e., "Select-Prov", are given in Algorithm 4. In proposed algorithm, a select query (Q s ) and its query id (Q id ) are passed as inputs and a comma separated list of provenance path expression (P) for each value exists in the result tuple of a query result along with the following updated column families, viz. "select_provenance", and "query_table" are obtained as outputs of the algorithm. Initially, Q s is parsed and the following information, viz. KS, CF, PK, CN, etc., is retrieved from the query statement in the form of parsed result R p , refer to lines 1 and 2. Then, a rewritten select query Q r is generated by appending a predicate (i.e., "valid_to") in the query statement, refer to line 3. The value of this predicate is being set to the Null to retrieve currently existing rows. Afterwards, Q r is executed and all its result tuples are stored in record set (RS), refer to line 4. Now, for each result tuple r of result set, a unique result tuple id is generated by using Q id , refer to lines 5, 7 and 17. Initially, the value of P for all columns of each result tuple is being set to the null, refer to line 8. Then, the value of V pk is retrieved from result tuple and assigned to RK, refer to lines 9 and 10. After that, for each non-key column C i of r, provenance path expression p i is generated (i.e., KS/CF/RK/C i ) and added in the corresponding r and further appended in P, refer to lines 11 to 14. A provenance path expression consists of a keyspace name, column family, row key, and column name in the following form: "keyspace/columnfamily/rowkey/columnname". Provenance path expression provides a detailed provenance for each of the result tuple exists in the query result at different granularity levels, i.e., How a value in result tuple is derived. Finally, column families "select_provenance" and "query_table" are also updated, refer to lines 14 to 20. Demonstration of Algo- r tid = Q id +'t'+ k //r tid is unique id of each result tuple of Q r 8: Set P = Null 9: Obtain v pk //v pk =Values of PK 10: Query result of the above query is shown in Fig. 7 , which shows that the user "mkzangid" used hashtag "Vikramlander" in two of his tweets with tweet id's "1181510817377767 426" and "1181512471518990342". Provenance path expression under column "Hashtag_ Provennace" shows the derivation process of the value present in result set, i.e., value "Vikramlander" in result set is derived from two different Proposed framework supports to capture provenance information for aggregate queries too. The high-level details of provenance generation algorithm for aggregate queries, i.e., "AggreProv", are given in Algorithm 5. According to this algorithm, an Aggregate Query (Q a ) with its Query Id (Q id ) is passed as an input and a comma separated list of Provenance Path Expressions pv[i] for each of its result tuple exists in query result is obtained as an output in Provenance Vector (pv). The provenance path expression consists of all the source rows and column names of a column family in a keyspace that contributed to generate the corresponding result tuple. All the steps of this algorithm are very similar to Algorithm 4, i.e., "SelectProv" except the concept of prove- for all r ∈ RS do 10: r tid = Q id + 't'+ k //r tid is unique id of each result tuple of Q r 11: Set P = Null 12: Vector pv = Null //pv is a vector to store all provenance paths of r 13: for all r1 ∈ RS1 till value of aggregate attribute in r1 is same as in r do 14: Obtain v pk from r1 //v pk =Values of PK 15: nance vector. Although, Provenance path is generated in the same way as in Algorithm 4; however, iteration is performed on all source rows that contributed to produce one result row in result set to generate pv[i] of all source rows, refer to lines 13 to 21. Further, provenance of result tuples and corresponding aggregate query are stored in "select_provenance" and "query_table" column families, respectively. Demonstration of Algorithm 5 with illustrative example queries 4 and 5 is given below: Example Query 4: Display the total no of tweets posted by a user "sunilthalia" on "08/10/2019". Cassandra Query 4: select count(tweet_body) from tweets_user_day where screen_name='sunilthalia' and published_day=8 and published_date>='2019-10-08' and published_date<'2019-10-09' group by screen_name allow filtering; The above query is an example of aggregate query to retrieve the total number of tweets posted by a specific user on a given day. This aggregate query efficiently executes on "tweets_user_day" column family with composite primary key, i.e., "screen_ name, published_day, and pub- The above aggregate query executed on "tweets_day" column family with composite primary key, i.e., "pub-lished_day, published_date", and counts the total number of tweets posted on each day of October, 2019. Partial result of above aggregate query is shown in Fig. 9 , where the total number of tweets posted on each day is shown under the column name "SYSTEM.COUNT (TWEET_BODY)", along with the tweets posted day, and a comma separated list of provenance path expression for all the rows that contributed towards aggregated result under the column name "SYS-TEM.COUNT (TWEET_BODY)_PROVENANCE". Proposed framework also supports to capture provenance information for historical/standing queries using data versioning support in ZILKVD. The high-level details of provenance generation algorithm for standing queries, i.e., "StandProv", are given in Algorithm 6, where a Standing Query (Q st ) along with its Time of Execution(t) is passed as an input and a comma separated list of Provenance Path Expressions (p i ) for all of its result tuple exists in query result is obtained as an output. Initially, query Q st is parsed to retrieve the following information, viz. Keyspace, Column Family, Column Names, Primary Key, etc., refer to lines 1 and 2. Afterwards, a Rewritten Select Query (Q r ) is generated to retrieve Row Key (RK), i.e., values of primary key column of column family, and each result tuple with predicate "valid_to". The value of this predicate is set to time "t" (i.e., given in input) and then query Q r is executed on database, refer to lines 3 to 7. Now, for every value in result set of Q r , its "writetime" (time of existence in database) is compared with "t". If "writetime" is less than or equal to "t", then provenance path expression ( p i ) is generated with corresponding source row and column contributed towards its generation and further, added in result tuple refer to lines 9 to 11,. But, if "writetime" is greater than "t", then corresponding column value and provenance path are retrieved from "update_provenance" column family, refer to lines 13 to 15. At the end, the value of column and provenance path expression that are retrieved from "update_ provenance" column family are updated in result set and finally, updated result set along with provenance information is obtained, refer to lines 16 to 21. In our proposed framework, all the captured provenance is stored in the following three column families of Apache Cassandra for further analysis, viz. "query_table", "select_provenance", and "update_provenance", see Fig. 10 . //Q r is rewritten select query for retrieving values of Pk and appending //predicate with "valid_to" as "t" 4: RS ← Execute Q r 5: for all r ∈ RS do 6: Obtain v pk //vpk=Values of PK 7: RK ← v pk //RK is Row Key 8: for all v in r do 9: if writetime(v)<=t then 10: p i ← KS/CF/RK/C //Where v is value of column C in r and C ∈ CN, p i is provenance path of C 11: r ← Add p i //Adding provenance of C i 's value in r 12: else 13: Generate Q s //Q s is rewritten select query for retrieving value of C (v c ) from update_provenance //column family with predicate "valid_to" as "t" 14: RS1 ← Execute Q s 15: Provenance information of all the executed queries with their query id and time of executions is stored in "query_table" column family. Provenance path expressions for all the result tuples of select/aggregate queries are stored in "select_provenance" column family along with their query statement, result tuple id and time of executions as shown in Fig. 11 . Similarly, the column family "update_provenance" keeps the provenance information about all the update operations along with following attributes, i.e., query statement, provenance path expression, old value and its write time, new value, column type, and time of update (current date/time), see Fig. 6 . The captured provenance is used in source tracing, update tracking, and in querying historical data. Further, the The proposed framework also supports querying provenance information for various purposes such as audit trail, updates tracking, source tracing, data discovery, etc. Provenance querying on captured provenance is carried out to achieve the following two objectives: first, How any result tuple of select query is derived?, i.e., querying provenance to know about the source of information, and second How to track all the updates performed on a given data?, i.e., querying provenance for historical data. Framework provides the following two column families to accomplish the above tasks, viz. "select_provenance" and "update_provenance". Provenance path expressions for all the result tuples of select/aggregate queries along with their query statement, result tuple id and time of executions are stored in "select_ provenance" column family. This provenance information is used in provenance querying to know about the source of information as shown in Fig. 11 . Similarly, the column family "update_provenance" stores the provenance information about all the update operations performed along with the following parameters, i.e., query statement, provenance path expression, old value and its write time, new value, column type, and time of update (current date/time). This provenance information is used in provenance querying for historical data, see Fig. 6 . In addition to above column families, one more column family, i.e., "query_table" is also used in provenance querying to obtain the information about all the queries executed till a particular date with their time of execution. The illustrative examples of provenance querying are given below: Example provenance Query 1: Explain how result tuple q6t1 of query q6 (as shown in Fig. 11 ) is derived? The above query is executed on "select_provenance" column family to retrieve provenance path expressions for result tuple q6t1 of query q6 along with its time of execution. Here, provenance path expression of resultant tuple is "[NewTwit-ter_Keyspace/ user_details/ Gagan4041/location]" and time of query execution is "2019-12-16 05:02:34.266000+0000". This indicates that the source keyspace name of required tuple is "NewTwitter_Keyspace", name of column family is "user_details", row key is "Gagan4041", column name is "location" and time of query execution is "2019-12-1605:02:34.26600 0+0000". Now, "user_details" column family is queried with this row key, column name and execution time to retrieve all the rows that contributed to produce the result tuple t1 of query q6 which justify the resultant tuple. However, if the source has been modified after query execution, in that case, the original source can still be devised through querying historical data. To support provenance querying for historical data, we designed the following four User-Defined CQL Constructs (UDCs), viz. "all", "instance", "validon now", and "validon date". These constructs are further categorized in the following two categories, viz. T1 ("all", "instance") and T2 ("validon now", "validon date"). The high level details of provenance querying algorithm for historical data, i.e., "QueryProv_HistData", are given in Algorithm 7, in which an Extended Query (Q E ) (i.e., a CQL query with UDCs) is passed as an input and a corresponding Result Set (RS) of historical data are obtained as an output. In the beginning, Q E is sent to the Query Parser to retrieve all the UDCs (T1 and T2) used in Q E along with the CQL Query Q (i.e., CQL query without UDCs) and parsed result (R p ), refer to lines 1 and 2. In addition to this, some other information such as Keyspace Name (KS), Column Family (CF), Primary Key (PK), and Column Name (CN) associated with Q E is also extracted from R p , refer to line 3. Now, query Q executes on the related column families to retrieve required historical data as per the following conditions mentioned from lines 4 to 16. -If UDC T1 and T2 are "instance" and "validon now" type constructs respectively, then query Q executes on the column families mentioned in issued query statement only, refer to lines 4 and 5. -If UDC T1 and T2 are "instance" and "validon date" type constructs respectively, then the "write time" of current value is first fetched and compared with "validon date". If the "write time" of current value is lesser than "validon date", then query Q executes on the column families mentioned in issued query statement only; otherwise, it executes on "update_provenance", refer to lines 6 to 10. -If UDC T1 and T2 are "all" and "validon now" type constructs, respectively, then query Q executes on both "update_provenance" and the column families mentioned in issued query statement to retrieve the complete history of all the updates of a column value, refer to lines 13 to 16. -Similarly, If UDC T1 and T2 are "all" and "validon date" type constructs respectively, then again "write time" of current value is fetched and compared with "validon date" . If the "write time" of current value is lesser than "validon date", then query Q executes on both "update_provenance" and the column families mentioned in issued query statement; otherwise, it executes only on "update_provenance", refer to lines 13 to 16. Demonstrations of Algorithm 7 with illustrative examples of provenance queries 2, 3, 4 and 5 are given below: Example Provenance Query 2: Display all the location updates of a specific user named 'MemeBaaaz' till now. Extended CQL Query Q E : select all location from user_ details where screen_name='MemeBaaaz' validon now; The above Q E is parsed first to retrieve all the UDCs used in this extended query, i.e., "all" and "validon now", respectively. Now, CQL query Q is executed on "user_ details" and "update_provenance" column families to retrieve all the location updates of the given user "MemeBaaaz". The query result of above provenance query is shown in Table 2 Example Provenance Query 3: Display all the location updates of a specific user named 'MemeBaaaz' till 23/10/2019 9:50AM. The above example provenance query 4 generates current location of user as "Mumbai" which is valid from "2019-12- To evaluate the performance of proposed framework, all the experiments are performed on a single node Apache Cassandra Cluster on Intel i7-8700 processor @ 3.20GHz with 16GB RAM, and 1TB disk. Apache Cassandra version 3.11.3 has been used for the experiments. In the proposed framework, big social data are fetched from the Twitter's network through live streaming and modelled in Apache Cassandra. This big social data consists of around 2.4 lakh twitter users, 2.1 lakh user's friends, 1.8 lakh user's followers, and their related information such as tweet's body, tweet's id, tweeter's screen name, tweet created date, user's personal information, etc. The proposed key-value pair data model contains a keyspace named "NewTwitter_Keyspace" that consists of 20 Column Families those are used to store this huge volume of social data. On execution of each query, the provenance information is captured and stored in the following three column families, viz. "select_provenance", "update_provenance", and "query table" that gradually increases the size of database. Java version 8 has been used as front-end programming language to interact with Cassandra, and Twitter's network. Cassandra Query Language (CQL) is used for querying and to communicate with Apache Cassandra. The performance analysis of proposed framework in terms of provenance capturing overhead and provenance query execution time for different query sets including, select, aggregate, data update and provenance queries are presented in the following subsections. To perform an experimental analysis on provenance capture, several query sets of different type of queries including select, aggregate, and data update queries are executed on ZILKVD architecture. A sample set of select queries are shown in Table 4 . Initially, all the queries are executed 12 times without provenance support and then, the same set of queries are again executed with provenance support. To calculate the average execution time of each query, we dropped the minimum and the maximum execution time and then taken the average of remaining 10 values. The execution performance of all the select queries in terms of average execution times is shown in Fig. 12 . The average execution time of select queries with provenance support is slightly larger than the select queries without provenance support. However, it indicates that the performance overhead of most of the select queries with provenance support is very minimal in respect to the select queries without provenance support, except query The proposed framework also provides the provenance support for aggregate queries with following aggregate functions such as count, max, min, etc. A sample set of aggregate queries are shown in Table 5 . The performance analysis of aggregate queries in terms of average execution time with and without provenance support is also shown in Fig. 13 . It indicates that the framework efficiently captures provenance for aggregate queries such as query Q1, Q2, and Q4. However, more execution time is measured for those queries in which aggregation is performed on a large number of input tuples such as query Q3, and Q5. For example, let's consider the query Q3, i.e., "count the total number of tweets posted in one month". Here, as the aggregation is performed on all the tweets of that month, which requires to capture the provenance for all such rows those are contributed to generate the result set, as a result it adds some measurable execution overhead. Provenance capturing for data update queries is also supported by the proposed framework using ZILKVD architecture. A sample set of data update queries are shown in Table 6 . The performance analysis of update queries in terms of average execution time with and without provenance support is shown in Fig. 14 . It also indicates that the framework efficiently captures provenance for update queries with minimum execution overhead. The captured provenance information for update queries is stored in "update_provenance" col- umn family. The following parameters such as "value_type", "old_value", "new_value", "old_value_writetime", and "provenance_path_expression", etc., are used to capture the provenance information. These parameters are further used for historical data queries, and queries executed in the past at any specific time, i.e., standing/historical queries as explained in Sect. 3.4.3. Ultimately, the overall performance of all types of queries with and without provenance support is shown in Fig. 15 . The average query execution time for "update", "select", and "aggregate" queries with and without provenance supports are summarized in Table 7 . It indicates that our proposed framework is very efficient in capturing provenance information for "update", and "select" queries, while a very small overhead is measured in case of "aggregate" queries, see The performance analysis of querying provenance information stored in Apache Cassandra is presented in the following section. A set of different provenance queries are executed for the performance analysis of provenance querying. A sample set of provenance queries are shown in Table 8 . Initially, all the provenance queries are executed 12 times. To calculate the average execution time of each query, we dropped the minimum and the maximum execution time and then taken average of remaining 10 values. The execution performance of all the queries is shown in Fig. 17 . Average execution times of all the provenance queries are mentioned in milliseconds (ms). According to Fig. 17 , the average execution time of provenance queries is varying from 1000ms to 1800ms. It shows that the proposed framework provides support for effi- Fig. 15 Overall query performance without and with provenance Display all the rows contributed to produce result tuple of query Q2 of Table 5 Q2 Display the row keys of all the rows those are contributed to produce result tuple t1 of query Q1 of Table 4 Q3 Display all location updates of a specific user till now cient provenance querying for both justifying answers of a query result, and historical data queries at an accepted level of precision. Proposed framework is beneficial in attempting to understand the social processes and behaviour of a social media user. Some of the application scenarios are given below: -It is a generalized framework that provides provenance solutions for other social media platforms also. Proposed algorithm can be used in fetching social data from other social networks such as Facebook, Instagram, etc., by using their supporting APIs, for instance Graph APIs for Facebook Social Graph. -Proposed framework is applicable for such applications where progressive user profile maintenance is required. For example, a social media user frequently updates his profile by adding, removing or changing his information. In such cases, our framework maintains all the data updates performed without losing any information. -In current pandemic situation of COVID-19, where health-related data is provided by almost all the countries across the world. This data is valuable, but came in diverse format and scattered at different portals across the Internet. BSDP can be applied to extract and analyse this data for a better understanding of current situation and in fighting against the COVID-19 pandemic. In this paper, we designed and implemented a Zero-Information Loss Key-Value Pair Database (ZILKVD) on top of which a Big Social Data Provenance (BSDP) Framework has been developed to capture and query provenance for live streamed Twitter data set. The proposed framework is capable to capture fine-grained provenance for various query sets including select, aggregate, and data update queries with insert, delete, and update operations. It also supports to capture provenance for historical/standing queries using data version support in ZILKVD. The proposed ZILKVD architecture and KVP data model leads to an adequate design methodology that provides a flexible provenance management system for social data. The proposed framework is efficient in terms of average execution time for capturing and storing provenance for select, and data update queries. However, a small execution overhead is measured for some aggregate queries, where the aggregation is performed on a larger number of input tuples. Proposed framework supports efficient provenance querying for both justifying answers of a query result, and historical data queries at an accepted level of precision. Our provenance capturing and querying algorithms prove to be very promising, retrieving more precise information with an optimal latency. However, our framework has the following limitations. First, proposed framework provides single-layer provenance support (i.e., tracing out direct sources that contributed to a query result) at this stage. Second, currently BSDP framework is implemented for a single node Apache Cassandra rather than for several distributed nodes in a cluster. In the future, we plan to extend BSDP framework for multi-layer provenance support (i.e., tracing out both direct and indirect sources that contributed to a query result) by using multi-depth provenance querying. We also plan to further extend our framework for a distributed environment where data is redundantly stored across multiple nodes in a cluster. A layer based architecture for provenance in big data Hadoopprov: towards provenance as a first class citizen in mapreduce Provenance data in social media Relational database systems with zero information loss Data science: nature and pitfalls Data science: a comprehensive overview Big data provenance research directions Bigtable: a distributed storage system for structured data From big data to big data mining: challenges, issues, and opportunities Milieu: lightweight and configurable big data provenance for science A big data modeling methodology for apache cassandra Social media data in research: provenance challenges Provenance for mapreduce-based data-intensive workflows Provenance research issues and challenges in the big data era Big data provenance: State-of-the-art analysis and emerging research challenges Towards multi-level provenance reconstruction of information diffusion on social media Dynamo: Amazon's highly available key-value store The social engineering optimizer (seo) Novel modifications of social engineering optimizer to solve a truck scheduling problem in a cross-docking system Cassandra: Principles and Application. Department of Provenance from log files: a bigdata problem Big data provenance: challenges and implications for benchmarking Reexamining some holy grails of data provenance Seeking provenance of information using social media Automatic query driven data modelling in cassandra Data provenance management for bioinformatics workflows using nosql database systems in a cloud computing environment Provenance for generalized map and reduce workflows Users of the world, unite! the challenges and opportunities of social media The Provenance of a Tweet A fine-grained access control model for key-value systems A provenance model for key-value systems Cassandra: a decentralized structured storage system Coupling Analysis Between Twitter and Call Centre Performance comparison of nosql database cassandra and sql server for large databases A role for provenance in social computation Conceptualizing big social data Ariadne: Online provenance for big graph analytics Ramp: a system for capturing and tracing provenance in mapreduce workflows Query driven implementation of twitter base using cassandra Semiring provenance over graph databases A tool for assisting provenance search in social media Data provenance for historical queries in relational database Efficient multi-depth querying on provenance of relational queries using graph database Twitter data modelling and provenance support for key-value pair databases Provenance framework for twitter data using zero-information loss graph database Real-time twitter data analysis using hadoop ecosystem Change data capture in nosql databases: a functional and performance comparison Provenance in databases: principles and applications A Survey of Data Provenance Techniques Modeling information diffusion in social media as provenance with w3c prov Web-scale provenance reconstruction of implicit information diffusion on social media Big data provenance: challenges, state of the art and opportunities A novel approach to user involved big data provenance visualization s2p: provenance research for stream processing system Disassembly sequence planning for intelligent manufacturing using social engineering optimizer Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations