key: cord-0946798-gjsu5dki authors: Manickam, Vijayalakshmi; Rajasekaran Indra, Minu title: Dynamic multi-variant relational scheme-based intelligent ETL framework for healthcare management date: 2022-03-21 journal: Soft comput DOI: 10.1007/s00500-022-06938-8 sha: bf979a813e7c18f426e116853091691ae0dec1aa doc_id: 946798 cord_uid: gjsu5dki The growth of information technology has opened the gate for the organizations to maintain their data in various forms and at various volumes. This increases the volume and dimension of data being maintained. However, they store their data in their data servers or in cloud environment. Such data have been used to generate various intelligence to support various problems. To support such analysis process, different data have been used and the big data comes to play in this part. Optimizing techniques to help improve the process of ETL could greatly help in real-time analysis of data. ETL optimization could be achieved through several factors simplest being increasing the frequency of the process. Other ways to achieve optimization are through the use of various architectures, programming models, intelligence in transformation and security. To improve the performance of ETL, an efficient dynamic multi-variant relational intelligent ETL framework has been presented in this article. The distributed approach maintains various ontology’s and data dictionaries which have been dynamically updated by different threads of ETL process. Initially, the process is start by applying the extraction process which extracts the data from different sources and finds set of dimensions and their characteristics. Such data extracted have been verified over the data dictionary. Further, the relational score has been measured for each data source with the existing one. Similarly, the method computes the value of multi-variant relational similarity (MVRS) for the data obtained from a single source. This will be performed by different threads of ETL process. According to the value of MVRS, the method performs map reduce and merging of data. According to the value of MVRS the method selects the data node and merges the data to store in the data warehouse. The threads of ETL are capable of reading the changes in data dictionaries and ontology’s to iterate the process of transformation and loading. The method improves the performance of ETL with least time complexity and higher performance. The modern organizations maintain variety of data belonging to their employees, customers and business sectors. Such data cover different dimensions, as the data being maintained by different organizations and their size also huge in size. This forms the big data which has no restriction on the size and volume of the data. In reality, the Communicated by Meng Joo. growing size of data and volume really challenges the organization. The ultimate aim of maintaining such big data is to generate different intelligence from the data. For example, when the data are related to the purchase histories (Nagarajan et al. 2015) of any E-commerce system, then it can be used to generate intelligence towards the business growth or the sale of different products. Similarly, when the big data is related to the medical domain, such big data can be used in generating intelligence and decision on different disease prediction systems. Similarly, it can be used towards variety of issues. The organizations maintain the data in cloud data servers which can be accessed by the employees and users of the organization. However, the volume of data being increased at each fraction which would challenge the data analyst in generating different intelligence from the big data available. The extract transform loading (ETL) has been identified as a key concept in modern big data environment where the data are heterogeneous. It helps to move the source data from transactional database to analytical database for analysis. The ETL process involves in extraction process which extracts data from various sources and converts them to the meaningful form. Similarly, transforming is the processes of changing the data to specific form and performs map reduce process. Loading is the process of moving the data to the warehouse. Extraction is responsible to access various sources in order to extract the selected data intended to analysis purpose. To be standardized, data are then processed by a huge number of transforming (T) tasks (cleansing, filtering, converting, merging, splitting, conforming, aggregation, etc.) . Finally, the loading (L) tasks are in charge of loading the prepared data in the data warehouse (DW). The challenge behind the process is the varying volume as well as dimension (Kartick Chandra Mondal 2020) . The author represented the issues over the data warehousing traditional data according to the data quality and availability. For example, consider a medical data which includes patient details, diagnostic details, scan images, cardiac graphs, and so on. Different medical organizations would maintain the same in different form and each would maintain additional data. Also, the organizations would keep and maintain the data in various structures. This increases the complexity of ETL process and challenges the entire problem of knowledge transformation to support various decisive support systems. Similarly, the problem of channel publication towards stream of real-time data and join operation is well studied in Mehamood (2020) which notice the requirement of ETL process. Similarly, in Adnan and Akbar Dec. (2019) , there are many limitations in the existing ETL techniques according to the type of data and variety of big data. The issues related to them are identified and addressed. By considering all these, the problem of ETL has been approached with a multi-variant relational model which improves the performance of data learning and intelligence generation. The data maintained by different organization or at the data servers would maintain the meta-data in different forms (Simpson and Nagarajan 2021) . By considering such meta-data in terms of ontology, the relation between the data sources and data can be used in generating the intelligence. The proposed multi-variant relational scheme would read the entire data and ontology, to measure MVRS (multi-variant relational similarity) to perform transformation and learning to support intelligence generation. The proposed model has been detailed in the next section. Number of approaches exist and discussed earlier towards ETL of big data. This section details set of methods around the problem. A meta-data-based scheme is designed in Wang (2020), which is a logical model towards ETL and works according to customizing the data cleaning process. The model has been evaluated with different experiments. The application of machine learning in ETL is discussed to meet the real-time user requirements even at the diversity of data and poor availability (Kartick Chandra Mondal 2020) . Similarly, by considering the diversity of data volume, different approaches are analysed in ManelSouibgui (2019), and the author considers the characteristic of ETL quality and the impact of data volume and its quality in the process. Also, in Dakrory et al. (2015) , the author analysed quality of data by designing a testing model which measures the quality of data by varying the volume towards ETL. The data volume has great impact on the ETL process, and in Raj and Souza (2020) , the author discussed how the data can be loaded to the Pig environment to support various predictions. The Pig model fetches the data and converts them to homogenous forms and transforms them towards predictions. It has been designed to access the HDFS system. Similarly, to support data analyst of power centres, an Informatica Power Center ETL tool is designed to support the predictions (Gupta 2019). To improve the query optimization process, the author enforced different data quality techniques in data warehouse in Gupta (2020), and to support the generation of business intelligence a Talend Open Studio tool has been used to transform heterogeneous data to homogenous one. According to the integrated transformed data, the business intelligence has been generated (Shreemathi , 2020) . Towards research in different sectors, various journals are published every year. In order to analyze the data in business journals, ETL process has been used. According to this, a visualization tool is designed in Grover and Kar Sep. (2017) , which performs text mining to identify the theme of journal and according to that the journals are categorized. Similarly, the concept of data in warehouse has been used in the ETL problem. With the social media data and data present in warehouse, an behaviour analysis model is designed which works on two classes to perform sentiment analysis (Moalla et al. 2017) . The data streaming has great importance in ETL, and an efficient DS-Join scheme is presented in Jeon et al. (2019) , which involve in microlevel. The method performs stream processing in distributed way, and stream processing engines are designed to perform such joins. Similarly, to handle the unstructured and structured streams, an Mongo DB model is discussed in Mehmood and Anees (2019), which involves in ETL by joining the semi-stream data coming from various sources with the disk-based master data according to the keys. Similarly, the problem of distributed streaming with ETL is handled with novel Strilim model which supports the development and deployment of streaming application in rapid manner. The architecture of the model is sophisticated to handle both developers and business users (Pareek et al. 2018 ). On the other end, to handle the real-time conditions of streaming, an efficient ETL model is designed which performs streaming and joining according to the requirement (Biswas et al. May 2019) . To handle the problem of cache inequality at semistream many to many equijoin approaches, an semi-stream balanced join (SSBJ) is designed which handles the joint operations according to the service rate and enforces the reduction of memory equally (Naeem et al. Dec. 2019) . Data partitioning is an important stage of ETL process, and to support the process, an efficient two-level data partitioning approach is designed in Hamdi et al. (2015) , towards real-time data warehouse named (2LPA-RTDW) which consider the unbalanced data in every partition according to the user requirement. The spatial data has great use in various problems, and to integrate them to single form, a GeoKettle tool is designed which integrates the spatial data collected from different hotspots of Indonesia (Astriani and Trisminingsih Jan. 2016 ). The tool is designed to provide various interfaces to perform modelling with simplified and adjustable one. Similarly to make the data integration as simpler one, a rewrite/merge scheme is presented in Cuzzocrea and Furtado (2018) , which is a novel data warehousing model. The model works on two modes: in static phase, it performs rewrite, and in dynamic phase, it performs merging to support realtime conditions. To handle the problem of synchronization in live streams, an live synchronization model is presented in Ma and Yang Mar. (2017) , which uses the cache of physical RDBMS by using stream processing. The method constructs a Column Access-aware In-stream Data Cache (CAIDC) to support streaming in relational databases. Similarly, to support real-time streaming an incremental model named Stream Cube is presented which handles the real-time data acquisition, processing and analysis to support decision making (Zheng et al. Aug. 2019) . To handle the smart transportation of big data in IoT environment, a 3-phase model is presented which supports the management of big data and processing the data at higher service management (Babar and Arif Oct. 2019) . To improve the performance of data warehousing a conceptual model is presented, which defines different dimensionality and stereotypes. Similarly, a Data on Demand ETL (DOD-ETL) model is proposed in Machado et al. Dec. (2019) , which combines on demand data stream and pipelines in distributed and parallel memory caches to support effective portioning of data. An event-driven architecture is presented in Rieke et al. (2018) , to manage spatial data obtained from different capturing devices to maintain the geospatial information according to the real-time data (Bouali et al. 2019) . All the approaches suffer to achieve higher performance in extract transform and loading process in high-dimensional data environments. The proposed dynamic multi-variant relational schemebased extraction transformation and learning framework works according to the data dictionary as well as the ontology. First, the method read the data locations and meta-data about the big data given. Using the information available in the ontology, each data source has been read and initiated with different threads. Each of them is responsible for reading the ontology of the concern data source and data dictionary. Using them the value of relational score and similarity is measured to perform map reduce and merging in an iterative manner. The detailed working of the proposed model is presented in this section. The architecture of proposed dynamic multi-variant relational model for ETL process is presented in Fig. 1 . The functional components are detailed in this section. The big data given for the ETL problem has been read and the module fetches the set of data dictionary available and their respective ontology indexed. From the data source available, the method first identifies the set of features or dimensions available. Also, according to the features identified, the data points of the data set given are verified. If there exist any incomplete data, then it has been Dynamic multi-variant relational scheme-based intelligent ETL framework for healthcare management eliminated from the data set. Accordingly, the method generates a multi-thread where each is responsible for measuring different similarity measures towards the data present in each data nodes. The threads are capable of monitoring the update happening in the ontology and data dictionary. According to the update happening in the dictionaries, the process will be iterated further. The above-discussed algorithm represents the working of DMVR-ETL model which monitors the changes in data dictionary as well as receives the big data given. Using the big data and data dictionary with ontology, the method estimates the multi-variant relational similarity measure towards various data dictionary belonging to different data nodes. According to the value of MVRS, the method decides on merging and performs map reduce to support efficient intelligence generation. The multi-variant relational similarity (MVRS) value is measured by the proposed model according to feature relevancy score (FRS), relational score (RS), and type orient relevancy (TOR) values. First, the method estimates the value of feature relevancy score (FRS) towards various data dictionaries available belonging to different data nodes. Any data dictionary or data node would contain X number of data with Y number of features. According to that, the input big data Bd would contain a subset of y features belonging to the data node. So, based on the appearance and containment of the features, the value of feature relevancy measure is computed. Consider, the data dictionary of node p is represented as DD p , which contains a feature set Y f, and the input big data BD contains the feature set T f ; , then, the relevancy between the data points according to the feature values can be measured. It has been measured as follows: Similarly, the value of feature relevancy of any data given can be measured towards other data sets. On the other side, the relevancy can be measured by counting the semantic relations which is measured using the ontology available. The semantic relation is the value computed based on the number of different meaning a feature consists with the ontology. Each data set has their own dictionary as well as ontology which represents the relation between the features and data. Similarly, the method computes the value of type orient relevancy (ToR) values between given data and the data present in the data nodes. By considering such relation, the problem of transformation and learning can be improved. Dynamic multi-variant relational scheme-based intelligent ETL framework for healthcare management The above discussed algorithm shows the way how the value of multi-variant relational similarity measure is computed. The method computes the value of ToR (type orient relevancy) on features, FRS (feature relevancy score), and relational score (RS) between the given ontology and data dictionary sets. Using these features and values, the method computes the value of MVRS. The proposed DMVRS approach performs map reduce operation over the feature dimension but not on the features. As the data sets would have variety of dimension and features, the method computes the value of feature-centric overlap measure (FCOM) and dimension-centric overlap measure (DCOM) values. According to the value of both FCOM and DCOM values, the method performs either map reduce or both map reduce and merging. The value of FCOM is measured based on the number of values of features correlated with two different data sets where the value of DCOM is measured according to the similarity on the dimensions of two data sets. Once the merging is performed, then the concern data dictionary is updated as well as ontology. The proposed merging and map reduce algorithm estimates the feature-centric overlap measure and dimensioncentric overlap measure to perform merging and updating the ontology as well as data dictionary. The ontology file will be updated with new terms and relations identified accordingly. The proposed dynamic multi-variant relational modelbased ETL framework has been implemented, and the performance of the method has been evaluated under various constraints. The performance of the method is evaluated based on the varying number of data nodes and varying number of ontology and data dictionaries. This section presents the result obtained through evaluation of different approaches. The data being used towards performance evaluation of the proposed dynamic multi-variant relational model are presented in Table 1 . The Covid-19 data set provided by WHO (World Health Organization) has been combined with other data sets like diabetic, cardiac, clinical, lung data sets. Such data have been collected from 20 numbers V. Manickam, M. Rajasekaran Indra of sources through the web link https://data.world/datasets/ covid-19 and applied with the proposed model to generate intelligence. The data set has different number of records and finally obtained up to one million records. The data sets are collected from various sources like American Health Department who populates the entire details of Covid-19 patients and their medical records. Similarly, the other sources are used for the study. According to the above details the performance of the method has been evaluated and presented in this section. The features considered for this evaluation include different features of clinical, diabetic, cardiac, Covid-19 (https:, , www.kaggle. com, sudalairajkumar, novel-corona-virus-2019-dataset. xxxx) , and lung data sets. The clinical data set includes the features obtained from CBC (complete blood count) like number of white, red blood cells, haemoglobin, haematocrit, and platelets. It also includes temperature, height, weight, age, sex, and so on. Similarly, each data set covers various features, so the analysis is carried by considering 100 features, how the method performs and at 200 features how the method works and 300 features how the methods works. The analysis is performed with Covid-19 data set which is the collection of data set like diabetic, cardiac, clinical and other features. For example, the diabetic data set includes name, age, sex, weight, BMI, fasting sugar, meal sugar, burning, giddiness. Similarly, the cardiac data set would include BP, heart rate, QRS value, P wave value, R wave value, etc. The clinical data set includes the result of CRC and CBR test which has hundreds of features. The Covid-19 features like temperature, pressure, body pain, diarrhoea, vomiting, tiredness, contacts, transport, cough, headache, clinical visit, and so on. The features vary between 100 and 300 features which have been used to perform performance analysis. Dynamic multi-variant relational scheme-based intelligent ETL framework for healthcare management The performance in ETL has been measured for different approaches and is presented in Table 2 , where the proposed method improves the performance in ETL which is higher than other approaches. The proposed method improves the performance in the ratio 87, 92, and 97% in the cases of 100, 200, and 300 feature cases. The performance of the methods is measured for their ETL performance and compared in Fig. 2 . The proposed DMVR-ETL has produced higher performance in extraction transformation and learning in all the test cases considered. The ratio of overlap produced by different methods according to number of features is measured for various methods and is presented in Table 3 . The proposed DMVR-ETL approach has produced less overlap ratio than other approaches. The ratio of overlap produced by different methods is measured and presented in Fig. 3 . The proposed DMVR-ETL approach has produced least overlap ratio compared to other methods in all the test cases. The DMVR-ETL algorithm reduces the overlap value compared to 2LPA-RTDW, STRIIM, DOD-ETL approaches by considering the multi-variant relation value between different data and features. The performance of time complexity has been measured and is presented in Table 4 . The proposed DMVR-ETL approach has produced less time complexity in all the cases. The time complexity introduced by different methods in ETL process has been measured and is presented in Fig. 4 . The proposed DMVR-ETL algorithm has produced less time complexity compared to other methods. In this paper an efficient dynamic multi-variant relational similarity-based ETL framework is sketched. The proposed approach read the data sets and dictionary with ontology, to perform ETL process. The method first identifies the features and dimensions of the data set given and estimates the multi-variant relational similarity (MVRS) towards various data dictionary and data nodes, which is based on the feature relevancy score, relational score and type orient relevancy score. Based on these values, the method estimates the MVRS value to perform merging and map reduce. The proposed method improves the performance in ETL at different test cases. Funding We did not receive financial support from any organization for the submitted work. Data availability Not applicable. Conflict of interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. Dynamic multi-variant relational scheme-based intelligent ETL framework for healthcare management Proposed techniques to optimize the DW and ETL query for enhancing data warehouse efficiency A complete reference for informatica power center ETL tool An analytical study of information extraction from unstructured and multidimensional big data Extraction transformation and loading (ETL) module for hotspot spatial data warehouse using Geokettle Real-time data processing scheme using big data analytics in Internet of Things based smart transportation environment Efficient incremental loading in ETL processing for real-time data integration Real-time data warehouse loading methodology and architecture: a healthcare use case A rewrite/merge approach for supporting real-time data warehousing via lightweight data integration Automated ETL testing on the data quality of a data warehouse Big data analytics: a review on theoretical contributions and tools used in literature 2LPA-RTDW: a twolevel data partitioning approach for real-time data warehouse Distributed join processing between streaming and stored big data under the micro-batch model Role of machine learning in ETL automation Column access-aware in-stream data cache with stream processing framework DOD-ETL: Distributed on-demand ETL for near real-time business intelligence Data quality in ETL process: a preliminary study Performance analysis of not only SQL semi-stream join using MongoDB for real-time data warehousing 2020) Challenges and solutions for processing real-time big data stream: a systematic literature review Data warehouse design approaches from social media: review and comparison A memory-optimal manyto-many semi-stream join CIMTEL-mining algorithm for big data in telecommunication Real-time ETL in Striim Implementation of ETL Process using Pig and Hadoop Geospatial IoT-The need for event-driven architectures in contemporary spatial data infrastructures Data integration in ETL using TALEND An edge based trustworthy environment establishment for internet of things: an approach for smart cities Design of ETL tool for structured data based on data warehouse Real-time intelligent big data processing: technology platform and applications The Journal of Mobile Communication, Computation and Information to publish their article.