key: cord-0038931-fryrs7vx
authors: Brous, Paul; Janssen, Marijn; Krans, Rutger
title: Data Governance as Success Factor for Data Science
date: 2020-03-06
journal: Responsible Design, Implementation and Use of Information and Communication Technology
DOI: 10.1007/978-3-030-44999-5_36
sha: 543300ae579c6d2b470c51dd8395e7d897cce3ea
doc_id: 38931
cord_uid: fryrs7vx

More and more, asset management organizations are introducing data science initiatives to support predictive maintenance and anomaly detection. Asset management organizations are by nature data intensive to manage their assets like bridges, dykes, railways and roads. For this, they often implement data lakes using a variety of architectures and technologies to store big data and facilitate data science initiatives. However, the decision-outcomes of data science models are often highly reliant on the quality of the data. The data in the data lake therefore has to be of sufficient quality to develop trust by decision-makers. Not surprisingly, organizations are increasingly adopting data governance as a means to ensure that the quality of data entering the data lake is and remains of sufficient quality, and to ensure the organization remains legally compliant. The objective of the case study is to understand the role of data governance as success factor for data science. For this, a case study regarding the governance of data in a data lake in the asset management domain is analyzed to test three propositions contributing to the success of using data science. The results show that unambiguous ownership of the data, monitoring the quality of the data entering the data lake, and a controlled overview of standard and specific compliance requirements are important factors for maintaining data quality and compliance and building trust in data science products.

More and more, asset management organizations are introducing data science initiatives to support the digital transformation of their business processes [1] . However, in order for data science to be successful, it is vital that asset management organizations are able to trust the integrity of the digital environment [2, 3] . Managers have, in the past, found it difficult to trust data science products as, for example, the data is often found to be lacking the required quality [4] [5] [6] [7] . Furthermore, as suggested by Wallis et al. [7] , data collections are only as valuable as the data they contain, and users need to be able to trust the data based on the integrity of the data systems and the intrinsic quality of the data. Managers need to be able to trust data science products before they are confident enough to use these products to support their business processes to make crucial decisions [6] . Examples of these decisions in the asset management domain are maintaining dykes or replacing a bridge. Decisions in these scenarios have long term implications and wrong decisions can be expensive and risky. A lack of trust in data science projects can often be attributed to the lack of data quality, and the success of data science projects is often highly reliant on the quality of the data being used [8] [9] [10] . There is no single factor defining the successful outcomes of a data science project [11, 12] , but recently data governance has gained traction by many organizations as being important for ensuring quality and compliance in data science outcomes [11, 13] . However, it remains unclear how data governance contributes to the success of data science outcomes, leading to calls for more research in this area [11, 14, 15] .

Data Governance can be defined as "the exercise of authority and control (planning, monitoring and enforcement) over the management of data assets" [16] (p. 67), and can provide direct and indirect benefits [17] . For example, Brous et al. [14] showed that adoption of data governance can improve operational efficiency, increase revenue, reduce risk (for example with regards to privacy violations), reduce costs, improve perception of how information initiatives perform, improve acceptance of spending on information management projects, and improve trust in information products.

The main objective of the paper is to understand the role of data governance as a factor for successful data science outcomes. Our main research question therefore asks how does data governance contribute to more successful data science outcomes? This paper analyses a case study in the asset management domain with specific regard for the role of data governance as success factor for data science outcomes. The case under study is managed by Rijkswaterstaat in the Netherlands. Rijkswaterstaat is part of the Dutch Ministry of Infrastructure and Water Management and is responsible for the design, construction, management and maintenance of the main infrastructure facilities in the Netherlands. The paper reads as follows. Section 2 presents the background of literature regarding the relationship between data governance, trust and the digital environment. In Sect. 3 the methodology of the research is described. Section 4 describes the findings of the case study. Section 5 discusses the findings of the case study and Sect. 6 presents the conclusions.

Although more attention has been paid to data governance in the literature in recent years, there have been several calls within the scientific community for more systematic research into data governance and its impact on the business capabilities of organizations [18] [19] [20] . Little evidence has been produced so far indicating what actually has to be organized by data governance and what data governance processes may entail [20, 21] , and many organizations find data governance difficult to implement [22, 23] . There appears to be no "one-size-fits-all" approach to data governance [24] and the nuances attached to various domains and organizational types have not yet been extensively described [25, 26] . Furthermore, evidence is scant as to the role data governance plays in ensuring the successful outcomes of data science initiatives [18, 19] .

Recent years have witnessed more and more asset management organizations adopting data science initiatives in order to support the digital transformation of their business processes [27, 28] , and Van der Aalst [29] go so far as to suggest that organizations without a data science capability may not survive. According to Provost and Fawcett [1] (p. 52), data science is "a set of fundamental principles that support and guide the principled extraction of information and knowledge from data". From this perspective, data science encompasses a broad range of knowledge and capabilities such as data-mining and machine learning, which are designed to extract knowledge from data and are important for creating value and moderating risk in data science initiatives. As such, data governance can help organizations make use of data as a competitive asset [21, 23] . Data governance aims at maximizing the value of data assets in enterprises [1, 37] . For example, capturing electric-and gas-usage data every few minutes benefits the consumer as well as the provider of energy. With active governance of big data, isolation of faults and quick fixing of issues can prevent systemic energy grid collapse [38] .

Data science can improve asset management decision-making which is needed to facilitate more efficient and secure asset management operations, as well the need for better situational awareness about network disturbances [10, 27] . Data science initiatives such as predictive maintenance modelling generally require big data [10, 30, 31] . Asset management organizations often choose to implement data lakes using a variety of architectures and technologies to store big data and to make this data available for use. A data lake is "a central repository system for storage, processing, and analysis of raw data, in which the data is kept in its original format and is processed to be queried only when needed" [32] (p. 456). Data lakes are different to traditional data warehouses which often have their own native formats and structures as data is stored in its original, raw, format [33, 34] . Often, the data processing systems which are required to allow the data to be ingested without compromising the data structure are also included in the definition [32, 34] . The data in the data lake is generally immediately accessible, allowing users to utilize dynamic analytical applications [34, 35] . This immediate accessibility, as well as the retaining of data in its original format presents a number of challenges regarding management of the data lake, including data quality management, data security and access control [33, 36] , as well as in maintaining compliance with regards to privacy [21, 36] . As such, data governance has increasingly gained popularity as a means of ensuring data quality and maintaining compliance.

Managing data quality is considered by many researchers to be an important reason for adopting data governance (e.g. [24, 37, 39] ). However, big data can provide asset management organizations with complex challenges in the management of data quality. According to Saha and Srivastava [40] , the massive volumes, high velocity and large variety of automatically generated data can lead to serious data quality management issues which can be difficult to manage in a timely manner [41] . For example, IoT sensors calibrated to measure the salinity of water may, over time, begin to provide incorrect values due to biofouling. Data science information products often rely on near real-time data to provide timely alerts, and, as such, problems may arise if these data quality issues are not timely detected and corrected.

As well as establishing data management processes which manage data quality, data governance should also ensure that the organization's data management processes are compliant with laws, directives, policies and procedures [42] . For example, Panian [43] states that establishing and enforcing policies and processes around the management of data should be the foundation of effective data governance practice as using big data for data science often raises ethical concerns. Automatic data collection may cause privacy infringements [44, 45] such as cameras used to track traffic on highways which often record personally identifiable data such as number plates or faces of persons in the vehicles. Data governance processes should ensure that these personally identifiable features are removed before data is shared or used for purposes other than legally allowed. Data governance should therefore establish what specific data privacy policies are appropriate [39] and applicable across the organization [38] . For example, Tallon [46] states that organizations have a social and legal responsibility to safeguard personal data, whilst Power and Trope [47] suggest that risks and threats to data and privacy require diligent attention from organizations.

In summary, asset management organizations often choose to implement data science initiatives such as predictive maintenance and anomaly detection, using methods such as data-mining and machine learning, in order to support the digital transformation of their business processes. Many modern data science methods require big data which is often stored and made available through data lakes. However, asset management organizations are increasingly being faced with challenges which impact the success of data science outcomes, often related to: 1. a lack of trust in the quality of data [40, 41] , 2. whether or not the data is being used in an ethical way [46] , and 3. whether or not the management and use of the data is compliant with relevant legislation and internal policies [47] . In order to tackle these challenges, data governance assigns responsibilities for decision-making [24] , defines processes for monitoring an managing data quality [41] , and defines policies for monitoring and maintaining compliance with relevant legislation [47] .

The propositions of the research are based on the results of the background literature review as well as on existing theory regarding the principles of data governance in asset management organizations and the reasons why asset management organizations choose to implement data governance [13, 14, 48] . The propositions of the research therefore read as follows:

1. Defining clear roles and responsibilities for data management will result in easier generation of business value from data science efforts. 2. Monitoring and managing data quality will result in more useful outcomes from data science efforts. 3. Compliance monitoring and control is a required condition for data science.

As discussed above, the literature shows that many organizations have implemented data governance in an attempt to improve trust in data science efforts through the improved management of data quality and compliance to relevant legislation.

This paper describes a single case study using a multi-method approach to investigate the role of data governance as success factor for data science. Case study is a widely adopted method for examining contemporary phenomenon such as the adoption of data governance [49, 50] . In this research we analyze a single case, following the design of an explanatory case study research proposed by Yin [51] , including the research question, the propositions for research, the unit of analysis, and the logic linking the data to the propositions. Single case study was selected as being appropriate for this research as there is a need to investigate data governance as success factor for data science in greater detail. In this regard, single case studies may be more appropriate than multiple case studies, as a single case study provides the opportunity to have a deeper understanding of data governance in a specific context [51, 52] , in this case, data science efforts in the asset management domain. As suggested by Eisenhardt [50] , the research was contextualized by a review of background literature, identifying the generally accepted roles of data governance in a data science context. The literature background reveals data science initiatives often face a number of challenges, and not all efforts lead to successful outcomes [15, 48, 53] . Facing these challenges has led many organizations to adopt data governance as a means of improving the outcomes of data science efforts [13] . However, data governance remains a poorly understood concept [22, 36] and its contribution to the success of data science has not been widely researched [36] . As discussed above, our main research question therefore asks how does data governance contribute to more successful data science outcomes?

Following Ketokivi and Choi [54] , deduction type reasoning augmented by contextual considerations provided the basic logic for the propositions to be tested in a particular context, namely data science in an asset management domain. The data analysis in this research utilizes "within case analysis" [55] . Within case analysis helped us to examine the impact of data governance on the success of data science in a single context. In this case, the unit of analysis was a single data science project in the asset management domain. The case selected was managed and implemented by Rijkswaterstaat, often abbreviated to RWS and referred to as such in this paper. RWS is the Directorate-General for Public Works and Water Management and an operational agency of the Ministry of Infrastructure and Water Management of the Netherlands. RWS is charged with the management and maintenance of the major highways, waterways and shipping lanes in the Netherlands. In order to prepare the organization for the case study research project, RWS was provided with information material outlining the objectives of the project.

Following the suggestions of Yin [51] , the case study was conducted using a multimethod approach and multiple data sources were used. Methods used are document analysis and face-to-face interviews. The interviews were conducted during 2019 taking the form of one-on-one, face-to-face interviews. The interviewees were mainly selected from RWS staff members directly involved in the data science project in various roles, but also included other staff members involved in the governance and management of the data and the monitoring of the data in order to ensure saturation. Secondary data sources included relevant internal documentation, including project reports, data governance workshop reports, and data and information technology strategy documents. Company websites which included relevant data governance information and reports on the data science case were also included. Triangulation of aspects of data governance which contribute to the successful outcome of the data science case was made by listing aspects of data governance found in internal documentation and testing these in the one-on-one interviews. In the interviews the interviewees were asked as to the contribution of these aspects of data governance towards the successful outcome of the project. In the interviews the interviewees were also asked to name other aspects of data governance that may have had a significant contribution to the successful outcome of the data science project but which may have been overlooked.

RWS is tasked with the management and maintenance of the national public infrastructure including the construction and maintenance of shipping lanes, major waterways (including flood prevention) and national roads and highways. RWS has a spend of approximately €200 million per annum on asphalt maintenance, with operational parameters traditionally focused on traffic safety. In the past this has led to increasing overspend due either to premature maintenance, or to expensive emergency repairs. The prediction of asphalt lifetime based on traditional parameters has been shown to be correct one third of the time. RWS is seeking to reduce these costs by extending the lifespan of asphalt where possible whilst reducing the number of emergency repairs made by adopting data science techniques for the purpose of predictive, "just-in-time" maintenance. Using available big data in a more detailed manner, such as raveling data collected by a Laser Crack Measurement System combined with Weigh-in-Motion data has doubled the prediction consistency. According to RWS officials, improving the accuracy of asphalt lifetime prediction has enabled better maintenance planning which has significantly reduced premature maintenance, improving road safety and cost savings, and reducing the environmental impact due to reduced traffic congestion and a reduction in CO 2 emissions. The data science model uses data related to traditional inspections, historical data generated during the laying of the asphalt, road attribute data and planning data, as well automatically generated, streaming data such as weather data, traffic data, and IoT sensor data. The current model takes about 400 parameters into consideration. According to an RWS official, "this number will only grow, as the (project partners) continue to supply new data". According to RWS, the ultimate goal is a model that can accurately predict the lifespan of a highway.

With regards to defining roles and responsibilities RWS has asked the data managers of each of the datasets used in the data science project to each appoint an executive sponsor or data owner. The data owner is a business sponsor. Once ownership is established, the current and desired future situations are assessed in terms of production and delivery. A roadmap is then established which was translated into concrete actions and a delivery agreement is reached. RWS also uses "open" data from external sources. Due to its many open data partnerships, RWS has implemented a policy of providing knowledge, tools and a government-wide contact network in which best practices are shared with other government organizations. These best practices refer to organization of data management, data exchange with third parties, data processing methods and individual training. According to staff members, RWS has implemented data governance for their big data in order to remain "future-proof, agile and to improve digital interaction with citizens and partners". According to an RWS executive manager, "RWS wants to be careful, open and transparent about the way in which it handles big and open data and how it organizes itself". Furthermore, RWS has introduced the policy of assessing and publishing the monetary cost of data assets in order to raise awareness of the importance of data quality management. This means that every RWS process and every RWS organizational unit is encouraged to be aware of its data needs and the incurred costs.

With regards to data quality, RWS has implemented a data quality framework to improve their control of data quality. RWS staff believe that "the return (of the investment) stands or falls with the quality of data and information". As such, according to RWS staff, the underlying quality of the data and information is of great importance to work in an information-driven way. RWS staff members have suggested that, in the past, a significant amount of production time has often been lost due to inadequate data quality. The RWS data quality management process follows an eight step process which begins by identifying: 1. the data to be produced, 2. the value of the data for the RWS primary processes, and 3. a data owner. RWS has developed an automatic auditing tool (AAT) in combination with a Manual Auditing Tool (MAT) to monitor the quality of the data as a product in order to further improve its grip on data quality. According to RWS staff, the AAT and the MAT ensured that quality measurements were mutually comparable, provided tools for more focused management, and caused a change in the conscious use of data as a strategic asset. Alongside with the AAT, the MAT is considered important as it is not yet possible to automate the monitoring of all data quality dimensions. Data quality measuring is centralized at RWS, the goal being to ensure a standardized working method. However, RWS maintained the policy that every data owner is responsible for improvements to the data management process and the data itself. The RWS data quality framework was based on fitness for use and data quality measurement was maintained according to 8 main dimensions and 47 subdimensions.

With regards to compliance, RWS has translated their data policies and principles into a data agenda in which the opportunities, risks and dilemmas of their data policies and ambitions are identified in advance and are made measurable and practicable. Terms and definitions have been coordinated with the Dutch legal framework related to the environment to ensure compliance. Responsibilities relating to compliance to privacy laws are centralized and RWS has assigned privacy officers to this role. The CIO has the final responsibility for ensuring that privacy and security are managed and maintained, however, business data owners are held accountable for ensuring compliance to dataset specific policy and regulations.

Case study methodology was used in this research to identify the role that data governance plays as success factor for data science. The choice for an in-depth, single case study was based on the contemporary nature of both data science and data governance and the need to study data governance as success factor for data science in greater depth. The study was conducted as a single case study and the results should be regarded in this light. Single case study has been criticized in the past due to the difficulty of providing a generalizing conclusion [51, 56] . In order to overcome this, the data collection made use of multiple sources including reports, presentations and faceto-face interviews. More research is recommended in this area to test the applicability of the propositions in other domains and organizational types. The study was conducted in the asset management domain as asset management organizations by nature are often data rich due to the need to monitor the state of the infrastructure assets. This may limit the applicability of the study for domains which are less data intensive, however the essence of generating value from data is likely to be the same in other domains.

Proposition 1 proposes that data science is likely to generate more business value if responsibilities for data management are clearly defined. RWS has many various open data partners, as well as a large variety of sources from which the data is collected. As a result RWS has experienced difficulties in managing responsibilities for data quality and data management processes. RWS has therefore assumed a leadership role in maintaining a government-wide contact network in which knowledge, tooling and best practices with regards to data management and data sharing are shared with other government organizations. Internally, RWS has assigned business sponsors to assume ownership of datasets so that roles and responsibilities of data management are clearly defined. In order to ensure that sufficient resources are made available for data quality management, RWS has also defined a "price" for each dataset so that business owners are aware of the value of each dataset. This allows the organization to treat the data as a business asset, promoting the need to maintain the expected quality of each dataset.

Proposition 2 proposes that data science is more likely to result in useful outcomes if data quality is monitored and controlled. RWS actively monitors their data inputs by means of an "automatic audit tool". RWS has assembled a library of business rules which form the input for the calculation of the data quality. The results of the calculations are displayed in the form of a dashboard which indicates whether the calculated values fall within acceptable limits or not. The acceptable limits are described in the RWS data quality framework which has standardized the calculation and description of data quality throughout RWS. The results of the data quality monitor are used to define which interventions need to be taken in order to achieve the desired levels of quality and also to monitor the effects of the interventions on the data quality. Traditionally, data quality projects at RWS were based on "hearsay" from staff whereby the general feeling was that the quality was below requirements. The AAT has allowed RWS to be more data driven with regards to their data management processes. According to RWS staff, the active monitoring of data quality has led to "identification of gaps in data governance, harmonization of processes across organizational departments, increased awareness and cost savings".

Proposition 3 proposes that compliance with relevant legislation is a necessary and required condition for data science. RWS has had a central, IT-centered approach to data privacy to ensure that legal requirements and guidelines regarding the European General Data Protection Regulation (GDPR) are standardized and consistent throughout the organization. RWS has published a transparent list of systems in which personal data is collected, and has published detailed instructions as to how personal data may be viewed and, where necessary, deleted. RWS has appointed privacy and compliance officers to assume this responsibility and has appointed the CIO has the responsible executive sponsor. The monitoring of other compliance related activities is done using the AAT or the MAT. Responsibility for the actions flowing from the results of the AAT or the MAT lies with the data managers and ownership lies with the data sponsor. This hybrid approach allows RWS to standardize compliance processes where possible, whilst also being able to tailor customized solutions for particular data issues. Currently the feasibility of a nationwide data platform for asphalt pavement data is being explored in which easy data accessibility, authorization, storage, scalability, architecture, plateau planning, solution directions and cost estimations are addressed.

In this research paper we analyzed a case study regarding the governance of data in a data lake in the asset management domain to identify factors contributing to the success of using data science. The objective of the case study is to understand the role of data governance as success factor for data science. The case under study is a data science project which predicts the maintenance requirements of asphalt on national highways over time. Three propositions were defined on the basis of existing theory on data governance, namely: 1. defining clear roles and responsibilities for data management will result in easier generation of business value from data science efforts, 2. monitoring and managing data quality will result in more useful outcomes from data science efforts, and 3. compliance monitoring and control is a required condition for data science. These propositions were derived from the literature and confirmed in the case study, suggesting that data governance should be regarded as an important success factor for data science outcomes. The results show that clearly defined ownership of the data, monitoring the quality of the data entering the data lake, and a controlled overview of compliance requirements are important factors for successful data science outcomes. The results also show that efficient management of compliance may be performed by developing centrally managed, standardized solutions for privacy and security requirements. However, system-specific compliance requirements need to be developed by data managers and these requirements should be owned by a business sponsor who assumes responsibility for these requirements. As such, the results show the data governance is an important success factor for data science outcomes as it ensures that data quality and compliance are effectively managed.

Data science and its relationship to Big Data and data-driven decision making

Authenticity in a Digital Environment. Council on Library and Information Resources

Extreme trust: the new competitive advantage

The need for a data quality framework in asset management

Can we trust Big Data? Applying philosophy of science to software

Trust in data science: collaboration, translation, and accountability in corporate data science projects

Know thy sensor: trust, data quality, and data integrity in scientific digital libraries

Fault detection and explanation through Big Data analysis on sensor streams

Predictive maintenance of complex system with multi-level reliability structure

The role of Big Data in improving power system operation and protection

Big Data team process methodologies: a literature review and the identification of key factors for a project's success

An investigation into the implementation factors affecting the success of Big Data systems

Governing asset management data infrastructures

Coordinating decision-making in data management activities: a systematic review of data governance principles

Data reusers' trust development

DAMA-DMBOK: Data Management Body of Knowledge

Data Governance: How to Design, Deploy and Sustain an Effective Data Governance Program

Using the Bolman and deal's four frames in developing a data governance strategy

The rise of 'big data' on cloud computing: review and open research issues

A morphology of the organisation of data governance

Big Data governance

Big Data has unique needs for information governance and data quality

Government data does not mean data governance: Lessons learned from a public sector application audit

A contingency approach to data governance

An integrated data analytics process to optimize data governance of non-profit organization

Data governance: a conceptual framework, structured review, and research agenda

Data science, predictive analytics, and big data: a revolution that will transform supply chain design and management

Digital transformation: opportunities to create new business models

Data Science in Action

Business intelligence and analytics: from Big Data to big impact

How 'big data' can make big impact: findings from a systematic review and a longitudinal case study

A mapping study about data lakes: an improved definition and possible architectures

The next information architecture evolution: the data lake wave

Big Data, fast data and data lake concepts

Big Data in cloud computing: a resource management perspective

Data science data governance

Data governance

Governing big data: principles and practices

Designing data governance

Data quality: the other face of big data

Data quality for data science, predictive analytics, and Big Data in supply chain management: an introduction to the problem and suggestions for research and applications

Data governance for SoS

Some practical experiences in data governance

Perceived internet privacy concerns on social networks in Europe

Governance of big data collaborations: how to balance regulatory compliance and disruptive innovation

Corporate governance of big data: perspectives on value, risk, and cost

The 2006 survey of legal developments in data management, privacy, and information security: the continuing evolution of data governance

Factors influencing adoption of IoT for data-driven decision making in asset management organizations

Investigating the research approaches for examining technology adoption issues

Building theories from case study research. Acad

Case Study Research: Design and Methods. Sage, Thousand oaks

Single case studies vs. multiple case studies: a comparative study

A trust model for data sharing in smart cities

Renaissance of case research as a scientific method

Qualitative Data Analysis: An Expanded Sourcebook

Case study as a research method