Title Page Article: Scalable Decision Support for Digital Preservation *This version is a voluntary deposit by the author. The publisher’s version is available at: http://dx.doi.org/10.1108/OCLC-06-2014-0025 Author Details Author 1 Name: Christoph Becker Department: Faculty of Information University/Institution: University of Toronto Town/City: Toronto Country: Canada Author 2 Name: Luis Faria University/Institution: KEEP Solutions Town/City: Braga Country: Portugal Author 3 Name: Kresimir Duretec Department: Information and Software Engineering Group University/Institution: Vienna University of Technology Town/City: Vienna Country: Austria Acknowledgments (if applicable): Part of this work was supported by the European Union in the 7th Framework Program, IST, through the SCAPE project, Contract 270137. Abstract: Purpose – Preservation environments such as repositories need scalable and context-aware preservation planning and monitoring capabilities to ensure continued accessibility of content over time. This article identifies a number of gaps in the systems and mechanisms currently available, and presents a new, innovative architecture for scalable decision making and control in such environments. Design/methodology/approach – The paper illustrates the state of the art in preservation planning and monitoring, highlights the key challenges faced by repositories to provide scalable decision making and monitoring facilities, and presents the contributions of the SCAPE Planning and Watch suite to provide such capabilities. Findings – The presented architecture makes preservation planning and monitoring context-aware through a semantic representation of key organizational factors, and integrates this with a business intelligence system that collects and reasons upon preservation-relevant information. http://dx.doi.org/10.1108/OCLC-06-2014-0025 Research limitations/implications - The architecture has been implemented in the SCAPE Planning and Watch suite. Integration with repositories and external information sources provide powerful preservation capabilities that can be freely integrated with virtually any repository. Practical implications - The open nature of the software suite enables stewardship organizations to integrate the components with their own preservation environments and to contribute to the ongoing improvement of the systems. Originality/value – The paper reports on innovative research and development to provide preservation capabilities. The results enable proactive, continuous preservation management through a context-aware planning and monitoring cycle integrated with operational systems. Keywords: Repositories, preservation planning, preservation watch, monitoring, scalability, digital libraries Scalable Decision Support for Digital Preservation Christoph Becker, Kresimir Duretec & Luis Faria 1. Introduction Digital preservation aims at keeping digital information authentic, understandable, and usable over long periods of time and across ever-changing social and technical environments (Rothenberg 1995; Garret & Waters 1996; Hedstrom 1998). The challenge of keeping digital artifacts accessible and usable while assuring their authenticity surfaces in a multitude of domains and organizational contexts. While digital longevity as a challenge is increasingly encountered in domains as diverse as high-energy physics and electronic arts, the repository is still the prototypical scenario where the concern of longevity is of paramount importance, and libraries continue to play a strong role in the preservation community. Repository systems are increasingly made fit for actively managing content over the long run so that they can provide authentic access even after the availability of the original creation context, both technical and social. In this process, they have to address two conflicting requirements: The need for trust, a fundamental principle that is indispensable in the quest for long-term delivery of authentic information, and the need for scalability, arising from the ever-rising levels of digital artifacts deemed worthy of keeping. Systems that address aspects of preservation include repository software, tools for identification and characterization of digital artifacts, tools for preservation actions such as migration and emulation, systems to address aspects of analysis and monitoring, and preservation planning. It is understood today that automating most aspects of an operational preservation system is a crucial step to enable the scalability required for achieving longevity of digital content on the scales of tomorrow. Such automation is required within components of a complex system, but also needs to address systems integration, information gathering, and ultimately, decision support. The core capabilities that an organization needs to establish cover ● preservation operations, i.e. preservation actions such as emulation, virtualization and migration of digital objects to formats in which they can be accessed by the users, but also object-level characterization, quality assurance and metadata management; ● preservation planning, i.e. the creation, ongoing management and revisions of operational action plans prescribing the actions and operations to be carried out as means to effectively safeguard, protect and sustain digital artifacts authentically and ensuring that the means to access them are available to the designated community; and ● monitoring as a sine-qua-non of the very idea of longevity: Most of the risks that need to be mitigated to achieve longevity stem from the tendency of aspects in the socio-technical environment to evolve and sometimes change radically. Without the capability to sustain a continued awareness of a preservation system and its environment, preservation will not achieve its ultimate goal for long. Monitoring focuses on analyzing information gathered from different sources, both internal and external to the organization, to ensure that the organization stays on track in meeting its preservation objectives (Becker, Duretec, et al. 2012). Such awareness needs to be based on a solid understanding of organizational policies, which provide the context for preservation. In general terms, it can be said that policies "guide, shape and control" decisions taken within the organization to achieve long-term goals (Object Management Group, 2008; Kulovits et al. 2013b). Monitoring, policy making and decision making processes are guided by information on a variety of aspects ranging from file format risks to user community trends, regulations, and experience shared by other organizations. Sources that provide this kind of information include online registries and catalogues for software and formats or technology watch reports of recognized organizations. These are increasingly available online, but the variety of structures, semantics and formats prohibit, so far, truly scalable approaches to utilizing the knowledge gathered in such sources to provide effective decision support. (Becker, Duretec, et al. 2012) However, the key challenge confronting institutions worldwide is precisely to enable digital preservation systems to scale cost-efficiently and effectively in times where content production is soaring, but budgets are not always commensurate with the volume of content in need of safeguarding. Recent advances in using paradigms such as mapreduce (Dean & Ghemawat 2004) to apply distributed data- intensive computing techniques to the content processing tasks that arise in repositories show a promising step forward for those aspects that are inherently automated in nature. But ultimately, for a preservation system to be truly scalable as a whole, each process and component involved needs to provide scalability, including business intelligence and decision making. Here, the decision points where responsible stakeholders set directions and solve the tradeoff conflicts that inevitably arise need to be isolated and well-supported. Planning and monitoring as key functions in preservation systems have received considerable attention in recent years. The preservation planning tool Plato has shown how trustworthy decisions can be achieved (Becker et al. 2009). Its application in operational practice has advanced the community’s understanding of the key decision factors that need to be considered (Becker & Rauber 2011a), and case studies have provided estimates of the effort required to create a preservation plan (Kulovits et al. 2013a). Finally, the systematic quantitative assessment of preservation cases can provide a roadmap for automation efforts by prioritizing those aspects that occur most frequently and have the strongest impact (Becker, Kraxner, et al. 2013). However, creating a preservation plan in many cases still is a complex and effort-intensive task, since many of the required activities have to be carried out manually. It is difficult for organizations to share their experience in a way that can be actively monitored by automated agents and effectively used by others on any scale. Automated monitoring in most cases is restricted to the state of internal storage and processing systems, with little linking to preservation goals and strategies and scarce support for continuously monitoring how the activities in a repository and its overall state match the evolving environment. Finally, integrating whatever solution components an organization chooses to adopt with the existing technical and social environment is difficult, and integration of this context with strategies and operations is challenging (Becker & Rauber 2011c). This article presents an innovative architecture for scalable decision making and control in preservation environments, implemented and evaluated in the real world. The SCAPE Planning and Watch suite builds on the preservation planning tool Plato and is designed to address the challenges outlined above. It makes preservation planning and monitoring context-aware through a semantic representation of key organizational factors, and integrates this with a sophisticated new business intelligence tool that collects and reasons upon preservation-relevant information. Integration with repositories and external information sources provide powerful preservation capabilities that can be freely integrated with virtually any repository or content management system. The new system provides substantial capabilities for large-scale risk diagnosis and semi-automated, scalable decision making and control of preservation functions in repositories. Well-defined interfaces allow a flexible integration with diverse institutional environments. The free and open nature of the tool suite further encourages global take-up in the repository communities. The article synthesizes and extends a series of articles reporting on partial solution blocks to this overarching challenge (Becker & Rauber 2011c; Becker, Duretec, et al. 2012; Faria et al. 2012; Petrov & Becker 2012; Kulovits et al. 2013a; Kulovits et al. 2013b; Faria et al. 2013; Kraxner et al 2013). Besides pulling together the compound vision and value proposition of the integrated systems and providing additional insight into the design goals and objectives, we conduct an extensive evaluation based on a controlled case study, outline the interfaces of the ecosystem to enable integration with arbitrary repository environments, discuss implications for systems integration, and assess the improvement of the provided system over the state of the art in terms of efficiency, effectiveness, and trustworthiness. The article is structured as follows. The next section illustrates the state of the art in preservation planning and monitoring and highlights the key challenges faced by repositories to provide scalable decision making and monitoring facilities. Section 3 presents the key goals of our work and the main conceptual solution components that are developed to address the identified challenge. We outline common vocabularies and those aspects of design pertaining to future extensions of the resulting preservation ecosystem. Section 4 presents the suite of automated tools designed and developed to improve decision support and control in real-world settings. Becker et al (2015) will discuss the improvements of the presented work and identified limitations, based on a quantitative and qualitative evaluation of advancing the state of art including a case study with a national library. 2. Digital preservation: Background and challenges 2.1 Digital preservation and repositories The existing support for active, continued preservation in the context of digital repositories can be divided into several broad areas: repository software, tools for identifying and characterizing digital objects, tools for preservation actions (migration and emulation) and quality assurance, and systems for preservation monitoring and preservation planning. The main goal of repository software is to provide capabilities of storing any type of content and accompanying metadata, managing those data, and providing search and retrieval options to the user community. Rapidly growing demands for storing digital material are shifting development trends towards more scalable solutions aiming to provide a fully scalable system capable of managing millions of objects. Even though it is a very important aspect, scalability is only one of the dimensions which need to be effectively addressed in digital repositories. Many repositories are looking for ways to endow their systems with the capabilities to ensure continued access to digital content beyond the original creation contexts. Replacing an entire existing repository system is rarely a preferred option. It may not be affordable or sustainable, but also will often not solve the problem, since the organizational side of the problem needs to be addressed as well and preservation is inherently of continuous nature. In a recent survey, a majority of organizations looking for preservation solutions stated that they are looking for mix-and-match solution components that can be flexibly integrated and support a stepwise evolutionary approach to improving their systems and capabilities. There is a strong preference in the community for open-source components with well-defined interfaces, fitting the approach preferred by most organizations (Sinclair et al. 2009). The importance of file properties and file formats in digital preservation resulted in broad research and the development of tools and methods for file analysis and diagnosis. According to their functionality such tools can be divided into three categories: identification (identifying the format of a file), validation (checking the file conformance with the format specification) and characterization (extracting object properties) (Abrams, 2004). Probably the best known identification tool is the Unix command file. Further examples are the National Archives’ DROIDi tool and its siblings such as fidoii . Characterization tools differ in performance characteristics as well in feature and format coverage. Some of the most used and cited examples are the JSTOR/Harvard Object Validation Environment JHoveiii and its successor JHove2, and the eXtensible Characterization Languages (XCL) (Thaller 2009). Apache Tika iv combines fast performance with a coverage that extends beyond mere identification to cover extraction of various object features. Acknowledging that a single tool cannot cover all formats and the entire feature space, the File Information Tool Set (FITS) v combines other identification and characterization tools such as DROID and JHove and harmonizes their output to be able to cover different file formats and have a richer feature space as a result. Some efforts have been reported on aggregating and analyzing such file statistics for preservation purposes. Most approaches and tools demonstrated thus far are often focused solely on format identification (Knijff & Wilson 2011; Hutchins 2012). (Brody et al. 2008) describes PRONOM-ROAR, an aggregation of format identification distributions across repositories. Today, automatic characterization and meta data extraction is supported by numerous tools. The SCAPE project is packaging such components into discoverable workflows and by that providing a possibility to automatically discover, install and run those toolsvi. This encompasses migration actions, characterization components, and quality assurance. The latter refers to the ability to deliver accurate measures about the quality of digital objects, in particular to ensure that preservation actions have not damaged the authenticity of an object’s performance (Heslop et al. 2002). The SCAPE project is addressing that question by providing a number of tools for image, audio video and web quality assurance (Pehlivan et al. 2013). Furthermore it is packaging those components into discoverable workflows and by that providing the facilities to discover and invoke these tools. The metadata collected by different identification and characterization tools will help with managing the objects more efficiently and effectively. The real power of such data becomes especially visible when visual aggregation and analysis methods are used. Jackson (Jackson 2012) presented a longitudinal analysis of the format identification over time in the UK web. Even though this study was only using identification data, the resulting statistics of the evolution of different formats over time can yield significant insights into format usage trends and obsolescence. There is a clear need of a broad systematic approach to large-scale feature-rich content analysis to support business intelligence methods in extracting important knowledge about the content and formats stored in a repository. This is also a key enabler for successful preservation planning, one of the six functional entities specified in the OAIS model (CCSDS 2002). Its goals are to provide functions for monitoring internal and external environments and to provide preservation plans and recommendations which will ensure information accessibility over a longer period of time. Viewed as an organizational capability, the two main sub capabilities are Operational Preservation Planning and Monitoring (Antunes et al. 2011). The planning tool Plato (Becker et al. 2009) is up to now the best known implementation of an operational preservation planning method. It provides a 14-step workflow that guides a preservation planner in making decisions about the actions performed on a digital content. The result of a planning process is a trustworthy and well documented recommendation which identifies the optimal action from a defined set of alternatives according to specified objectives and requirements. These plans are not strategic plans guiding the organization’s processes and activities, but operational specifications for particular actions to be carried out with exact directives on how they shall be carried out. Even though Plato offers a great deal of automation, some steps in the workflow require significant manual work. Kulovits (Kulovits et al. 2009) showed that in 2009, a typical use case involved several people for about a week, including a planning expert to coach them. Preservation monitoring shows a comparable gap of automated tool support. Current activities usually result in technical reports, as (Lawrence et al. 2000), DigiCULTvii and Digital Preservation Coalition periodic reportsviii, or file format and tool registries (PRONOM ix, Global Digital Format Registry (GDFR)x, Unified Digital Format Registry (UDFR)xi, the P2 registryxii, and others). Technical reports function on a principle of periodically publishing documents about available formats and tools. They are meant for human reading and support no automation. Registries such as PRONOM are shared and potentially large, but very often do not provide in-depth information. They have difficulties in ensuring enough community contributions and, where those contributions exist, they are often sparse and dispersed in different registries. Moderation of such contributions through a closed, centralized system has proven notoriously difficult, which has led to increasing calls for a more open ecology of information sources (Becker & Rauber 2011a; Pennock et al. 2012) xiii. An early attempt to demonstrate automation in preservation monitoring was PANIC (Hunter & Choudhury 2006). The goal was to provide a system which will periodically combine the metadata from repositories with the information captured from software and format registries in order to detect potential preservation risks and provide recommendations for possible solutions. The initiative to develop an Automatic Obsolescence Notification Service (Pearson 2007) aimed at providing a service that would automatically monitor the status of file formats in a digital repository against format risks collected in external registries. Unfortunately, the dependency on external format registries to provide information for a wider range of file formats was a limitation for AONS, which caused it to monitor only a limited amount of information. Preservation actions need to be carefully chosen and deployed to ensure they in fact address real issues and provide effective and efficient solutions. There is an increasing awareness and understanding of the interplay of preservation goals and strategies, tools and systems, and digital preservation policies. Policies are often thought to provide the context of the activities and processes an organization executes to achieve its goals, and hence the context for the preservation planning and monitoring processes described. Yet, in Digital Preservation, the term “policies” is used ambiguously; often, it is associated with mission statements and high-level strategic documents (Becker & Rauber 2011c). Representing these in formal models would lead to only limited benefit for systems automation and scalability, since they are intended for humans. On the other hand, models exist for general machine-level policies and business policies. However, a deep domain understanding is required to bring clarity into the different levels and dimensions at hand. This should be based on an analysis of the relevant drivers and constraints of preservation. A driver in this sense is an “external or internal condition that motivates the organization to define its goals” (Object Management Group 2010), while a constraint is an “external factor that prevents an organization from pursuing particular approaches to meet its goals” (Object Management Group 2010). Common examples for preservation policies are on the level of statements in TRAC (OCLC and CRL 2007), ISO16363 (ISO 2010), or statements in Beagrie et al. (2008). These are well known, but their impact is not always well understood, and operations based on these can be quite complex to implement. Moreover, there is no recognized model for formalizing preservation policies in a standard way. Providing such context for preservation planning, monitoring and operations, however, is key to successful preservation. So far, context has been provided implicitly as part of decision making, adding a burden on decision makers and threatening the quality and transparency of planning and actions. These policies correspond to what the OMG standards call "business policies". The OMG has been active in modeling and standardizing this concept for many years and produced in particular two valuable standards: the Business Motivation Model (Object Management Group 2010) and the Semantics of Business Vocabulary and Business Rules (SBVR) (Object Management Group 2008). According to these, policies are non-enforceable elements of governance that guide, shape and control the strategies and tactics of an organization. An element of governance is an "element of guidance that is concerned with directly controlling, influencing, or regulating the actions of an enterprise and the people in it". Enforceable means that "violations of the element of governance can be detected without the need for additional interpretation of the element of governance" (Object Management Group 2008). There are various levels of policy statements required in a digital preservation environment. While the DP community has specified criteria catalogs for trustworthy preservation systems, these fail to separate concerns and distinguish between strategic goals, operational objectives and constraints, and internal process metrics. The relationship between these is often vague. Compliance monitoring in operational preservation systems is restricted to generic operations and does not align well with the business objectives of providing understandable and authentic access to information artifacts. The lack of clarity, separation of concerns, formalism and standardization in regulations for DP compliance means that operationalizing such compliance catalogs is very difficult, and verification of compliance is manual and either limited to abstract high-level checks on a system’s design or inherently subjective. 2.2 On trust and scalability Preservation planning methods and tools such as Plato have evolved considerably from their origins (Strodl et al. 2006). It is worth recalling here the two fundamental dimensions along which such evolution could take place – dimensions set by the decision space in which these methods are designed to operate. The key requirements, not at all compatible at first sight, are trust and scalability. Trust as a requirement hardly disputed mandates organizations to strive for transparency, accountability, and traceability, traits evidently recommended by standards such as the Repository Audit and Certification checklist (ISO 2010). Achieving trust requires a carefully designed environment that promotes transparency of decisions, ensures full traceability of decision chains, and supports full accountability. Scalability, on the other hand, is mandated by the sheer volumes of content pouring into repositories in the real world, and calls for automation, reduced interaction, simplified decisions, and the removal of human interaction wherever possible. Scalability calls for automated actions applied in standardized ways with minimized human intervention. Trust, on the other hand, mandates that any automated action is fully validated prior to execution, providing an assessment trail against the objectives specified by the organization which is supported by real-world evidence. Preservation planning methods and tools such as Plato come a long way along the path of trustworthy decision making, but by the very nature of the task have difficulties in making progress on the dimension of scalability. Considerable effort is commonly required for taking trustworthy decisions, as well as for creating, structuring and analyzing the underlying information that is the input for the decision making process. Until now, this often means that organizations fail to move from hardly trustworthy ad-hoc decision making to fully trustworthy, well-documented preservation planning (Becker & Rauber 2011c; Kulovits et al. 2013a). 2.3 Challenges and goals The preservation of digital content requires that continuous monitoring, planning and the execution of corrective actions work together towards keeping the content authentic and understandable for the user community and compatible with the external environment and restrictions. However, many institutions carry out these processes in a manual and ad-hoc way, completely detached from the content lifecycle and without well-defined points of interoperability. This limits the ability to integrate and scale preservation processes in order to cope with the escalating growth of content volume and heterogeneity, and it undermines the capacity of institutions to provide continued access to digital content and preserve its authenticity. We observe that there are a number of gaps in the means currently available to institutions: 1. Business intelligence mechanisms are missing that address the specific needs of preservation over time and enable organizations to monitor the compliance of their activities to goals and objectives as well as risks and opportunities. Similarly, organizations lack the scalable tools to create feature-rich profiles of their holdings to support this monitoring and analysis process. There are no accepted ways to address the need for continuous awareness of a multitude of key factors prone to change, including user communities, available software technology, costs, and risks, to provide a unified view on the alignment of an organization’s operations to goals and needs. While the community is eager to share the burden and promote collaboration, it is notoriously difficult for organizations to effectively do so. 2. Knowledge sharing and discovery at scale is not widely practiced, since there is no common language, no effective model, and little clarity as to what exactly can and should be shared and how. Hence, sharing is practiced on an ad-hoc and peer-to-peer basis, with little scalable value for the wider community. 3. Decision making efficiency needs to be improved without sacrificing transparency and trustworthiness. This requires not only more efficient mechanisms built into decision making tools, but also a more explicit awareness of an organization’s context. 4. Preservation policies are a key factor to achieve this and have been notoriously difficult to pin down. In this context, it is important to understand policies as ‘elements of guidance that shape, guide and control’ (Object Management Group 2008) the activities of an organization, so that the core aspects can be formalized and understood by decision support and business intelligence systems. 5. Systems integration, finally, is chronically difficult and only successful where modular components with clearly defined purpose and well-specified interfaces are provided in the place of monolithic, custom-built solutions. It becomes clear that establishing such capabilities cannot simply be solved by introducing a new software tool, but requires careful consideration of the socio-technical dimensions of the design problem. Designing a set of means to address these issues requires a solid understanding of socio- technical environments and a flexible suite of methods and tools that can be customized, integrated and deployed in a real-world context to address the issues pertaining to a particular situation. The following section will discuss each of the design challenges in turn and derive a set of overarching design goals. Based on these, we will present the main concepts and solution components that form the main contribution of our work and discuss how they can be used in isolation or conjunction to improve the state of art in scalable decision making and control. 3. Scalable, context-aware Preservation Planning and Watch 3.1 Overview Based on the observations outlined above, this section derives a number of design goals to be addressed in order to enable scalable decision making and control for information longevity, while further advancing the progress made on the path of trust, in a form that can make substantial real-world impact for a variety of organizations. Based on a new perspective that emphasizes the continuous nature of preservation, we describe an architectural design for trustworthy and scalable preservation planning and watch. Section 4 discusses the implementation of the architecture in the SCAPE Planning and Watch suite. 3.2 Design goals Systematic analysis of digital object sets is a critical step towards preservation operations and a fundamental enabler for successful preservation planning: Without a full understanding of the properties and peculiarities of the content at hand, informed decisions and effective actions cannot be taken. While large-scale format identification has been in focus for a while and tools for in-depth feature extraction exist, little work has been shown that combines in-depth analysis and large-scale aggregation into content profiles that are rich in information content and large in size. G1: Provide a scalable mechanism to create and monitor large and rich content profiles. For successful preservation operations, a preservation system needs to be capable of monitoring compliance of preservation operations to specifications, alignment of these operations with the organization’s preservation objectives, and associated risks and opportunities that arise over time. Achieving such a business intelligence capability for preservation requires linking a number of diverse information sources and specifying complex conditions. Doing this automatically in an integrated system should yield tremendous benefits in scalability and enable sharing of preservation information, in particular risks and opportunities. G2: Enable monitoring of operational compliance, risks and opportunities. The preservation planning framework and tool Plato provide a well-known and solid approach to create preservation plans. However, a preservation plan in Plato 3 is constructed largely manually, which involves substantial effort. This effort is spent in analyzing and describing the key properties of the content that the plan is created for; identifying, formulating and formalizing requirements; discovering and evaluating applicable actions; taking a decision on the recommended steps and activities; and initiating deployment and execution of the preservation plan. When automating such steps, trustworthiness must not be sacrificed for efficiency. Still, the efficiency of planning needs to be improved to the point that creating and revising operational plans becomes an affordable and normal, routine part of organizations responsible for safeguarding content and is understood well enough so that it can potentially be offered as a service. G3: Improve efficiency of trustworthy preservation planning. For decision support and monitoring systems to be truly useful, they need to be aware of the context in which they are operating. That includes an awareness of the organizational setting and the state of the repository so that they can assess risks and identify issues that need intervention, but it extends to an awareness of the world outside the repository to ensure these systems can provide this assessment also with respect to the larger context in which the repository operates. So far, it has been very difficult to make the organizational context known to the systems in a way that enables them to act upon it. The planning tool Plato 3, for example, requires the decision makers to model their goals and objectives in a tree structure; but it is not directly aware of other organizations’ goals and objectives. Similarly, the context awareness of systems such as PANIC is very limited. Most importantly, hence, preservation systems need to be endowed with an awareness of the context in which they shall keep content alive. This includes the organizational goals and objectives, constraints, and directives that shape and control the preservation operations of a repository. Such an awareness of the context requires a formalized representation of organizational constraints and objectives and a controlled vocabulary for representing the key entities of the domain. Given the evolutionary nature of the world in which preservation has to operate, such a vocabulary needs to be permanent, modular and extensible. G4: Make the systems aware of their context. Preservation planning focuses on the creation of preservation plans; Preservation Watch focuses on gathering and analyzing information; operations focus on actual processing of data and metadata. These methods and tools will in general be deployed in conjunction with a repository environment. This requires open interfaces and demonstrated integration patterns in order to be useful in practice. We hence need a system architecture that is based on open interfaces, well-understood components and processes, open data and standard vocabularies, but also able to be mixed and matched, extended and supportive of evolution over time. Components in an open preservation ecosystem need to use standards and appeal beyond digital preservation to enable growth and community participation. They should be built around a simple core, with the goal to connect and enable rather than impose and restrict. The preservation community is painfully aware how important sustainable evolution is for their systems, as emphasized by a recent discussionxiv. Correspondingly, the ecosystem in question should be built with sustainability in mind. G5: Design for loosely-coupled preservation ecosystems. Clearly, addressing the sum of these goals requires a view on the preservation environment that focuses on the continuous, evolving nature of information longevity as a sustained capability rather than a one-time activity. The following section presents such a view, focusing on the preservation lifecycle and its key components. 3.3 The preservation lifecycle Figure 1: Digital preservation lifecycle Figure 1 shows a view on the key elements in a preservation environment that relates the key processes required to successfully sustain content over time to each other. The preservation lifecycle naturally starts with the repository and its environment and evolves around the continuous alignment of preservation activities to the policies and goals of the organization. The Repository is an instance of a system which contains the digital content and may comprise processes such as ingest, access, storage and metadata management. The Repository may be as simple as a shared folder with files that represent the content, or as complex as dedicated systems such as DSpacexv, Eprints xvi and RODAxvii. The Repository refers not only to the generic software system but also its instantiation within an institution, related to an institutional purpose guided and constrained by policies that define objectives and restrictions for its content and procedures. In the context of this article, the preservation policies drive how the Repository must align to its context, environment and users, guiding digital preservation processes such as Watch, Planning and Operations. The alignment of the content and the activities of a repository to its context, environment and users is constantly monitored by Watch to detect preservation risks that may threaten the continuous and authentic access to the content. This starts by obtaining an understanding of what content the repository holds and what the specific characteristics of this content are. This process is supported by the characterization of content and allows a content owner to be aware of volumes, characteristics, format distributions, and specific peculiarities such as digital rights management issues and complex content elements. The characterization process feeds the aggregated set of key characteristics of the monitored content, i.e. the content profile, into the Watch process. This is depicted as the ‘monitored content’ in Figure 1. Repository events such as ingest or download of content are monitored by the Watch process, as they can be useful for tracking producer and consumer trends and uncover preservation risks. The Watch process cross-relates the information that comes from internal content characterization and repository events with the institutional policies and the external information about the technological, economic, social and political environment of the repository, allowing for the identification of preservation risks and opportunities. For example, checking the conformance of content with the owner’s expectations or policies, identifying format or technological obsolescence in content, or comparing the content profile with other repositories can reveal possible preservation risks, but also opportunities for actions and possibilities to improve efficiency or effectiveness. These possible risks and opportunities should be analyzed by Planning to devise a suitable response. The Planning process carefully examines the risks or opportunities, considering the institution’s goals, objectives and constraints. It evaluates and compares possible alternatives and produces an action plan that defines which operations should be implemented and which service levels have been agreed on, and documents the reasoning that supports this decision (Becker et al. 2009). This action plan is deployed to the Operations process that orchestrates the execution of the necessary actions on the repository content, if necessary in large-scale distributed fashion, and integrates the results back to the repository. These operations can include characterization, quality assurance, migration and emulation, metadata, and reporting. The Operations process should provide information about executed actions such as quality assurance measurements to the Watch process to be sure that the results conform to the expectations set out in the action plan. All the conditions about internal and external information considered as a decision factor by Planning should be continuously monitored so that the organization knows where active plans remain aligned and valid over time. Once a condition is detected that may invalidate a plan, Planning should be called upon to re-evaluate the plan. This perspective on digital preservation as a set of processes or capabilities that interact with each other to achieve the digital preservation objectives has evolved considerably over the last decade. From the oft-cited standard model in the domain, the OAIS (CCSDS 2002), which emphasizes a functional decomposition of elements in an archive, the perspective evolved to the capability- based view of the SHAMAN Reference Architecture (Antunes et al. 2011), which based the model strongly in Enterprise Architecture foundations and thus integrated the domain knowledge of preservation with a holistic view of the organizational dimensions. However, neither presents a specific view on how these processes can align with each other in practice, allowing the flow of information from one process to the next. The streamlined view illustrated in Figure 1 forms a lifecycle that ensures digital content on repositories is continuously adapted to the environment, the target users and institutional policies. Considering the above, it becomes clear that optimization of efficiency (whether of performance and cost or effort) must not only occur within each process, and not only consider scalable processing of data, but also at the integration points between each of the processes and in the decision functions themselves, so the whole preservation lifecycle becomes efficient and sustainable. Finally, many of the activities in these processes require sophisticated tool support to be applicable in a real-world environment. 3.4 An architecture for loosely-coupled preservation systems Achieving a full preservation lifecycle requires a set of components that implement the digital preservation processes and interoperate with each other in an open and scalable architecture. Figure 2 shows the set of components that are required to support and partially automate the processes necessary to sustain the preservation lifecycle. These need to be designed to be modular and flexible, have clearly distinguished functionalities, and fit the technical specifications of the institution context. Figure 2 Overall architecture of scalable planning and watch The Content profiler has the function of aggregating and analyzing content characteristics and producing a well-specified content profile that provides a meaningful and useful summary of the relevant aspects of the content. This component has to cope with large amounts of data in the content and support the watch and planning components by summarizing the important aspects to a content profile, exposed via the Content profile interface. The Watch component has the function of collecting this and other aspects in order to provide the business intelligence functionalities necessary for monitoring and alignment. By gathering information from diverse sources relevant for preservation, it enables the organization to monitor compliance, risks and opportunities, based on monitoring conditions that can be specified in the corresponding interface. For example, it provides the means to answer questions such as “How many organizations have content in format X?” or “Which software components have been tested successfully in analyzing objects in format Y?” (Becker, Duretec, et al. 2012). The component should be able to raise events when specified conditions are met. Interested clients provide a Notification interface to receive such events. The Planning component is hence informed about conditions that require mitigation. Its key function is to support the creation, revision and deployment of trustworthy and actionable preservation plans. In order to achieve this, it needs to retrieve the complete content profile, potentially access data from the repository for sample sets to experiment with for evaluation purposes, and will use the Plan Management interface to initiate the execution of the actions specified in the plan. A Repository should be able to integrate this preservation lifecycle architecture by implementing a set of interfaces xviii: Data Management enables basic operations on the data held by the repository to ensure controlled access, Event Reporting ensures that Watch can be informed about the status of operations and repository activities (Becker, Duretec, et al. 2012), and Plan Management provides the facilities to create and update preservation plans and initiate their deployment. To coordinate these sometimes complex activities and processes that are executed by operations, a Workflow engine can be used to execute the preservation action plan, i.e. the set of all actions and quality assurance tasks that compose the execution of a plan on the content. The Data management interface can also be used to merge the results of executing the action plan back to the repository. Interoperability between components is achieved via well-defined interfaces that allow the decoupling from the specific implementation of each component and also allow the reuse, replacement and easier maintenance of each of the components. The interfaces are open, in order to allow easy support of different component implementations, in particular different repository implementations. A key goal of such open interfaces is to enable continuous growth of systems by community participation. However, standardization in this area needs to go one step further and support semantic interoperability of the components. Components need to be aware of the context they are operating in, and this context need to be well communicated and mapped between each of components. Information exchanged between these components needs to be opened up to the community to build synergies, enable knowledge discovery, and move from static to dynamically growing information sources. The next section will describe the mechanisms designed to support this. 3.5 Policies as basis for preservation management When endowing components of a context-aware planning and watch system as envisioned here with an awareness of organizational context to create "policy-driven planning and watch", the idea cannot be that entirely non-enforceable elements drive something automatically, since the result would be random behavior. Instead, the idea is to relate non-enforceable high-level policies to practicable policies that are machine-understandable, but usually not specific enough to directly drive operations. The control of operations then is the responsibility of preservation planning, which creates enforceable preservation plans based on practicable policies. Corresponding to the observation that policies ‘guide, shape, and control’ the activities of an organization (Object Management Group 2010), we distinguish between the following levels. Guidance policies are non-enforceable governance statements that reside on the strategic (governance) level and often relate several high-level aspects of governance to each other. For example, they express value propositions to key stakeholders, commit to high-level functional strategies, define key performance indicators to be met, or express a commitment to comply with a regulatory standard. These policies are expressed in natural language and need to be interpreted by human decision makers. Automated reasoning on these is not generally feasible. The aspects to be included in such policy statements can be standardized and identified, but the statements can often not feasibly be expressed as machine language to a meaningful extent. In the preservation domain, typical examples can be seen in current regulatory compliance statements (ISO 2010), but also in preservation business policies (Beagrie et al. 2008). Control policies, on the other hand, are “practicable elements of governance that relate to clearly identified entities in a specified domain model … [and] constitute quantified, precise statements of facts, constraints, objectives, directives or rules about these entities and their properties.” (Kulovits et al 2013b). Practicable means that a statement is `sufficiently detailed and precise that a person who knows the element of guidance can apply it effectively and consistently in relevant circumstances to know what behavior is acceptable or not, or how something is understood.’ (Object Management Group 2008). Such policies can be fully represented in a machine-understandable model, but are often not directly actionable in the sense that it does not make sense to directly enforce them in isolation: The exact enactment will depend on the context and the relation of multiple control policies. For example, multiple control policies may be defined in isolation and contradict each other. The resolution of this contradiction in the decision making process (preservation planning) leads to a specified set of rules in the plan. This rule set is then actionable and enforceable. Some control policies will, on the other hand, be in principle enforceable. For example, constraints about data formats to be produced by conversion processes can be automatically enforced in a straightforward way. Control policies are practicable in the sense of the SBVR, but generally have to be specified by human decision makers in policy specification processes that refer to the guidance policies and take into account the drivers and constraints of the organization to create control policies. These processes can be standardized to a degree similar to standard business processes. The typical inputs and outputs as well as the stakeholders responsible, accountable, consulted and informed can be specified. Yet, it should not be prescribed to a particular organization in which way these policies have to be managed. By applying these levels, non-enforceable high-level policies can be related to practicable policies that are machine-understandable, but usually not specific enough to directly drive operations. The control of operations then is the responsibility of preservation planning, which creates enforceable preservation plans based on practicable policies. These preservation plans correspond to business rules. We note that if control policies are specified in a formal model, it should be possible to check instances of that model against formal constraints. Figure 3: Digital preservation policies need a well-defined domain model (Kulovits et al 2013b) An institutions’ specific policies should thus be specified following a well-defined vocabulary. In order to make such policies meaningful, a core set of domain elements has to be identified and named so that the properties of these concepts can be referred to, represented and measured. This is illustrated in Figure 3 and Figure 4. Ultimately, a preservation case arises, in analogy to a business case, from the identified value of a set of digital artifacts for a specified, more or less well-defined, set of users, called the user community. A preservation case hence concerns identified content and identified users and specifies the goals that should be achieved by preservation. Practically, the level of detail known in each specific instance about the users’ goals and means will vary greatly, but where there is no identified potential value in preserving a set of digital artifacts, it will likely be discarded. The scope of the preservation case thus corresponds closely to the statements of “preservation intent” discussed by Webb et al. (2013). In order to successfully preserve objects for a set of users, i.e. address a preservation case, goals will be identified and made explicit by specifying objectives. These are more explicit than a general preservation intent and represent the general goals for effective and efficient continued access to the intellectual content of the digital artifacts in precise statements: The objectives specify desirable properties of the objects with regards to authenticity, formats, and other aspects of representation (such as compression, codecs, or encryption); desired properties of the formats in which such objects shall be represented; desired properties of the preservation operations carried out to achieve preservation goals, in particular preservation actions to be applied (such as a preferred strategy of migration or an upper limit on costs); and access goals derived from knowledge about the user community. It can be seen that the core focus of this model is on continued accessibility and understandability on a logical level, emphasizing the continued alignment that is at the heart of preservation rather than the mere conservation of the bitstreams themselves, which is seen as a necessary precondition to be addressed independently. It is only through access (of whatever form) that preservation results in value; and it is only through a continued process that such understandability and access can be assured. Making the aspects that should be aligned explicit and measurable is the first step towards intelligent detection and reaction. Correspondingly, the core set of control policy elements is shown in Figure 4, taken from (Kulovits et al. 2013b) which describes the controlled vocabularies in more detail. Figure 4: The core set of elements in the vocabulary (Kulovits et al, 2013b) The ontology of core control policies and the ontology of the domain elements referenced in these statements are permanently accessible on http://purl.org/DP/control-policy and http://purl.org/DP/quality. While a detailed discussion of these elements is out of the scope of this article, the next sections will show how it enables the components of the implemented software suite to sustain an awareness of an organization’s objectives and constraints and monitor the alignment of operations to the preservation goals. 3.6 A preservation ecosystem The standardization of the policy vocabulary and the domain model allows us to envision a digital preservation ecosystem that brings together the Organization, the Community environment, the Solution components and the Decision support and control tools that make up the loosely-coupled system presented in Section 3.4. The vocabularies allow all of these entities to share a common language and be able to interoperate. Figure 5 illustrates how the digital preservation vocabulary connects the ecosystem domains: ● Organization. An organization has digital content and internal goals regarding its purpose and delivery which influence decisions on how to curate, preserve and reuse the content over time. People on behalf of the organization manage information systems and define policies that guide and constrain the selection and design of operations to be executed to preserve the content. The formulation of policy instances for the organization can follow a vocabulary that is widely understood by the other parts of the ecosystem. Figure 5: A common language connects the domains of the preservation ecosystem ● Community environment. Other organizations with particular concerns, not necessarily to preserve content, develop and populate systems that support various aspects of preservation directly or indirectly. These systems contain essential information on aspects relevant to preservation. The main building blocks in this domain include technical registries such as PRONOM, but increasingly extend to environments not originally emerging within digital preservation, such as the workflow sharing platform myExperiment xix or public open source software repositories such as github xx. ● Solution components comprise the services and tools, platforms and infrastructure components that support the necessary operations to address organizations’ needs. These components relate to the pieces that need to be put together to allow addressing the organization objectives in specific cases in cost-efficient ways. These tools must be selected considering the organization’s policies (criteria and constraints) that define requirements for a solution. The main types of such solution components include software tools for file format identification, feature extraction, validation, migration, emulation and quality assurance. Solution components in this domain are generally developed, maintained, and distributed by commercial or noncommercial solution providers trying to meet market needs. Many of them are in fact created by members of the preservation community.xxi ● Decision support and control, finally, brings together those methods and systems that support the organization in choosing from the solution domain those elements that fit their policies and goals best, ensure most effectively that the content remains usable for the community, and support the organization in the continued task of monitoring. Each software system requires information about certain domain entities. For example, content profiling needs to describe objects it analyzes, and preservation tools need to report measures. Planning needs to discover preservation actions, evaluate actions, and describe plans. Watch needs to collect measures on all these entities, detect conditions, and observe events. Finally, decision makers need to describe their goals and objectives in a way understandable by the systems, so that decision support can provide customized advice and support that befits their specific policies and constraints. The next section will outline how this ecosystem has been implemented and instantiated and show the preservation lifecycle in action within the ecosystem. We will outline the solution architecture and discuss the specific components of the architecture in turn, and then return to the preservation lifecycle and how the ecosystem increasingly supports scalable, context-aware preservation planning and monitoring and its integration into repository environments and the community. 4. The SCAPE Planning and Watch Suite 4.1 Overall solution architecture The architecture outlined above has been implemented by a publicly available set of components and API specifications that can be freely integrated with any repository system. The suite of components aims to provide the tool support necessary to enable organizations to advance from largely isolated, ad-hoc decisions and actions reacting to preservation incidents to well-supported and well-documented, yet scalable and efficient preservation management. The following section describes each of the key building blocks of this tool suite, focusing on the core design goals and features and pointing to references for further in-depth information. Note that the design is not limited to large-scale environments, but understands scalability as a general flexibility with a focus on efficiency and automation. This is relevant in two ways: First, the tools do not require large-scale infrastructure, but are able to leverage it when present. Second, providing a loosely-coupled set of modular components enables organizations to adopt the suite using an incremental approach, without large upfront investments. Figure 6 depicts the SCAPE software components supporting the preservation lifecycle and implementing the components and interfaces described above. The next sections will describe each of these components in turn. Figure 6: SCAPE software components supporting the preservation lifecycle 4.2 C3PO: Scalable Content analysis Recent advancements in tool development for file analysis resulted in a number of tools covering different functionality such as identification, validation and characterization. A crucial challenge presented by those tools is the variance of coverage in terms of file formats supported and features extracted. The characterization tool FITS addresses the problem of coverage by combining outputs from different identification and characterization tools in one descriptor. This enables a rich characterization of a single file by using only one tool, which will in fact run the appropriate identification and characterization tools on the content and normalize the output into a well-defined XML output. While this comes at a performance cost, it is currently the only method that provides reasonable coverage of the feature space, covering both a variety of identification measures such as the PRONOM format ID and mime-types as well as in-depth feature extraction supported by an array of toolsxxii. Using the output of characterization tools such as FITS and Apache Tika, the tool Clever Crafty Content Profiling of Objects (C3PO) xxiii enables a detailed content analysis of large-scale collections (Petrov & Becker 2012). Figure 7 provides a high level overview of the process, which as a result produces a detailed content profile describing the key distribution characteristics of a set of objects. The process starts with running identification and characterization tools on a set of content. The metadata produced by those tools is collected and stored by C3PO, which currently supports the metadata schemas of FITS and Apache Tika. Support for other characterization tool output formats can easily be added by extending the highly modular architecture, which enables the integration of additional adaptors to support other metadata formats and gathering strategies. The combination of using multiple metadata extraction tools on the same content will often result in conflicts, a state where two tools provide different values for the same feature. A common example is the file format, when two tools assign different format identifiers to the same file, either because of different interpretation logics or simple because they have a different way of representing the same format. C3PO offers the possibility to add rules which will resolve those conflicts. These rules can range from simple conditions regulating that certain two identifiers represent the same format to complex rules prioritizing certain tools or deriving values based on the presence of other features. Figure 7: The key steps of content profiling The architecture of C3PO decouples the persistence layer so that a variety of engines can be used. The default database provides strong scalability support by using the open-source highly-scalable MongoDB, which supports sharding (Plugge et al. 2010) and map-reduce (Dean & Ghemawat 2004) natively. This also enables users to provide their own analytics on the basis of this platform, similar to the built-in queries that are readily supported through a web user interface. These standard analytical queries calculate a range of statistics from the size of the collection to the triangular distributions of all numerical features and a histogram of all non-numerical features in the collection. The set of these statistics is the heart of the content profile. In addition to its processing platform, C3PO offers a simple web interface which allows dynamic exploration of the content. Part of it is shown in Figure 8, displaying a collection with about 42 thousand objects and an overall size of approximately 23 GB. Additional diagrams show the distribution of mime- types and formats. The user can create additional diagrams for any feature present in the set in order to visualize key aspects of the property sets. Advanced filtering techniques enable exploring the content in more detailed fashion. By clicking on a bar representing a certain format in the format distribution diagram, for instance, the user will filter down on the corresponding object set to see details about that part of the collection only. This enables a straightforward drill-down analysis to see, for instance, how many of a set of TIFF files are valid or how many have a certain compression type. Figure 8: C3PO visualizing a content profile While C3PO can be readily used independently, it integrates with the remaining two components in the planning and watch suite, Scout and Plato. The integration with Scout offers the possibility to monitor the feature distributions of any number of collections over time. By creating a historic profile from a collection, its growth and changes in the distributions of key aspects such as formats can be revealed over time. The integration with Plato uses an export of the content profile for the whole or a subset of a collection into a well-defined content profilexxiv. This profile identifies and describes the set of objects contained and provides a statistical summary of file format identification and important features extracted. Plato understands this profile and uses it to obtain statistics about the content set for which a plan is being created. Finally, this profile can contain a set of objects that are seen as representative for the entire set, to enable controlled experimentation on a realistic subset instead of the entire set of objects. This can yield increased reliability of the sample selection and provide a substantial speedup, since without these heuristics, samples have to be selected by hand, a tedious and error-prone process (Becker & Rauber 2011c). As with the other modules, the heuristics used to select samples from this multidimensional view on the content set are flexible and configurable, and additional algorithms for sample selection can be added easily. 4.3 Scout: Scalable Monitoring Scout xxv is an automated preservation monitoring service which supports the scalable preservation planning process by collecting and analyzing information on the preservation environment, pulling together information from heterogeneous sources and providing coherent unified access to it. It addresses the need to combine an awareness of the internal state of an organization and its systems (internal monitoring) with an awareness of the environment in the widest sense (external monitoring) to enable a continued assessment of the alignment between the two (Faria et al. 2012). The information is collected by implementing different source adaptors, as illustrated in Figure 9. Scout has no restrictions on the types of data that can be collected. It is built to collect a variety of data from different sources such as format and tool registries, repositories, and policies. It already implements source adaptors for the PRONOM registry, content profiles from C3PO, repository events (ingest, access, and migration), policies and other specific adaptors. The combination of content profiles from C3PO with repository events from the Report API provides a complete overview of the current content in a repository and shows trends of how the overall set of content is evolving. Figure 9: Scout information flows from sources to users Continuous automated rendering experiments xxvi can be used to track the ability of viewing environments to display content and verify whether it corresponds to the original performance (Law et al. 2012). Once information is collected, it is saved in a formally specified and normalized manner to the knowledge base (Faria et al. 2012). Built upon linked data principles, the knowledge base supports reasoning on and analysis of the collected data using standard mechanisms such as SPARQLxxvii. Such queries provide the mechanisms for automatic change detection. By registering an interest in a watch condition associated with such a query, the results will be monitored periodically. When the condition is met, a notification is sent to the user. Conditions can cover arbitrary circumstances of relevance in the known domain, ranging from checks on content validity and profile conformance to certain constraints to the question whether any new tools are available to measure a certain property in electronic documents, or whether a Quality Assurance tool that is in use for validating authenticity of converted images is still considered reliable by the community. Upon receiving the notification, the user can initiate additional actions such as preservation planning to address any risks that have surfaced or take advantage of opportunities that have been detected. Scout has a simple web interface which allows operations such as management, adding new adaptors and triggers, and browsing the collected data. This includes dynamically generated visualizations of data over time. By operating over a longer period, Scout is expected to have a valuable collection of historical data. Figure 10 shows an example of evolution of file formats through time. The resulting graph is based on an analysis of approximately 1.4 million files gathered in the period from December 2008 to December 2012 by the Internet Memory foundation xxviii. Additional content sets that are gathered for historical analysis and shared publication include a set of over 400 million web resources collected in the Danish web archive over almost a decade and characterized using fitsxxix. Figure 10: A format distribution of 1.4M files from the Internet Memory archive Other specific adaptors demonstrate the capacity of Scout to incorporate new information and identify new preservation risks. Faria et al. (2013) describe a case study that demonstrates how to use information extraction technologies on crawled web content to extract specific domain cases, like publisher-journal relationships, and integrate it with Scout for monitoring producers in journal repositories. Another specific adaptor feeds large-scale experiments on the renderability analysis of web pages into the knowledge base. Here, image snapshots are taken of pages from web archives with different web browsers, and the result is compared with image quality assurance tools. Expanding the comparison with structural information from the web page and cross-relation with content profiles of the resources used by the page will give further insight into which formats and which of their features are affecting the renderability of pages on modern web browsers. 4.4 Plato: Scalable Decision making Upon discovery of a risk or misalignment between the organization’s content and actions and the objectives, a plan is needed to resolve the detected problem and improve the robustness of the state of the repository against preservation threats. Creating such a plan is supported by the publicly available open-source planning tool Plato, which implements the preservation planning method described in detail in (Becker et al. 2009). Figure 11: The planning workflow The tool guides decision makers through a structured planning workflow and supports them in producing an actionable preservation plan for a defined set of objects. In doing so, they use a thorough goal-oriented evidence-based evaluation of the potential actions that can be applied. Controlled experimentation on real sample content is at the heart of the four-phase workflow shown in Figure 11: Testing the candidate actions on real-world content greatly increases the trust that stakeholders put into the actions to be taken and ensures that the chosen steps are not simply taken from elsewhere and applied blindly, but will be effective and fit for the specific situation (Becker & Rauber 2011c). 1. Define requirements: In the first phase, the context of planning is documented, and decision criteria are specified that can be used to find the optimal preservation action. The specification starts with high-level goals and breaks them down into quantifiable criteria. The resulting objective tree provides the evaluation mechanism for choosing from the candidate preservation actions. To enable this, the set of objects to preserve is profiled, and sample elements are selected that will be used in controlled experimentation. 2. Evaluate alternatives: In an experiment step, empirical evidence is gathered about all potential candidate solutions by applying each to the sample content selected. The results are evaluated against the decision criteria specified in the objective tree. 3. Analyze results: For each decision criterion, a utility function is defined to allow the comparison across different criteria and their measures. This utility function maps all measures to a uniform score that can be aggregated. Relative weights model the preferences of the stakeholders on each level of the goal hierarchy. An in-depth visual and quantitative analysis of the resulting score of candidates leads to a well-informed recommendation of one alternative to choose. 4. Build preservation plan: In this final phase, the concrete plan for action is defined. This includes an accurate and understandable description of which action is to be executed on which objects and how, and specifies the quality assurance measures to be taken along with the action to ensure that the results are verified and correspond to the expected outcomes. Responsibilities and procedures for plan execution are defined. The finished preservation plan drives the activities in operations and Watch and will be reevaluated over time. Figure 12: Plato visualizing criteria statistics from its knowledge base (Becker et al, 2013) Plato has been used for operational preservation planning in different scenarios in recent years. The Bavarian State Library, for example, evaluated the migration options for one of their largest collections of scanned images of 16th-century books (Kulovits et al. 2009). A detailed discussion of this and several other case studies is given in (Becker & Rauber 2011b). At this point, creating a preservation plan still was an effort intensive and complex task, since many of the required activities had to be carried out manually for each plan. However, the collected set of real-world cases enabled systematic analysis of the variety of decision factors and a systematic categorization and formalization of the criteria used for decision making (Becker & Rauber 2011a; Kulovits et al. 2013b). Figure 12 shows Plato visualizing aggregated decision criteria collected in the knowledge base. This is increasingly supporting Plato in becoming context-aware and automating many of the steps that have previously prohibited large-scale, policy-driven preservation planning (Kraxner et al. 2013; Kulovits et al. 2013b). Figure 13: Preservation operations are composed of multiple components (Kulovits et al. 2013b) As part of the tool suite presented here, Plato has been integrated with Scout, C3PO, and an online catalogue for preservation components published as reusable, semantically annotated workflows on myExperiment xxx . An actionable preservation plan can contain a complex number of automated processing steps of different kinds of operations, linked through a pipeline of inputs and outputs that is best represented as a workflow as shown in Figure 13. Specifying such a workflow in a standard manner, as opposed to a textual operations manual, greatly reduces the risk of operational errors and streamlines deployment. The integration of Plato with the Taverna workflow engine provides such possibilities (Kraxner et al. 2013). Plato furthermore is endowed with an awareness of the control policies encompassing objectives and constraints to be followed. This understanding of the drivers and constraints of an organization is provided by an awareness of the semantic policy model which can be shared across members of the same organization. This removes much of the burden of contextual factors needing clarification, which previously accounted for much of the difficulty in starting a planning process (Becker & Rauber 2011c; Kulovits et al. 2009). Together, this removes much of the effort required for preservation planning: The institutional context is provided by and documented by a semantic model; content statistics, samples and technical descriptors are provided by the content profile; and the available actions that can mitigate risks such as obsolescence can be discovered on myExperiment. Finally, executable workflows can be deployed to the repository, removing risks of misunderstandings and misconfigurations and easing the burden of running operations in accordance to specifications (Kraxner et al. 2013). This awareness and the integration with an open and growing experiment sharing platform plus an open controlled vocabulary provides the basis for continued improvement of operations over time, as organizations can build on each other’s work, show quantitative improvement of new solution components over those previously available, and discover which solution components are needed most urgently. As an example, consider the need to verify the quality of migration processes with respect to content authenticity: When converting even seemingly simple artifacts such as digital photographs, many conversion components introduce subtle errors by omitting embedded metadata, misinterpreting white balance and color setting, or using lossy compression methods where none was expected. Automated means are required to validate each conversion (Bauer & Becker 2011), but developing these is a heavy burden for each organization on their own. Instead, by showing that certain quality checks are required by multiple scenarios, efforts can be shared and focused on those aspects that are most frequent and at the same time critical for decision makers. The visual analysis shown in Figure 12 supports this by visualizing the quantified impact of each decision criterion and computing aggregated impact factors for arbitrary sets of criteria and preservation plans (Becker, Kraxner, et al. 2013). 4.5 Repository The repository is defined here as the system that contains and manages content, allowing ingest and access features. A repository may be as simple as a shared folder with files that represent the content, or as complex as dedicated systems such like DSpace or RODA. There are many different types and implementations of repositories, each with different features and a focus on the needs of different types of institutions. Endowing a repository with digital preservation features should therefore be independent on the repository type and implementation. To achieve the integration with the tools described above, that effectively support the digital preservation processes, a set of repository integration APIs are defined: Data Connector API, Report API and Plan Management API. Data Connector API The Data Connector API is an interface that allows access and modification of content in the repository. Defined as a RESTful web service (Fielding 2000), it contains methods to ● Retrieve intellectual entities, metadata, representations, files and named bit streams, ● Ingest an intellectual entity (synchronously and asynchronously), ● Update an intellectual entity, a representation or a file, and ● Search intellectual entities, representations or files using Search/Retrieval via URL protocol xxxi. The SCAPE Digital Object Model defines how to represent the intellectual entities, metadata, representations, files and named bit streams defined above. It defines a METS xxxii xxxiii profile that uses PREMIS to specify the technical metadata, the rights associated with the object, and the digital provenance metadata. The Data Connector API specification and SCAPE Digital Object Model is availablexxxiv, and the API reference implementations are provided by RODA and Fedora Commons 4. Report API The Report API is an interface that provides access to repository events such as ● Ingest started or finished, ● Descriptive metadata viewed or downloaded, ● Representation viewed or downloaded, or ● Preservation plan executed. The Report API is defined as an OAI-PMH xxxvi. The Report API specification is availablexxxvii xxxviii. A Fedora Commons reference implementation is being developed. xxxv provider that uses PREMIS metadata to describe the repository events. The PREMIS Agent is used to define who triggered the event, PREMIS Date/time to define when the event has occurred, and PREMIS Details is used to describe what has happened. The OAI- PMH protocol allows harvesting of all events and filtering by date and type of event. A Scout Report API adaptor harvests all events and creates aggregations of the events and a reference implementation is available in RODA Plan Management API This interface provides the facilities to deploy and manage preservation plans in the repository. Defined as a RESTful web service, it contains methods to ● Search and retrieve plans, ● Deploy a new plan, ● Retrieve or add a preservation execution state (e.g. in progress, success or fail), and ● Enable and disable a preservation plan. The implementation of the Plan Management API (called the Plan Management Component) can use a Workflow Engine such as Taverna, which understands the workflow language in which the action plan is defined, to execute the workflow and run its preservation actions and quality assurance components. Finally, the Plan Management Component can use the Data Connector API to merge the result of preservation action, such as migration, back into the repository. The Plan Management API specification is available onlinexxxix, and the API reference implementations are being developed by RODA and Fedora Commons 4. 4.6 Workflow Engine Any complex set of operations such as those outlined in Section 4.4 will benefit from a workflow environment to support the coordinated execution on large amounts of content. The system design separates the implementation-level detail of such a workflow engine to enable integration of different platforms. However, there is strong tool support available based on an integration of the workflow engine Taverna and the workflow sharing platform myExperiment, where fully annotated solution components can be published for sharing and discovery. Operational preservation plans can be created and specified as Taverna workflows and published using semantic annotations following the controlled vocabularies described above (Kraxner et al. 2013). Components published using this ontology can be discovered automatically and monitored for specific properties in Scout. The aggregated experience collected on their behavior can support early selection and recommendation of likely fits in the planning process. However, an organization who wishes to support a different workflow engine could replace Taverna with a platform of their own choice. 4.7 Automating the preservation lifecycle To illustrate how the presented suite of tools can support the preservation lifecycle, consider the following scenario. An institution has a repository with content and policies in place. These policies might not be formalized, and some even only documented implicitly, but they are what represents the intentions of an organization and should guide and constrain all the preservation processes. A first step requires the institution to define and formalize the purpose of the content and the digital preservation requirements associated with it. This requirements start by a high level definition of the mission and objectives, such as “long term availability and authenticity”, and must iteratively relate to low level requirements that relate to tangible and measurable facts, like for example “no compression allowed”. These more specific requirements, i.e. control policies, should be defined using the SCAPE policy model. By running characterization tools and C3PO on the repository and configuring the Scout adaptors for repository integration, which uses the C3PO and the Report API, Scout will be able to constantly monitor characteristics of the content and the repository events of importance for digital preservation. Scout provides the facility to upload the policies defined in the SCAPE policy model and activate a set of triggers. It will then notify the users when policy conformance is not fulfilled. These triggers might need external information monitored by Scout, such as the content of format and tool registries, different classes of experiments, and even manually inserted human knowledge. Scout may for example detect that some content uses compression, but that this violates a defined policy, and hence send an email notification to the Planner. The third step is to decide which actions should be taken to mitigate this problem. The Planner can use Plato to support the creation of a well described and traceable preservation plan that addresses the detected preservation risk. By knowing the defined preservation policies, Plato can pre-fill many of the necessary contextual bits of required information, supporting the reuse of the institution's objectives definition and greatly reducing the time needed to create a preservation plan (Kulovits et al. 2013a). Furthermore, Plato can automatically find and retrieve solution alternatives by connecting to the myExperiment preservation components catalogue. Also, Plato can automatically conduct experiments on all alternatives discovered in myExperiment, applying them to the set of sample objects. The analysis of results is partially supported by quality assurance tools that provide an evaluation of the behavior of each alternative considering the case requirements, which enables the decision maker to discover the best solution. The fourth step is to deploy the preservation plan into the repository via the Plan Management API. The Plan Management Component of the repository can use a workflow engine to execute the preservation action, including the quality assurance steps, and use the Data Connector API to merge the action results back into the repository. The results of the preservation action quality assurance step are sent to Scout via the Report API, so that Scout can monitor if the action performed as expected. Finally, the preservation plan contains triggers to be installed in Scout to automatically monitor if the assumptions taken on the decision-making step remain true. If the action plan does not execute as expected or if the preservation plan needs to be reviewed because policies or the environment have changed, then the Planner is again notified to re-evaluate the preservation plan, starting again the cycle. 5. Summary Digital Preservation is the set of activities and processes required to ensure the continued, authentic access to digital content over time. Providing such information longevity across changing socio-technical environments poses a number of challenges, in particular in the light of recent rising content volumes. Scalability for handling large amounts of data can be achieved by state of the art technologies commonly used in the cloud. Additionally, scalable monitoring and decision making is required to support automated, large-scale operations of systems and tools. Scaling up decision making, policy definition, and processes for monitoring and actions requires a set of techniques that include scalable in-depth content analysis, intelligent information gathering, and efficient multi-criteria decision support. But it also requires loosely-coupled systems that are able to interact with each other and the wider preservation context and are capable of evolution over time, and a set of common vocabularies that can be used to publish and discover knowledge about the evolving preservation ecosystems. This article presented the SCAPE Planning and Watch suite, a new, innovative system for scalable decision making and control in preservation environments. The Planning and Watch suite builds on Plato and extends it into a loosely coupled, extensible preservation planning and monitoring system that can be integrated with virtually any repository and content management system through open and standardized interfaces. While each of the components can be used and integrated independently from the other components, this article focused on the compound value contribution that can be obtained by the set of systems and showed how the resulting SCAPE ecosystem can support organizations in managing their holdings more effectively, using policy-driven monitoring and well-supported decision making systems to provide scalable decision making and control capabilities in support of digital preservation objectives. In Becker et al (2015), we will conduct a systematic assessment of the system based on the design goals outlined in this article. We will discuss the improvements of the presented work and identified limitations, based on a quantitative and qualitative evaluation including a case study with a national library. Acknowledgements Part of this work was supported by the European Union in the 7th Framework Program, IST, through the SCAPE project, Contract 270137. References Abrams, S. L. (2004), “The role of format in digital preservation”, VINE: The Journal of Information and Knowledge Management Systems, Volume 34, Number 2, pp. 49-55. Antunes, G. and Borbinha, J. and Barateiro, J. and Becker, C. and Proenca, D. and Vieira, R. (2011), “Shaman reference architecture”, version 3.0. SHAMAN project report. Bauer, S. and Becker, C. (2011), “Automated Preservation: The Case of Digital Raw Photographs” , in Digital Libraries: For Cultural Heritage, Knowledge Dissemination, and Future Creation Proceedings of 13th International Conference on Asia-Pacific Digital Libraries (ICADL 2011) in Beijing, China, 2011, Springer-Verlag. Beagrie, N. and Semple, N. and Williams, P. and Wright, R. (2008), “Digital Preservation Policies Study Part 1: Final Report”, HEFCE. Becker, C. and Kraxner, M. and Plangg, M. and Rauber, A. (2013), “Improving decision support for software component selection through systematic cross-referencing and analysis of multiple decision criteria”, in Proceedings of 46th Hawaii International Conference on System Sciences (HICSS), 2013, Maui, USA, pp 1193-1202. Becker, C. and Duretec, K. and Petrov, P. and Faria, L. and Ferreira, M. and Ramalho, J.C. (2012), “Preservation Watch: What to monitor and how”, in Proceedings of the 9th International Conference on Preservation of Digital Objects (iPRES)2012, Toronto, Canada. Becker, C. and Duretec, K. and Faria, L. (2015). “Scalable Decision Support for Digital Preservation: An Assessment”. To appear in: OCLC Systems & Services, volume 31, no. 1. Becker, C. and Kulovits, H. and Guttenbrunner, M. and Strodl, S. and Rauber, A. and Hofman, H. (2009), “Systematic planning for digital preservation: evaluating potential strategies and building preservation plans”, International Journal on Digital Libraries, Volume 10, Issue 4, pp 133–157. Becker, C. and Rauber, A. (2011a), “Decision criteria in digital preservation: What to measure and how”, Journal of the American Society for Information Science and Technology, Volume 62, Issue 6, pp 1009-1028. Becker, C. and Rauber, A. (2011b), “Four cases, three solutions: Preservation plans for images”, Technical report, 2011, Vienna University of Technology, Vienna, Austria. Becker, C. and Rauber, A. (2011c), “Preservation Decisions: Terms and Conditions Apply. Challenges, Misperceptions and Lessons Learned in Preservation Planning”, in Proceedings of the 11th annual international ACM/IEEE Joint Conference on Digital libraries (JCDL), 2011, Ottawa, Canada, pp 67-76. Brody, T. and Carr, L. and Hey, J. and Brown, A. and Hitchcock, S. (2008), “PRONOM-ROAR: Adding Format Profiles to a Repository Registry to inform Preservation Services”, The International Journal of Digital Curation ( IJDC), Volume 2, Issue 2, 2007, pp 3–19. CCSDS, (2002), “Reference Model for an Open Archival Information System (OAIS)”, Retrieved from http://public.ccsds.org/publications/archive/650x0b1.pdf Dean, J. and Ghemawat, S. (2004), “MapReduce: simplified data processing on large clusters”, in Proceedings of 6th conference on Symposium on Operating System Design & Implementation, Berkley, USA. Faria, L. and Akbik, A. and Sierman, B. and Ras, M. and Ferreira, M. and Ramalho, J.C. (2013), “Automatic Preservation Watch using Information Extraction on the Web”, in Proceedings of the 10th International Conference on Preservation of Digital Objects (iPRES) 2013, Lisbon, Portugal . Faria, L. and Petrov, P. and Duretec, K. and Becker, C. and Ferreira, M. and Ramalho, J.C. (2012), “Design and architecture of a novel preservation watch system”, in The Outreach of Digital Libraries: A Globalized Resource Network Proceedings of 14th International Conference on Asia-Pacific Digital Libraries (ICADL) 2012, Taipei, Taiwan, pp. 168–178. Fielding, R.T. (2000), “Architectural Styles and the Design of Network-based Software Architecture”, doctoral disertation, University of California, Irvine. Garret, J. and Waters, D. (1996), “Preserving digital information: Report of the task force on archiving digital information”, (The Commission on Preservation and Access and RLG). Hedstrom, M. (1998), “Digital Preservation: A time bomb for digital libraries”, in Journal of Computers and the Humanities, 1997, Volume 31, Issue 3, pp 189–202. Heslop, H. and Davis, S. and Wilson, A. (2002), “An approach to the preservation of digital records”, Green paper, National Archives of Australia, 2002, retrieved from http://www.naa.gov.au/Images/An- approach-Green-Paper_tcm16-47161.pdf. Hunter, J. and Choudhury, S. (2006), “PANIC: an integrated approach to the preservation of composite digital objects using Semantic Web services”, in International Journal on Digital Libraries (IJDL) 2006, Volume 6, Issue 2, pp 174-183. Hutchins, M. (2012), “Testing software tools of potential interest for digital preservation activities at the national library of Australia”, 2012, Technical report, National Library of Australia. ISO (2010), “Space data and information transfer systems - Audit and certification of trustworthy digital repositories (ISO/DIS 16363)”, International Standards Organisation. Jackson, A.(2012), “Formats over time: Exploring UK web history”, in Proceedings of the 9th International Conference on Preservation of Digital Objects (iPRES)2012, Toronto, Canada. Knijff, J. and Wilson, C. (2011), “Evaluation of characterization tools”, Technical report retrieved from http://www.scape-project.eu/wp- content/uploads/2012/01/SCAPE_PC_WP1_identification21092011.pdf. Kraxner, M. and Plangg, M. and Duretec, K. and Becker. C, and Faria, L. (2013), “The SCAPE Planning and Watch suite”, in Proceedings of the 10th International Conference on Preservation of Digital Objects (iPRES)2013, Lisbon, Portugal. Kulovits, H. and Rauber, A. and Kugler, A. and Brantl, M. and Beiner, T. and Schoger, A. (2009), “From TIFF to JPEG2000? Preservation Planning at the Bavarian State Library Using a Collection of Digitized 16th Century Printings”, in D-Lib Magazine ,2009, Volume 15, Number 11/12. Kulovits, H. and Becker, C. and Andersen, B. (2013a), “Scalable preservation decisions: A controlled case study”, in proceeding of Archiving 2013. Washington D.C., USA , pp 167-172. Kulovits, H. and Kraxner, M. and Plangg, M. and Becker, C. and Bechofer, S. (2013b), “Open Preservation Data: Controlled vocabularies and ontologies for preservation ecosystems”, in Proceedings of the 10th International Conference on Preservation of Digital Objects (iPRES)2013, Lisbon, Portugal. Law, M.T. and Thome, N. and Gançarski, S. and Cord, M. (2012), “Structural and visual comparisons for web page archiving”, in Proceedings of the 2012 ACM symposium on Document Engineering (DocEng’12), 2012, New York, NY, USA, pp 117-120. Lawrence, G.W. and Kehoe, W. and Kenny, A.R. and Rieger, O.Y. and Walters, W. (2000), “Risk Management of Digital Information: A File Format Investigation”. Object Management Group (2010), “Business Motivation Model 1.1”. Object Management Group (2008), “Semantics of Business Vocabulary and Business Rules (SBVR)”, Version 1.0. OCLC and CRL (2007), “Trustworthy Repositories Audit & Certification: Criteria and Checklist”. Pearson, D. (2007), “AONS II: continuing the trend towards preservation software “Nirvana”, in Proceedings of the 4th International Conference on Preservation of Digital Objects (iPRES)2007, Beijing, China, 2007 . Pehlivan, Z. (2013), “Quality Assurance Workflow, Release 2 + Release Report”, Technical report , retrieved from http://www.scape-project.eu/wp- content/uploads/2013/06/SCAPE_D11.2_UPMC_V1.0.pdf. Pennock, M. and Jackson, A. and Wheatley, P. (2012), “CRISP: Crowdsourcing Representation Information to Support Preservation”, in Proceedings of the 9th International Conference on Preservation of Digital Objects (iPRES)2012, Toronto, Canada. Petrov, P. and Becker, C. (2012), “Large-scale content profiling for preservation analysis”, in Proceedings of the 9th International Conference on Preservation of Digital Objects (iPRES)2012, Toronto, Canada. Plugge, E. and Hawkins, T. and Membrey, P. (2010), “The Definitive Guide to MongoDB: The NoSQL Database for Cloud and Desktop Computing”, Apress, USA. Rothenberg, J. (1995), “Ensuring the longevity of digital documents”, in Scientific American, Volume 272, Number 1, pp. 42-47. Sinclair, P. and Billenness, C. and Duckworth, J. and Farquhar, A. and Humphreys, J. and JArdine, L. (2009), “Are you Ready? Assessing Whether Organisation are Prepared for Digital Preservation”, in Proceedings of the 6th International Conference on Preservation of Digital Objects (iPRES)2009, San Francisco, USA, pp 174-181. Strodl, S. and Rauber, A. and Rauch, C. and Hofman, H. and Debole, F. and Amato, G. (2006), “The DELOS Testbed for Choosing a Digital Preservation Strategy” in Digital Libraries: Achievements, Challenges and Opportunities proceedings on 9th International Conference on Asian Libraries (ICADL), 2006, Springer-Verlag, Kyoto, Japan pp. 323-332. Thaller, M. (2009), “The eXtensible Characterisation Languages – XCL”, 2009, Hamburg, Germany, Verlag Dr Kovac. Webb, C. and Pearson, D. and Koerbin, P. (2013), “Oh, you wanted us to preserve that?! Statements of Preservation Intent for the National Library of Australia’s Digital Collections”, in D- Lib Magazine, 2013, Volume 19, Number 1/2. i http://www.nationalarchives.gov.uk/information-management/projects-and-work/droid.htm ii https://github.com/openplanets/fido http://www.nationalarchives.gov.uk/information-management/projects-and-work/droid.htm https://github.com/openplanets/fido iii http://jhove.sourceforge.net/ iv http://tika.apache.org/ v http://code.google.com/p/fits/ vi http://www.scape-project.eu/tools vii http://www.digicult.info/pages/techwatch.php viii http://dpconline.org/advice/technology-watch-reports ix http://www.nationalarchives.gov.uk/PRONOM/ x http://www.gdfr.info xi http://udfr.cdlib.org/ xii http://p2-registry.ecs.soton.ac.uk xiii http://fileformats.archiveteam.org/ is one example. xiv http://blogs.loc.gov/digitalpreservation/2013/06/why-cant-you-just-build-it-and-leave-it-alone/ xv http://www.dspace.org xvi http://www.eprints.org xvii http://www.roda-community.org xviii https://github.com/openplanets/scape-apis xix http://www.myexperiment.org/ xx https://github.com/ xxi http://www.scape-project.eu/tools xxii https://code.google.com/p/fits/ xxiii http://peshkira.github.io/C3PO/ xxiv https://github.com/peshkira/C3PO/blob/master/format/C3PO.xsd xxv http://openplanets.github.io/scout/ xxvi http://wiki.opf-labs.org/display/SP/Comparison+of+Web+Snapshots xxvii http://www.w3.org/TR/rdf-sparql-query/ xxviii http://internetmemory.org xxix http://www.openplanetsfoundation.org/blogs/2013-01-09-year-fits xxxhttp://myexperiment.org xxxi http://www.loc.gov/standards/sru/ xxxii http://www.loc.gov/standards/mets/ xxxiii http://www.loc.gov/standards/premis/ xxxiv https://github.com/openplanets/scape-apis/ xxxv http://www.openarchives.org/pmh/ xxxvi https://github.com/openplanets/scout/tree/master/adaptors/report-api-adaptor xxxvii https://github.com/openplanets/scape-apis xxxviii https://github.com/openplanets/roda xxxix https://github.com/openplanets/scape-apis http://jhove.sourceforge.net/ http://tika.apache.org/ http://code.google.com/p/fits/ http://www.scape-project.eu/tools http://www.digicult.info/pages/techwatch.php http://dpconline.org/advice/technology-watch-reports http://www.nationalarchives.gov.uk/PRONOM/ http://www.gdfr.info/ http://udfr.cdlib.org/ http://p2-registry.ecs.soton.ac.uk/ http://fileformats.archiveteam.org/ http://www.scape-project.eu/ http://www.scape-project.eu/ http://www.scape-project.eu/ http://www.dspace.org/ http://www.eprints.org/ http://www.roda-community.org/ https://github.com/openplanets/scape-apis http://www.scape-project.eu/ http://www.scape-project.eu/ http://www.scape-project.eu/ http://www.scape-project.eu/ http://www.scape-project.eu/tools https://code.google.com/p/fits/ http://peshkira.github.io/c3po/ https://github.com/peshkira/c3po/blob/master/format/c3po.xsd http://openplanets.github.io/scout/ http://wiki.opf-labs.org/display/SP/Comparison+of+Web+Snapshots http://www.w3.org/TR/rdf-sparql-query/ http://www.scape-project.eu/ http://www.openplanetsfoundation.org/blogs/2013-01-09-year-fits http://myexperiment.org/ http://myexperiment.org/ http://www.loc.gov/standards/sru/ http://www.loc.gov/standards/mets/ http://www.loc.gov/standards/premis/ https://github.com/openplanets/scape-apis/ http://www.openarchives.org/pmh/ https://github.com/openplanets/scout/tree/master/adaptors/report-api-adaptor https://github.com/openplanets/scape-apis https://github.com/openplanets/roda https://github.com/openplanets/scape-apis 1. Introduction 2. Digital preservation: Background and challenges 2.1 Digital preservation and repositories 2.2 On trust and scalability 2.3 Challenges and goals 3. Scalable, context-aware Preservation Planning and Watch 3.1 Overview 3.2 Design goals 3.3 The preservation lifecycle 3.4 An architecture for loosely-coupled preservation systems 3.5 Policies as basis for preservation management 3.6 A preservation ecosystem 4. The SCAPE Planning and Watch Suite 4.1 Overall solution architecture 4.2 C3PO: Scalable Content analysis 4.3 Scout: Scalable Monitoring 4.4 Plato: Scalable Decision making 4.5 Repository Data Connector API Report API Plan Management API 4.6 Workflow Engine 4.7 Automating the preservation lifecycle 5. Summary Acknowledgements References