Title Page Article: Scalable Decision Support for Digital Preservation: An Assessment *This version is a voluntary deposit by the author. The publisher’s version is available at: http://dx.doi.org/10.1108/OCLC-06-2014-0026. Author Details Author 1 Name: Christoph Becker Department: Faculty of Information University/Institution: University of Toronto Town/City: Toronto Country: Canada Author 2 Name: Luis Faria University/Institution: KEEP Solutions Town/City: Braga Country: Portugal Author 3 Name: Kresimir Duretec Department: Information and Software Engineering Group University/Institution: Vienna University of Technology Town/City: Vienna Country: Austria Acknowledgments (if applicable): Part of this work was supported by the European Union in the 7th Framework Program, IST, through the SCAPE project, Contract 270137, and by the Vienna Science and Technology Fund (WWTF) through the project BenchmarkDP (ICT12-046). Structured Abstract: Purpose – Scalable decision support and business intelligence capabilities are required to effectively secure content over time. This article evaluates a new architecture for scalable decision making and control in preservation environments for its ability to address five key goals: (1) Scalable content profiling, (2) Monitoring of compliance, risks and opportunities, (3) Efficient creation of trustworthy plans, (4) Context awareness, and (5) Loosely-coupled preservation ecosystems. Design/methodology/approach – We conduct a systematic evaluation of the contributions of the SCAPE Planning and Watch suite to provide effective and scalable decision support capabilities. We discuss the quantitative and qualitative evaluation of advancing the state of art and report on a case study with a national library. Findings – The system provides substantial capabilities for semi-automated, scalable decision making and control of preservation functions in repositories. Well-defined interfaces allow a flexible integration with diverse institutional environments. The free and open nature of the tool suite further encourages global take-up in the repository communities. Research limitations/implications - The article discusses a number of bottlenecks and factors limiting the real- world scalability of preservation environments. This includes data-intensive processing of large volumes of information, automated quality assurance for preservation actions, and the element of human decision making. We outline open issues and future work. Practical implications - The open nature of the software suite enables stewardship organizations to integrate the components with their own preservation environments and to contribute to the ongoing improvement of the systems. Originality/value – The paper reports on innovative research and development to provide preservation capabilities. The results of the assessment demonstrate how the system advances the control of digital preservation operations from ad-hoc decision making to proactive, continuous preservation management, through a context- aware planning and monitoring cycle integrated with operational systems. Keywords: Repositories, preservation planning, preservation watch, monitoring, scalability, digital libraries. Scalable Decision Support for Digital Preservation: An Assessment This article presents a systematic assessment and evaluation of the SCAPE decision support environment comprising PLATO, SCOUT and c3po. We discuss the improvements and identified limitations of the presented system. We furthermore discuss the quantitative and qualitative evaluation of advancing the state of art and report on a case study with a national library. Finally, we summarize the contributions and provide an outlook on future work. 1. Introduction This article continues the discussion in Becker et al (2014) and presents a systematic assessment and evaluation of the SCAPE decision support environment comprising PLATO, SCOUT and c3po. We discuss the improvements and identified limitations of the presented system. We furthermore discuss the quantitative and qualitative evaluation of advancing the state of art and report on a case study with a national library. Finally, we summarize the contributions and provide an outlook on future work. 2. Evaluation and assessment While some of the questions that are raised by the design goals discussed in Becker et al (2014) can be readily evaluated using standard metrics, others require a detailed qualitative assessment. This section will discuss how to systematically assess improvements on the dimensions of trust and scalability. We will report on a typical case study conducted with the State and University Library Denmark, discuss key metrics that can be used for evaluation, apply them to assess recent advances, and discuss a set of limitations. We further discuss how these findings can be applied on a wider scale. 5.1 Evaluation dimensions and challenges Five major design goals have been proposed in Becker et al (2014): ● G1: Scalable content profiling is required to create and maintain an awareness of the holdings of an organization, including the technical variety and the risk factors that cause difficulties in continued access and successful preservation. ● G2: Monitoring compliance, risks and opportunities is a key enabler to ensure that the continued preservation activities are effective and efficient. ● G3: Efficient creation of trustworthy plans is required so that preservation can function as a normal part of an organization's processes in a cost-efficient way. ● G4: Context awareness of the systems ensures that they can adapt to the specific situation rather than provide generic recommendations or require extensive manual configuration. ● G5: Loosely-coupled preservation ecosystems, finally, enable organizations to follow a stepwise adoption path and support continuous evolution of the preservation systems as new solutions and improved systems emerge. Given this set of design goals, it is clear that a systematic evaluation has to be based on both qualitative and quantitative criteria and account for the various socio-technical dimensions of the design problem. Scalable content profiling requires, first and foremost, efficiency in the data processing system. This can be measured in terms of the amount of data processed in a certain timeframe using a defined set of resources. This applies to the content profiling tool C3PO. The efficiency of decision making, on the other hand, can be measured in controlled experiments. This, however, has to be done in a real-world environment to be meaningful, which creates additional challenges for a large-scale assessment and has to be interpreted with caution. The effectiveness of a preservation system composed of several heterogeneous and asynchronous processes, collaborating over time and controlled by decision makers in a real organization, is much harder to measure, since very often what needs to be measured in terms of the effects is time-delayed, and to a large extent defies objective measures in the present time. Similarly, trust is extremely hard to measure, and the preservation community has for a decade discussed different ways of assessing the trustworthiness of a repository (Ross & McHugh 2006; OCLC and CRL 2007). The resulting criteria catalogue ISO 16363 (ISO 2010) provides a useful checklist for assessing the assumed trust of an organization and hence can form a guideline for evaluation, but does not apply to the actual operations and the preservation lifecycle on the operational level. The Plato planning approach that forms the basis for the architecture presented here has been designed with these criteria in mind and evaluated for adherence with and support of these criteria (Becker et al. 2009). However, it can be argued that more holistic perspectives are required to assess and improve an organization’s trustworthiness, perspectives that emphasize enterprise governance of IT and the maturity of organizational processes (Becker et al. 2011). The following discussion is designed loosely along the Goal-Question-Metric paradigm (Basili et al. 1994). Each goal is associated with a set of questions corresponding to the objectives outlined in Becker et al (2014). The answers to these should support an assessment as to how far the goal has been achieved. To this end, each question is further linked to a set of metrics that provide objective indicators to support an answer to the question. We discuss each of the design goals in turn and discuss the specific questions that need to be answered to provide an assessment of how the state of art is improved with the proposed system design and implementation. This forms the basis of a systematic discussion, taking into account the quantitative indicators and the qualitative discussion of the state of art. 5.2 Evaluation of design goals G1 Provide a mechanisms for scalable in-depth content profiling Figure 14: Scalable profiling goals Figure 14 poses the key questions we need to answer to evaluate the scalability and quality of in- depth profiling. Content profiles need to be meaningful, i.e. cover the interesting features that are known to be relevant for preservation, and trustworthy. Clearly, a profile covering only the size of files will be less meaningful than a profile including mime-types, formats, and validity. Additionally, a plethora of features influence the success in continued access, ranging from the presence of Digital Rights Management settings to the numbers of embedded tables in electronic documents and other dependencies. The features needed for in-depth characterization process are broad coverage in terms of supported file formats and the features extracted, usage of a common vocabulary for identification of the formats, feature names and its values, and a reasonable low resource consumption and performance so they can be used in large-scale frameworks. By relying on the fits file information toolset, the C3PO profiler maximizes the coverage of features, arguably providing the highest possible feature coverage that can currently be achieved (Petrov & Becker 2012). The correctness of aggregation itself can be verified in a straightforward way, since the operations are basic statistical calculations. The correctness of general map-reduce based operations in themselves can be assumed. On the other hand, the correctness of characterization components to provide accurate feature descriptors based on arbitrary input is far from proven. In fact, current data sets are entirely insufficient for proving correctness of the complex interpretation processes that take place. This, however, is a problem on the level of operations and cannot be attributed to the aggregation step of content profiling. Separate efforts are underway to verify the correctness of characterization tools using model-driven engineering to generate annotated test data (Becker & Duretec 2013). To enable future evolution, yet another aspect of scalability, a meaningful content profiler must be flexible enough to work on arbitrary property sets. C3PO supports this by relying on a generic data model, so that any additional property sets can be profiled. This also supports the integration of further characterization tools. While FITS fulfils coverage and vocabulary requirements, it consumes considerable resources and takes a substantial amount of time to executei. Hence, C3PO also supports Apache Tika, which in experiments showed far better resource consumption and performance with a throughput of up to 18 GB per minuteii when used with large- scale platforms such as Apache Hadoop iii. While in this case, Apache Tika was only used for file format identification, it supports feature extraction and has good coverage of file formatsiv, but does not yet use a well-defined vocabulary for the identification of extracted features. The objectively measurable throughput and resource usage in profiling, then, is the crucial final question. To measure the time and resource behavior of C3PO, a set of controlled experiments has been conducted. The first measured the throughput of C3PO on a single standard machine, while the second employed a server with strong hardware and explored the boundaries of scalability by attempting to profile up to 400 million resources (12 Terabyte) in a single profile, enabling a further extrapolation of these results to the entire set of 300 TB in this collection. The third examined the limits of the web visualization platform to cope with these amounts of data. The first experiment (Petrov & Becker 2012) tested the performance on limited resources and showed that on a standard computer with 4 GB of RAM and 2.3 GHz CPU, the ingesting and generation of a profile of 42 thousand FITS takes about 1.5 minutes. Large scale tests were performed by Niels Bjarke Reimer from the Danish State and University Libraryv. A 12 TB sample was taken from a dataset with 300 TB of the Danish web archive. FITS was run on the sample content, resulting in 441 million FITS files. This characterization process took about a year to complete vi. For content profiling, two processing parts need to be considered: 1) the gathering of files into the internal data structure and 2) the analysis of that data set using mapreduce queries. The experiments were executed on a single machine with the specifications described in Table 1. Processor 2 X Intel Xeon x5660 2.8 GHZ (12 core) RAM 72 GB Storage Isilion storage box with 20 TB storage and 400Gb SSD, connected by a 1Gbit/s Ethernet network Operating System MongoDB Linux x86 64-bit v2.4 MongoDB Version 2.4 Application service Apache Tomcat version 7 Table 1: C3PO scalability test machine specifications The first step, which ingests the FITS files into a MongoDB server, was tested with the 441 million FITS items. The graph depicted in Figure 15 shows the import time for samples of around 3.600 files. The Y-axis unit shows time in milliseconds and the X-axis unit is a sample number, which can also be considered as a timeline. Figure 15: Performance of C3PO import process using FITS metadata (Reimer, et al. 2013) The complete import process took less than 80 hours, with an average execution time of 0.65 milliseconds per FITS file. This import time is quite constant with only a few outliers, which implies that the platform and the software are acceptable for importing large amounts of data. The second step, the analysis of the data using map reduce queries, was tested with a data set of about 12 million FITS files and took 15 hours and 18 minutes, which is about 4.63 milliseconds per FITS file. Using sharding and mapreduce technologies, the processing time of the second step should also be linear. In conclusions, both steps are linear and together take on average 5.28 milliseconds per FITS file. This means that processing the current 300TB dataset would take about 16170 hours, or about 677 days, on a single machine. As both processes are massively parallelizable and the MongoDB platform already supports sharding and map reduce, the processing time can be highly reduced by distributing the load on several servers. Substantial resources may be needed to bring the processing time down to a practical time, but this profile does not have be re-generated frequently. The C3PO tool also provides a web interface that supports real-time analytics on the gathered data. While this is not one of the requirements strictly required for automated monitoring, it provides interesting insights into content profiles that is considered highly valuable by the decision makers. However, in this scenario the limits of the web application are revealed. Several test runs were made with different data set sizes to ascertain the limits of the application. For each test run, two manual procedures on the web interface were made: 1) opening the overview page that calculates, in real time, distributions of several extracted features, and 2) drilling down into into the characteristics of a subset of the collection, as all the files of a defined format. Test run # FITS files Elements size GB number of properties Overview processing time Drill down processing time 1 13,962 0.03 80 Fast Fast 2 108,348 0.26 96 18 sec 11 sec 3 363,991 1.00 106 30 sec 34 sec 4 1,020,514 2.46 113 2 min 25 sec 1 min 42 sec 5 1,639,842 3.95 119 3 min 52 sec 2 min 50 sec 6 2,683,596 6.44 119 6 min 28 sec 4 min 25 sec 7 11,905,935 28.63 211 not finished within 3 hours N/A 8 441,923,560 1183.50 5122 N/A N/A Table 2 - Testing the limits of real-time analytics in the C3PO web application (Reimer, et al. 2013) Table 2 shows the results of the tests, which show acceptable results up to around 2.5 million files, with a waiting time of around 6.5 minutes. Above this limit, the web system does not respond within 3 hours, which is considered unacceptable. Hence, it is not feasible to perform real-time analytics with the current solution on the set of 440 million FITS files of the 12TB data set. It should be noted that the analysis of the entire dataset provides a view of over 400 million rows in a table with over 5000 columns, with a resulting database size of over a Terabyte. G2 Enable monitoring of operational compliance, risks and opportunities This section analyzes how the mechanisms presented in Becker et al (2014) can be used to accomplish proactive monitoring of operational compliance, risks and opportunities in a preservation environment. Figure 16: Monitoring As outlined in Figure 16, the key questions relate to the identification of aspects that need to be monitored and to the coverage of measures available to provide indicators related to these aspects. Based on an analysis of a reference model for drivers and constraints (Antunes et al. 2011), which classifies each of the influencers a preservation organization should be aware of, the discussion in (Becker, Duretec, et al. 2012) showed that relevant questions and measures can be derived for each of the influencers of interest. This enables the development of appropriate adaptors for measuring specific indicators pertaining to this driver. Table 3 shows key examples, while a full discussion and detailed table is provided in (Becker, Duretec, et al. 2012). Driver Question Indicator Sources Content Is the content volume growing unexpectedly? Rate of growth changes dramatically in Ingest content profile, Repository Report API Operations Are our content profiles policy- compliant? Mismatch between content profiles and policy statements content profiles, control policy statements Format How many organizations have content in this format? number of shared content profiles containing a format content profiles shared by organizations Format What is the predicted lifespan of format X? lifespan estimates based on historic profiles model-based simulation Table 3: Selected preservation drivers and related information sources (Becker et al, 2012) In practice, the achieved coverage of measures is by no means complete, but increasing. Currently supported sources include format registries, semantic policies, content profiles, and an automated rendering and comparison tool (Law et al. 2012). A prioritization approach is taken to target first and foremost those aspects that are perceived most critical. The open nature of the adaptor design, the data model, and the licensing model has the effect that additional sources can be integrated by anybody in the preservation community, and the coverage is rising steadily. Figure 17: Evaluation of monitoring compliance Monitoring of operational plans is illustrated schematically in Figure 17 on a simplified example. Consider a preservation plan that evaluates four potential actions (“alternatives”) against a set of four decision criteria. These criteria evaluate the important aspects of the data to be preserved, the environment, and the actions to be applied. Based on these criteria, preservation actions in question are evaluated and a ranking is calculated. The planner then chooses the best suited action and adopts it. In this example, a check mark denotes a best-in-class performance, a tilde denotes acceptable performance, and a cross reflects an unacceptable performance, for example a process that did not terminate or an image conversion that shows a distorted image. We can see that two alternatives have been rejected, and alternative 1 has the highest score and will be selected. vii Since the decision criteria identified during planning lead to the adoption of a certain action, they must be monitored during operational executions as well to enable the organization to track whether the action keeps performing according to the expectations. This is shown on the bottom left of the figure. However, this is not the only aspect prone to evolution: (1) New alternatives will emerge over time that may perform better than the chosen alternative. In cases where no alternative was acceptable, this will sometimes be the only thing monitored, since the organization would wait for a better solution to become available before embarking on premature preservation actions. For example, this was the case in (Kulovits et al. 2009). (2) Updated or new Quality Assurance tools can emerge that provide more reliable or more efficient measures for Quality Assurance or even the first automated way to measure relevant quality. For example, these could be of the kind described in (Jurik & Nielsen 2012) or (Bauer & Becker 2011). (3) Related to this, experiments including certain criteria may be conducted by other individuals or organizations that can reveal risks and opportunities related to this plan. For example, the chosen Quality Assurance tool might be shown to malfunction on similar objects, which poses a major risk (Bauer & Becker 2011). (4) Finally, the organization’s objectives themselves may shift over time as goals change. This would be reflected by a change in the control policies. The tool suite described in this article is designed to provide full support for this monitoring scenario. The upcoming release of Plato generates specifications describing the expected quality of service (QoS), similar to a service-level agreement (SLA), for the set of decision criteria considered, linked to the corresponding organizational policies, and deposits corresponding monitoring conditions in Scout upon deployment of a preservation plan. Such QoS specifications are created for those criteria in a tree which are influenced by the dynamic behavior of the service - i.e. the components. That means that they are not created for aspects relating to the format, such as the ISO standardization of PDF versions, but they include criteria such as whether the created files are well-formed. QoS is then measured within executable workflows and monitored for fulfillment. Aspects pertaining to the format and other non- dynamic aspects are monitored as risks and opportunities using Scout. While Scout is able to collect a wide variety of measures, these are naturally limited by the availability of operations that support such measures. The controlled vocabulary encourages developers to declare which measures their tools deliver to support discovery, but the coverage of measures will naturally vary across different scenarios. It is important to note, however, that any required measures can be integrated by any organization due to the open nature of the ecosystem. Finally, transparency of the monitoring process is achieved through the usage of the permanent shared vocabulary and the explicit declaration of tolerance levels in the QoS, corresponding to the specified acceptance thresholds that are derived from the organization’s control policies. G3 Improve planning efficiency Previous work has shown that the key challenge in planning is to make the decision making process more efficient (Kulovits et al. 2009; Becker & Rauber 2011c). In Becker et al (2014), we reflected on the dimension of trust that should not be sacrificed along this quest. Correspondingly, the key questions shown in Figure 18 relate to the aspect of effort: How long does it take to create one preservation plan now, and how much further improvement is possible? Figure 18: Efficient creation of trustworthy plans Previous discussions have shown the trustworthiness of plans produced by Plato (Becker et al. 2009), which is based on the evidence-based measures of decision criteria directly linked to organizational goals and based on factual evidence, documented with full change tracking assigned to acting users. These strengths continue to form the backbone of trustworthy planning. While it is clear that fully automated, i.e. autonomous, preservation planning is contradicting the goal of trustworthiness in this domain, the goal nevertheless must be to achieve a substantial increase in efficiency (Becker & Rauber 2011c). We will focus our discussion on measurements of effort on a controlled case study conducted with the Danish State and University Library, described in detail in (Kulovits et al. 2013). In this study, a set of responsible decision makers and experts from the library set out to create a preservation plan, with the assistance of a planning expert and a moderator who kept time of all activities throughout the planning process. The goal of planning was to create a preservation plan for a large set of audio recordings; the drivers and events motivating the plan included the goal to homogenize the formats of the library’s holdings and provide well-supported and efficient access to authentic content. The team at the library has comprehensive expertise in all relevant areas, which range from technical knowledge on audio formats and quality assurance mechanisms for comparing audio files to a documented understanding of the designated communities and preservation purpose of the content set at hand. The preservation plan was created using the then-current version 3 of the planning tool Plato, which is the precursor of the solution presented here. The goal was to identify the major areas of decision making effort and measure the potential improvement that can be realistically achieved. The total time required to create a preservation plan amounted to 35.5 person hours, completed over a period of two days. This shows on the one hand that efficient teams in well-established settings can already plan quite efficiently. Nevertheless, the effort must be further reduced to make planning truly a part of “business-as-usual” preservation in practice. To contextualize the effort required in this case, it is important to understand that this effort strongly depends on a well-defined understanding of the decision making context, including the understanding of the goals and constraints; the expertise of decision makers; and the technical proficiency of the staff carrying out the experimental steps of preservation planning. Finally, a strong driver for cost is the homogeneity of content: For large object sets that are very diverse, several preservation plans will have to be created, each respecting to a certain degree the specific aspects of a subset of the content and the means available to ensure access to this subset. Figure 19: Distribution of effort across activities in preservation planning (Kulovits et al, 2013) Figure 19 shows the distribution of effort across each of the types of activities that were part of this planning process. It should be noted that several of these activities were in fact on the upper end of the efficiency range for several reasons: ● Experiment execution often takes more time. The experiments conducted were highly efficient due to the minimum number of alternatives evaluated, the high technical proficiency of staff, the homogeneity of content, and the quality assurance mechanisms employed. In many cases, the experimentation processes consumes a multiple of the time. The integration of Taverna workflows and myExperiment can reduce this massively, since potential components can be discovered and automatically invoked within planning. This automation is similar to an existing integration of automated measures in Plato (Becker & Rauber 2011a), but makes these mechanisms available on an open, standardized and easily extensible basis. ● Background information often is unavailable. This applies in particular to the user communities and the statements of preservation intent that many organizations are only now beginning to document systematically (Webb et al. 2013). The organization in question, however, has a stable and well-supported definition of collections and user communities, from which the preservation goals could be derived rather efficiently. Formal policy specification makes this background explicit and known by the systems, so that the effort can be further reduced. ● Analysis and verification is complex. Even with the support of a planning expert, 14% of the time was spent in sense-making, analyzing the completed set of evidence and assessment in the decision making tool to arrive at a conclusion that was well understood by the stakeholders. This points to the need for improving the decision support tool in visualizing results in a more easily understandable and user-friendly way. Improved summaries in Plato are planned to this end. ● Entering data into the system is tedious, in particular for users not familiar with the tools. This was alleviated by the involvement of a planning expert familiar with the tool. Similar to other aspects, this benefits greatly from the integration of the tool with workflows and from the explicit endowment of Plato with an understanding for the policy models of organizations. A subsequent controlled experiment showed that Plato 4 reduced this effort by over 50% (Kulovits et al. 2013). In an ideal case, the effort required to cover the above aspects (software testing, background information, analysis and verification, and data input) can be removed almost entirely. Still, 50% of the time in this case would be spent discussing requirements. However, the majority of these concerns objectives about formats, significant properties, and technical encoding or representation (Becker & Rauber 2011a). For all of these aspects, standard definitions are now available as part of the controlled vocabulary that enable decision makers to reuse definitions and formalize these aspects on a policy level, removing this activity from the operational planning process. This applies to the designated community and preservation intent statement as well as to format and risk factors and technical constraints. The control policy statements thus can improve his effort by enabling reuse of these goals and constraints across plans. As the discussion on Goal 4 will show, the context awareness of Plato can eliminate the need for in-depth discussions of requirements as part of planning almost entirely. For an organization that establishes planning as a proper function in its roles and responsibilities and possesses a solid skills and expertise base, we estimate that preservation planning should on average take about one to two person days per plan, provided that policies and content profiles are known and documented. However, a large variance across organizations is to be expected. This estimate will strongly depend on a variety of specific factors and certainly needs to be further validated in longer- term empirical studies. These should in particular also cover the question of homogeneity of content sets covered in a plan: How many plans are required to safeguard a particular heterogeneous set of objects? A detailed discussion on the activities in this process and the relevant skills and expertise is presented in (Kulovits et al. 2012). G4 Make systems aware of their context Figure 20: Context awareness By providing a model that enables decision makers to formulate policies so they can be understood by automated processes, the systems can understand their context and stay informed about its state. To assess the context awareness of the systems in question, we investigate three distinct aspects: On the one hand, the context needs to be well understood and modeled in order to ensure a solid approach has been taken. On the other hand, each of the systems needs to demonstrate that it can use the part of the context that is relevant for its function appropriately. Finally, it is crucial to ensure that this does not come at the cost of coupling the context too closely with the systems. To this end, we discuss how this context can evolve independently from each of the systems. This is illustrated in Figure 20. The modular approach of the semantic models has been discussed in Becker et al (2014). A detailed documentation of the model is provided in (Kulovits et al. 2013). The models are based on W3C- approved standards and follow established Linked Data principles. At the heart of the model is the Resource Description Frameworkviii (RDF), a standard model for enabling the representation of data and metadata in subject-object-predicate triples. The Web Ontology Languageix (OWL) provides the mechanisms for the description of vocabularies, defining classes and properties. These are used to annotate, describe and define resources. Having well-defined semantics, OWL facilitates reasoning, ontology management and querying of data. The model is represented as an RDF graph and queried using SPARQL. The vocabulary domains have permanent identifiers according to the following ontologies: ● http://purl.org/DP/preservation-case contains the basic elements that link a preservation case together. ● http://purl.org/DP/quality describes the quality ontology, linking attributes and measures in a domain-specific quality model. ● http://purl.org/DP/quality/measures contains the vocabulary individuals that are used for annotating, describing and discovering measures and the mechanisms for measuring. ● http://purl.org/DP/control-policy, finally, defines the classes of objectives relevant for making a preservation case operational. Each of the systems presented is aware of those parts of the model that are relevant for its domain. Correspondingly, each system shows its awareness of this model in different manners. Plato uses the control policy model in several ways. On the one hand, the preservation case provides the basic cornerstones of planning. Instead of providing the documentation of the planning context in textual form, as it used to be standard (Becker & Rauber 2011c), a planner who has specified the policy model selects a preservation case to start planning, and the contextual information from this case is extracted from the policy model. Additionally, the objectives and measures specified in the control policy enable the decision support tool to derive the complete goal hierarchy automatically from the model, leaving it to the decision maker only to revise, verify and confirm the decision criteria to be used for the experimental evaluation. In the case study discussed above, this requirements specification alone accounted for 30% of the effort. While the policies of course require a similar discussion, much of the objective specification has to be discussed only once and can then be carried forward across preservation cases, which represents a substantial efficiency gain as soon as more than one plan is created. Similarly, the acceptable values, and hence the utility functions associated with each measure, can be computed in a straightforward way based on the objectives specified in the control policies, which presented a potential gain of another 17% in our case study discussed above. C3PO uses the vocabulary relevant to characterization in the content profile, referencing elements from the quality measures catalogue (such as http://purl.org/DP/quality/measures#55). Given that it only provides objective analytics of factual statements about the domain elements, it has no understanding of the policy model and does not require any. Scout, finally, leverages the policy model for monitoring the alignment of operations and plans to the policies, and also monitors the policy itself: If it is updated, that means that affected plans should be re-evaluated. Specific standard queries are provided as templates that monitor policy compliance. These can be activated by the user. For example, Figure 21 shows Scout starting a monitoring activity on the policy conformance of a specific content set (identified by a collection key). In this case, it shows in a preview that the property compression scheme is violated by 3 entries, and provides the option to create a continuous monitoring process by specifying a trigger with a condition and an event. It can be seen that the model of the context is shared between the tools, with the decision maker updating the ontology independently of the tools. A crucial requirement is that the context model can evolve independently of the systems. This is especially important considering that the current model is very much focused on operational support and can benefit greatly from being expanded to cover aspects of decision making that are further removed from operations. Similarly, it can be expected that meaningful linkages will surface that connect the existing ontologies to emerging ontologies from neighboring areas ranging from software quality and ontologies for describing software dependencies and platforms to preservation metadata and related policies. The potential for such evolution is guaranteed by the choice of representation and languages, since the Linked Data principles that the model adheres to are designed with these very goals in mind. Figure 21: Checking collection policy conformance in Scout G5 Design for loosely-coupled preservation ecosystems The design goal of loosely-coupled systems is relevant for several reasons. On the one hand, it is crucial to enable the stepwise adoption approach preferred by many organizations (Sinclair et al. 2009). On the other hand, it ensures that evolution can take place independently, enabling each organization to replace parts of its system without negatively affecting continued operations, and enables each component of the ecosystem to be sustained independently (to a degree) of the others. Figure 22: Loosely-coupled preservation ecosystems Figure 22 relates these goals to more specific questions. While it is clear that the components are open source, licensed under OSI-approved conditions x, and highly modular, it is useful to consider closely both the functional specifications and the data structures. The API specifications for the SCAPE Planning and Watch suite are in the process of being published openly on github. Data exchanged between components is standardized and supported by schemas, as shown in Table 4. Plato Scout C3PO All functional interfaces openly published In progress In Progress In progress All data structures documented using standards and schemas XML schemas published for each version Linked Data model Policy model XML schema published Component is used independently Yes Yes Yes Component follows the controlled vocabulary objectives, measures, control policies, preservation cases objectives, measures, control policies measures Table 4: Interoperability of components The controlled vocabulary as the glue that connects much of the ecosystem is maintained on githubxi. Curating this vocabulary over the long term will be sustained by a community effort. Recent discussions in the communities of metadata and preservation have brought forward long-term requirements for such evolution that will be considered carefully. (Gallagher 2013a; Gallagher 2013b) The components are functionally independent in that every component can and is actually used independently. Nevertheless, it is clear that the compound value proposition is larger than the sum of its parts, serving to encourage take-up of the suite as a whole. Similarly, the usage of this tool suite benefits greatly from integrating also with the workflow development, execution and sharing platforms Taverna and myExperiment, whose latest releases provide specific support for semantic annotation, driven by the requirements outlined in this article. Since such an ecosystem should be built with sustainable evolution in mind, we consider a recent discussion that identified eleven factors affecting the sustainability of a modular preservation system (Gallagher 2013a; Gallagher 2013b). Table 5 shows how our system performs on each of these criteria. Sustainability factorsxii How does the SCAPE Planning and Watch suite perform? Ability to view and modify source code All components are openly licensed, and all source code elements are freely available on a github repository. Widely used C3PO and Scout are relatively new, but enjoying quick take-up in the community, while Plato has been growing to over 1000 user accounts since first publication 2008. However, usage so far has been limited to prototypical evaluation rather than production-level deployment, mostly due to the level of effort involved. Well tested, few bugs or security flaws All tools support automated tests and have an active ticketing system, and the major releases are considered very stable. No security incident has been reported so far. Actively developed, supported All tools are part of an active development community, continuously supported, and the development platform is hosted by the Open Planets Foundationxiii. Standards aware All components follow standards on multiple levels wherever possible. This ranges from standard technologies such as Java Server Faces to XML Schema declarations and Linked Data principles. Well documented All components have extensive code documentation, manuals, built-in help and tutorials, as well as scientific publications explaining the theoretical foundations and practical implications of the software. Unrestricted licensing All software components are licensed under OSI-approved open licenses such as LGPL and Apache Software License 2.0. All documentation is licensed under the Creative Commons license. Ability to import and export data and code Preservation plans, executable plans and content profiles can be freely imported and exported, and shared between users. The Scout knowledge base is a Linked Data triple store and hence equally portable. Compatible with multiple platforms Being based on standard server technologies, all components are compatible with multiple platforms. Plato even integrates with multiple platforms at once in the case of preservation action discovery (Kraxner et al. 2013). Backward compatible This is very relevant in the context of Plato, which is an online service since 2008. Here, there is full backward compatibility with a fully traceable forward conversion upon import of legacy preservation plans. All plans created on the online service have been automatically migrated for all releases. Similarly, the knowledge base of Scout is designed to keep growing incrementally, without disposing of accumulated historical data. Minimal customization There is almost no customization required, since all contextual adaptation of the systems’ behavior can be achieved through the configuration of API endpoints and the corresponding definition of control policies. Table 5: Sustainability evaluation of the SCAPE Planning and Watch suite While the ecosystem is well positioned for future sustainability, there is still room for improvements. This includes the development and publication of Technical Compatibility Kits that can automatically test the functional compliance of a component to an API specification, as it has been done for the Data Connector APIxiv, but also the long-term evolution of vocabularies and any future extensions of the tool suite. 5.3 Practical adoption Considering the preservation lifecycle outlined in Becker et al (2014), what does the availability of the described system mean for an organization that has content and a preservation mandate, has set up a reasonable organizational structure and defined corresponding responsibilities, but has not yet ventured to create and maintain specific, actionable preservation plans? The exact measures to be taken will certainly depend on the specific institutional context, but essentially, such an organization can follow a series of steps. 1. Getting started entails several aspects. a. Start content profiling. Run format identification and characterization components such as fits on the set of content to extract metadata, deploy the content profiling tool C3PO, gather the metadata, and conduct an analysis of the content profile. b. Sign up with SCAPE Planning and Watch, either on the online servicexv or on an organization-specific deployment based on a public code releasexvi. c. Connect the organization’s repository with SCAPE Planning and Watch, either through configuring a standard adaptor or implementing a specific adaptor. 2. Specify control policies based on a thorough analysis of the organization’s collection, the user communities, and the preservation cases that are considered relevant. 3. Activate the monitoring of policies and content profiles in Scout to detect policy violations. 4. Create preservation plans to increase the alignment of the organization’s content and operations to the goals as declared in the policies. This planning will be done by evaluating action components using characterization and QA components in Taverna workflows, all integrated in planning. The finished Plans contain workflow specification including QoS criteria that can be automatically monitored. 5. Deploy the operational plans to the repository through the plan management API, connected to a workflow engine such as Taverna. 6. Establish responsibility for continuous monitoring. This is supported by Scout, which will monitor the compliance of operations to plans and detect risks and opportunities connected to these plans and policies. 5.4 Limitations From the discussion above, a number of limitations can be observed. These can be divided into limitations of the current capabilities of available tools, which can be expected to grow, and more fundamental limitations of current approaches which require new perspectives to be overcome; limitations of the problem space that set natural limits to further improvement; and limitations on the quantitative evaluation that can feasibly and meaningfully be conducted. This section discusses those limitations that are seen as central to the further advancement of the state of art. Coverage and correctness of available measurement techniques The availability of tools and mechanisms to deliver objective and well-defined measures that are shown to be correct and reliable is a key challenge holding back operational preservation today (Becker & Rauber 2011c; Becker & Duretec 2013). Scout supports a growing set of adaptors to feed in measures into the knowledge base, and by nature of the design alleviates some of the shortcomings and gaps in existing tools through the free combination of multiple information sources, but still is limited by the availability of these information sources. Similarly, experiment automation in Plato and, equally important, the feasibility of large-scale preservation operations in general, is entirely dependent on the existence of well-tested, efficient and effective mechanisms for Quality Assurance. Recent work is showing promising advances (Jurik & Nielsen 2012; Bauer & Becker 2011; Law et al. 2012), but there is still a wide gap to be addressed for preservation operations to be broadly supported. It seems crucial that this gap is made explicit and shared with a wide community so that efforts to close it can be based on a solid assessment of the shortcomings of existing tools rather than isolated ad-hoc identification of application scenarios within single institutions, as is often practiced today. Scalable preservation operations are only possible with fully automated, reliable and trustworthy Quality Assurance; and such quality assurance is expensive to develop and difficult to verify. Only through coordinated community efforts based on solid experimentation can the evidence be constructed to make a convincing case on authenticity (Bauer & Becker 2011). The utter lack of solid, reliable and open benchmark data set with full ground truth is a fundamental inhibitor to validating the correctness of such measures. To address this gap, we are investigating innovative approaches to turn around the publication of test data sets from ex-post annotation, inherently plagued by unreliable ground truth and copyright problems, to an open, model-driven generative approach (Becker & Duretec 2013). Scalable distributed and cost-efficient processing: How to profile a Petabyte? As shown above, the content profiling tool C3PO provides support for scaling out on distributed platforms. However, it requires considerable resources if the content to be profiled is approaching the Petabyte range, and visual analytics are not currently supported on such amounts of data. Yet, it is important to point out that the core goals of content profiling are achieved regardless of the collection size: Visual analytics is an additional capability on top of the processing activity. To enable cost-efficient creation of large content profiles without visual analytics requirements, we are exploring purely sequential profilers with a small footprint as a low-cost alternative, and we are investigating a set of techniques for feature-space pruning and dimensionality reduction prior to the more expensive processing steps. Similar considerations apply to preservation operations such as actions, characterization, and quality assurance. As noted, the execution of fits on the 440M resources on the Danish web archive took a year to complete, which clearly indicates the need for improvements. Similarly, automated QA mechanisms are computationally demanding (Bauer & Becker 2011). These processes need to be supported by parallel execution environments and more efficient algorithms to be truly applicable on large-scale volumes. The element of human decision making As observed above, trustworthy preservation should always be driven by careful decision making and factual evidence. While this element of human decision making can be reasonably minimized, replacing it entirely will only be possible once a solid, substantial knowledge base of real-world cases populates the ecosystem described above. Eventually, the human element can in the ideal case be reduced to a policy specification activity and a monitoring oversight function. This is clearly out of scope for this article, but will provide the logical next step in research on preservation planning and monitoring. Trust and maturity The assessment of complex socio-technical systems such as the one presented is challenging. Arguably, it will not be complete without an enterprise governance view incorporating a set of dimensions on the level of organizational process performance and maturity. A first view on this perspective has been presented in (Becker et al. 2011), where a process and maturity model for preservation planning was outlined that was aligned with the IT Governance framework COBIT (IT Governance Institute 2007). Current efforts are building on this work to develop a full-fledged process and capability maturity model that shall support organizations in systematic improvement of their preservation capabilities.xvii 5.5 Summary This section discussed each of the key design goals of the architecture and system presented in Becker et al (2014) and conducted a quantitative and qualitative evaluation of the key objectives for each of the goals. We showed that the system significantly improves on the existing state of art in digital preservation by combining a context-aware business intelligence support tool with a scalable mechanism for content profiling, both integrated with a successor of the standard preservation planning tool Plato that is showing substantial efficiency gains over previous solutions. While there are limitations on the scale of content that can be profiled, analyzed and preserved in limited amounts of time, the improvements show that preservation planning and monitoring can be realistically advanced to a continuous preservation management function integrated with operational systems. This will provide a substantial step forward for the many organizations that are looking for ways to enable their repositories for truly supporting the long-term access promise that digital preservation has set out to deliver (Hedstrom 1998). We pointed out a number of limitations that currently hold back further progress, and outlined current efforts to tackle them. 3. Conclusion and Outlook Ensuring the longevity of digital assets across time and changing social and technical environments requires continuous actions. The volumes of today’s digital assets make effective business intelligence and decision support mechanisms crucial in this enterprise. While purely technical scalability of data processing can be handled using state of the art technologies, curators require specific decision support to enable large-scale management of digital assets over time. This demands a set of systems and services that facilitate scalable in-depth content analysis, intelligent information gathering, and efficient decision support, designed as loosely-coupled systems that are able to interact and connect to the wider preservation context. This article presented a systematic assessment and evaluation of the SCAPE Planning and Watch suite presented in Becker et al (2014). The results of the assessment demonstrate the possibility to deploy full preservation lifecycle support into preservation systems of real-world scale by adopting a loosely-coupled, open and extensible suite of preservation tools that each support particular aspects of the core preservation planning and monitoring capabilities: 1. Scalable content profiling is supported by the highly flexible and efficient content profiler C3PO, which has been tested on a data set of 441 million files. 2. Monitoring of compliance, risks and opportunities is supported by the monitoring system Scout, which provides an extensible open platform for drawing together information from a variety of sources to support the much-needed business intelligence insights that are key to continued preservation success. 3. Preservation planning efficiency is being continuously improved as the ecosystem grows, and recent advances show that planning can become a well-understood and managed activity of repositories. 4. Context awareness of each of the systems is supported by a shared permanent vocabulary set to grow over time through extensions with related ontologies, connecting the domains of solution components and the preservation community with the organizational policies and the decision support and control systems presented here. 5. Loose coupling of the components in this ecosystem guarantees that organizations can follow an incremental approach to improving their preservation systems and capabilities. We discussed the evaluation of key aspects of each tool as well as the ecosystem as a whole and outlined the key benefits and advances over the existing state of art. Based on a number of limitations, we define a number of key goals for future research. These include real-time profiling of very large data sets in the Petabyte range; benchmarking of automated tools against solid, reliable ground truth in open, fully transparent experiments with shared data sets; and a systematic framework for assessing the performance of organizations in terms of process metrics and organizational maturity. Acknowledgements Part of this work was supported by the European Union in the 7th Framework Program, IST, through the SCAPE project, Contract 270137, and by the Vienna Science and Technology Fund (WWTF) through the project BenchmarkDP (ICT12-046). References Antunes, G. and Borbinha, J. and Barateiro, J. and Becker, C. and Proenca, D. and Vieira, R. (2011), “Shaman reference architecture”, version 3.0. SHAMAN project report. Basili, V.R. and Caldiera, G. and Rombach, H.D. (1994), “The Goal Question Metric Approach”, Encyclopedia of Software Engineering, Volume 2, John Wiley, pp.528–532. Bauer, S. and Becker, C. (2011), “Automated Preservation: The Case of Digital Raw Photographs” , in Digital Libraries: For Cultural Heritage, Knowledge Dissemination, and Future Creation Proceedings of 13th International Conference on Asia-Pacific Digital Libraries (ICADL 2011) in Beijing, China, 2011, Springer-Verlag. Becker, C. and Antunes, G. and Barateiro, J. and Vieira, R. and Borbinha, J. (2011), “Control Objectives for DP: Digital Preservation as an Integrated Part of IT Governance”, In Proceedings of the ASIST Annual Meeting, 2011, New Orleans, USA: American Society for Information Science and Technology. Becker, C. and Kraxner, M. and Plangg, M. and Rauber, A. (2013), “Improving decision support for software component selection through systematic cross-referencing and analysis of multiple decision criteria”, in Proceedings of 46th Hawaii International Conference on System Sciences (HICSS), 2013, Maui, USA, pp 1193-1202. Becker, C. and Duretec, K. and Petrov, P. and Faria, L. and Ferreira, M. and Ramalho, J.C. (2012), “Preservation Watch: What to monitor and how”, in Proceedings of the 9th International Conference on Preservation of Digital Objects (iPRES)2012, Toronto, Canada. Becker, C. and Duretec, K. and Faria, L. (2014). “Scalable Decision Support for Digital Preservation”. To appear in: OCLC Systems & Services, volume 31, no. 1. Becker, C. and Kulovits, H. and Guttenbrunner, M. and Strodl, S. and Rauber, A. and Hofman, H. (2009), “Systematic planning for digital preservation: evaluating potential strategies and building preservation plans”, International Journal on Digital Libraries, Volume 10, Issue 4, pp 133–157. Becker, C. and Duretec, K. (2013), “Free Benckmark Corpora for Preservation Experiments: Using Model-Driven Engineering to Generate Data Sets”, in Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital libraries (JCDL), 2013, Indianapolis, USA, pp 349-358. Becker, C. and Rauber, A. (2011a), “Decision criteria in digital preservation: What to measure and how”, Journal of the American Society for Information Science and Technology, Volume 62, Issue 6, pp 1009-1028. Becker, C. and Rauber, A. (2011c), “Preservation Decisions: Terms and Conditions Apply. Challenges, Misperceptions and Lessons Learned in Preservation Planning”, in Proceedings of the 11th annual international ACM/IEEE Joint Conference on Digital libraries (JCDL), 2011, Ottawa, Canada, pp 67-76. Gallagher, M. (2013a), “Improving Software Sustainability: Lessons Learned from Profiles in Science” in Proceeding of Archiving 2013, Washington D.C., USA, pp 74-79. Gallagher, M. (2013b), “Why can’t you just build it and leave it alone?”, retrieved from http://blogs.loc.gov/digitalpreservation/2013/06/why-cant-you-just-build-it-and-leave-it-alone/ . Hedstrom, M. (1998), “Digital Preservation: A time bomb for digital libraries”, in Journal of Computers and the Humanities, 1997, Volume 31, Issue 3, pp 189–202. ISO (2010), “Space data and information transfer systems - Audit and certification of trustworthy digital repositories (ISO/DIS 16363)”, International Standards Organisation. IT Governance Institute, 2007. COBIT 4.1 Framework. Jurik, B. and Nielsen, J. (2012), “Audio Quality Assurance: An Application of Cross Correlation” in Proceedings of the 9th International Conference on Preservation of Digital Objects (iPRES)2012, Toronto, Canada. Kulovits, H. and Rauber, A. and Kugler, A. and Brantl, M. and Beiner, T. and Schoger, A. (2009), “From TIFF to JPEG2000? Preservation Planning at the Bavarian State Library Using a Collection of Digitized 16th Century Printings”, in D-Lib Magazine ,2009, Volume 15, Number 11/12. http://blogs.loc.gov/digitalpreservation/2013/06/why-cant-you-just-build-it-and-leave-it-alone/ Kulovits, H. and Becker, C. and Rauber, A. (2012), “Roles and responsibilities in digital preservation decision making: Towards effective governance”, in The Memory of the World in the Digital Age: Digitization and Preservation 2012, Vancouver, Canada. Kulovits, H. and Becker, C. and Andersen, B. (2013a), “Scalable preservation decisions: A controlled case study”, in proceeding of Archiving 2013. Washington D.C., USA , pp 167-172. Kulovits, H. and Kraxner, M. and Plangg, M. and Becker, C. and Bechofer, S. (2013b), “Open Preservation Data: Controlled vocabularies and ontologies for preservation ecosystems”, in Proceedings of the 10th International Conference on Preservation of Digital Objects (iPRES)2013, Lisbon, Portugal. Law, M.T. and Thome, N. and Gançarski, S. and Cord, M. (2012), “Structural and visual comparisons for web page archiving”, in Proceedings of the 2012 ACM symposium on Document Engineering (DocEng’12), 2012, New York, NY, USA, pp 117-120. OCLC and CRL (2007), “Trustworthy Repositories Audit & Certification: Criteria and Checklist”. Petrov, P. and Becker, C. (2012), “Large-scale content profiling for preservation analysis”, in Proceedings of the 9th International Conference on Preservation of Digital Objects (iPRES)2012, Toronto, Canada. Ross, S. and McHugh, A. (2006), “The Role of Evidence in Establishing Trust in Repositories”, in D-Lib Magazine, 2006, Volume 12, Number 7/8. Sinclair, P. and Billenness, C. and Duckworth, J. and Farquhar, A. and Humphreys, J. and JArdine, L. (2009), “Are you Ready? Assessing Whether Organisation are Prepared for Digital Preservation”, in Proceedings of the 6th International Conference on Preservation of Digital Objects (iPRES)2009, San Francisco, USA, pp 174-181. Webb, C. and Pearson, D. and Koerbin, P. (2013), “Oh, you wanted us to preserve that?! Statements of Preservation Intent for the National Library of Australia’s Digital Collections”, in D-Lib Magazine, 2013, Volume 19, Number 1/2. i http://www.openplanetsfoundation.org/blogs/2013-01-09-year-fits ii http://www.openplanetsfoundation.org/blogs/2012-11-06-running-apache-tika-over-arc-files-using-apache- hadoop iii http://hadoop.apache.org iv https://tika.apache.org/1.4/formats.html v http://en.statsbiblioteket.dk vi http://www.openplanetsfoundation.org/blogs/2013-01-09-year-fits vii In Plato, the scoring functions range between 0 and 5, with 0 being unacceptable, and are aggregated across the goal hierarchy. This is discussed in detail in (Becker et al, 2013). viii http://www.w3.org/RDF/ ix http://www.w3.org/TR/owl2-overview/ x http://opensource.org/licenses xi https://github.com/openplanets/policies xii (Gallagher 2013, Gallagher 2013a) xiii http://openplanetsfoundation.org/ xiv https://github.com/fasseg/scape-tck xv http://www.ifs.tuwien.ac.at/dp/plato/ xvi https://github.com/openplanets/plato xvii www.benchmark-dp.org http://www.openplanetsfoundation.org/blogs/2013-01-09-year-fits http://www.openplanetsfoundation.org/blogs/2012-11-06-running-apache-tika-over-arc-files-using-apache-hadoop http://www.openplanetsfoundation.org/blogs/2012-11-06-running-apache-tika-over-arc-files-using-apache-hadoop http://hadoop.apache.org/ https://tika.apache.org/1.4/formats.html http://en.statsbiblioteket.dk/ http://www.openplanetsfoundation.org/blogs/2013-01-09-year-fits http://www.w3.org/RDF/ http://www.w3.org/TR/owl2-overview/ http://opensource.org/licenses http://www.scape-project.eu/ http://www.scape-project.eu/ http://www.scape-project.eu/ http://openplanetsfoundation.org/ https://github.com/fasseg/scape-tck http://www.scape-project.eu/ http://www.scape-project.eu/ http://www.scape-project.eu/ http://www.scape-project.eu/ http://www.scape-project.eu/ http://www.benchmark-dp.org/ 1. Introduction 2. Evaluation and assessment 5.1 Evaluation dimensions and challenges 5.2 Evaluation of design goals G1 Provide a mechanisms for scalable in-depth content profiling G2 Enable monitoring of operational compliance, risks and opportunities G3 Improve planning efficiency G4 Make systems aware of their context G5 Design for loosely-coupled preservation ecosystems 5.3 Practical adoption 5.4 Limitations Coverage and correctness of available measurement techniques Scalable distributed and cost-efficient processing: How to profile a Petabyte? The element of human decision making Trust and maturity 5.5 Summary 3. Conclusion and Outlook Acknowledgements References