Microsoft Word - ASIST2012-Annot-DR-ChvdH-final-submission.doc 1 Annotation as a New Paradigm in Research Archiving Two Case Studies: Republic of Letters - Hebrew Text Database Dirk Roorda Data Archiving Networked Services (KNAW) P.O. Box 93067 2509 AB Den Haag Netherlands dirk.roorda@dans.knaw.nl Charles van den Heuvel Huygens ING (KNAW) P.O. Box 90754 2509 LT Den Haag Netherlands charles.van.den.heuvel@huygens.knaw.nl KNAW = Royal Netherlands Academy of Arts and Sciences, knaw.nl ABSTRACT We outline a paradigm to preserve results of digital scholarship, whether they are query results, feature values, or topic assignments. This paradigm is characterized by using annotations as multifunctional carriers and making them portable. The testing grounds we have chosen are two significant enterprises, one in the history of science, and one in Hebrew scholarship. The first one (CKCC) focuses on the results of a project where a Dutch consortium of universities, research institutes, and cultural heritage institutions experimented for 4 years with language techniques and topic modeling methods with the aim to analyze the emergence of scholarly debates. The data: a complex set of about 20.000 letters. The second one (DTHB) is a multi-year effort to express the linguistic features of the Hebrew bible in a text database, which is still growing in detail and sophistication. Versions of this database are packaged in commercial bible study software. We state that the results of these forms of scholarship require new knowledge management and archive practices. Only when researchers can build efficiently on each other’s (intermediate) results, they can achieve the aggregations of quality data by which new questions can be answered, and hidden patterns visualized. Archives are required to find a balance between preserving authoritative versions of sources and supporting collaborative efforts in digital scholarship. Annotations are promising vehicles for preserving and reusing research results. Keywords annotation, portability, archiving, queries, features, topics, keywords, Republic of Letters, Hebrew text databases. INTRODUCTION In the Early Modern History of Europe, letters were by far the most important means of communication and played a role in the emergence of scholarly communities. Although from the 1660s onwards this one-to-one means of exchange of knowledge was gradually replaced by a more public form of scholarly communication via learned periodicals, communication via letters did continue. Given the role of the letter in scholarly communication and the emergence of scientific communities in Europe it is not surprising that the so-called “Republic of Letters” became a recurrent theme in the history of the humanities and sciences. With the introduction of digital tools, various new projects were set up to map the exchange of knowledge and to analyze the creation of scholarly networks in Europe. The beautiful visualizations of the project Mapping the Republic of Letters of the Stanford Humanities Research Center made the headlines of the New Yorker. In Europe, Oxford University in the Cultures of Knowledge project is building a large repository to make the research on the Republic of Letters available on an international level. Although we cooperate with these consortia to create a Digital Republic of Letters, one of the two projects discussed here: Circulation of This is the space reserved for copyright notices. ASIST 2012, October 28-31, 2012, Baltimore, MD, USA. Copyright notice continues right here. 2 Knowledge and learned practices in the 17th century Dutch Republic: A Collaboratory around Correspondences (CKCC)1 (see also Roorda & Bos & van den Heuvel 2010), is different in geographical range and in its analytical depth. First CKCC does not follow the correspondences of all European scientists, but of those scholars that lived or sojourned extensively in the Low Countries. The scientific revolution of the 17th century was driven by discoveries at sea, in observatories, in workshops of artisans, and in libraries. The Dutch Republic with its global trade network, its book printing industry, and its relative tolerance to religious differences became a refuge for intellectuals from around Europe: an information society avant la lettre. As such, it is an interesting counterpart to traditional studies in which knowledge production primarily is described as a scientific revolution driven by protagonists in the Galileo-Descartes-Newton tradition. A second difference with the above-mentioned projects is in the depth of analysis. Instead of focusing on metadata to explore the exchanges of knowledge between scholars in Europe and overseas, CKCC focuses on the data, on the letters themselves. It does not only try to answer how knowledge was disseminated in correspondences, but also to establish how new information was picked up, processed, and finally accepted in scholarly communications. What is the impact of the correspondences and how did new scientific topics and scholarly debates around them emerge? To answer those questions CKCC digitized the corporora of published editions and of unpublished letters from the scholars Caspar Barlaeus (1584-1648), Isaac Beeckman (1588-1637), René Descartes (1596- 1650), Hugo Grotius (1583-1645), Constantijn Huygens (1596-1687) Christiaan Huygens (1629-1695), Dirck Rembrandtsz van Nierop (1610-1682), Johannes Swammerdam and Anthoni van Leeuwenhoek (1632-1723). Software has been developed to analyze this machine-readable corpus of approximately 20.000 letters to detect topics and to visualize meaningful patterns in the networks of scholars that discuss them by using a combination of text mining, topic modeling, language technology, and visualization techniques. This analysis works in two different ways: a researcher can query the database with specific keywords, which will get a presentation of all the letters in which these 1Project website: ckcc.huygens.knaw.nl. words occur. Apart from the fact that all the queried keywords will light up in the text, the computer generates the most frequent words (in a different color) in relation to them. This way a researcher can test hypotheses about the expected outcomes of her/his queries, but at the same time serendipity has a better chance because of the computer generated terms that might convey unexpected meanings, which have to be put into context by additional research. The complexity of this set is mainly caused by the multiple languages in historical forms that occur in the corpus: Latin, French, Italian, Dutch and English, not to speak about the spelling variants. After several years of experimenting we are entering a phase in which the database can be opened up in a web-based collaboratory and more data added. Moreover, the data is to be enriched with annotations. We state that the results of these experiments with topic modeling, language detection, and visualization require new knowledge management and archive practices. To that end, we will formulate a new paradigm where annotations will play a key role. KNOWLEDGE MANAGEMENT OF MIXED AND PARTIAL DOCUMENTS The challenge of CKCC is to study the appropriation of knowledge in an international context and to recognize the development of themes of interest and debates between scholars or in larger networks distributed over space and time. In order to recognize meaningful patterns in the machine-readable corpus, topic modeling is used. It is based on the distribution of words over the text in the documents and is used to find similar words, similar documents or documents similar to arbitrary text. It does this by calculating similarities between words and texts, which constitutes a statistical approach to topics. However, the specific characteristics of the corpus of about 20.000 letters not only complicate the analysis and visualization of meaningful patterns but also require a particular management for pre-processing the datasets for use and re-use in digital humanities research (Roorda & Bos & van den Heuvel, 2010; Wittek & Ravenek 2011). Letters often address more than one topic and their rhetorical opening and closing phrases are seldom relevant to their content. For that reason it is not only important to be able to segment the letters to the paragraph level but also to exclude certain phrases from content extraction. 3 Not only are the various digitized corpora so different in format and coding that much data curation is needed to make them suitable for analysis, but also the multilinguality and spelling variations in the letters require additional operations. The choice of language is very inconsistent over the corpus as a whole. About 95% of all letters are written in Dutch, Latin, and French, the rest in German, Greek, Italian, and English. For some languages, it is a profitable investment to use additional language resources and tools, but not for all of them. Moreover, the letters themselves are not monolingual, but even inside sentences language switches do occur. These 17th century letters exhibit much spelling variation. For instance, the name of Christiaan Huygens van Zuylichem in the CKCC corpus is spelled in more than 320 different ways. This requires additional language tools and methodologies, such as Named Entity Recognition to improve the recall of queries and of computer generated topics. In the first phase of CKCC, the topic model of Latent Dirichlet Allocation2 was used. In this model, documents are considered as random mixtures over latent topics, where each topic is characterized by a distribution of words. Using LDA the computer generated 100 strings of related words; each string was manually labeled by researchers within the team based on their domain expertise. After a year of developing the topic model, the database was tested by participants of an international workshop Mathematical Life in the Dutch Republic3. We asked three test groups of in total circa 20 historians of science to explore the possibilities of this tool and to inquire in what ways it could contribute to their historical research. Although the researchers acknowledged the potential of the database, they came up with two serious problems. They experienced the database as black box which was hard to control, and their queries often had a limited recall. To overcome these problems a mixed strategy was developed for the next phase of the project. Faceted search was improved to enhance the 2This method and Latent Semantic Analysis and Random Indexing (to be mentioned subsequently) and their application in CKCC are explained in (Wittek & Ravenek, 2012). 32010, December 6-10, Lorentz Center, Leiden, tinyurl.com/lorentz-mat-life. manipulation possibilities of the researchers. Experiments were set up with two different models of topic modeling next to Latent Dirichlet Allocation (LDA): Latent Semantic Analysis (LSA) and Random Indexing (RI) and combined with language normalization. Researchers were involved in labeling the terms (20 in a subset set of 300 letters) that were generated during these experiments to enable an evaluation of the outcomes. It goes beyond the scope of this article to describe these experiments in detail, but best results were achieved with RI in combination with stemming and removal of stop words. For the implementation, a combination of LSA and RI was used in two scenarios. (1) Query terms of users are forwarded to LSA and RI models that return a ranked list of keywords that are the most relevant to the topic(s) underlying the query. (2) Text fragments specified by users are forwarded to both models and the LSA and RI models return a ranked list of letters that are most relevant to this input. In short, full text search can be enhanced with query terms suggested by the topic model, and it is possible to query for letters that are similar to a given text fragment. Despite its potential usefulness there is still a long way to go given the multilingual situation and the spelling variants. To improve the recall of keywords, other experiments are set up Involving Named Entity Recognition. Once again, researchers play a role in the evaluation of the automatically generated terms, so that after several iterations of feedback the recall can be improved. Thus, enhancing the queries by topic modeling requires annotation, for the presentation of the results to the end user as well as for the experts' feedback to the software. THE NEED TO ANNOTATE AND NEW PARADIGMS IN ARCHIVING NOTES IN HUMANITIES RESEARCH Several studies have pointed to the different nature of data in the humanities. They are often multilingual, historically specific, geographically dispersed, and ambiguous in meaning (ACLS, 2006), (Borgman, 2007). Humanities scholars are concerned with the problem of meaning: how it is created, communicated, manipulated, and perceived. In order to contextualize data, they require annotation. Contextualization by annotation has a long history. Famous is the history of the footnote by (Grafton, 1997), but its future is still unclear. 4 “As the footnote reconfigures itself for the digital world, opportunity and danger are waiting side by side for it” (Zerby, 2003: 144). Bader stated in The New York Times: “Forget Footnotes. Hyperlink. Old Media, Meet New Media”. She claimed that after the eviction of the footnote by book publishers, they would find a new home in the hyper-link construction of the World Wide Web. “Indeed the Web has not only revived the footnote, it has spawned a cross- referencing craze that renders the formerly complete media event into a […] wallflower waiting to be courted by the next available annotator” (Bader, 2001). The statements of Zerby and Bader reveal two problems. We do not know anymore what the function is of the footnote in digital environments, which played such an important role in the contextualization of (humanities) research. Secondly, we are in need of new paradigms to preserve the digital counterpart of the footnote, the annotation by man or by machine, for re-use by researchers. The new role of the footnote in the virtual research environment has hardly been explored. An interesting exception is the study that presented the Multimedia Digital Annotation System, MADCOW (Bottoni & al., 2004), in which a functional taxonomy of “content” annotations (explanation, comment, question, example etc.) as opposed to metadata annotations was formulated based on Rhetorical Structure Theory (RST). Such a functional taxonomy can be developed to assign attributes for the contextualization of topics. Moreover, the MADCOW project signaled the problem that users are limited by either navigating through specific browsers with annotation facilities on a restricted set of contents or have to disrupt their navigation to start an annotation application. The MADCOW tool allowed users to switch between navigating and annotating modalities with the Web content. Here we try to extend these modalities beyond Web content, to include in principal all sorts of documents, and to explore its implications for preservation practices. ANNOTATIONS AND DIGITAL SCHOLARSHIP The practice of annotating is a traditional ingredient of research. How can annotating support modern, digital forms of research? Is the digital version of an annotation versatile enough to express new results? How do digital annotations behave in the total workflow of exploration, hypothesizing, testing, publishing, and archiving? In order to gain practical insights in these matters we have considered two significant projects that truly are representative of digital humanities research of which the first one is CKCC, described above, and the second one is rather a programme than a project: Data and Tradition. The Hebrew Bible as a linguistic corpus and as a literary composition (DTHB)4. This work builds on a multi-decade effort to linguistically markup the complete text of the Hebrew Bible. The result is a text database where morphemes, words, phrases, and higher-level text objects carry many features. A version of this database has been deposited into the DANS archive, where it is stored as a compressed SQL dump (Talstra, 2012). This act happened during the workshop: Biblical Scholarship and Humanities Computing: Data Types, Text, Language and Interpretation5 where an international group of experts reflected on how to bring these resources to better fruition in the digital age. Live versions of this database run on researcher’s computers, where they can craft queries of which the results may or may not support specific interpretations of the text. If a linguistic peculiarity shows up in a difficult passage, one can query the database and see whether it is a true exception to the known rules, or just an instance of a regular but rare pattern, to name a typical use case. Hundreds of queries have been crafted, run, and studied, all in relation to interpretation issues. Both CKCC and DTHB have produced curated sources plus analytical results. Yet it is far from clear how these results can partake in a process of accumulation and sharing. Here lies our motivation to explore the power of annotations. The central statement of this part is that annotations are indeed a powerful carrier of digital scholarship and that they can bridge the gap between past and future research, provided they conform to a generic model that supports preservation and sharing. 4Project Data and Tradition. The Hebrew Bible as a linguistic corpus and as a literary composition. Initiated by Eep Talstra, from 2010-07-01 to 2014-06-30. See tinyurl.com/nwo-nl-dthb. More projects in the same programme are listed at tinyurl.com/nwo-nl-talstra. 52012, February 6-10, Lorentz Center, Leiden, tinyurl.com/lorentz-hum-comp. 5 In order to substantiate this statement, we have to argue that: • there are frameworks for web-based, digital annotations; • annotations are versatile: they can express queries, features, keyword and topic assignments; • annotations can be made portable: they still make sense when their targets move or change; • annotations must and can be managed with their metadata, provenance, and types; • annotations can “drive” end user applications. Of course, we cannot rigorously prove these assertions. We will draw on our own experiences in building (demo) applications that are driven by queries and features as annotations in the DTHB case, and by topics and keywords as annotations in the CKCC case. Open Annotation Collaboration The realization that annotations are important carriers of scholarship, and the fact that in practice annotations tend to become locked up in the systems used to create them, has led to several attempts to standardize annotations and turn them into web resources. Two of those attempts, the Annotation Ontology (Ciccarese, 2011)6 and the Open Annotation Model (henceforth OAM) (Sanderson & van de Sompel, 2011)7 are currently under consideration of the W3C Open Annotation Community Group8 with the aim of reconciling the two into a common, RDF9-based specification. The guidelines in (Sanderson & van de Sompel, 2011) are particularly concise and revealing. To summarize even more: the OAM focuses on the basic structure of an annotation: a body is taken to comment on one or more targets, and the annotation binds them together. Annotations, body, and targets are all addressable as web resources. They can all have separate metadata, 6See: tinyurl.com/annot-ont. 7See: openannotation.org. 8See: tinyurl.com/w3-annot. 9RDF: Resource Description Framework. The language of the Semantic Web, also known as Linked Data. See linkeddata.org. including authorship, but the metadata is not part of the model. The model is agnostic to the specific protocols, platforms, and applications with one exception: everything is geared to the architecture of the web with its HTTP10 protocol. The implicit consequence is that OAM- annotations can be expressed as RDF and become part of the Semantic Web. So far, the guidelines reveal that very important goals are being achieved: annotations can be shared easily across applications, platforms, and institutions. They can be discovered, filtered by the metadata they are linked to, and organized by the resources they target, and moved around and aggregated by discovery services. Yet, the guidelines also point to challenges: (1) real annotations need to target fragments of resources, but how can they be specified in interoperable ways? (2) Resources tend to move and change, so how are the annotations that link to them, either by body or by target, to be maintained? (3) The basic model is bare, and lots of information about annotations has to be expressed in ways not prescribed by OAM, so how much interoperability can be actually achieved? From the perspective of a research archive, which preserves resources past their active lifetime in an encapsulated form, in order to revive them when somebody is interested in them, exactly these two issues of addressing and metadata are of utmost importance. In our view (1)+(2) are fundamental issues that require additional concepts. We address them in section Portable Annotations. As to (3), there is a general tendency in archives, repositories, and cultural heritage institutions to conform their metadata to the ontologies that are being designed on the Semantic Web, not only for the metadata profiles, but also for the actual values that metadata fields may take (Gradmann, 2010). OAM is very well poised to take advantage of these developments, since it is itself defined in Semantic Web terms. Queries, Features, Topics and Keywords as Annotations As discussed above, the results of CKCC and DTHB are predominantly queries and features (DTHB) and topics and keywords (CKCC). Here we explain how we translated these items all into 10HTTP: HyperText Transfer Protocol. Defined here: tinyurl.com/ietf-http. 6 annotations. We subsequently wrote two web applications that present these annotations next to the resources in one interface. QFA (Queries/Features as Annotations)11 (figure 2) is written for DTHB, and TKA (Topics/Keywords as Annotations)12 (figure 3) is written for CKCC material. The intention was to explore if one could build usable interfaces that are driven by annotations, and with limited effort. To this end we developed two end-user applications that directly operate on sets of annotations using the abstract model, and connect them with data sources that they are about. We assume that both data sources and annotations have been previously imported into relational databases. (See further, Portable Annotations below). Queries as Annotations Queries are active, dynamic forays into landscapes of data. Annotations are passive, static comments on fragments of data. What do they have in common? One might expect that we are preserving queries with the aim to be able to run the query over and over again for the indefinite future. Or do we? It would require that we remain familiar with that version of the query language, and with the corresponding version of the database system forever and ever. It will become increasingly difficult to compare those query results with later ones because the modern query will not run on the old system and vice versa. The matter is not academic. In this particular case, the queries are expressed in MQL, which is an implemented version of QL, defined by (Doedens, 1994) as a query language specifically geared to text databases13. 11Application: tinyurl.com/demo-taa , wiki: tinyurl.com/wiki-taa. 12Application: tinyurl.com/demo-qaa , wiki: tinyurl.com/wiki-qaa. 13The acronym QL may best be read as: QUEST- like Query Language, and MQL stands for Mini QL. Appendix 1 of (Doedens, 1994) contains a historical account. Although the implementation, EMDROS (Petersen, 2004)14 is open source, well documented, and a powerful solution for text databases, it is definitely not a mainstream application, and its life span is hard to guess. For preserving the results of scholarship, there is a better option. We can select the important queries, those that have been used to obtain new interpretations that have been published in journals. The query instruction is then the body of an annotation, and the query results are the (many) targets of that same annotation. Annotations will be linked to metadata specifying the related research problem, the author of the query, and the moment of its last run. That will give the future user a good picture of past research. In addition, in current research users can stumble upon query results as targets of annotations, so that these annotations lead them from passages to queries, exactly in the opposite direction that one usually follows with queries. It is the direction of serendipity. Features as Annotations In the DTHB case features are linguistic properties of the form key=value that apply to text objects of nearly every granularity, from 14See emdros.org. Figure 1. Screenshot of Queries/Features as Applications Figure 2. Screenshot of Topics/Keywords as Annotations 7 morpheme through part-of-speech up to book. These features are the product of many years of manual labor, combined with automatic processing. They have been checked and revised. They constitute a treasure trove. They live in the same implementation of text databases, Emdros, as the queries above. By transforming the features into annotations, we potentially unlock the value that is hidden here. In this case, we simply chose as bodies strings of the form key=value. The targets are the objects that carry that feature value. In our demo application QFA we give the user 70 key=value combinations at the word level to play with. As an example, a user can tell the application to show all verbs with tense=imperfect in blue and all verbs with tense=perfect in red. This helps to interpret narrative structures, even if you do not know Hebrew, although being a linguist helps. Again, this is a case of annotations with (very) many bodies: the annotation with body gender=masculine has 101335 targets! The number of targets of gender=feminine is left as an exercise to the reader. Topics and Keywords as Annotations Extracting topics from texts is as useful as it is challenging. Topics are semantic entities that may not have easily identifiable surface forms, so it is impossible to detect them by straightforward search. Topics live at an abstraction level that does not care about language differences, let alone spelling variations. Therefore, if one has a corpus with thousands of letters in several historical languages, and wants to know what they are about without actually reading them all, a good topic assignment is a very valuable resource indeed. There are several ways to tackle the problem of topic detection, and they vary in the quality of what is detected, the cost of detection, and the ratio between manual work an automatic work. Several of these methods have been (and are being) tested CKCC as explained before. It is not the purpose of this paper to go into topic modeling in depth. Here we are concerned with gathering results, even intermediate results, and making them re-usable for subsequent attempts to uncover the semantic contents of the corpora involved. For our demo application, we gathered three kinds of intermediate results: (1) automatic keyword assignments, (2) manual keyword assignments, (3) automatic topic assignments detected by a specific algorithm. We used the complete corpus of letters from and to the Dutch 17th century scholar Christiaan Huygens (3090 letters). The mapping from keyword assignments to annotations is simple: bodies are the keywords; targets are the letters to which the keywords are assigned. There is no fragment addressing here. Topics reveal two complications when translating them into annotations. (1) A topic is not a single word but a complex object in itself. In this context, it is a collection of words that span a semantic field. Moreover, each word contributing to a topic does so with a certain relative weight. (2) When a topic is assigned to a letter, the assignment has a certain confidence, expressed as a number. This could be modeled as an extra annotation on top of the annotation that merely links a topic to a letter: the extra annotation has the confidence as body and the other annotation as target. In our application, however, we have opted to include topic and confidence in one body, as distinct fields. There are even more options, for which we refer to the wiki about TKA15. Portable Annotations Beyond RDF So our annotations are not coded in RDF, they have no URIs16, and they do not conform to the Linked-Data aspects of OAM. There are good reasons for this: neither the sources nor the annotations that result from CKCC and DTHB are currently web resources. Nevertheless, there is a sense in which we conform to OAM: the annotations reside in a different database than the sources do, and the link between annotations and their targets is strictly symbolic, not dependent on database modeling and technology (no foreign key constraints). One could say that we enforce modularity between sources and annotations, in the sense that annotations can be ported from one source to a comparable source. From here it is not a big step to completely conform to OAM: (1) import real RDF annotations to local database tables from where they drive local applications; (2) if a local application produces annotations that must 15 tinyurl.com/wiki-taa-topics. 16URI: Uniform Resource Identifier, which can be dereferenced by means of the HTTP protocol. The definition URI is at tinyurl.com/ietf-uri. 8 be shared: export them as RDF. In both cases, local addresses must be translated into absolute URIs. Usefulness of Porting Annotations Now we arrive at a tempting picture: annotations that are portable. Many sources are available in several versions, in many copies, in different formats, in multiple languages, and in diverse media. Many annotations on a resource still make sense if one explores other variants of it. Here are some examples: (1) (from DTHB): there are various authoritative versions of the Hebrew text. We have compared the Biblia Hebraica Stuttgartensia (BHS)17 with the Westminster Leningrad Codex (WLC) 18. Most of the differences are different word divisions and different diacritical marks. That means that the vast majority of Feature and Query annotations based on the BHS also apply to the WLC. Moreover, there is a set of features, by a different enterprise (Groves & Lowery 2006)19, based on the WLC, which can also be applied to the BHS. Even the mismatches are interesting! (2) (from DTHB): there are word-by-word translations of the Hebrew Bible into English. For non-Hebrew-readers, it might be interesting to see which words in such a translation derive from a masculine and which from a feminine word. Such an observation can be easily achieved if we can port the feature annotations from the Hebrew source to such a translation. (3) (from CKCC): the manual topic assignments are a valuable resource. New attempts at topic modeling could make good use of that, for training or testing purposes. In those cases, it 17 tinyurl.com/bhs-browse. 18 tinyurl.com/tanach-tech. 19The Westminster Hebrew Morphology. tinyurl.com/groves-whm. would be convenient to retrieve such annotations from an archive and then to be able to reapply them on new incarnations of the sources. URIs, Anchors, FRBR OAM requires that annotations point to their bodies and targets in the Linked Data way: by proper HTTP URIs. If the resources in question are stable and being maintained by libraries, archives and cultural heritage institutions, it becomes possible to harvest many sorts of annotations around the same sources. This is an organizing principle that is quite new and from which huge benefits for data mining and visualization are to be expected. In practice, however, there are several scenarios in which (fragments) of resources are not addressed in a stable way. This happens for instance when resources go off-line into an archive. In case we want to restore those resources later on, the means of addressing them from the outside may have changed. Moreover, there might not be a unique, canonical restored incarnation of that resource. For that reason one needs anchors to resources that enable the re- use of annotations that have been archived in the past. The solution adopted in QFA and in TKA is to work with localized addresses. These are essentially relative addresses that point to (fragments) of local resources that are part of a local corpus. There is an ontological consideration involved here. The model of Functional Requirements for Bibliographic Records (IFLA, 1997-2009) makes a distinction between work, expression, manifestation, and item. Work is a distinct intellectual or artistic creation. As such, it is a non-physical entity. Expression, manifestation, and item point to increasing levels of concreteness, an item being a concrete entity in the physical world. Wikipedia20 gives a nice example from music, see table 1. The full refinement of these four FRBR concepts is probably not needed for our purposes. Yet, a distinction between the work, which exists in an ideal, conceptual domain, and its incarnations, which exist in physical reality, is too important to ignore. It bears on the ways by which identifiers to works and incarnations can be kept stable. Identifiers to works identify within conceptual domains, but they have no function in physically 20 tinyurl.com/wikip-frbr. FRBR concept example characteristic work Beethoven’s Ninth Symphony distinct creation expression musical score specific form manifestation recording by the London Philharmonic in 1996 physical embodiment item record disk concrete entity Table 1. FRBR’s view of the world 9 locating works. These identifiers are naturally free of those factors that make a typical hyperlink such a flaky thing. So whenever annotations are about aspects of a resource that are at the work level, they have better target those resources by means of work identifiers. Moreover, the distinction between work and incarnation also applies to fragments of works. Most subdivisions, such as volumes, chapters, and verses in resources do exist at the work level, albeit that there are some fragments that are typically products of the incarnation level, e.g. lines and pages. We can now define our anchors as identifiers at the work level, for resources and their fragments. This is in fact the nature of our localized addresses. Quite often, the sources themselves and their fragments have anchors that are recognized by whoever is involved with them. Take the books, chapters, and verses in the bible, for example. Even where there are no universally recognized anchors, it is easier to translate between rival anchoring schemes, than to maintain and multiply stable identifiers at the incarnation level. Lurking below the surface there is the question: to what extent are differing versions incarnations of the same work? Can we keep fragment identifiers stable under versioning? This is really a complex issue, and we plan to devote a completely new demo application to it in a new use case. (See the wiki on Portable Annotations)21. Statement Not all variance between sources can be productively addressed with time-based versioning. There are deeper reasons for variation and deeper reasons for identification than sequences of surface forms. If we ignore those reasons, and if we omit to base our identifiers on them, we will not have truly portable annotations. Annotation management: metadata, provenance and types The role of metadata for annotations is (at least) twofold: first, they enable to assess the quality, significance, and meaning of an annotation. Quality judgments can be made based on the provenance: who made the annotation, for which project, when? Significance can be gleaned from 21 tinyurl.com/wiki-pa. Beware that this is work in progress. a list of publications that are associated with that (set of) annotations. Meaning can be retrieved from pointers to reference materials. As OAM is firmly integrated in the Semantic Web effort, there are no conceptual limitations on linking metadata to annotations. The second role of metadata is to enable annotation-driven applications to decide how to best filter and display the annotations. Here the typology of annotations comes in. We exposed four not too ordinary types of annotation, each with its own requirements for display. The unlimited linking of metadata to annotations is problematic for generic applications. How do applications recognize what metadata is available and by which metadata they should let themselves be controlled? Here we find ourselves on the middle ground between the rigor of what is within the limits of OAM and the polymorphism of what lies outside it. For dedicated applications, there is no problem: you can tell them where to look, but fully generic annotation-driven applications will have difficulties here. Annotation-driven applications How difficult is it to develop an annotation-driven application that deals with significant amounts of data and annotations, and that presents a usable interface to the end user? Design The demo applications QFA and TKA are driven by a database containing the source materials and a separate database with the annotations. There is no mingling or tight coupling between the sources and the annotations. The only links are the anchors: symbolic expressions in the annotation targets that refer to fragments of the sources. Functionality Both applications display the source material in a broad column, and the annotations in narrower columns next to the sources. The targets of the annotations can be highlighted in the sources, and the user has some control over the highlighting, depending on the type of annotation. We invite the reader to explore these applications to get a more detailed picture. In short, these applications visualize the annotations and the sources in basic, not too crude ways, adapted to the different kinds of annotation. 10 Implementation In order to rapidly implement our ideas concerning annotations and sources we needed a simple but effective framework on which we could build data-driven web applications. We found it in the shape of Web2py (Di Pierro, 2007-2010). We needed very little code on top of the framework, just a few hundred lines of Python and Javascript each. Deployment of these apps is completely web-based, and only takes seconds. Most work went into the data preparation stage, where we used Perl and shell scripts to compile data from various origins into SQL-imports for sources and annotations. These scripts were also in the few hundred lines range. Missing Link What these demos still lack is full RDF capability. Once these sources are truly web resources, we expect that it is easy to make an import/export facility to turn database annotations into real RDF annotations. How to translate our fragment anchors into HTTP URIs is still an open question. Finally, work is to be done in order to get the best of the worlds of relational databases and of linked data, see e.g. (Baron & Di Pierro, 2010). CONCLUSION We have investigated the feasibility of using annotations as portable carriers for diverse results of scholarship in the humanities. We found that annotations are versatile enough to carry the products of digital scholarship such as query results, features, topics, and keywords. The Open Annotation Model represents annotations as web resources, which makes them easy to share beyond the systems in which they originated. Annotations can be managed by unlimited association of metadata. The development of annotation-driven applications is doable: the focus remains on the data, and does not shift to the software. Yet, the web-based model for annotations is not fully compatible with the process of archiving and re-use. This would greatly be improved if we could make annotations more portable across variant resources. And that, in turn, boils down to using anchors for targeting resources and their fragments. Anchors are identifiers at the work-level in the FRBR sense. Let us briefly consider what this outcome means for digital humanities in general. In the non-digital ages before us, scholars relied on harmonization efforts such as standard editions of historical texts, because the source materials were simply too complex to deal with in their raw form. It had the character of projecting the data on a space of one dimension. Now there is a growing pressure to investigate (again) the raw data, find new perspectives, and preserve the connections between interpretations and data in a much more transparent way. This shift in research paradigm can only succeed if it is matched by a shift in archiving methods. Annotations have the potential to unlock data that is behind the barriers of application interfaces and data models. They facilitate deep linking to fragments. They can be instrumental in identifying interesting slices of the data that could not be accessed as such before. This is particularly useful in disciplines whose business it is to make distinctions between objective data and many layers of interpretation, where those interpretations are based on the data themselves in combination with any amount of data from the context. The fabric of objects and meanings that humanities research is creating must be taken care of in such a way that it remains navigable from all imaginable entry points in all conceivable directions. We have shown that annotations are up to the task. Their way into the web of linked open data is being paved. If, in that process, they can play nice with the distinction between concept and realization, they constitute a new archiving paradigm. ACKNOWLEDGMENTS Walter Ravenek (Huygens ING) for helpful comments on topic modeling; Eko Indarto (DANS) for helping to develop a first version of QFA in very short time; Andrea Scharnhorst (DANS) for granting additional time for research; Joris van Zundert (Huygens ING) for facilitating an inspiring Interedition bootcamp22 which set me (Dirk) on the track of rapid development. REFERENCES ACLS (2006). Our Cultural Commonwealth: The Report of the American Council of Learned Societies’ Commission on Cyberinfrastructure for Humanities and Social Sciences. Retrieved 2012-04-28 from http://www.acls.org/cyberinfrastructure/OurCu lturalCommonwealth.pdf 22 tinyurl.com/intered-lvn. 11 Bader, J.L. (2000). Forget Footnotes. Hyperlink. The New York Times, Sunday 16 July 2000 Section 4 Week In Review. Baron, C., Di Pierro, M. (2010). Publishing Linked Data Using web2py. School of Computing, DePaul University of Chicago. Retrieved 2012- 04-28 from tinyurl.com/web2py-ld-article (pdf). Bottoni, P., Civica, R., Levialdi, S., Orso, L., Panizzi, E., Trinchese, R. (2004). MADCOW: a Multimedia Digital Annotation System. In M.F. Costabile (Ed.), Proc. Working Conference on Advanced Visual Interfaces (AVI 2004) (pp. 55-62). New York: ACM Press. Borgman, C. (2007). Scholarship in the Digital Age. Information, Infrastructure and the Internet, Cambridge (Mass.), London: The MIT Press. Ciccarese, P., Ocana, M., Castro, L.J.G., Das, S., Clark, T. (2011). An Open Annotation Ontology for Science on Web 3.0. J. Biomed Semantics 2011, 2(Suppl 2):S4 (17 May 2011). Di Pierro, M. (2007-2011). web2py. Full Stack Web Framework, 4th edition. Online book. Retrieved 2012-04-28 from web2py.com. Doedens, C.F.J. (1994). Text Databases. One Database Model and Several Retrieval Languages. Language and Computers, Number 14. Editions Rodopi Amsterdam. Amsterdam and Atlanta, GA. ISBN: 90-5183-729-1. Gradmann, S. (2010). Knowledge = Information in Context: on the Importance of Semantic Contextualisation in Europeana. White paper. Retrieved 2012-04-28 from tinyurl.com/europeana-gradmann (pdf). Grafton, A. (1997). The Footnote. A curious history. Cambridge (Mass.): Havard University Press. Groves, A., Lowery, K., (Eds). (2006). The Westminster Hebrew Bible Morphology Database. Philadelphia: Westminster Hebrew Institute. IFLA (International Federation of Library Associations and Institutions) (1997-2009). Functional Requirements for Bibliographic Records. Final Report. Retrieved 2012-04-28 from tinyurl.com/ifla-frbr (pdf). Petersen, U. (2004). Emdros - a text database engine for analyzed or annotated text. Proceedings of COLING 2004. 1190–1193. Retrieved 2012-04-28 from tinyurl.com/emdros-coling (pdf). Roorda, D., Bos, E-J., van den Heuvel, C. (2010). Letters, Ideas and Information Technology. Using digital corpora of letters to disclose the circulation of knowledge in the 17th century”. In Digital Humanities Conference Abstracts King’s College London 7-10 July 2010 (pp. 211-214). Sanderson, R., van de Sompel, H. (Eds.). (2011). Open Annotation: Beta Data Model Guide. Web document. Retrieved 2012-04-28 from openannotation.org. Talstra, E., Sikkel, C., Glanz, O., Oosting, R., Dyk, J.W. (2012). Text Database of the Hebrew Bible. Dataset available from Data Archiving and Networked services after permission of the depositor through Retrieved 2012-04-28 from tinyurl.com/dans-wivu. Wittek, P., Ravenek, W. (2011). Supporting the exploration of a corpus of 17th century scholarly correspondences by topic modeling. In B. Maegaard (Ed.), Proceedings of Supporting Digital Humanities 2011: Answering the unaskable. Copenhagen. Zerby, C. (2003). The Devil's Details: A History of Footnotes. New York: Touchstone.