Submitted 9 October 2016 Accepted 7 August 2017 Published 2 October 2017 Corresponding author Silvio Peroni, silvio.peroni@unibo.it Academic editor Ciro Cattuto Additional Information and Declarations can be found on page 31 DOI 10.7717/peerj-cs.132 Copyright 2017 Peroni et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Research Articles in Simplified HTML: a Web-first format for HTML-based scholarly articles Silvio Peroni1, Francesco Osborne2, Angelo Di Iorio1, Andrea Giovanni Nuzzolese3, Francesco Poggi1, Fabio Vitali1 and Enrico Motta2 1 Digital and Semantic Publishing Laboratory, Department of Computer Science and Engineering, University of Bologna, Bologna, Italy 2 Knowledge Media Institute, Open University, Milton Keynes, United Kingdom 3 Semantic Technologies Laboratory, Institute of Cognitive Sciences and Technologies, Italian National Research Council, Rome, Italy ABSTRACT Purpose. This paper introduces the Research Articles in Simplified HTML (or RASH), which is a Web-first format for writing HTML-based scholarly papers; it is accompanied by the RASH Framework, a set of tools for interacting with RASH-based articles. The paper also presents an evaluation that involved authors and reviewers of RASH articles submitted to the SAVE-SD 2015 and SAVE-SD 2016 workshops. Design. RASH has been developed aiming to: be easy to learn and use; share scholarly documents (and embedded semantic annotations) through the Web; support its adoption within the existing publishing workflow. Findings. The evaluation study confirmed that RASH is ready to be adopted in workshops, conferences, and journals and can be quickly learnt by researchers who are familiar with HTML. Research Limitations. The evaluation study also highlighted some issues in the adoption of RASH, and in general of HTML formats, especially by less technically savvy users. Moreover, additional tools are needed, e.g., for enabling additional conversions from/to existing formats such as OpenXML. Practical Implications. RASH (and its Framework) is another step towards enabling the definition of formal representations of the meaning of the content of an article, facilitating its automatic discovery, enabling its linking to semantically related articles, providing access to data within the article in actionable form, and allowing integration of data between papers. Social Implications. RASH addresses the intrinsic needs related to the various users of a scholarly article: researchers (focussing on its content), readers (experiencing new ways for browsing it), citizen scientists (reusing available data formally defined within it through semantic annotations), publishers (using the advantages of new technologies as envisioned by the Semantic Publishing movement). Value. RASH helps authors to focus on the organisation of their texts, supports them in the task of semantically enriching the content of articles, and leaves all the issues about validation, visualisation, conversion, and semantic data extraction to the various tools developed within its Framework. How to cite this article Peroni et al. (2017), Research Articles in Simplified HTML: a Web-first format for HTML-based scholarly arti- cles. PeerJ Comput. Sci. 3:e132; DOI 10.7717/peerj-cs.132 https://peerj.com mailto:silvio.peroni@unibo.it https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.132 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://dx.doi.org/10.7717/peerj-cs.132 Subjects Digital Libraries, World Wide Web and Web Science Keywords Document conversion, XSLT, RASH, Semantic Publishing, Digital Publishing, Semantic Web INTRODUCTION In the last months of 2014, several posts within technical mailing lists of the Web (https://lists.w3.org/Archives/Public/public-lod/2014Nov/0003.html) and Semantic Web (https://lists.w3.org/Archives/Public/public-lod/2014Oct/0058.html) community have discussedan evergreentopic inscholarly communication, i.e.,how couldauthorsof research papers submit their works in HTML rather than, say, PDF, MS Word or LaTeX. Besides the obvious justification of simplification and unification of data formats for drafting, submission and publication, an additional underlying rationale is that the adoption of HTML would ease the embedding of semantic annotations, thus improving research communications thanks to already existing W3C standards such as RDFa (Sporny, 2015), Turtle (Prud’hommeaux & Carothers, 2014) and JSON-LD (Sporny, Kellogg & Lanthaler, 2014). This opens complex and exciting scenarios that the Semantic Publishing community has promised us in terms of increased discoverability, interactivity, openness and usability of the scientific works (Bourne et al., 2011; Shotton et al., 2009). Nonetheless, HTML is still primarily used as an output format only: the authors write their papers in LaTeX or MS Word and submit sources to the typesetters, who are responsible for producing the final version, that eventually will be published and read on the Web. Appropriate tools in the publishing toolchain are used to convert papers among multiple formats. The interest in Web-first research papers—that are natively designed, stored and transferred in HTML—is increasing. Just to cite a few research efforts: Scholarly HTML (http://scholarlyhtml.org) defines a set of descriptive rules for adopting a defined subset of HTML to describe the metadata and content of scholarly articles; Dokieli (http://dokie.li) is a Web application that allows authors to create HTML-based scholarly articles directly on the browser, adding annotations and many other sophisticated features. This paper introduces a novel approach towards the same goal: providing authors with a subset of HTML for Web-first papers. The format is called RASH, Research Articles in Sim- plified HTML, and consists of 32 HTML elements only. This format is also accompanied by the RASH Framework, a set of specifications and tools for RASH documents (Peroni, 2017). There are two key differences between RASH and other similar proposals. First of all, RASH adopts a simplified pattern-based data model. The number of markup elements to be used by authors was reduced down to the bare minimum, and the elements themselves were chosen in order to minimize the cognitive effort of authors when writing documents. Secondly, RASH does not come with a full authoring environment but is expected to be produced from MS Word, ODT and LaTeX sources. The basic idea is to allow authors to keep using the word processors on which they routinely write their papers and to Peroni et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.132 2/35 https://peerj.com https://lists.w3.org/Archives/Public/public-lod/2014Nov/0003.html https://lists.w3.org/Archives/Public/public-lod/2014Oct/0058.html http://scholarlyhtml.org http://dokie.li http://dx.doi.org/10.7717/peerj-cs.132 provide them with multi-format converters. These converters are included in the RASH Framework, whose architecture is modular and extensible for handling new formats in the future. RASH is in fact intended to help authors in focussing on the organisation of their texts and supports them in the task of semantically enriching the content of articles, delegating all the issues about validation/presentation/conversion of RASH documents to the various tools developed within its Framework. This is a well-known principle in scientific publishing, even if not yet fully applied: clear separation of concerns. The authors should focus on organising the content and structure only, and the format should not require authors to worry about how the content will be presented on screen and in print. The publishers will then take care of creating the final formatting to best render the content in the style of their publications, or authors could use self-publishing platforms as promoted by Linked Research (http://linkedresearch.org). Such a separation of concerns can be pushed much forward. Pettifer et al. (2011) explained well the difference between an article as ‘‘an instance of scholarly thought’’ and ‘‘a representation for consumption by human or machine’’, and showed how multiple representations can be combined, integrated with external data, enhanced and interacted with in order to provide scholars with sophisticated tools directly within their articles. Another critical requirement for any HTML-based language used for scientific writing is good rendering and acceptance by the publishers. Any new HTML-based format should be beneficial for publishers as well. Of course, publishers, conferences, and workshop organisers, would like to manage new formats in the same way as for the formats they already support, such as LaTeX. To this end, these formats should support tools for their conversion and for rendering the content in specific layouts, such as ACM ICPS (http://www.acm.org/sigs/publications/proceedings-templates) and Springer LNCS (http://www.springer.com/computer/lncs?SGWID=0-164-6-793341-0). RASH adopts a pragmatic approach to this issue: while we are interested in a full-fledged native RASH authoring environment, we implemented a set of converters, in the RASH Framework, that are easily integrable (and were integrated) with existing publishing platforms. The goal of this paper is, in fact, to describe the outcomes of some experimentations on the use of RASH, so as to understand: 1. if it can be adopted as HTML-based submission format in academic venues (workshops, conferences, journals); 2. if it is easy to learn and use; 3. if it can be used to add semantic annotations and what are the most widely adopted vocabularies in RASH papers. The rest of the paper is structured as follows. In ‘Related Works’ we introduce some of the most relevant related works in the area, providing a functional comparison of the various works. In ‘Which ‘‘Web-first’’ Format for Research Articles?’ we introduce the rationale for the creation of a new Web-first format for scholarly publications, discussing the importance of minimality. In ‘Writing Scholarly Articles in HTML with RASH’ and ‘The RASH Framework’ we introduce the theoretical background of RASH, and then provide an introduction to the language and the main tools included in its Framework. In ‘RASH Peroni et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.132 3/35 https://peerj.com http://linkedresearch.org http://www.acm.org/sigs/publications/proceedings-templates http://www.springer.com/computer/lncs?SGWID=0-164-6-793341-0 http://dx.doi.org/10.7717/peerj-cs.132 and SAVE-SD: an Evaluation’ we present, as a case study, an analysis of the adoption of RASH at the SAVE-SD 2015 (http://cs.unibo.it/save-sd/2015/index.html) and SAVE-SD 2016 (http://cs.unibo.it/save-sd/2016/index.html) workshops. Finally, in ‘Conclusions’ we conclude the paper by sketching out some future developments. RELATED WORKS The growing interest in the publication of Web-first research papers has resulted in the release of some interesting projects related to RASH. In the following subsections, we discuss all the most important contributions in this area by splitting them into two main categories: (i) HTML-based formats and (ii) WYSIWYG editors for HTML documents. Note that we do not discuss in detail some other efforts that have recently been done by means of non-HTML languages, even if they are equally relevant for the community. ScholarlyMarkdown (http://scholarlymarkdown.com/) (Lin & Beales, 2015), for instance, is a syntax to produce scholarly articles according to a Markdown (http://daringfireball.net/ projects/markdown/) input. ShareLaTeX (https://www.sharelatex.com/) is a Web-based real-time collaborative editor for LaTeX documents. In Table 1 we briefly summarise the features and capabilities of the formats presented, in order to highlight the main differences between them. HTML-based formats One of the first documented contributions that proposed an HTML-based format for scholarly articles was Scholarly HTML (http://scholarlyhtml.org). It is not defined as a formal grammar, but as a set of descriptive rules which allows one to specify just a reduced amount of HTML tags for describing the metadata and content of a scholarly article. It is the main intermediate format used in ContentMine (http://contentmine.org) for describing the conversion of PDF content into HTML. Along the same lines, PubCSS (https://github.com/thomaspark/pubcss/) is a project which aims at pushing the use of HTML+CSS for writing scholarly articles. It does not define a formal grammar for the HTML element set to use. Rather it provides some HTML templates according to four different CSS styles, which mimic four LaTeX styles for Computer Science articles, i.e., ACM SIG Proceedings, ACM SIGCHI Proceedings, ACM SIGCHI Extended Abstracts, and IEEE Conference Proceedings. HTMLBooks (https://github.com/oreillymedia/HTMLBook/) is an O’Reilly’s specification for creating HTML documents (books, in particular) by using a subset of all the (X)HTML5 elements. This is one of the first public works by a publisher for pushing HTML-like publications, even if the status of its documentation (and, consequently, of its schema) is still ‘‘unofficial’’. Another project, which shares the same name of one of the previous ones, Scholarly HTML (https://github.com/scienceai/scholarly.vernacular.io), is a work by the science.ai (http://science.ai) company that aims at providing a domain-specific data format based on open standards (among which HTML5) for enabling ‘‘the interoperable exchange of scholarly articles in a manner that is compatible with off-the-shelf browsers’’ (Berjon & Ballesteros, 2015). While the format is not defined by any particular formal grammar, it has Peroni et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.132 4/35 https://peerj.com http://cs.unibo.it/save-sd/2015/index.html http://cs.unibo.it/save-sd/2016/index.html http://scholarlymarkdown.com/ http://daringfireball.net/projects/markdown/ http://daringfireball.net/projects/markdown/ https://www.sharelatex.com/ http://scholarlyhtml.org http://contentmine.org https://github.com/thomaspark/pubcss/ https://github.com/oreillymedia/HTMLBook/ https://github.com/scienceai/scholarly.vernacular.io http://science.ai http://dx.doi.org/10.7717/peerj-cs.132 Table 1 A comparison among existing HTML-oriented formats for scholarly papers according to seven distinct categories. Format Syntax Doc Formal grammar Semantic annotations CSS for different formats WYSIWYG editor Conversion from Conversion to RASHa HTML Available onlineb RelaxNGc RDFa, RDF/XML, Turtle, JSON-LD Web-based and Springer LNCS Apache OpenOffice, Mi- crosoft Word, RASH Javascript Editor (RAJE) ODT, DOCX LaTeX: ACM ICPS, ACM Journal Large, PeerJ CS, Springer LNCS Scholarly HTML (2011)d HTML Available online e None RDFa None None PDF (via ContentMine— Normaf) None PubCSSg HTML Available onlineh Informal (via HTML templates) None ACM SIG Proceedings, ACM SIGCHI Proceedings, ACM SIGCHI Extended Abstracts, and IEEE Confer- ence Proceedings None None PDF (via browser interface) HTML Booksi HTML Available onlinej XML Schemak None CSS files for PDF print- ing and EPUB/MOBI- compatible device visualisa- tions None None None Scholarly HTML (2015)l HTML Available onlinem None RDFa, JSON-LD Web-based Microsoft Word (as refer- enced onlinen and their on- line platform (no access for free guaranted as of 20 June 2017) DOCX None Scholarly HTML (2016)o HTML Available onlinep None RDFa, JSON-LD Web-based None None None dokieli format HTML Available onlineq Informal (via HTML templates and patterns) RDFa, Turtle, JSON-LD, TRiG Web-based (Native and Ba- sic), Springer LNCS, ACM ICPS dokielir None PDF (via browser interface) (continued on next page) P eronietal. (2017),P eerJ C om put.S ci.,D O I10.7717/peerj-cs.132 5/35 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.132 Table 1 (continued) Format Syntax Doc Formal grammar Semantic annotations CSS for different formats WYSIWYG editor Conversion from Conversion to Fiduswriter format HTML None None None Web-based Fiduswriters None HTML, EPUB, LaTeX Authorea format HTML None None None Web-based Authorea t DOCX, LaTeX DOCX, LaTeX (accord- ing to several stylesheets), PDF, Zipped structure with HTML Notes. ahttps://github.com/essepuntato/rash/. bhttp://cs.unibo.it/save-sd/rash. chttps://raw.githubusercontent.com/essepuntato/rash/master/grammar/rash.rng. dhttp://scholarlyhtml.org/. ehttp://scholarlyhtml.org/core-specification/. fhttps://github.com/ContentMine/norma. ghttps://github.com/thomaspark/pubcss/. hhttp://thomaspark.co/2015/01/pubcss-formatting-academic-publications-in-html-css/. ihttps://github.com/oreillymedia/HTMLBook/. jhttp://oreillymedia.github.io/HTMLBook/. khttps://raw.githubusercontent.com/oreillymedia/HTMLBook/master/schema/htmlbook.xsd. lhttps://github.com/scienceai/scholarly.vernacular.io. mhttp://scholarly.vernacular.io/. nhttps://science.ai/overview. ohttps://github.com/w3c/scholarly-html. phttps://w3c.github.io/scholarly-html/. qhttps://dokie.li/docs. rhttp://dokie.li. shttps://www.fiduswriter.org. thttps://www.authorea.com. P eronietal. (2017),P eerJ C om put.S ci.,D O I10.7717/peerj-cs.132 6/35 https://peerj.com https://github.com/essepuntato/rash/ http://cs.unibo.it/save-sd/rash https://raw.githubusercontent.com/essepuntato/rash/master/grammar/rash.rng http://scholarlyhtml.org/ http://scholarlyhtml.org/core-specification/ https://github.com/ContentMine/norma https://github.com/thomaspark/pubcss/ http://thomaspark.co/2015/01/pubcss-formatting-academic-publications-in-html-css/ https://github.com/oreillymedia/HTMLBook/ http://oreillymedia.github.io/HTMLBook/ https://raw.githubusercontent.com/oreillymedia/HTMLBook/master/schema/htmlbook.xsd https://github.com/scienceai/scholarly.vernacular.io http://scholarly.vernacular.io/ https://science.ai/overview https://github.com/w3c/scholarly-html https://w3c.github.io/scholarly-html/ https://dokie.li/docs http://dokie.li https://www.fiduswriter.org https://www.authorea.com http://dx.doi.org/10.7717/peerj-cs.132 1The main aim of the LinkedResearch project is to propose principles for enabling researchers to share and reuse research knowledge by means of existing Web and Semantic Web technologies towards a future world where researchers can publish and consume human-friendly and machine-readable (e.g., by using RDFa (Sporny, 2015)) scholarly documents. a well-described documentation (Berjon & Ballesteros, 2015) that teaches how to produce scholarly documents by using a quite large set of HTML tags, accompanied by schema.org (http://schema.org) annotations for describing specific structural roles of documents as well as basic metadata of the paper. The company also provides services that enable the conversion from Microsoft Word document into ScholarlyHTML format. One of the authors of the previous work is also the chair of a W3C community group called ‘‘Scholarly HTML’’ (https://www.w3.org/community/scholarlyhtml/) which aims at developing a HTML vernacular (https://github.com/w3c/scholarly-html) for the creation of a Web-first format for scholarly articles. It involves several people from all the aforementioned specifications (including RASH), and the group work should result in the release of a community-proposed interchange HTML format. As of September 22, 2017, the online documentation (https://w3c.github.io/scholarly-html/) is mainly a fork of the Scholarly HTML specification proposed by science.ai discussed above. HTML-oriented WYSIWYG editors One of the most important and recent proposals, which is compliant with the principles introduced as part of the Linked Research (http://linkedresearch.org) project1, is dokieli (https://dokie.li) (Capadisli et al., 2017). Dokieli is a web application (still under development) that allows the creation of HTML-based scholarly articles directly on the browser, and implements several features among which are annotations (in RDF) and a notification system. The application makes also available some HTML templates and a series of widgets for navigating, visualising (in different formats) and printing research documents easily by using common browsers. Fidus Writer (https://www.fiduswriter.org/) is another Web-based application for creating HTML scholarly documents by means of a wordprocessor-like interface. While the particular format used is not explicitly specified, it allows the conversion of the HTML documents created within the application in two different formats, i.e., EPUB and LaTeX (alongside with HTML). Authorea (https://www.authorea.com) is a Web service that allows users to write papers by means of a clear and effective interface. It enables the inclusion of the main components of scientific papers such as inline elements (emphasis, quotations, etc.), complex structures (figures, equations, etc.), and allows the use of Markdown and LaTeX for adding more sophisticated constructs. In addition, Authorea is able to export the document in four different formats (PDF, LaTeX, DOCX, and zipped archive with several HTML files) and according to a large number of stylesheets used in academic venues. WHICH “WEB-FIRST” FORMAT FOR RESEARCH ARTICLES? The term ‘‘Web-first’’ format indicates the possibility of using HTML as a primary format to write, store and transfer research articles, and not only to make these articles available on the Web. Some questions naturally arise in this context: shall we use the full HTML? If we impose a limited subset, which elements should we consider? Shall we demand specific rules for using the language? Peroni et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.132 7/35 https://peerj.com http://schema.org https://www.w3.org/community/scholarlyhtml/ https://github.com/w3c/scholarly-html https://w3c.github.io/scholarly-html/ http://linkedresearch.org https://dokie.li https://www.fiduswriter.org/ https://www.authorea.com http://dx.doi.org/10.7717/peerj-cs.132 2Note that accepting HTML as format for submissions in conferences/workshops is a totally different issue, since this choice is normally taken by the organisers. For instance, see the SAVE-SD 2015 call for papers (http://cs.unibo.it/save- sd/2015/submission.html) and the various editions of SePublica (http://ceur- ws.org/Vol-1155/). Some works, e.g., Capadisli, Riedl & Auer (2015), suggest not to force any particular HTML structure for research papers. This choice would allow authors to use whatever HTML structure they want for writing papers and would reduce (even, eliminate) the fear for the template bottleneck, i.e., the fact that users may not adopt a particular language if they are compelled to follow specific rules. On the other hand, leaving to the authors the freedom of using, potentially, the whole HTML specification may affect, in some way, the whole writing and publishing process of articles. The author could adopt any kind of HTML linearisation, e.g., using elements div instead of elements section, using elements table for their presentational behaviour (i.e., how they are rendered by browsers or other software readers) and not for presenting tabular data, and the like. This freedom could result in two main kinds of issues: • visualisation bottleneck—it may affect the correct use of existing, well-developed and pretty standard CSSs (e.g., Capadisli’s CSSs developed for Dokieli (https://dokie.li)) for both screen and print media, in having to write new codes for handling paper visualisation correctly; • less focus on the research content—the fact that a certain paper is not visualised in a browser very well (or, worse, in a way that is not the one the author expects) could bring the author to work on the presentation of the text, rather than on focussing on the actual research content of the text. Another point against the use of any HTML syntax for writing papers concerns the possibility of enabling an easy way for sharing the paper with others (e.g., co-authors) who, potentially, may not use HTML in the same way. If all the co-authors of a paper are able to use the full HTML, they may not understand other users’ specific use of some HTML tags —‘‘why did they use the elements section instead of div?’’; ‘‘what is this freaky use of elements table?’’. Hence, the advantages of using a common HTML format is quite evident: only one syntax and only one possible semantics. There is a further issue worth mentioning. Having a shared, unambiguous and simple format would facilitate conversions from/into other complex ones (e.g., ODT (JTC1/SC34 WG 6, 2006), OOXML (JTC1/SC34 WG 4, 2011), DocBook (Walsh, 2009), JATS (National Information Standards Organization, 2012), thus enabling authors to use their own text editors or word-processors to modify the articles. The conversion is instead much more complex, error-prone and imprecise on the full HTML. To complicate an already complex scenario there is the necessary involvement of publishers. Allowing the authorsto use their ownHTML format couldbe counterproductive from a publisher’s perspective, in particular when we speak about the possibility of adopting such HTML formats for regular conference/journal camera-ready submissions. From a recent discussion on the Force11 mailing list (https://groups.google.com/forum/#!topic/ forcnet/g4BNAOOMjMM), it emerges that publishers are willing to adopt HTML for submissions if and only if it is a clear community need. It means that they will include HTML formats in the publishing workflow only once a number of conference organisers decides to include HTML among the accepted formats for paper submissions2. However, using one clear Web-first format, rather than a plethora of possible variations allowed by Peroni et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.132 8/35 http://cs.unibo.it/save-sd/2015/submission.html http://cs.unibo.it/save-sd/2015/submission.html http://ceur-ws.org/Vol-1155/ http://ceur-ws.org/Vol-1155/ https://peerj.com https://dokie.li https://groups.google.com/forum/#!topic/forcnet/g4BNAOOMjMM https://groups.google.com/forum/#!topic/forcnet/g4BNAOOMjMM http://dx.doi.org/10.7717/peerj-cs.132 3OASIS LegalDocumentML is the standardisation of AkomaNtoso (http: //www.akomantoso.org/), which is a set of simple technology-neutral electronic representations in XML format of parliamentary, legislative and judiciary documents, and has been already adopted by several parliaments in European Union, Africa, and South America. the full HTML schema, would certainly lighten the burden of publishers for including HTML within their publishing workflow. This inclusion could be additionally favoured by the availability of services (e.g., editors, converters, enhancers, visualisers) for facilitating the use of such a Web-first format within the existing publishing environments. Last but not least, using a controlled subset of HTML is more appropriate for Semantic Publishing applications (Shotton et al., 2009; Peroni, 2014b). The development of scripts and applications to extract, for instance, RDF statements directly from the markup structure of the text is a sort of nightmare if different authors use HTML in different manners. For instance, what happens when trying to extract the rhetorical organisation of a scientific paper according to the Document Component Ontology (DoCO) (http://purl.org/spar/doco) (Constantin et al., 2016) from two HTML documents that use HTML tags in different ways? Is an HTML element table an actual table (containing tabular data)? Which are the tags identifying sections? These analyses are all easier within a controlled and unambiguous subset of HTML. WRITING SCHOLARLY ARTICLES IN HTML WITH RASH The subset of HTML we propose in RASH is strictly compliant to a patterns theory we have developed over the past few years. Patterns are widely accepted solutions to handle recurring problems. Firstly introduced for architecture and engineering problems (Alexander, 1979), they have been successfully deployed in computer science and in particular in software engineering (Gamma et al., 1994). In this section, we briefly introduce our patterns for document engineering and then we go into the details of RASH. Theoretical foundations: structural patterns While we have plenty of tools and languages for creating new markup languages (e.g., RelaxNG (Clark & Makoto, 2001) and XMLSchema Gao, Sperberg-McQueen & Thompson, 2012), these usually do not provide any particular guideline for fostering the development of robust and well-shaped document languages. In order to fill that gap, in the last decade we have experimented with the use of a theory of structural patterns for markup documents (Di Iorio et al., 2014), that has since been applied in several national and international standards, among which OASIS LegalDocumentML (https://www.oasis-open.org/committees/legaldocml/)3, a legal document standard for the specification of parliamentary, legislative and judicial documents, and for their interchange between institutions in different countries. The basic idea behind this theory is that each element of a markup language should comply with one and only one structural pattern, depending on the fact that the element: • can or cannot contain text (+t in the first case, −t otherwise); • can or cannot contain other elements (+s in the first case, −s otherwise); • is contained by another element that can or cannot contain text (+T in the first case, −T otherwise). Peroni et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.132 9/35 http://www.akomantoso.org/ http://www.akomantoso.org/ https://peerj.com http://purl.org/spar/doco https://www.oasis-open.org/committees/legaldocml/ http://dx.doi.org/10.7717/peerj-cs.132 By combining all these possible values—i.e.,±t,±s, and±T—we basically obtain eight core structural patterns, namely (accompanied by a plausible example within the HTML elements): 1. inline [+t+s+T], e.g., the element em; 2. block [+t+s−T], e.g., the element p; 3. popup [−t+s+T], e.g., the element aside; 4. container [−t+s−T], e.g., the element section; 5. atom [+t−s+T], e.g., the element abbr; 6. field [+t−s−T], e.g., the element title; 7. milestone [−t−s+T], e.g., the element img; 8. meta [−t−s−T], e.g., the element link. Instead of defining a large number of complex and diversified structures, the idea is that a small number of structural patterns are sufficient to express what most users need for defining the organisation of their documents. Therefore, the two main aspects related to such patterns are: • orthogonality—each pattern has a specific goal and fits a specific context. It makes it possible to associate a single pattern to each of the most common situations in document design. Conversely, for every situation encountered in the creation of a new markup language, the corresponding pattern is immediately selectable and applicable; • assemblability—each pattern can be used only in some contexts within other patterns. This strictness provides expressiveness and non-ambiguity in the patterns. By limiting the possible choices, patterns prevent the creation of uncontrolled and misleading content structures. Such patterns allow authors to create unambiguous, manageable and well-structured markup languages and, consequently, documents, fostering increased reusability (e.g., inclusion, conversion, etc.) among different languages. Also, thanks to the regularity they provide, it is possible to perform easily complex operations on pattern-based documents even when knowing very little about their vocabulary (automatic visualisation of document, inferences on the document structure, etc.). In this way, designers can implement more reliable and efficient tools, can make a hypothesis regarding the meanings of the document fragments, can identify singularities and can study the global properties of a set of documents, as described in Di Iorio et al. (2012) and Di Iorio et al. (2013). HTML does not use the aforementioned patterns in a systematic way, as it allows the creation of arbitrary and, sometimes, quite ambiguous structures. To apply the structural pattern guidelines for RASH, we restricted HTML by selecting a good subset of elements expressive enough to capture the typical components of a scholarly article while being also well-designed, easy to reuse and robust. RASH: Research Article in Simplified HTML The Research Articles in Simplified HTML (RASH) format is a markup language that restricts the use of HTML (http://www.w3.org/TR/html5/) elements to only 32 elements for writing academic research articles. It allows authors to use embedded RDF annotations. Peroni et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.132 10/35 https://peerj.com http://www.w3.org/TR/html5/ http://dx.doi.org/10.7717/peerj-cs.132 4Please refer to the official RASH documentation, available at https: //rawgit.com/essepuntato/rash/master/ documentation/index.html, for a complete introduction of all the elements and attributes that can be used in RASH documents. 5The following prefixes are always mandatory in any RASH document: •schema: http://schema.org/ •prism: http://prismstandard.org/ namespaces/basic/2.0/. In addition, RASH strictly follows the Digital Publishing WAI-ARIA Module 1.0 (Garrish et al., 2016) for expressing structural semantics on various markup elements used. All RASH documents begin as a simple HTML5 document4 (Hickson et al., 2014), by specifying the generic HTML DOCTYPE followed by the document element html with the usual namespace (‘‘http://www.w3.org/1999/xhtml’’) and with additional (and mandatory) prefix declarations through the attribute prefix5. The element html contains the element head for defining metadata of the document according to the DCTERMS (http: //dublincore.org/documents/dcmi-terms/) and PRISM (http://www.prismstandard.org/) standards and the element body for including the whole content of the document. The element head of a RASH document must include some information about the paper, i.e., the paper title (element title), at least one author, while other related information (i.e., affiliations, keywords and categories included using the elements meta and link) are optional. The element body mainly contains textual elements (e.g., paragraphs, emphases, links, and quotations) for describing the content of the paper, and other structural elements (e.g., abstract, sections, references, and footnotes) used to organise the paper in appropriate blocks and to present specific complex structures (e.g., figures, formulas, and tables). In the following subsection, we provide a quick discussion about usage patterns in RASH, and introduce the tools used for developing its grammar. Development and patterns The development of RASH started from the whole HTML5 grammar, and proceeded by removing and restricting the particular use of HTML elements, to make them expressive enough for representing the structures of scholarly papers and to have the language totally compliant with the theory on structural patterns for XML documents (Di Iorio et al., 2014) introduced in ‘Theoretical foundations: structural patterns’. The systematic use of these structural patterns is an added value in all stages of the documents’ lifecycle: they can be guidelines for creating well-engineered documents and vocabularies, rules to extract structural components from legacy documents, indicators to study to what extent documents share design principles and community guidelines. All these characteristics have allowed us to simplify, at least to some extent, the handling of all the requirements introduced in ‘Introduction’ and ‘Which ‘‘Web-first’’ Format for Research Articles?’ in RASH. Table 2 shows what is the current pattern assignment for each element in RASH. Notice that we do not use two patterns presented in ‘Theoretical foundations: structural patterns’, namely atom and popup. The elements compliant with the former pattern are contained in discursive blocks (e.g., paragraphs) and contain only textual content with no additional elements. This is very infrequent in scholarly writings since any element used for emphases, links, and other in-sentence elements can always contain additional elements (e.g., an emphasis can contain a link). A different discourse can be done for the pattern popup, which is meant to represent complex substructures that interrupt but do not break the main flow of the text, such as footnotes (Di Iorio et al., 2014). An element compliant to the popup pattern, while still not allowing directly text content inside itself, is found in elements with a mixed context Peroni et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.132 11/35 https://rawgit.com/essepuntato/rash/master/documentation/index.html https://rawgit.com/essepuntato/rash/master/documentation/index.html https://rawgit.com/essepuntato/rash/master/documentation/index.html http://schema.org/ http://prismstandard.org/namespaces/basic/2.0/ http://prismstandard.org/namespaces/basic/2.0/ https://peerj.com http://www.w3.org/1999/xhtml http://dublincore.org/documents/dcmi-terms/ http://dublincore.org/documents/dcmi-terms/ http://www.prismstandard.org/ http://dx.doi.org/10.7717/peerj-cs.132 Table 2 The use of structural patterns in RASH. Pattern RASH element inline a, code, em, math, q, span, strong, sub, sup, svg block figcaption, h1, p, pre, th popup none container blockquote, body, figure, head, html, li, ol, section, table, td, tr, ul atom none field script, title milestone img meta link, meta [t+s+]. In particular, in developing RASH, we discussed which of the following two possible approaches for defining footnotes was more adequate to our needs. The first option was a container-based behaviour, also suggested by JATS (National Information Standards Organization, 2012) by means of the element fn-group and not included in HTML specifications, that allows the authors to specify footnotes (through the element ft) by using a tag that is totally separated from the main text from which it is referenced (usually through XML attributes), as shown in the following excerpt: <-- A paragraph referring to a footnote -->
In this paragraph there is an explicit reference to the
second footnote
This is a paragraph within a footnote.
This is a paragraph in another footnote.
All the footnotes are contained in a group , so as to collect them together.
In this paragraph the footnote That is
what we call popup -based behaviour !.
In this paragraph there is an explicit reference to the second footnote .
<-- The group containing all the footnotes -->This is the text of a footnote.
This is the text of another footnote.
Write here the reference entry.
RASH has been developed in order to allow anyone to add RDFa annotations to any element of the document.
... In addition to RDFa, RASH makes available another way to inject RDF statements (Cyganiak, Wood & Lanthaler, 2014) to the document, by means of an element script (within the element head): • with the attribute type set to ‘‘text/turtle’’ for adding plain Turtle content (Prud’hommeaux & Carothers, 2014); • with the attribute type set to ‘‘application/ld+json’’ for adding plain JSON-LD content (Sporny, Kellogg & Lanthaler, 2014); • with the attribute type set to ‘‘application/rdf+xml’’ for adding plain RDF/XML content (Gandon & Schreiber, 2014). An example of the use of the script for Turtle and JSON-LD statements is shown in the following excerpt: Peroni et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.132 15/35 https://peerj.com http://asciimath.org https://www.mathjax.org/ http://dx.doi.org/10.7717/peerj-cs.132 It is worth noticing that RASH does not require any particular vocabulary for introducing RDF statements, except three properties from schema.org (http://schema.org) for defining author’s metadata (see the RASH documentation (https://rawgit.com/essepuntato/rash/ master/documentation/index.html#metadata) for additional details). For instance, in this document (in particular, in its RASH version (https://w3id.org/people/essepuntato/ papers/rash-peerj2016.html)) we mainly use CiTO (Peroni & Shotton, 2012) and other SPAR Ontologies (Peroni, 2014a) for creating citation statements about the paper itself, but alternative and/or complementary vocabularies are freely usable as well. THE RASH FRAMEWORK One of the issues we had to face, and in general anyone has to face when proposing a new markup language, was to provide tools for writing papers in RASH. It is undeniable that: • not all the potential authors are able (or willing) to write scholarly articles in HTML, even within the Web community; • not all the potential authors are able (or willing) to manually add additional semantic annotations, even within the Semantic Web community. The authorial activity of writing an article by using RASH, but also any other new Web-first format, must be supported by appropriate interfaces and tools to reach a broad adoption. A possible solution was to implement a native HTML authoring environment, so that authors did not have to deal directly with the new language. However, this solution would have forced all co-authors to use to the same tool and introduced a variety of technical difficulties, since it is not easy to create and support a user friendly and flexible work environment. We believe that a more liberal approach, that allows each author to keep using her/his preferred tools, even off-line, is more practical. This is the idea behind the RASH Framework (https://github.com/essepuntato/rash) (Peroni, 2017): a set of specifications and writing/conversion/extraction tools for Peroni et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.132 16/35 https://peerj.com http://schema.org https://rawgit.com/essepuntato/rash/master/documentation/index.html#metadata https://rawgit.com/essepuntato/rash/master/documentation/index.html#metadata https://w3id.org/people/essepuntato/papers/rash-peerj2016.html https://w3id.org/people/essepuntato/papers/rash-peerj2016.html https://github.com/essepuntato/rash http://dx.doi.org/10.7717/peerj-cs.132 Figure 1 RASH Framework. The RASH Framework and its main components. writing articles in RASH. In this section, we give a brief description of all the tools we have developed in the framework. All the software components are distributed under an ISC License (http://opensource.org/licenses/ISC), while the other components are distributed under a Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/). A summary of the whole framework is introduced in Fig. 1. Validating RASH documents RASH has been developed as a RelaxNG grammar (Clark & Makoto, 2001), i.e., a well- known schema language for XML documents. All the markup items it defines are fully compatible with the HTML5 specifications (Hickson et al., 2014). In order to check whether a document is compliant with RASH, we developed a script (https://github.com/essepuntato/rash/blob/master/tools/rash-check.sh) to enable RASH users to check their documents simultaneously both against the specific requirements in the RASH RelaxNG grammar and also against the HTML specification through W3C Nu HTML Checker (http://validator.w3.org/nu/). This will hopefully help RASH users to timely detect and fix any mistakes in their documents. This script also checks datatype microsyntaxes. In addition to the aforementioned script, we developed a Python application (https://github.com/essepuntato/rash/tree/master/tools/rash-validator) that enables one to validate RASH documents against the RASH grammar. This application makes also available a Web interface for visualising all the validation issues retrieved in RASH documents. Visualising RASH documents The visualization of a RASH document is rendered by the browser by means of appropriate CSS3 (http://www.w3.org/Style/CSS/specs.en.html) stylesheets (Atkins Jr, Etemad & Rivoal, 2017) and Javascript developed for this purpose. RASH adopts external libraries, such as Bootstrap (http://getbootstrap.com/) and JQuery (http://jquery.com/), in order to provide the current visualisation and include additional tools for the user. For instance, the footbar with statistics about the paper (i.e., number of Peroni et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.132 17/35 https://peerj.com http://opensource.org/licenses/ISC http://creativecommons.org/licenses/by/4.0/ https://github.com/essepuntato/rash/blob/master/tools/rash-check.sh http://validator.w3.org/nu/ https://github.com/essepuntato/rash/tree/master/tools/rash-validator http://www.w3.org/Style/CSS/specs.en.html http://getbootstrap.com/ http://jquery.com/ http://dx.doi.org/10.7717/peerj-cs.132 8The layouts currently available are Web- based and Springer’s Lecture Note in Computer Science (http://www.springer. com/computer/lncs?SGWID=0-164-6- 793341-0)—the latter is based on the Springer LNCS CSS included in dokieli (http://dokie.li) (Capadisli et al., 2017). words, figures, tables and formulas) and a menu to change the actual layout of the page8, the automatic reordering of footnotes and references, the visualisation of the metadata of the paper, etc. Note that this kind of automatic rendering of paper items, such as references to a bibliographic entry or a figure, reduce the cognitive effort of an author when writing a RASH paper. For instance, a piece of text referencing a table, e.g., ‘‘as shown in Table 2’’, is created without caring about the particular text to specify for that reference (‘‘Table 2’’ in the example), since RASH prescribes to specify just an empty link to the object one wants to refer to, as shown in the following excerpt: For these objects, the Javascript scripts decide which is the most suitable text to put there according to the type of the item referenced. Converting RASH into LaTeX styles We spent some effort in preparing XSLT 2.0 documents (Kay, 2007) for converting RASH documents into different LaTeX styles, such as ACM ICPS (http://www.acm. org/sigs/publications/proceedings-templates) and Springer LNCS (http://www.springer. com/computer/lncs?SGWID=0-164-6-793341-0), among the others. We believe this is essential to foster the use of RASH within international events and to easily publish RASH documents in the official LaTeX format currently required by the organisation committee of such events. Obviously, the full adoption of RASH or any other Web-first format would make these stylesheets not necessary but, currently, they are fundamental for the adoption of the overall approach. Producing RASH from ODT and DOCX We also developed two XSLT 2.0 documents to perform conversion from Apache OpenOffice documents (https://github.com/essepuntato/rash/blob/master/xslt/from- odt.xsl) and Microsoft Word documents (https://github.com/essepuntato/rash/blob/ master/xslt/from-docx.xsl) into RASH documents. The RASH documentation provides a detailed description of how to use Apache OpenOffice (https://rawgit.com/essepuntato/ rash/master/documentation/rash-in-odt.odt) and Microsoft Word (https://rawgit. com/essepuntato/rash/master/documentation/rash-in-docx.docx) for writing scientific documents that can be easily converted to the RASH format. The standard features of these two editors (e.g., styles, document properties, etc.), elements (e.g., lists, pictures, captions, footnotes, hyperlinks, etc.) and facilities (e.g., mathematical editor, cross-reference editor, etc.) can be used to produce fully compliant RASH documents. A web-based service, for converting documents online (presented in ‘ROCS’) and two Java applications for ODT (https://github.com/essepuntato/rash/tree/master/tools/odt2rash) and DOCX (https://github.com/essepuntato/rash/tree/master/tools/docx2rash) documents (that can be downloaded and used offline on the local machine) were developed to facilitate the conversion process of Apache OpenOffice and Microsoft Word documents into the RASH format. Peroni et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.132 18/35 http://www.springer.com/computer/lncs?SGWID=0-164-6-793341-0 http://www.springer.com/computer/lncs?SGWID=0-164-6-793341-0 http://www.springer.com/computer/lncs?SGWID=0-164-6-793341-0 http://dokie.li https://peerj.com http://www.acm.org/sigs/publications/proceedings-templates http://www.acm.org/sigs/publications/proceedings-templates http://www.springer.com/computer/lncs?SGWID=0-164-6-793341-0 http://www.springer.com/computer/lncs?SGWID=0-164-6-793341-0 https://github.com/essepuntato/rash/blob/master/xslt/from-odt.xsl https://github.com/essepuntato/rash/blob/master/xslt/from-odt.xsl https://github.com/essepuntato/rash/blob/master/xslt/from-docx.xsl https://github.com/essepuntato/rash/blob/master/xslt/from-docx.xsl https://rawgit.com/essepuntato/rash/master/documentation/rash-in-odt.odt https://rawgit.com/essepuntato/rash/master/documentation/rash-in-odt.odt https://rawgit.com/essepuntato/rash/master/documentation/rash-in-docx.docx https://rawgit.com/essepuntato/rash/master/documentation/rash-in-docx.docx https://github.com/essepuntato/rash/tree/master/tools/odt2rash https://github.com/essepuntato/rash/tree/master/tools/docx2rash http://dx.doi.org/10.7717/peerj-cs.132 Figure 2 ROCS. The architecture of ROCS. In the past few years, as sort of alpha-testing, we have used these conversion approaches with many internal projects in the Digital and Semantic Publishing Laboratory of the Department of Computer Science and Engineering at the University of Bologna. Moreover, also our co-authors and collaborators from different disciplines (e.g., business and management, humanities, medicine, etc.) have successfully used this approach for producing their documents, giving us a chance to have fruitful feedback, comments, and suggestions. In particular, we have been able to convert with discrete success several ODT and DOCX files of research papers, PhD theses, documentations, and project proposals and deliverables. ROCS We created an online conversion tool called ROCS (RASH Online Conversion Service) (http://dasplab.cs.unibo.it/rocs) (Di Iorio et al., 2016) for supporting authors in writing RASH documents and preparing submissions that could be easily processed by journals, workshops, and conferences. ROCS integrates the tools introduced in the previous sections. The abstract architecture of the tool is shown in Fig. 2. ROCS allows converting either an ODT document or a DOCX document, written according to specific guidelines, into RASH and, then, into LaTeX according to the following layouts: Springer LNCS, ACM IPCS, ACM Journal Large, PeerJ. Such guidelines, introduced in ‘Producing RASH from ODT and DOCX’, are very simple and use only the basic features available in Apache OpenOffice Writer and in Microsoft Word, without any external tool or plug-in. ROCS allows users to upload four kinds of file, i.e., an ODT document, a DOCX document, an HTML file compliant with RASH, and a ZIP archive which contains an Peroni et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.132 19/35 https://peerj.com http://dasplab.cs.unibo.it/rocs http://dx.doi.org/10.7717/peerj-cs.132 9The source code and binaries of SPAR Xtractor are available at https://github. com/essepuntato/rash/tree/master/sources/ spar-xtractor and https://github.com/ essepuntato/rash/tree/master/tools/spar- xtractor, respectively. 10The prefix po: stands for the namespace http://www.essepuntato.it/2008/12/ pattern#. HTML file compliant with RASH and related files (i.e., CSSs, javascript files, fonts, images). It returns a ZIP archive containing the original document plus all its converted versions, i.e., RASH, if an ODT/DOCX file was given, and the LaTeX file. The main advantage of having the paper both in RASH and in LaTeX is that it is fairly easy for RASH to be adopted by workshops, conferences or journals. Since the program committee, the reviewers, and the editors have also access to a LaTeX or a PDF version of the paper, the RASH file is an addition that does not preclude any current workflows. Of course, the hope is that the inherent advantages of an HTML-based format such as RASH will eventually persuade stakeholders to adopt the HTML version whenever it is possible, keeping the alternatives as fall-back options. Enriching RASH documents with structural semantics Another development of the RASH Framework concerns the automatic enrichment of RASH documents with RDFa annotations defining the actual structure of such documents in terms of the FRBR-aligned Bibliographic Ontology (FaBIO) (http://purl.org/spar/fabio) and the Document Component Ontology (DoCO) (http://purl.org/spar/doco) (Constantin et al., 2016). More in detail, we developed a Java application called SPAR Xtractor suite9. SPAR Xtractor is designed as a one-click tool able to add automatically structural semantics to a RASH document. SPAR Xtractor takes a RASH document as input and returns a new RASH document where all its markup elements have been annotated with their actual structural semantics by means of RDFa. The tool associates a set of FaBIO or DoCO types with specific HTML elements. The set of HTML elements and their associations with FaBIO or DoCO types can be customised according to specific needs of expressivity. The default association provided by the current release of SPAR Xtractor is the following: • the root html element is mapped to an individual of the class fabio:Expression (http://purl.org/spar/fabio/Expression). The class fabio:Expression identifies the specific intellectual or artistic form that a work takes each time it is realised; • the body element is mapped to an individual of the class doco:BodyMatter (http://purl.org/spar/doco/BodyMatter). The class doco:BodyMatter is the central principle part of a document, it contains the real document content, and it is subdivided hierarchically by means of sections; • p elements are represented as individuals of the class doco:Paragraph (http: //purl.org/spar/doco/Paragraph), i.e., self-contained units of discourse that deal with a particular point or idea; • figure elements containing the element img within a paragraph are represented as individuals of the class doco:FigureBox (http://purl.org/spar/doco/FigureBox), which is a space within a document that contains a figure and its caption; • section elements are mapped to individuals of the class doco:Section (http: //purl.org/spar/doco/Section), which represents a logical division of the text. Sections can be organised according to a variable level of nested sub-sections. Accordingly, SPAR Xtractor reflects this structural behaviour by representing the containment relation by means of the object property po:contains (http://www.essepuntato.it/2008/12/ pattern#contains)10. For example, a certain section element with a nested section Peroni et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.132 20/35 https://github.com/essepuntato/rash/tree/master/sources/spar-xtractor https://github.com/essepuntato/rash/tree/master/sources/spar-xtractor https://github.com/essepuntato/rash/tree/master/sources/spar-xtractor https://github.com/essepuntato/rash/tree/master/tools/spar-xtractor https://github.com/essepuntato/rash/tree/master/tools/spar-xtractor https://github.com/essepuntato/rash/tree/master/tools/spar-xtractor http://www.essepuntato.it/2008/12/pattern# http://www.essepuntato.it/2008/12/pattern# https://peerj.com http://purl.org/spar/fabio http://purl.org/spar/doco http://purl.org/spar/fabio/Expression http://purl.org/spar/doco/BodyMatter http://purl.org/spar/doco/Paragraph http://purl.org/spar/doco/Paragraph http://purl.org/spar/doco/FigureBox http://purl.org/spar/doco/Section http://purl.org/spar/doco/Section http://www.essepuntato.it/2008/12/pattern#contains http://www.essepuntato.it/2008/12/pattern#contains http://dx.doi.org/10.7717/peerj-cs.132 element produces two individuals of the class doco:Section (e.g., :section_outer a doco:Section and :section_inner a doco:Section) related by the property po:contains (e.g., section_outer po:contains :section_inner). In addition to these semantic annotations, which come from the actual structure of a document, the tool is also able to automatically detect sentences and annotate them as individuals of the class doco:Sentence (http://purl.org/spar/doco/Sentence). A doco:Sentence denotes an expression in natural language forming a single grammatical unit. For the sentence detection task, SPAR Xtractor relies on the sentence detection module of the Apache OpenNLP project (https://opennlp.apache.org/), which provides a machine learning based toolkit for the processing of natural language text. By default, SPAR Xtractor is released to support English only. However, it is possible to extend it with new languages by adding their corresponding models for Apache OpenNLP, most of which are available with an open licence (http://opennlp.sourceforge.net/models-1.5/). We remark that the object property po:contains is used for representing any kind of containment relation among the structural components that SPAR Xtractor deals with. Hence, the usage of such a property is not limited to the individuals of the class doco:Section only. In fact, the property po:contains can be used, for example, for expressing the containment relation between a doco:BodyMatter and a doco:Section or between a doco:Section and a doco:Sentence. For example, let us consider the following code snippets that provide a sample HTML document. ... ...This is a sentence. This is another sentence of this paragraph.
...Peroni et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.132 21/35 https://peerj.com http://purl.org/spar/doco/Sentence https://opennlp.apache.org/ http://opennlp.sourceforge.net/models-1.5/ http://dx.doi.org/10.7717/peerj-cs.132 This is a sentence. This is another sentence of this paragraph.
...