key: cord-0432505-ipaqlu92
authors: Gualandi, Bianca; Pareschi, Luca; Peroni, Silvio
title: What do we mean by"data"? A proposed classification of data types in the arts and humanities
date: 2022-05-13
journal: nan
DOI: nan
sha: a269df7ea5f81ec8a7908c76be57d797af1bac63
doc_id: 432505
cord_uid: ipaqlu92

Objectives: We describe here the interviews we conducted in late 2021 with 19 researchers at the Department of Classical Philology and Italian Studies at the University of Bologna. The purpose has been to shed light on the definition of the word"data"in the humanities domain, as far as FAIR data management practices are concerned, and on what researchers think of the term. Methods: We invited one researcher for each of the disciplinary areas represented within the department and all 19 accepted to participate in the study. We divided participants into 5 main research areas: philology and literary criticism, language and linguistics, history of art, computer science, archival studies. The interviews were transcribed and analysed using a grounded theory approach. Results: A list of 13 research data types in the humanities has been compiled thanks to the information collected from participants; publications emerge as the most important one. The term"data"does not seem to be especially problematic, contrary to what has been reported elsewhere. Regarding current research and data management practices, methodologies and teamwork appear more central than previously reported. Conclusions:"Data"in the FAIR framework need to include all types of input and outputs humanities research work with, including publications. Humanities researchers appear ready for a discussion around making their data FAIR: they do not find the terminology particularly problematic, while they rely on precise and recognised methodologies, as well as on sharing and collaboration. Future studies should include more disciplines to paint a more precise picture.

The start of the discussion around widening public access to research can be traced back to the Budapest Open Access Initiative (BOAI) FAIR is an acronym for Findable, Accessible, Interoperable and Re-usable (Wilkinson et al., 2016) and indicates a set of data management practices centred on machine actionability. In this scenario, the exact meaning of the word data is increasingly being discussed within the scholarly community. The use of this term, and the application of FAIR principles, within the art and humanities domain is not without problems. As pointed out, among others, by Tóth-Czifra: […] by applying the FAIR data guiding principles to arts and humanities data curation workflows, it will be uncovered that contrary to their general scope and deliberately domainindependent nature, they have been implicitly designed according to underlying assumptions about how knowledge creation operates and communicates (Tóth Czifra 2019, p. 3).

In the same instance, the author calls for "iterated and large-scale surveys [...] to assess whether and to what extent the term data is still a dirty word" 3 (Tóth Czifra 2019, p.

The present work is a contribution towards this goal, and towards looking for a way forward for FAIR data principles within the arts and humanities without relying on assumptions drawn from other disciplines. This study addresses the following research questions:

1. How do we define "data" in the humanities? 2. What do humanities researchers think of the word "data"? 3. What are their attitudes towards open science? 4. What are current data management practices in the humanities?

Before approaching researchers, we analysed 5 studies involving the staff of different universities, at various levels and across different disciplines 4 . We decided to reuse part of the work done by Thoegersen (2018) because we particularly liked the questions and the structure of the interviews. They:

• begin by enquiring on the faculty members' current research projects, • proceed to ask which research materials interviewees collect, generate, and use in their research, and how they manage them, • introduces the term "data" only at the very end, asking whether interviewees consider this term applicable to their own research 5 .

We translated Thoegersen's questions into in Italian and modified some passages. Table 1 lists our questions (translated back into English) and highlights the changes made, while also making the relationship with the research questions explicit. At the beginning of each interview, we asked for consent to record, for the purpose of transcribing, and to use the transcripts in the study. The recordings were transcribed and anonymised, editing out any content that could be used to identify the interviewees.

To analyse our interviews, we relied on a grounded theory approach (Glaser & Strauss, 1967) as it can generate rich descriptions, while at the same time rendering interpretive methods more codified and legible to a broad audience (Baumer et al., 2017) . The grounded approach is not a codified method but includes several methodologies sharing common epistemological underpinnings (Glaser, 1992) .

For the coding we used QualCoder version 2.8, a qualitative data analysis application written in python3 that uses a SQlite database to store coding data (Curtain 2022) . The transcriptions, in the original Italian, and the GT coding are available for reuse on Zenodo (Gualandi et al., 2022) .

In the analysis, we grouped participants into 5 main research areas:

• philology and literary criticism (12 participants) -includes all the researchers who have texts or documents as their main object(s) of study; • language and linguistics (4) -includes all participants whose research is focused on languages, linguistic phenomena, and language teaching; • history of art (1) -includes the only participant whose research revolves around art history; • computer science (1) -includes the only participant who is a computer scientist; • archival studies (1) -includes the only participant who focuses specifically on archives and archival documents.

To answer the first research question (How do we define "data" in the humanities?), we extracted all types of "research materials" mentioned by participants when answering questions 2 and 3a. They often used slightly different words or expressions to describe the same thing, so we normalised the terminology and then grouped materials into 13 different categories (e.g., webinars, conferences and exhibitions were all categorised as events -see also Table 2 below). We then used these categories to organise the coding so that all the codes pertaining to a specific data type could be grouped together.

As for the second research question (What do humanities researchers think of the word "data"?), we looked at the answers given to question 7 and 8, using coding to categorise definitions and statements on data as "inclusive", when they described many different research materials as data, and as "restrictive", when they described data as something rather limited and precise. Both kinds of statements often coexisted, so we looked at each interviewee's answer separately, to weigh the different statements and rate their overall opinion on data on a scale from 5 (very inclusive, e.g., "everything is data") to 1 (very restrictive, e.g., "data are numerical"). A similar approach was taken to answer research question 3 (What are their attitudes towards

open data and open science?), categorising statements as "mostly positive" or "mostly negative".

Finally, in analysing current data management practices, we used GM coding to let information emerge organically from the interviews. Then, where possible, we grouped the codes into categories that loosely reflected the main topics raised through questions 3b to 6 (e.g., documentation, copyright and privacy, methodologies, etc.).

Research question 1: How do we define "data" in the humanities?

We extracted the "materials" study participants work with and we grouped them into understanding of how FAIR data management practices can be implemented in the arts and humanities 10 . Research question 2: What do researchers think of the word "data"?

We asked each participant whether the research materials emerged earlier in the interview could be considered data:

• 14 researchers responded positively -of these2 expressed unease with the term, and 1 strongly objected, • 2 researchers answered negatively, • 3 said they did not know how to respond.

Interestingly, 12 researchers unwittingly used the term "data" before it was first introduced by us in question 7 and 8 11 . Asked to define "data", most researchers (11) included several or all research materials in their definition, while a minority (5) defined them rather precisely as "information items", "raw materials", or "quantitative components". The most restrictive definitions seem to come from researchers belonging to the field of language and linguistics.

Based on our inductive methodology, we counted 18 statements that we categorised as "mostly positive", and 10 that we categorised as "mostly negative". Across our interviews, positive attitudes towards open science appear prevalent.

Research question 4: What are current data management practices in the humanities?

We list below the most interesting aspects that emerged from the interviews:

• digital research tools and digital representations of cultural objects, especially manuscripts, are extensively used by study participants, • the vast majority of them states their discipline has a specific, recognised, methodology, • collaboration emerges as pivotal, not only in the shape of informal consultation with colleagues, but also of teamwork, • data produced by our interviewees are mostly intended for a specialised audience (can be freely available or not, depending on the type of data and funding), • researchers store most of their data long-term, but only 2 of them mention using a repository, • documentation and standards are familiar to a very small subset of participants.

After presenting the results of this study in a schematic fashion, in this section we are going to analyse them further, looking at how they fit into the current discussion around FAIR data management practices in the humanities. Where relevant, we will quote some passages from the interviews 12 .

12 All translations from Italian into English are ours.

The European Federation of Academies of Sciences and Humanities 13 (ALLEA)

Working Group on E-Humanities, in their report "Sustainable and FAIR Data Sharing in the Humanities", state that the definition of data encompasses all inputs and outputs of research that are not publications (Harrower, Maryl et al. 2020, p. 6) . It does recognise, however, that both texts and documents are data (Harrower, Maryl et al. 2020, pp. 8 and 14) .

The CO-OPERAS Implementation Network (IN), which is part of GO FAIR 14 , aims at helping SSH communities:

build a bridge between SSH data and the EOSC, widening the concept of "research data" to include all of the types of digital research output linked to scholarly communication that are part of the research process. (OPERAS 2022) In 2019-2020, they organised 5 workshops in Italy, Portugal, Germany, France, and Belgium, gathering more than 70 researchers from 32 different humanities and social sciences disciplines. They concluded that there was no consensus on what data in the humanities is and that any definition needs to "consider the process of data production, going as deep as possible in research practices" and be "linked to the workflow of knowledge creation" (CO-OPERAS & Giglia 2020, p. 1).

The OPERAS-P project 15 , working in synergy with the CO-OPERAS IN, released a "Report on innovative approach to SSH FAIR data and publications". The document consistently uses the expression "SSH data and publications" and confirms the importance of publications in the humanities, but its aim is ultimately to move from a definition of data based on data types and disciplines to the management of objects related to specific communities and aims (Avanço, Giglia and Gingold 2021, p. 9).

A literature review 16 conducted to investigate the definition of data in the SSH indicated that "every informational data is technically data" (Avanço, Giglia and Gingold 2021, p. 8):

The intersection between information science and technology has changed the status and the meaning of data: anything that can be transposed into binary bits constitutes data that in turn feeds and supports the entire digital environment. At first strictly scientific as a concept and a Finally, the 2021 OPERAS "Report on the Future of Scholarly Communication" states that, within the FAIR framework, everything is data, or could be, with the concept of data "intended to be as universal as possible, including datasets, publications, software, etc." (Avanço, Balula et al. 2021, p. 22) . At the same time, it remarks that FAIR awareness is still low in SSH, and that the SSH research environment is greatly diverse and complex (Avanço, Balula et al. 2021, p. 23 ).

We will quickly mention two attempts at classifying research data in the humanities.

First, the 2014-2015 survey conducted on print and electronic dissertations in SSH submitted to the University of Lille 3 between 1987 and 2013 (Prost, Malleret and Schöpfel 2015) . The authors found that 66% of the dissertations were accompanied by data, and that differences in data types were "more related to disciplinary methodologies than to support" (Prost, Malleret and Schöpfel 2015, p. 14) . They opted for "a large and pragmatic definition" of the term and found these data types:

• As for our study, we extracted from the interviews all the materials that researchers report producing and using in their projects and categorised them into 13 different data types. Please see Table 2 , above, for an overview, and the next paragraphs for more insight into what each category (highlighted in bold) includes.

(i) Publications are by far the most mentioned material 20 . Across the 21 projects described by our interviewees, the following types of publication have been produced (a single project may produce more than one):

• critical editions (8) 21 • journal articles (8) • monographs or edited books (4) • essays (3) • reading editions (2).

Further, publications are used in research both as secondary sources (literature reviews on the subject and sources, reference texts, etc.) and primary (novel or other literary text).

Apart from publications, the most used (ii) "primary sources" 22 are:

• ancient manuscripts and early printed books (9)

• modern and contemporary unpublished materials 23 (4) • archival documents (3) • monuments, artworks and artifacts 24 (3).

Their (iii) digital representations include facsimiles, photographs and 3D models.

Almost half of the study participants use them, although inspecting the objects in person is still considered fundamental. Of the 4 projects that produced a digital representation of some cultural object or event (e.g., videos), only 2 published them for the wider public.

19 These two categories were proposed in the discussion but ultimately excluded (Edmond 2019, pp. 2-3) . 20 These correspond to Edmond's "print paradigm publications" (2019, p. 2). 21 Bear in mind that interviewees are mostly philologists and literary critics. We have not considered critical editions as monographs because of their fundamental differences (the edition of a critically reconstructed text in one case, the discursive discussion of a research topic in the other); it is worth noting that no philologist in our group worked on or with digital scholarly editions. 22 Corresponding to Edmond's "single or collected/curated primary sources" (2019, p. 2). 23 Such as drawings, notes and letters. 24 E.g., tablets, vases.

Six interviewees report having organised (iv) events 25 (some projects have produced more than one event):

• conferences (4) • exhibitions (2) • webinars (2) • guided tour (1) • teacher training (1).

We then have at least 3 different types of "electronic paradigm publications" (Edmond 2019, p. 2):

• (v) websites -describe and/or publicise the project, limited user interaction • (vi) digital infrastructures -allow complex user interactions 26 • (vii) databases, catalogues and other search tools -widely used to study and locate primary sources 27 .

All 3 require (viii) software to be written and executed, which stands in a category of its own, both in Edmond's classification (2019, p. 2) and in ours.

Finally, we found these additional data types:

• (ix) documentation, consisting in deliverables for funders (2) and project notes (1) • (x) standards -languages or other conventions officially recognised as the norm in a certain research field (e.g., for interoperability) • (xi) corpora, a category specific to linguistics research • (xii) personal data 28 , which are always protected and never shared • (xiii) "born-digital artefacts" -by-products of users' online interaction with digital infrastructures (e.g., tags, associations, texts) that can be used as input for further research. 25 We defined events as one-off gatherings of people organised as a result of a research project, to share ideas, offer training, or present something to the public. These 13 data types represent the "building blocks" of research in the humanities, as emerged from the interviews. If we are looking to apply FAIR principles, these are the type of data we need to start reflecting on and developing infrastructures for.

Publications emerge as the most important type, in terms of research output but also in terms of input. In the humanities, open access to scientific literature seems to be an

integral part of open data policies.

It is important to also note that, when we asked participants to define "data" Humanities researchers and the term "data". How problematic is it?

We have mentioned earlier that Tóth-Czifra (2019, p. 3) points out that FAIR principles "have been implicitly designed according to underlying assumptions about how knowledge creation operates and communicates". The main assumptions are: that scholarly data or metadata is digital by nature, that scholarly data is always created and therefore owned by researchers, and that there is "a wide community-level agreement on what can be considered as scholarly data" (Tóth-Czifra 2019, p. 3). This is not the case in the humanities, where most scholarly data and metadata are held in galleries, libraries, archives, and museums: only a fraction of them has been digitised, and there are issues around their ownership. Also, data is a problematic concept in the humanities, with only a small community currently involved in data sharing and curation 30 (Tóth Czifra 2019, pp. 22-23).

Edmond describes how "many humanists resist the term 'data' as a descriptor for their primary and secondary sources, or indeed for almost anything they produce in the course of their research", because they "already have a much richer and more nuanced 29 While we recognise workflows have a specific connotation, as an extremely systematic and streamlined representation of a methodology, we employ the two terms interchangeably here. 30 For lack of incentives, on the one hand, and for "the lack of social life of data", on the other (Tóth Czifra 2019, p. 5).

vocabulary" and "the manner in which the term 'data' is deployed in disciplines that are primarily data-driven may also be a part of the hesitation" (Edmond 2019, p. 6 ).

Twin City's College of Liberal Arts whether they worked with "data" and 54% of the respondents had said they did, while 46% had preferred the expression "research materials" (Hofelich . In 2018, at another undisclosed US university, the majority of study participants 31 had stated they considered most of their "research materials" to be "data" (Thoegersen 2018 ).

When we asked the 19 participants to our study to think back at the materials listed earlier in the interview, and whether they would consider any of them to be data, 13 of them responded "yes", 2 answered "no", and 3 did not know whether to consider their own research materials as data. Among those who answered "no" or "I do not know", the concept of data was perceived as linked to the digital, or technological, sphere. Interestingly, 12 researchers used the term data well before it was introduced by us in question 7 32 .

Two interviewees express unease with the term "data". They stated that it is difficult to define, and that trying inevitably brings up more questions 33 . Another objected strongly to the use of the word, stating that is a generic and useless, in addition to being reductive and linked to the commodification of research. Even so, we found less resistance to the term "data" than we would perhaps have expected.

When asked to define "data", most participants seemed to go back and forth between restricting the term to something "circumscribed, objective and definable" 34 , and taking a more inclusive view:

• publications are data 35 ;

• methodology and notes are data 36 ; 31 Exclusively from history, English, languages, and philosophy. 32 In the term "database" (6); referring to sensitive and personal data (2); referring to data gathering (2); several times in different contexts (2).

I'm answering your question or rather asking new ones". 34 E.g., "I have this idea of data as something circumscribed, objective and definable, but I don't know if it's correct"; "data are single information items referring to a problem"; "data are the raw material of empirical research". 35 E.g., "a scientific article can be considered data, of course it can be considered as data". 36 E.g., "methodology is data; and even better if it is innovative compared to those used beforehand, to show it is indeed data, intellectual energy sources that were previously not considered"; "reading a passage according to a certain critical framework, as proposed in an essay or an article is data, assuming that is done according to a serious methodology".

• interpretation and research results are data 37 ;

• everything is data 38 .

Considering each interviewee's answers in full, we weighed the different statements they made and rated their overall opinion on data on a scale from 5 (very inclusive definition, "everything is data") to 1 (very restrictive, "data are numerical and objective"). The results are shown is the table below. open, tends to include either one among primary sources, interpretation, methodologies, research results 5 2 somehow restrictive, but can include either primary sources, critical texts, and/or textual variants 2 1 restrictive: "information items", "raw materials", "quantitative components" 3 n/a no reply or could not confidently assign a rating 3

Overall, most researchers gave a somehow inclusive definition (11), while a minority (5) defined data them restrictively as "information items", "raw materials", or "quantitative components" 39 . Two researchers openly stated they had changed their minds during our conversation and broadened their definition of data 40 .

Some researchers within the group expressed concerns around:

• the cost of article/book processing charges in open access publishing (1 researcher); 37 E.g., "the results or research can indeed be considered data"; "I understand data both as data emerged from research and as hypotheses"; "my judgment on the authenticity of a source can be data". 38 E.g., "In my opinion everything is data, everything I produce, handle, reuse -for me they are all data"; "In my case they are all data, the first look I take at a new manuscript, […] the reasoning on the stemma codicum to understand whether two manuscripts depend on a sub-archetype; everything I obtain from a serious, in-depth investigation is data (but it is a generic term, and as such it is useless)". 39 Interestingly, the most restrictive definitions all seem to come from researchers belonging to the field of language and linguistics. 40 In one case, the interviewee gave a restrictive definition of data as "objective" and "raw"; later, after reflecting on open data policies, they acknowledged that a wider view may also be taken. The other researcher began by considering only research results as data, but then added primary sources to the definition.

• the new rules around EU-funded projects, particularly the speed at which changes are being implemented and the imposition of open access for monographs, which limits researchers in their choice of a publisher (3); • the difficulty that less methodologically innovative research projects encounter in accessing funding in the current landscape (3).

Others reported a positive attitude towards open data e open science, expressing:

• a willingness to share research data and a desire to "give back" to society the results of research (4); • a desire for support from universities and public institutions, especially regarding research infrastructures, eliminating the need to rely on commercial third-party services (2); • concerns about traditional publishers protecting academic works excessively and in fact preventing the circulation or research results (2); • a desire to see primary sources digitized and made available for research (2).

Across all interviews, we counted 18 statements that we categorised as positive, and 10 that we categorised as negative. Positive attitudes seem therefore prevalent among participants. I will conclude by quoting 1 researcher who made an important point:

There can be a sense of fatigue for humanists, 'gosh I have to share my data', because to do that I have to follow a standardised schema.

[…] It is extra work, and I feel we should understand that, if that is the case, all projects need a bit more workforce and even people with specific expertise. We have very small projects.

These points echo those made several times across the literature (Tóth Czifra 2019; Edmond 2019; Harrower, Maryl et al. 2020) : that is necessary to work on current evaluation practices to make data curation worthwhile, and that institutions need to provide support to humanities researchers if we are to realise the FAIRification of research data (GO FAIR 2022). 

When asked whether their work is based on a precise methodology, recognised across the field, 11 of the 12 philologists and literary critics in our sample immediately replied "yes" 47 . We found a multiplicity of methodologies in this area, which are reportedly mixed and matched according to the literary work under study 48 . One researcher highlighted what, in their view, is a fundamental difference between philology and criticism:

42 "The problem of image copyright is a catastrophe because the cost of one single image can derail any budget.

[…] A photograph can cost 200 €, and you can imagine [what happens] if one needs seven or eight of them". 43 E.g., "Digitisation of archival documents is really behind […]". " […] only a small fraction of these materials is digitised, and online catalogues are also lacking". " […] human resources are missing, competences are missing. But for us doing this kind of research it would be fundamental, it would be a turning point." 44 E.g., "This job has deeply changed in these last years. It is faster to find anything, you do not have to visit a thousand libraries". 45 "Our idea is to digitise everything and do way more advanced research". 46 E.g., "The digital part is very important and can be really useful, even to structure the edition and show things, but then it is necessary to keep printing and reading [on paper], not only for the pleasure of it, but because some in-depth analysis can only be done on that kind of support". 47 E.g., "Absolutely, the discipline has its standards that are shared, even very formal standards because classical philology is really interested in the formal aspects"; "We also have a method, some forget about it, but a 'bad' article is an article that has not followed the method, that advances hypotheses that cannot be demonstrated, that has logical loopholes". 48 Some belong to the philology of copy (centred on the text and on a multiplicity of witnesses, includes Lachmann's method, stemma codicum, etc.), some to authorial philology (single witness by the hand of the author), and so on. I am a philologist, so I'm the closest to science. Criticism is further away, there clearly is more freedom of choice […] . Those who work in philology are closer to scientists and move in a less extemporaneous way, we have templates 49 .

As far as the description and formalisation of these methods is concerned, some participants indicated they happen in university manuals, as well as in preceding works that have become "milestones"; others described methodologies as "deposited through practice".

Critical editions are described as deeply formalized research outputs 50 . They have a precise structure, although there are some "margins to experiment" since the primary sources being edited are extremely varied.

Language and linguistics scholars (4) also describe different methodologies, corresponding to different approaches 51 , together with a combination of qualitative and quantitative methods; 1 of them stated that best practices are often more important than theoretical considerations, due to the applied nature of the discipline.

Here, too, the formalisation of methodologies is scarce 52 . Linguistic data is rarely (but increasingly) shared together with aggregated results, and this hesitation is linked to privacy 53 concerns and the extra work involved.

The observations made the 1 art historian in our group can hardly be considered representative but are nonetheless interesting. Here, too, there are different approaches, while audio-visual materials are of course particularly important. In addition, "visual philology" provides a "fundamental methodological basis": I consider fundamental the aspect that I would call our philology, the evaluation […], it is a real visual analysis […] . I believe it is the fundamental basis of our discipline, we all share the same templates to recognise different styles, different languages […]. Because "visual philology" is proper philology, otherwise it is easy to make mistakes. 49 Interestingly, the only researcher in this disciplinary area who stated that they do not follow a method is indeed a literary critic, who questioned the very idea of a method: "I do not believe in a preconstituted method, I come from a school that did not believe in it […] so I believe that even the sciences do not have a pre-constituted method, that big discoveries happen through infringing on a method." 50 E.g., "Within a critical edition I expect to find the text presented in a specific way, an apparatus compiled in a specific way, and I know the type of information I'm going to find there". " […] of course, there are parameters to evaluate if a critical edition is scientifically sound or not". 51 " […] in psycholinguistics and neurolinguistics […] the protocols are close to those of the experimental sciences in the medical field […] historical linguistics has more traditional methods, only partly codified." 52 "It is still left to the good will of individual scholars". "I wouldn't say there is a standard […] but there certainly are recurring elements". 53 "The materials, the recordings, are interesting but I cannot ask [participants] to sign a blanket consent […] because I would risk decimating the, already few, people who are enthusiast to participate […] the dissemination will thus be limited".

The 1 researcher working in archival studies stated that shared methodologies exist but are not clearly formalised, one of the reasons being that all archives are unique, and some have been built over centuries.

Finally, the 1 computer science researcher in the group stated that there are of course clear methodologies and paradigms, which, again, are mixed and matched according to each project needs. Methodological choices are documented in software documentation and deliverables, helping the reuse and modification of pre-existing code.

Collaboration seems to be pivotal in our sample of researchers. It takes place through email and telephone communications, cloud storage and file-sharing services (often Google Drive), or other specialised software (e.g., for corpus creation).

In many cases (9/19), it simply entails consulting other researchers, within the same disciplinary field, or not. This type of collaboration, informal and not recognised within academia beyond the occasional acknowledgment within the final publication, is described by 1 of our interviewees as "not 'real' sharing" because it comes down to asking for feedback and suggestions rather than opening up to real collaboration 54 .

However, 12 researchers in our sample, across all disciplinary areas, stated that they work, or have worked, in team with others 55 . A couple of them mention that they "like working alone" or that "philologists often work alone", but they still described teamwork as "extremely useful and productive". Only 1 researcher, a literary critic, stood out from the rest of the sample by stating: The fear that research ideas might be stolen before publication is mentioned by 5 interviewees in total 56 , 2 of which explicitly link it to research evaluation 57 . 54 " […] sharing is something else, is letting somebody else into your own research project". 55 Mostly colleagues, sometimes PhD and master's students. 56 E.g., "It is better to keep the unpublished archival original to yourself, otherwise somebody can publish it before you". "The material is shared with very few people and only partially, not in its entirety, to protect the originality of the research and its novelty, which has to be made public only when the research is finished and complete". 57 "I would share research results with anybody but [...] as long as hiring, career progression and evaluation within universities [...] keep happening with the current parameters I am forced to use copyright protection and Specialised content, often inaccessible Unsurprisingly, data produced by our interviewees are mostly intended for a specialised audience. Some are intended for a broader public (e.g., reading editions, websites, exhibitions and tours), while others are created with a very specific audience in mind, such as the project team (e.g., deliverables, project notes, personal data) 58 . Table 4 lists provides a summary assessment on the accessibility of these data, as derived from the interviews. Table 4 Research output Assessment on accessibility in our sample software freely available online standards websites born-digital artifacts (e.g., tags, associations) usually freely available online but access may be restricted catalogues, databases and other search tools digital infrastructures (e.g., mobile apps, web platforms) documentation digital representation of cultural objects (e.g., facsimiles, photos) if published at all, they are usually freely available online; otherwise, permission to publish must be sought and can be expensive "primary sources" different from publications (e.g., manuscripts, artworks) mostly held in conservation institutes; depending on rarity and state of conservation, they can be accessible to the public, or only to scholars under supervision events (e.g., conferences, exhibitions) may require registration and relevant specialisation (unfortunately, we did not gather any information on this point during the interviews) publications mostly published in closed access personal data for internal use only corpora If we look in more detail at publications by our participants, we see that:

• the 8 critical editions and 2 reading editions mentioned in the interviews are published on paper, in closed access, and with a traditional publisher, • only 1 out of 4 edited books has been published in open access, for express wish of its authors, • 2 out of 8 projects producing journal articles made them available in open access, according to funding requirements. keep results inaccessible"; "I like putting anything in common, but we face an issue of intellectual property since we mostly publish single-authorship articles, and our career is based on publications". 58 In most of the cases brough to our attention by the study participants, this depends on privacy concerns.

Finally, corpora produced by researchers we interviewed are associated and/or contain personal and sensitive information and are not made available to anyone outside of the project team, due to privacy concerns. Our interviewees recognised that value of making these primary sources available to the research community and stated that the discipline is moving in that direction.

We have already mentioned that image copyright can be problematic, while a couple of researchers also lament the different copyright laws existing in different countries, which complicate international collaborations. Another participant describes the extra work done to avoid dealing with copyright issues 59 .

As far as privacy is concerned, projects dealing with personal data in connection with linguistic corpora or with born-digital artifacts have opted to either restrict access to them and/or to have an advisor for privacy and ethical issues.

All researchers but 1 stated that they save most of the data they use and produce during a research project because they are potentially useful both in teaching and for further research 60 . However, only 2 participants mention using an (open) repository, in both cases Zenodo, while they all seem to be storing digital data on their own devices, using external hard drives, or on commercial cloud storage services 61 .

When asked whether the materials mentioned in the interview had any accompanying documentation, only 4 researchers responded affirmatively 62 . They mentioned:

• software documentation (1),

• deliverables for funders (2) • documentation shared among the project participants (1) 63 .

The situation is not dissimilar if we look at standards. At the time of the interview, we chose not to define the word, letting researchers free to interpret it as they pleased.

Most participants (11) understood it in a broad sense and proceeded to describe how their discipline does indeed have "standard" methodologies, recognised across the field (see the next section). Two researchers mentioned:

• standards such as JSON, RDF, XML and OWL in one case,

• cataloguing and technical standards to ensure interoperability between databases in the other 64 .

Unsurprisingly, both these data types are familiar only to a small subset of researchers, who do not belong to the philology and literary criticism group.

The discussion around what data is might seem redundant or pointless. At the same time, stretching the definition of data to also include publications, workflows or indeed anything researchers work with may seem dangerously vague and ultimately useless. But if we talk about FAIRifying research data, then this concept must include all materials used and produced in humanities research. We need to start looking closely at how researchers work day-to-day, "going as deep as possible in research practices" and linking the definition of data "to the workflow of knowledge creation" (CO-OPERAS & Giglia 2020, p. 1).

That is what we have tried to do through our interviews. Of course, this study has limitations: the sample of participants is small and represents only a sub-set of research areas in the arts and humanities. It should be replicated on a larger scale and include other disciplines to paint a picture of the humanities at large. However, to transition to open science, it is extremely important to engage academic communities (Armeni, Brinkman, et al. 2021 ), however small, and a project of this kind can indeed serve this purpose well. 63 One researcher talked about database tables containing metadata and constituting the main documentation of the project. This observation, although interesting, uses the term "documentation" in a slightly broader sense than the one we are adopting here: it refers to metadata rather "paradata", or data describing the methods of data generation/collection. 64 Of the others, 3 interviewees simply responded there are no standards in their field, and 3 mentioned archival and museum standards in relation to the primary sources of their research.

We were able to recognise at least 13 data types we need to reflect on and develop infrastructures for. Publications emerged as the most important one, hinting at how open access to scientific publications and open data policies are strictly interconnected in the humanities.

We have encountered more than once the idea that humanists find the term "data"

problematic, but we did not find it to be the case for our interviewees, with only 3 researchers expressing unease with or resistance to the term. Most researchers (11) also adopted and inclusive definition of "data", while a minority (5) defined them exclusively as "information items", "raw materials", or "quantitative components".

Attitudes towards open science are mostly positive across our pool of researchers. But, as stated by a handful of interviewees and as emerged clearly from the literature (CO-OPERAS, Bertino and Tóth-Czifra 2020; Edmond 2019; Tóth Czifra 2019), it is necessary to work on current evaluation practices, to make data curation worthwhile, and to support humanities researchers in the process of making research data FAIR.

Regarding other aspects of data management, our results mostly confirm what emerged from previous studies. One notable exception concerns the willingness to share research data with colleagues: we found collaboration to be extremely important for our interviewees, both in the shape of informal sharing (9/19) and teamwork (12/19).

Finally, when we asked researchers whether their work is based on a precise methodology, recognised across the field, the vast majority replied positively.

Although we are a long way away from workflows that are "explicit […] documented and, therefore verifiable" (Edmond 2019, p. 15 ), our analysis suggests that researchers might be ready to reflect on how to "characterize humanities workflows" and to "identify how such characterizations can be made useful" (Fenlon, 2019, p. 512) .

Disciplinary differences in faculty research data management practices and perspectives

Towards wide-scale adoption of open science practices: The role of open science communities

Future of scholarly communication. Forging an inclusive and innovative research infrastructure for scholarly communication in Social Sciences and Humanities

OPERAS-P Deliverable D6.3: Report on innovative approach to SSH FAIR data and publications (DRAFT). Zenodo

Comparing grounded theory and topic modeling: Extreme divergence or unlikely convergence

CO-OPERAS-SSHOC -Research data in the SSH -wksp report -Goettingen 30012020

CO-OPERAS-SSH FAIR data -wksp report -Brussels 03032020

QualCoder version 2

Horizon Europe, open science. Early knowledge and data sharing, and open collaboration

Digital technology and the practices of humanities research

European research area (ERA)

About EOSC

Interactivity, distributed workflows, and thick provenance: a review of challenges confronting Digital Humanities research objects. 15 th International Conference on eScience. eScience

FAIRification process

Basics of grounded theory analysis: Emergence vs forcing

The discovery of grounded theory: strategies for qualitative research

What do we mean by "data" in the arts and humanities? Interview transcripts (University of Bologna, FICLIT) and GT coding

Sustainable and FAIR data sharing in the Humanities: recommendations of the ALLEA working group E-Humanities. Digital Repository of Ireland

When data is a dirty word: a survey to understand data management needs across diverse research disciplines

Data management needs assessment -surveys in CLA, AHC, CSE, and CFANS

Berlin declaration on open access to knowledge in the sciences and humanities

Hidden treasures: opening data in PhD dissertations in social sciences and humanities

Yeah, I guess that's data": data practices and conceptions among humanities faculty

The risk of losing thick description: data management challenges Arts and Humanities face in the evolving FAIR data ecosystem

Creating and analyzing multilingual parliamentary corpora: research data management workflows volume

The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data, 3, 160018