the code4lib journal – editorial: journal updates and a call for editors mission editorial committee process and structure code4lib issue 55, 2023-1-20 editorial: journal updates and a call for editors journal updates, recent policies, and a call for editors. this is my second time and last time as coordinating editor for code4lib journal. after serving on the editorial committee for 7 years, i am rotating off of the committee to focus on other research projects. code4lib has played a big part in my career. in 2012, i published my first article for the journal. after attending my first code4lib conference at north carolina state university in 2014, funded by a code4lib diversity scholarship, i really wanted to get more involved with this wonderful and supportive community. i was co-convener for the local new york city chapter of code4lib, presented at two national pre-conferences, and served on a couple of code4lib national conference committees since then. out of participating in various code4lib related activities, i have to say that working with the editorial committee (ec) has been the most rewarding experience. i have learned quite a deal from my fellow editorial committee members, and for that i am immensely grateful. this includes everything from copy editing, the article review process, communicating and collaborating with authors, and most especially, managing a journal. i would like to share two recent developments with the journal: a guest editorial policy and a retraction policy. the ec has implemented a guest editor policy. editorial members have a wide skill set reflective of library coders and technologists, however, some of the articles that we review are beyond our scope of expertise. in those situations, we feel it necessary to consult with experts outside of the ec. the guest editor policy is in place to make it clear to the author, guest editor, and readers, their role in the review process. a retraction policy has also been implemented. this retraction policy was developed so the ec could withdraw articles that may include work that violate ethical standards or may be unreliable. retractions are not to be taken lightly, and as such, the journal will inform readers why the article was retracted. this will be another part of the article’s lifecycle post-publication. since there is now an opening on the editorial committee of code4lib journal, please respond to this call for editors. if you are interested in reading and learning about library information technology, as well as being part of a great team of editors, then this is an excellent opportunity. applicants from diverse communities are highly encouraged to apply. i believe that every issue of code4lib journal has practical applications for almost any library, archive, museum, and other related spaces, and this issue is no exception. this issue includes: a fast and full-text search engine for educational lecture archives which outlines the development of a search engine for educational videos using python in india. click tracking with google tag manager for the primo discovery service explores how to track open access content through unpaywall links. creating a custom queueing system for a makerspace using web technologies is a case study on streamlining the queue process of a makerspace. data preparation for fairseq and machine-learning using a neural network details the use of sequence-to-sequence models and how it can be applied for a variety application with the appropriate formatting of datasets. designing digital discovery and access systems for archival description compares the differences between archival and bibliographic description and the challenges of utilizing discovery based systems for digital born materials. drying our library’s libguides-based webpage by introducing vue.js investigates how to better streamline redundant html code from the popular libguides web content management system. revamping metadata maker for ‘linked data editor’: thinking out loud looks at using and evaluating the catalog record creation tool using linked data sources. using python scripts to compare records from vendors with those from ils examines the use of python to identity and synchronize out-of-sync vendor and ils catalog records. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 55, 2023-1-20 previous issues issue 54, 2022-08-29 issue 53, 2022-05-09 issue 52, 2021-09-22 issue 51, 2021-06-14 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – data preparation for fairseq and machine-learning using a neural network mission editorial committee process and structure code4lib issue 55, 2023-1-20 data preparation for fairseq and machine-learning using a neural network this article aims to demystify data preparation and machine-learning software for sequence-to-sequence models in the field of computational linguistics. the tools, however, may be used in many different applications. in this article we detail what sequence-to-sequence learning looks like using code and results from different projects: predicting pronunciation in esperanto, predicting the placement of stress in russian, and how open data like wikipron (mined pronunciation data from wiktionary) makes projects like these possible. with scraped data, projects can be started in automatic speech recognition, text-to-speech tasks, and computer-assisted language-learning for under-resourced and under-researched languages. we will explain why and how datasets are split into training, development, and test sets. the article will discuss how to add features (i.e. properties of the target word that may or may not help in prediction). by scaffolding the tasks and using code and results from these projects, it’s our hope that the article will demystify some of the technical jargon and methods. by john schriner introduction there are many tools in the field of natural language processing (nlp) and computational linguistics that: help us to understand language better; find patterns that we cannot perceive; find word collocation (i.e. when words are commonly near other words); improve text-to-speech; perform text summarization; perform information extraction; provide sentiment analysis; and perform machine-translation. some of these tools are the user-friendly web-based voyant tools,[1] the python software platform natural language toolkit (nltk),[2] and praat[3] phonetic software for examining sound. the nlp tool linguistic inquiry and word count (liwc)[4] is a psycholinguistic black-box[5] tool that can provide sentiment analysis, language style matching, and many other metrics using over 100 dimensions of text. liwc has been widely used for decades, is dictionary-based, and does not involve machine learning. although we may not see a lot of conspicuous use of machine-learning in libraries at present, any project in library and information science that uses an input sequence to map to an output sequence could be improved with this technology; indeed, our discovery services and search engines embrace techniques identified in 1995 that can “analyze user queries, identify users’ information needs, and suggest alternatives for search” (chen, 1995, p. 1). moving to the present day, in zhu & lei (2022) we see machine-learning being used in classification of research topics in covid-19 research. they extract noun phrases from an experimental corpus of full text articles indexed in web of science. these noun phrases numbered 19,240 with a minimum frequency of 10 per million words. zhu & lei (2022) identify research topics whose subject matter was increasing; these are labeled hot topics and categorized into larger categories such as biochemical terms, public health measures, symptoms and diseases, etc. their methods are robust and they work with six different classification models, finding that a random forest classifier[6] yields the best results. in a similar vein and apropos to information literacy, sanaullah et al. (2022) offers a systematic review of covid-19 misinformation research involving machine-learning and deep learning. in their review they selected 43 research articles and categorized them into misinformation types: fake news, conspiracy theory, rumor, misleading information, and disinformation (deceptive information, as opposed to inaccurate in the case of misinformation) (sanaullah et al., 2022). after a thorough discussion of methods, this survey finds that deep-learning methods are more efficacious than traditional machine-learning methods. with known datasets, or datasets created from scraped web data, we can use modern machine-learning tools for any number of projects in different subfields of linguistics like phonology (the study of linguistic sound), morphology (the study of words and how they are formed and used together), and even historical linguistics (the study of languages over time, including language families). this paper focuses on sequence-to-sequence models, the conversion of a sequence from one domain into a sequence of another domain. this could be, for example, polish words converted to their pronunciation in the international phonetic alphabet (ipa) format: e.g. osłu ‘donkey’ converted to ɔswu. this model would effectively aid in text-to-speech systems. another example of sequence-to-sequence modeling could be to predict the correct inflection and placement of a stress marker given a word and its part of speech: training a model that when given the russian adjective эйфорически ‘euphorically,’ must successfully place the stress on the middle «и́» as in эйфори́чески. the idea is that we will use 80% of the data to train on, 10% for development with which to choose the best parameters and model, and 10% for the test set. it’s easy to feel overwhelmed with these tools and their architectures. the aim of this paper is to help demystify this particular type of machine-learning with a well-prepared dataset and project goals. importance of open data open data is essential for original research and replication studies. sparc states that “despite its tremendous importance, today, research data remains largely fragmented—isolated across millions of individual computers, blocked by disparate technical, legal and financial restrictions” (“open data,” n.d.). to combat this fragmentation, a call for open data would require that research data: “(1) is freely available on the internet, (2) permits any user to download, copy, analyze, re-process, pass to software or use for any other purpose; and (3) is without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself” (“open data,” n.d.). open data can be found worldwide in glam labs such as the data foundry[7] at the national library of scotland, and linguistics repositories such as the tromsø repository of language and linguistics (trolling).[8] the registry of research data repositories[9] indexes nearly 3,000 research data repositories that provide databases, corpora, tools, statistical, and audiovisual data. with open and well-described data alongside open access papers, our research lives on in repositories, waiting to be replicated, rebutted, added to, or improved. projects like the one we describe below rely on data scraped by the wikipron project (lee et al., 2020), providing phonological and morphological datasets coupled with frequency data, all regularly updated and open. the wikipron project contains 1.7 million pronunciations from 165 languages. better still, the project released its mining software so that anyone may mine the data themselves so that researchers “no longer depend on ossified snapshots of an ever-growing, ever-changing collaborative resource” (lee et al., 2020, p 4223). under-researched languages like adyghe or urak lawoi’, or an endangered/moribund language like wiyot can benefit from projects that have access to open phonological data for language revitalization efforts or preservation. the wikipron project even has 452 words from old french (842 ce – ca. 1400 ce) that could be used to track sound change to modern french. repositories and applications like wikipron provide invaluable data that can be used in countless ways. projects with fairseq each project requires preparing the data in a way that can be used by the software. in this paper we use fairseq (ott et al., 2019) which is a “facebook ai research sequence-to-sequence toolkit written in python.”[10] the toolkit requires that characters be separated with a space if that is what we’re trying to sequence.[11] a wikipron dataset may be downloaded as a tab-separated values (tsv) file. in this article we’ll look at two projects and how we’d manipulate the data for fairseq. esperanto esperanto is a constructed language (conlang) created to be a universal auxiliary/second language to aid in international communication.[12] from the wikipron project we first download the tsv file for esperanto.[13] in esperanto, each letter has only one pronunciation, so it should be trivial to convert characters to the ipa pronunciation and our machine should be able to do this with great accuracy. stress is not marked in the dataset, but in esperanto stress is always placed on the penultimate syllable. the data is in two tab-separated columns with the grapheme (the written word) in the first column and the phoneme (the ipa representation for pronunciation) in the second column: table 1. example data from the tsv file from wikipron. aarono a a r o n o abadono a b a d o n o abateco a b a t e t͡s o abelmanĝulo a b e l m a n d͡ʒ u l o abortitaĵo a b o r t i t a ʒ o the tsv is shuffled using shuf and then split into three tsv files: an 80% training set, a 10% development set, and a 10% test set using a python script.[14] python3 split.py \ --seed 103 \ --input_path epo.tsv \ --train_path epo_train.tsv \ --dev_path epo_dev.tsv \ --test_path epo_train.tsv to prepare the data for fairseq, the important part of the code to note is that each of the three tsv files are then split into .g (for grapheme) and .p (for phoneme) files for training, dev, and test: import contextlib import csv # data was shuffled using `shuf` and split 80-10-10 using `split.py` train = "epo_train.tsv" train_g = "train.epo.g" train_p = "train.epo.p" dev = "epo_dev.tsv" dev_g = "dev.epo.g" dev_p = "dev.epo.p" test = "epo_test.tsv" test_g = "test.epo.g" test_p = "test.epo.p" # processes training data. with contextlib.exitstack() as stack: source = csv.reader(stack.enter_context(open(train, "r")), delimiter="\t") g = stack.enter_context(open(train_g, "w")) p = stack.enter_context(open(train_p, "w")) for graphemes, phones in source: print(" ".join(graphemes), file=g) print(phones, file=p) # processes development data. with contextlib.exitstack() as stack: source = csv.reader(stack.enter_context(open(dev, "r")), delimiter="\t") g = stack.enter_context(open(dev_g, "w")) p = stack.enter_context(open(dev_p, "w")) for graphemes, phones in source: print(" ".join(graphemes), file=g) print(phones, file=p) # processes test data. with contextlib.exitstack() as stack: source = csv.reader(stack.enter_context(open(test, "r")), delimiter="\t") g = stack.enter_context(open(test_g, "w")) p = stack.enter_context(open(test_p, "w")) for graphemes, phones in source: print(" ".join(graphemes), file=g) print(phones, file=p) as shown above in table 1, the second column characters were already spaced correctly, so we needed to add spaces to only the first column. the result is two files for each set with spaced characters: table 2. example of data ready for fairseq. train.epo.g train.epo.p s t a c i o s t a t͡s i o o m a ĝ o o m a d͡ʒ o ĉ i r k a ŭ f l a t a d i t͡ʃ i r k a w f l a t a d i the generated files are now ready for pre-processing in fairseq: fairseq-preprocess \ --source-lang epo.g \ --target-lang epo.p \ --trainpref train \ --validpref dev \ --testpref test \ --tokenizer space \ --thresholdsrc 2 \ --thresholdtgt 2 this pre-processing creates a folder called data-bin with binaries and a log file that provides the number of tokens found. we can now start the training: fairseq-train \ --data-bin \ --source-lang epo.g \ --target-lang epo.p \ --encoder-bidirectional \ --seed {choose a random whole numeral} \ --arch lstm \ --dropout 0.2 \ --lr .001 \ --max-update 800 \ --no-epoch-checkpoints \ --batch-size 3000 \ --clip-norm 1 \ --label-smoothing .1 \ --optimizer adam \ --clip-norm 1 \ --criterion label_smoothed_cross_entropy \ --encoder-embed-dim 128 \ --decoder-embed-dim 128 \ --encoder-layers 1 \ --decoder-layers 1 with these parameters it took my machine[15] a half-hour to train. tweaking the max-updates, the number of encoding layers, the architecture, or the optimizer (e.g. transformer instead of adam) will provide different, and perhaps better results. doubling the encoder and decoder layers, or doubling the encoder and decoder dimensions to 256 slowed the processing time significantly, without improving the model in this case. the training part of these experiments is meant to help us decide on which parameters we hope will yield the best results from many different options.[16] we’ll run this training several times with different parameters and choose three models. the dev part (10%) of the experiment is meant to choose the model that performs the best on the dev set. lastly, confident on our model, we’ll use that model on the test set, as yet unseen data. to determine how well each model is doing, we use fairseq-generate that provides us an error analysis that details where our model came up short. fairseq-generate \ data-bin \ --source-lang epo.g \ --target-lang epo.p \ --path checkpoints/checkpoint_best.pt \ --gen-subset valid \ --beam 8 \ predictions.txt the generated error analysis in predictions.txt is quite readable and shows where the expected hypothesis may be different from its target sequence: s-17 i k t i o s a ŭ r o t-17 i k t i o s a w r o h-17 -0.14448021352291107 i k t i o s a w r o d-17 -0.14448021352291107 i k t i o s a w r o s-824 e k s i ĝ o n t a j t-824 e k s i d͡ʒ o n t a j h-824 -0.12416490912437439 e k s i d͡ʒ o n t a j d-824 -0.12416490912437439 e k s i d͡ʒ o n t a j s-1085 k a p t o ŝ n u r o t-1085 k a p t o ʃ n u r o h-1085 -0.15732990205287933 k a p t o ʃ n u r o d-1085 -0.15732990205287933 k a p t o ʃ n u r o the rows in predictions.txt are source, target, hypothesis (tokenized, meaning any punctuation symbols in a project with sentences would be space-separated), and detokenized (not broken into separate linguistic units). the number before the hypothesis is the log-probability of this hypothesis. for our project, if the target matches the hypothesis, the model has predicted correctly. we use a script written by dr. kyle gorman to parse the output of fairseq-generate and provide a word error rate (wer). using this script, if any character is incorrect, the word error rate is raised. as there should be no ambiguity in pronunciation and the conversion of a character to a sound, we expected that our model would perform near-perfectly. choosing the model that performed the best, we can now give the model the test data. predictably, on the test data, the word error rate was 0.00, a perfect score. russian stress to explore how to add features to the model, we can look at experiments in russian stress. features are properties of the target that may or may not help in prediction. features could include part of speech, frequency, animacy (whether a noun is sentient or not), or many other characteristics. similar to the esperanto project above, we have columns with data in a tsv file. table 3. example data from the tsv file from schriner (2022). ямбам я́мбам 1 ямб n;msc;inan;pl;dat шихтовее шихтове́е 1 шихтовой a;cmpar;pred щелкануть щелкану́ть 0 щелкануть v;perf;inf иноки и́ноки 2 инока n;fem;anim;pl;nom стёсанном стёсанном 2 стесать v;perf;der;der/pstpss;a;neu;anin;sg;loc the first column of data is the word with no stress markers. the second column is the word with stress marked. the third column is a stress code derived from the placement of the stress in the word: reversing the text in place and counting from 0 at the end of the word, each word was given a stress code; this data was added to the tsv as a column. only vowels in russian may have stress, so deriving the stress code was simply a matter of counting vowels until a stress marker occurred. «ё» is always stressed, so the script stops and assigns a code when an «ё» is discovered. the fourth column is the word’s lemma (its root). the fifth column contains the full morphology of the word including the word’s part of speech, the tense for those that are verbs, animacy, gender, grammatical number (whether a noun is singular or plural) and russian case (e.g. nominative (nom) case for the subject of the sentence, or dative (dat) case for an indirect object of a sentence). for the adjective (a) in table 3, the word is comparative (cmpar, as in more) and it functions as an adjective predicate (pred), linked to the subject of the sentence. in this paper we will not be processing this with fairseq, but some promising results may be found in schriner (2022). this project is already significantly different from our esperanto example in that stress in russian has complicated patterns and ambiguous rules that will challenge a machine to place the stress correctly. incorrectly-stressed words may be unintelligible or prove more difficult to place correctly with the existence of stress homographs such as óрган ‘organ of the body’ and оргáн ‘organ’ (musical instrument) (wade & gillespie, 2011). similar to the esperanto example, we have to format our text for fairseq and sequence-to-sequence modeling. to do this we’ll again have space-separated characters that we’ll convert to other space-separated characters. from table 3, the word иноки ‘others’ will be converted to и́ноки so our tsv file should have spaces: и н о к и will convert to и́ н о к и. we want our machine to learn that given certain features we can expect a certain outcome in training. the features in table 3 are: stress code, lemma (the root of the word), and the full morphology including part of speech. we can create several experiments from this data including: given the word and its lemma, predict the stress code: и н о к и инока ← the feature added to the spaced-characters 2 ← the target will be the stress code, three vowels from the end starting at 0 given the word and its part of speech, predict the stress code: и н о к и noun ← the feature added to the spaced-characters 2 ← the target will be the stress code, three vowels from the end starting at 0 given the word and all of its morphological properties, predict the stress code: и н о к и n;fem;anim;pl;nom ← the feature added to the spaced-characters 2 ← the target will be the stress code, three vowels from the end starting at 0 from the first experiment, the data in the tsv would be formatted like so, with the feature added to the end of the data in the first column, itself with no spaces: table 4. formatting the tsv data. source (column 1) target (column 2) я м б а м ямб 0 ш и х т о в е е шихтовой 1 щ е л к а н у т ь щелкануть 0 и н о к и инока 2 с т ё с а н н о м стесать 2 the same methods used in the esperanto example could be used: we would train the model using fairseq on 80% of the data so the model can learn that words like иноки with the root of инока would have a stress code of 2. once trained, we choose the model that performs best on the dev set (10%). then we use that model on completely unseen data in the test set (10%). by examining and contrasting different experiments, we can see if knowing the word’s root helps in placement of the stress, or if adjectives tend to have stress in particular places, or possibly even that the ambiguity in stress-placement can not be aided with this type of machine-learning. experiments similar to these were conducted in schriner (2022), showing that knowing the word’s root led to the best predictions and the lowest word error rate, while adding the part of speech feature led to the worst results and the highest word error rate. conclusion preparing for experiments like those above require hypotheses, planning, and formatting the data for the software. we used fairseq and found that with our wikipron data, the model we chose had no errors in predicting pronunciation in esperanto, even with unseen data. in the russian stress experiment we looked at how to prepare data in the same way but added features to the model’s training. the fairseq framework makes it astonishingly easy to toggle and experiment with different parameters from the terminal and work on experiments like those described above. with continued, collaborative, and open data, we can expect invaluable further research in this area. about the author john schriner is the e-resources and digital initiatives librarian at nyu law school. his research tends to coalesce at the intersection of linguistics, cybersecurity, and librarianship. references chen, h. (1995). machine learning for information retrieval: neural networks, symbolic learning, and genetic algorithms. journal of the american society for information science, 46(3), 194–216. https://doi.org/10.1002/(sici)1097-4571(199504)46:3<194::aid-asi4>3.0.co;2-s lee, j.l., ashby, l., garza, e., lee-sikka, y., miller, s., wong, a., mccarthy, a., and gorman, k. (2020). massively multilingual pronunciation mining with wikipron. in proceedings of the 12th language resources and evaluation conference, pages 4223-4228. open data. (n.d.). sparc. retrieved november 29, 2022, from https://sparcopen.org/open-data/ ott, m., edunov, s., baevski, a., fan, a., gross, s., ng, n,. grangier, d., and auli, m. (2019). fairseq: a fast, extensible toolkit for sequence modeling. in proceedings of the 2019 conference of the north american chapter of the association for computational linguistics (demonstrations), minneapolis, minnesota. association for computational linguistics, (pp. 48-53). sanaullah, a. r., das, a., das, a., kabir, m. a., & shu, k. (2022). applications of machine learning for covid-19 misinformation: a systematic review. social network analysis and mining, 12(1), 94. https://doi.org/10.1007/s13278-022-00921-9 schriner, j. (2022). predicting stress in russian using modern machine-learning tools. https://academicworks.cuny.edu/gc_etds/4974/ wade, t., & gillespie, d. (2011). a comprehensive russian grammar. wiley-blackwell. zhu, h., & lei, l. (2022). a dependency-based machine learning approach to the identification of research topics: a case in covid-19 studies. library hi tech, 40(2), 495–515. https://doi.org/10.1108/lht-01-2021-0051 endnotes [1] https://voyant-tools.org/ [2] https://www.nltk.org/ [3] https://www.fon.hum.uva.nl/praat/ [4] https://www.liwc.app/ [5] meaning simply that the input and output are visible but the inner-workings and source code are closed [6] https://towardsdatascience.com/understanding-random-forest-58381e0602d2 [7] https://data.nls.uk/ [8] https://dataverse.no/dataverse/trolling [9] https://www.re3data.org/ [10] fairseq can be installed via pip from https://pypi.org/project/fairseq/ [11] this is specified in the preprocessing below [12] for a fascinating history of esperanto from its beginnings through the early soviet union, please see brigid o’keeffe’s esperanto and languages of internationalism in revolutionary russia, ‎2021, bloomsbury academic [13] https://github.com/cuny-cl/wikipron/blob/master/data/scrape/tsv/epo_latn_narrow.tsv [14] this script is agnostic to the data-format and is written by kyle gorman and jackson lee. the script can be found here: https://github.com/cuny-cl/wikipron-modeling/blob/master/scripts/split.py [15] intel core i7-6700 cpu @ 3.40ghz × 8 with 32gb ram [16] for all available parameters for training, please see: https://fairseq.readthedocs.io/en/latest/command_line_tools.html#fairseq-train subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 55, 2023-1-20 previous issues issue 54, 2022-08-29 issue 53, 2022-05-09 issue 52, 2021-09-22 issue 51, 2021-06-14 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – designing digital discovery and access systems for archival description mission editorial committee process and structure code4lib issue 55, 2023-1-20 designing digital discovery and access systems for archival description archival description is often misunderstood by librarians, administrators, and technologists in ways that have seriously hindered the development of access and discovery systems. it is not widely understood that there is currently no off-the-shelf system that provides discovery and access to digital materials using archival methods. this article is an overview of the core differences between archival and bibliographic description, and discusses how to design access systems for born-digital and digitized materials using the affordances of archival metadata. it offers a custom indexer as a working example that adds the full text of digital content to an arclight instance and argues that the extensibility of archival description makes it a perfect match for automated description. finally, it argues that building archives-first discovery systems allows us to use our descriptive labor more thoughtfully, better enable digitization on demand, and overall make a larger volume of cultural heritage materials available online. by gregory wiedeman introduction archives are weird. or at least that seems to be the perception of many library technologists. while archives are often part of larger research libraries, archival methodologies are often misunderstood by our administrator, technologist, and librarian peers. this confusion has become more problematic as archives continue to need and develop more complex access systems to make description, digitized materials, and born-digital objects available over the web. implementing these systems requires cross-domain partnerships, and the misunderstandings and miscommunications around archival description in particular have severely hindered the development of discovery and access systems for archives. archives access systems do not work like library catalogs or really anything else on the web and currently have major usability barriers. to those who work mostly with the bibliographic description used by libraries and most of the web, it can be unclear why archives cannot just use the same systems, or why archives systems and practices just seem so limiting for users. archival methodology and its reasoning can be easily obscured among the more esoteric traditions of archives, like the celebration of famous men to demonstrate value to donors, hollinger boxes, or finding aids. it is often hard to differentiate between the value and the dogma. archivists themselves often find it hard to articulate why their needs are just different than their librarian peers. it can be challenging even for many archival practitioners to acquire strong expertise in archival description. in the united states, archival training is a concentration within a library credential, which can mean merely one or two archives-specific courses. you might only get one single class that discusses archival description, and even that is often taught by a faculty member with a research focus rather than extensive practitioner experience. archival description skills often need to be learned on-the-job and seem to be mostly effectively passed on through peer groups, mentorship, or other types of informal professional development that not everyone has access to. even archivists that do have strong knowledge of archival description may not have a detailed understanding of how web applications or other technologies are designed or work in practice. while many archivists see firsthand the constant friction in current access systems, they often struggle to articulate how they can be designed better as web applications. the divide in domain knowledge between discovery systems and archival description is a challenging one to bridge. i hope to clarify the core differences between archival and bibliographic description and outline a path towards more effective discovery systems. while bibliographic description is much more intuitive and commonplace in our web applications, archival methods free us to apply the valuable descriptive labor that is the main bottleneck in our digitization and born-digital acquisitions programs more thoughtfully and appropriately.[1] if used properly, archival description could enable us to better provide digitization services on user request at scale and make these materials available online for future users. the extensibility of archival metadata also makes it a perfect fit for using automated description, such as optical character recognition (ocr), entity extraction, or automated transcription to enhance discovery, as it combines imprecise output with human-created records. i try to make it much clearer why archival metadata makes discovery so peculiar, highlight the cases where it can be advantageous, outline a path forward to increase the usability of archives access systems, and make the case for privileging archival description when planning and designing discovery systems. the misunderstandings around archival description have hidden an enormous problem: there are no available off-the-shelf systems that provide access to digital materials using archival description. every digital repository, digital asset management system (dams) or institutional repository (ir) uses bibliographic description as an unrecognized design assumption. to illustrate this, i provide a case study of ualbany’s existing hyrax and arclight implementations which use archival description for discovery by linking data from these systems over apis. this approach works functionally but has substantial usability and maintenance issues. in working to combine these systems into a single archives discovery system, i wrote a custom indexer that adds digital materials, full-text orc and extracted text content to arclight as a proof-of-concept example that i hope can illustrate a path forward towards designing access systems that work directly with archival methods. finally, i will point to some ways we can experiment with how archival inheritance is indexed to potentially mimic bibliographic usability. archival vs. bibliographic description by bibliographic description, i mean the creation of individual metadata records for each object with a set of descriptive fields. this has been the intuitive method of managing information going back beyond our relevant professional history. i’m sure you could go back thousands of years and find library workers creating some kind of discrete bibliographic record describing an individual item. library catalog cards and online public access catalogs (opacs) are canonical examples of bibliographic description. each record has a set of descriptive fields and is self-contained – all of the available information is contained within the record. dublin core states this explicitly in its “one-to-one principle,” where it declares that each discrete entity “should be described by conceptually distinct descriptions.”[2] while linked data adds some complexity by potentially breaking up records into statements, data structures and descriptive practices usually remain the same. most of the information on the web is displayed to users in a way that looks like bibliographic description. a search engine, a major e-commerce site, or wikipedia will display records of objects to users that contain all the available information. these records often link to other records, but each record still describes an isolated object and is fully comprehensible by itself. the ubiquity of this format proves its intuitiveness and usability. i am sure that this is to some degree an oversimplified caricature of bibliographic practices, but it is a useful contrast to help us to better understand the impact of archival description. while archives may appear to be just a specialized type of library, they have a fundamentally different methodology for managing and providing access to materials. why did early archivists reinvent the wheel and develop incompatible practices that are less intuitive for both professionals and users? the answer is very practical: they simply had too much stuff. the early development of archival description in the united states illustrates how usability was a conscious and necessary tradeoff to be able to adequately manage the scale of records that were working with. the american national archives was first created in the 1930s and, since the government had been functioning and creating records for over 150 years, records had been previously managed by individual departments and offices, often with a variety of different methods and techniques. by 1941 archivists had accessioned 302,114 cubic feet of records from seventy-two different agencies.[3] these early american archivists actually wanted to use bibliographic methods to make all these records easily accessible in familiar ways. they made multiple attempts to use various forms of card catalogs to describe materials and established a classification division devoted to somehow providing subject-based discovery. however, “…given the diverse mass of materials in the national archives, classification demanded vastly more time and expense than the agency could afford,” and the division was disbanded in 1941.[4] with truck after truck moving more and more records to the archives, all archivists were able to feasibly do was document the source of records and their existing arrangement. the provenance of each set of records was important because each source had a different arrangement system and discovery process. a user would have to use the “preliminary inventories” created by the archivists to find what office created the records they were seeking, and how that office arranged or maintained them to navigate that file series or records component.[5] these “preliminary inventories” evolved into paper and online finding aids over time.[6] of course, it would be simpler for users if all the records had a single discovery process, but to early american archivists, that was obviously (if regretfully) infeasible. usability was a conscious trade off to make the enormous volume of materials even somewhat accessible. as a rule of thumb, the approaches used by archivists are useful primarily because of the scale of the materials they manage. got a large but manageable amount of stuff? use bibliographic description. got a seemingly never-ending vast mountain of materials? use archival description. this is an oversimplification, as archival methods are also very good at retaining context of materials, but scale alone is a useful distinction to show how archival systems are meaningfully different.[7] the reality is that in our current unlimited information environment, archives and libraries have larger collecting scopes and volumes of materials than they have descriptive resources, much like the early national archives. even with the additional catalogers and archivists that should be hired to address this, archival methods should be reassessed in order to make the line between the available and the inaccessible more gradual, and to make a larger body of materials open for use. archival description in practice most of our librarian and technologist peers understand that archival data is structured differently, as archival data is hierarchical, with a tree structure of “components,” such as collections, file series, folders, and perhaps items. however, the way archival description inherits is not widely understood and has really important implications for system design. even archivists do not often articulate how the relationships between components of description work. for example, a repository might hold a folder called “meeting minutes, 1989 july 26.” this component only has a title and a date, which alone are not very helpful to users. who was meeting? what were they discussing? unlike a bibliographic record, not all of the available information is contained within the record and the relationships to other records are very meaningful. in this example, the file is part of a series titled “new jersey proportionality review project records,” which is part of a collection titled the “leigh b. bienen papers.” both higher order components have fields where a user might learn the purpose of the meeting and its potential participants and outcomes. image 1. the file is part of a series titled “new jersey proportionality review project records,” which is part of a collection titled the “leigh b. bienen papers.” https://archives.albany.edu/description/catalog/apap312aspace_c264f5e1f93f9d58e5b60483c32d76e9 here is where we have to get into the weeds a bit. at all levels, components may use twenty-five elements that are described in the archives content standard, describing archives: a content standard (dacs). eleven of these elements are required fields. the standard also outlines a set of requirements for multilevel descriptions that articulates rules for the relationships between multilevel archival components like the above example. this section of dacs is particularly impactful, but it is challenging for non-experts to fully appreciate its meaning. what is not often understood here is that, while most of the eleven required elements are often only used at the collection level in practice, each component is required to be described by every one of these fields. even the above “meeting minutes, 1989 july 26” example needs to have a name of creator(s), a scope and content note, an access conditions note, etc. this example is actually described by those elements, they are just stored outside of the record in higher order components. lower-level components only use dacs elements if they supersede or are more granular than the higher order component. if this is not the case, the element from the higher-level component applies. thus, the scope and content element from the new jersey proportionality review project records series component and several elements from the leigh b. bienen papers collection component also describe the “meeting minutes” file component. when archival repositories used paper finding aids, inherited elements were implicitly displayed using front matter, indents, and other design features that conveyed this relationship, but our current discovery systems do not account for this. archival description also provides us with a tremendous amount of flexibility, allowing for the discovery of full text, bibliographic records, and description automatically derived from digital materials within a single descriptive schema and discovery system. dacs allows archivists to use bibliographic metadata, such as dublin core fields, to further describe materials when there is a user-driven reason to do so. it just requires a clear and explicit relationship between these records and the archival component that describes them. this allows archivists to create high-quality descriptive records when appropriate. an archival collection can easily contain one series of lower-value or rarely used materials that are only generally described by the series description, and another series of high valued items containing high quality detailed metadata for each item that you would expect in a library catalog. instead of allocating a similar amount of descriptive labor to all materials as bibliographic description often does, archival description empowers archivists to use their appraisal skills and spend their valuable time in proportion to the value of materials they are working with. for rarely used items with less value, materials can still be accessible, just with a higher usability cost.[8] because archival methodology accommodates lower quality descriptive records, this also makes it a perfect fit for automated approaches that derive description from digital materials. this includes full text, technical metadata, or the output of computational techniques such as entities extracted using natural language processing (nlp). archival description is also a perfect fit for using emerging machine learning techniques for extracting meaningful information from digital images and documents for discovery if these tools can be used without causing harm. there have been some experiments that have used automated approaches to describe special collections materials. however, no matter how sophisticated, automated methods alone produce lower-quality records that limit discoverability and usability in bibliographic systems.[9] in archival description, these records would always be linked with higher-level metadata created by a human professional. the flexibility of archival description also makes it easy to manually enhance automated description when needed. for lower value materials that would not receive detailed description, automated description can also be better than nothing. archivists are also welcome to use automated description at first while assessing its use and potentially enhancing the description later as appropriate. yet, as i’ll discuss later, while archival description encourages these practices, the current systems available for managing digital materials are designed only to work with bibliographic description, thus they are blocking the use of automated approaches in practice. archival methods do have significant drawbacks. this is an idealized vision of archival description. systems that support the creation of quality archival description are a relatively new phenomenon and a lack of training and support can mean that archival methods are sometimes inconsistently mixed with bibliographic approaches, or just poorly applied. additionally, even if we design discovery systems that make use of archival description, there is a usability cost that may be unavoidable when we compare it to the simplicity of bibliographic description. when you compare catalog cards to finding aids or opacs to archivesspace, bibliographic description is often more familiar and comfortable for most users. the usability problems of online finding aids and archives access systems are very well-documented.[10] the more complex relationships in archival data are simply just more challenging to navigate and display intuitively. yet, there are paths forward if we design digital repositories to match the affordances of archival description, then we may be able to improve the usability of discovery systems to where the advantages are well worth the costs to users. the current landscape of digital repositories when we apply a strong understanding of archival description to the current landscape of digital repositories, we see that there are several digital repositories available, but no system allows for the discovery of digital material using archival description. this is true across both open source and proprietary systems. using archival methods for discovery is simply not currently possible without substantial customization. most repositories are designed as digital asset management systems (dams) like contentdm for the upload and discovery of digital objects or designed as institutional repositories (irs) like islandora, samvera-based applications, or bepress digital commons that have built-in multi-user submission workflows. every single one of these systems is designed with bibliographic description in mind. each assumes that librarians or archivists will enter a set of descriptive metadata fields when uploading digital objects. each tool also envisions itself as a self-contained system for this description. no complete off-the-shelf system expects description to be managed and made discoverable outside of its interface. remember that if an item is described by an archival component, dacs requires a clear and explicit relationship between that item and its higher-level components so that users can use those inherited descriptive fields, and it is reasonable for a user to expect at least a navigable link here. a common workflow is for archivists to digitize an item that is already described by an archival component, but since all dams and irs assume they are self-contained, the archivist then has to spend additional time and labor to create a separate set of dublin core or other bibliographic elements for a digital repository. this both duplicates effort and creates an obvious usability barrier. users often must navigate both a system for archival description and a separate system for digital content. this problem is particularly acute for small repositories, as to make digital content available, they are incentivised to change their local descriptive practices to match the system used by whatever consortial repository is available to them. it is probably correct to say that none of the current tools, including contentdm, islandora, dspace, bepress digital commons, or samvera-based systems like hyrax or hyku, are compliant with dacs. archivists have no options. this is a major use case that is simply not being met with available tools, likely because of the divide in domain knowledge between archivists and administrators, librarians, and technologists. there is no off-the-shelf product that provides access and discovery for digital materials using archival repositories’ existing description methods and systems. over the last decade or so, there has been a lot of progress in designing and developing systems to manage archival description, with the development of archivesspace being a major success. however, archivesspace, access to memory (atom), and arclight all only manage and/or provide access to description, not digital content. while these tools all help provide us with an important piece to the access puzzle, users want to access materials, not just descriptive records. in-person research will always be a key part of archival repositories, but more and more archival research is being done primarily or solely online, with the covid-19 pandemic possibly being a major turning point. the closure of reading rooms finally forced many archives to regularly accommodate digitization requests on-demand. this is a major advancement in user services, yet many of these materials are often sent directly to users and not uploaded into digital repositories for future use. this is because these systems are not able to accommodate items without additional descriptive labor, despite them already having archival description and the fact that they were already discovered by a user.[11] archival repositories need systems that manage digital content to do less – focus on asset management, file serving, and interoperability. archivists are already able to create and manage complex archival description in tools like archivesspace or atom. archives need digital repositories to manage digital content but be interoperable with and rely on their existing description systems. the international image interoperability framework (iiif) is a great way to make these connections. there are some important roles that repositories should take on, such as processing or ingest workflows and technical metadata, but digital repositories as currently constituted cannot serve as the primary end-user discovery system for archival materials. it also could be advantageous to designate digital repositories and discovery systems as separate concerns, as repositories can better serve as a “back-end” systems that may better provide or are more interoperable with preservation functions. in the future this may help us avoid design problems like the samvera architecture, which too tightly coupled preservation and access functionality though activefedora.[12] this separation may also make it easier for systems to manage access restrictions, as archivists need to manage and preserve digital materials that cannot currently be made publicly available, as “virtual reading rooms” or limited or controlled access systems are another important piece of the access puzzle.[13] but most importantly, separating discovery from asset management may also provide us with the space and flexibility to design access systems that allow end-users to discover and navigate that content using archival description. ualbany case study a case study of the espy papers from ualbany illustrates both the potential for using archival description to manage digital objects, particularly by enabling digitization on demand, as well as the practical challenges that arose attempting this with current systems. m. watt espy spent most of his life documenting capital punishment in the united states. he dug up information for every death row inmate he could find from corrections records, county histories, court proceedings, and popular publications, and summarized each case on index cards – colorfully documenting victims, alleged perpetrators, and circumstances. at his height he had a large network of collaborators that sent him documentation sourced from all over the country. this collection represented the most complete documentation of executions in what is now the united states dating back to european colonization. in 1984 the national science foundation (nsf) awarded a grant to the university of alabama to create a computational dataset based on the materials, which was first released as executions in the united states, 1608-1987: the espy file. on espy’s death, the original source materials along with other papers were donated to ualbany’s national death penalty archive and in 2010 it received detailed folder-level processing with funding from the national historical publications and records commission (nhprc).[15] while the espy file dataset became a canonical source for criminal justice researchers, abstracting the stories of these thousands of individuals onto a spreadsheet took away a lot of meaning and serviced only certain types of research. some researchers had found issues with the dataset and reference staff had heard a number of anecdotes from users about discrepancies they found between the index cards and the espy file data. seeing so many users willing to travel to see the index cards, along with the potential of leveraging the existing metadata from the dataset made it a strong candidate for digitization and in 2016, ualbany was awarded a council on library and information resources (clir) hidden collections grant to digitize two file series and make them openly available online.[16] since the collection had previously received detailed folder-level processing and the materials were the source for an existing dataset, it seemed wasteful and duplicative to create additional item-level records with bibliographic metadata for what would be about 125,000 digital objects. the espy file dataset was not created as descriptive metadata to our current standards and did not map to the paper materials in a machine-actionable way, so it was not useful as a drop-in replacement for bibliographic metadata in a dams. thus, the collection seemed like an excellent candidate for using existing archival description to provide access to the digital scans, as it could make practical use of the problematic espy file data. our existing systems provided no way to use the existing description to provide discovery and access to digital scans. we had recently completed migrating our archival description to archivesspace and were using extensible text framework (xtf) and the luna dams for access, but neither xtf or luna were interoperable or sustainably customizable and no digital repository was available that used archival description for discovery out-of-the-box. the archivesspace rest api provided the potential to use archival description in new ways, and we were eager to fully leverage our descriptive labor already dedicated to the collection to benefit users and make our work more impactful. we decided to implement an open-source digital repository that would be more customizable to use folder-level description from archivesspace along with the espy file dataset. for much of the source materials series, we thought that the quality folder-level description already existed should be sufficient to provide access. also, if we could implement a successful process for using existing archival description for digitization, we hoped that we could do the same for other collections, and potentially even provide digitization services on request for single folders without having to create detailed bibliographic metadata. we decided to implement a lightly customized hyrax repository which uses the samvera framework. hyrax is not a “turnkey” system, but a fully featured set of open components that can be implemented into a digital repository. we hoped the openness of hyrax would make it easily adaptable to our existing archival description. over the course of the project, the arclight mvp project made arclight into a viable option for providing access to archival description. because it uses a similar ruby on rails stack as hyrax, it became easier to implement arclight and integrate it with hyrax than doing a similar level of customization with the archivesspace public user interface (pui). we needed data to be passed both ways, from hyrax to arclight and from arclight to hyrax, and both systems exposed json metadata with rest apis, which was an invaluable feature we could not have done without. since both systems used the same technology, much of what we learned customizing one system could also be applied towards the other. we did not quite know what we were getting into. the project was significantly under-resourced in both outside funding and internal expertise. however, despite some delays, data problems, and the challenges of learning new technologies, the systems we implemented were a major success. the espy project execution records website provides open access and discovery to the espy papers. our university libraries also gained a lot of skills and capacity to implement and host open-source applications that would be applicable to other projects, we developed a more productive relationship with the university-level information technology services division, and we are better able to utilize our on-campus virtualized data center. the need to support these systems was successfully used in 2019 to justify filling a vacant technologist position that otherwise was not likely to have received university-level approval. the project enabled us to use existing archival description for digitized and born-digital items and allowed us to provide online access to a much greater volume of materials. on the hyrax side, we had to develop multiple custom data models to handle both legacy materials from our existing dams as well as objects that would rely only on a link to a component of archival description. it was relatively straightforward to create image and av models to handle the schemas used in our existing dams, but hyrax’s use of linked data uris was a barrier to creating a sensible digital archival object (dao) model for archival description.[17] to make connections between digital objects and components of archival description, we used the 32-character ref_id generated by archivesspace and indexed into arclight. each folder-level component would have a ref_id for itself, could have multiple ref_ids for higher level series and subseries components, and always had a collection identifier for the top-level collection. we thus needed three identifier fields, one containing multiple ids where the order mattered, and each having a separate meaning. it also made sense to store the name of each component in the model as a string. this was challenging to model using linked data uris since hyrax requires a unique uri for each field. once we got a set of uris that hyrax accepted, we essentially ignored the uris downstream and relied on local meanings for the fields. i am skeptical that even a perfectly designed or customized ontology would have provided any value to this project, and trying to use any form of the records in context (ric-o) ontology currently being designed by the international council on archives experts group on archival description (ica egad) would have been a nightmare.[18] once the dao model was complete, we customized the workflow page where an archivist would upload and describe a digital object. this worker would enter the ref_id for the component of archival description, the collection identifier, and click a “load record” button. this button would make a javascript ajax call to the arclight json api and automatically fill most of the descriptive fields. the worker would then only be required to add a resource type and a license or rights statement before uploading the object. image 2. dao model. we also customized the display page for each object to pull relevant archival description from arclight also using client-side javascript calls. when an object page loads, it uses the ref_id and collection identifier to query the archival description component and all of its parent components. the page then displays the names and links for all higher-level components as any scope and content notes. the use of client-side ajax calls is imperfect but allowed us to integrate the two systems without much more complex customization within the rails applications. if a worker was digitizing an item, they would just have to find the ref_id and collection number for the folder in archivesspace or arclight, and enter those fields in hyrax with a resource type and rights statement. for descriptive metadata hyrax would then only contain a title (example: skandalon, vol. 3, no. 9) and date (example: 1965 march 10), which by itself would not be very helpful to users. when a user accesses the item, hyrax will query and display scope and content notes for the skandalon and the university publications collection. a user could then read that this is a single issue from a bi-weekly journal of news and opinion published by campus christian council, which was part of an artificial collection of student publications. this minimal descriptive workflow, along with rapid lower-quality scanning allowed for digitization on user request. we later implemented a new digital reproduction fee schedule that charged by the time required for digitization rather than page counts.[19] since we were using existing archival description for metadata and avoiding page count estimates with back-and-forth emails, in many cases we were able to digitize an item in about the same time as a traditional reference request and make requests that take under 30 minutes free to users. this practice improved user experience, allowed us to digitize a much larger volume of materials and make them accessible online, and has the added benefit of making our digitization labor more transparent to users. in this example, i received a request for one issue and digitized the whole run of 42 issues in an afternoon merely because i had some extra time and thought the materials were interesting and worth digitizing. in addition digitizing individual items on request, we also developed a batch upload workflow for large sets of items sent to an outside vendor for digitization. the process relied mostly on spreadsheets. here we also used existing archival description so the materials did not require item-level bibliographic metadata. this proved to be really useful for university publications, for example, where we had existing volume and issue lists. we had an existing tool for exporting this metadata from the archivesspace api, so we added on a process where an archivist could paste in the corresponding access file for each issue and a script would generate another spreadsheet that could be uploaded into hyrax using a rake task. this workflow enabled us to rapidly digitize large collections or file series that were really valuable for reference use, such as student newspapers, university publications, commencement programs, university organizational charts, press releases, and university senate legislation. while additional descriptive care would have improved discoverability as always, making these materials discoverable using existing archival description plus full text ocr and extracted text was a major advancement. while our arclight and hyrax implementations were very successful in providing access to digital materials using archival description, they also have a number of practical limitations. the most obvious problem is that users still must navigate two separate systems, one for archival description and another for digital materials. we implemented a “bento” style discovery layer based on quicksearch to make search results from both arclight and hyrax available from a single search box but found that users still had trouble navigating back and forth between the two systems.[20] a redesign in early 2022 based on the duke university arclight implementation addressed some minor issues with this integration, but the core problem remains.[21] additionally, getting data from hyrax back into arclight is challenging. it was easy to modify arclight templates to point to hyrax for digital materials, but once an archivist uploaded a new object into hyrax, that uri had to be added to a new archivesspace digital object record. we were also storing separate preservation copies for each object outside of hyrax so we needed to download the object, store it as a local archival information package (aip), and add an identifier that references the aip into hyrax. since hyrax does not provide an api for this, we were only able to automate this using a very wonky script that queries the hyrax solr index, adds a new digital object in archivesspace, schedules it to be indexed to arclight, downloads and stores the object as an aip and adds the identifier to hyrax by literally scraping the hyrax login and edit pages and posting data to the edit form using the python requests module. it worked, but it was a hack. this process along with overall support for hyrax creates major sustainability risks. our library systems department has struggled to maintain hyrax without anyone with a strong ruby or rails background on staff. major cuts to library staff in 2020-2022 only minimally impacted applications support, but with overall library staff reduced by about 30% due to unfilled retirements, our long-term support for customized applications should be questioned, particularly when we are adapting systems like hyrax and not using them quite as they are intended. overall, there is a need for this setup to be simplified. a discovery system designed for archival description archivists need a discovery system for digital materials that uses archival description. a true archival discovery system would query archival description along with item-level bibliographic metadata and automated description derived from digital materials, such as extracted text, ocr text, and a/v transcripts in a single search interface. arclight has the potential to be this system. currently arclight is an access system for archival description based on blacklight. it does not manage digital assets but returns individual components of archival description and lets users navigate through connected records. since arclight merely displays data indexed in solr just like blacklight, it also has the potential to display and return search results for digital objects, including full text. description_indexer is an experimental tool that overrides the default arclight indexing pipeline. out-of-the-box, arclight uses traject to index archival description from ead-xml files, often exported from archivesspace. while traject is set up to be easily configurable to select which xpath to use for each solr field, it is not easily customizable to add the significant logic needed to index archival description or data from other sources. instead, description_indexer is a python library that uses archivessnake and pysolr to index archival description directly from the archivesspace api. this approach is potentially very useful for individual repository instances but may be less so for consortial aggregators because of the high permissions levels currently needed to access the archivesspace api. description_indexer contains two very basic json data models, one for archival description and another for the arclight solr index. this extra layer of abstraction is useful, as any data source that can map to the archival description model would then be automatically indexable into arclight. the archival description model is very much a draft and is likely too simple to be comprehensive, but community consensus around a model like this is key to consistently representing digital materials in the arclight index. the description_indexer main branch is set up to be a “drop-in” replacement for the current traject indexer. the dao-indexing branch is designed to be a more experimental branch that flexibly indexes from digital repositories or other systems that manage digital assets. it is designed to be extensible, since individual implementations will likely need to index asset data from a number of different sources, you can write your own plugin-in to index digital assets from your local system. once description_indexer is installed, you can add a custom class in a .py file in your home directory or using an environment variable that will allow for local logic to override how digital objects are indexed. the ualbany example that is included queries json from our hyrax instance to index links to content and other item-level data not managed in archivesspace. description_indexer also contains multiple “extractors” for pulling content from digital files using apache tika and/or tesseract, however running these during indexing is a challenge and a better design would be to extract and store this data while processing digital files and make it available to the indexer via a file system or a rest service. here is also where there is the potential to experiment with new tools for extracting useful information from documents for discovery using nlp or models generated with machine learning. the data pipeline to the indexer needs further consensus and standardization. in writing description_indexer, i discovered that digital objects, files, and file versions are under-theorized in archival description and archivists need to better define these objects and their relationships. the portland common data model (pcdm) provides helpful definitions of objects and files, and should be incorporated as much as possible, but the relationships between objects and archival components in lieu of pcdm collections is ill-defined and current practice is inconsistent.[22] archivesspace attaches digital objects to archival components, but allows component attributes such as subjects and note fields to be attached to digital objects as well. digital objects also do not have href or url attributes but contain file versions which have file uri attributes. both digital objects and file versions also have is_representative boolean attributes that are likely useful for digital objects. overall, it should be clearer that digital objects are an abstraction that do not necessarily correspond to a file, and digital objects should probably have a field for an international image interoperability framework (iiif) manifest, as that also can be an abstraction and should be the preferred method of linking archival description to digital materials. attributes for how files and versions are displayed in the absence of a iiif manifests are also likely necessary, and overall, it was challenging to model this and broader and more complex community use cases are needed.[23] the biggest barriers to enabling the discovery of digital materials in arclight are establishing consensus data models and data pipelines. once content from digital materials is indexed into an arclight solr index, we can display those objects in arclight with only some minor customizations and a iiif-compliant image server. i implemented a simple demonstration application that illustrates what this could look like in practice. this system returns results based both on archival description and full-text content extracted from digital objects. this implementation has data and design limitations, but i hope that this can be a useful model that shows the potential for what arclight can be going forward. privileging archival description in discovery systems academic libraries and other cultural heritage institutions also manage digital objects using bibliographic description. to avoid implementing and maintaining multiple discovery systems, archival materials are often forced into off-the-shelf irs and dams designed for bibliographic description. a better understanding of archival description shows that it is actually more appropriate to do the reverse, and index bibliographic records into systems designed for archival materials. here, it might be helpful to see archival description as an organizational schema for managing materials which have many different organizational schemas. in the same way that the early national archives used archival description to manage different descriptive methods used by different government agencies, archival systems can also accommodate bibliographic metadata that provides more usable and familiar access. this provides the best of both worlds. we can have one discovery system that provides both a strong user experience for higher value materials while still providing some level of access for materials that do not receive wide interest and otherwise would not receive detailed descriptive care. this also works from a purely technical perspective. while it is possible to model archival description in digital repositories like hyrax, the more complex structure of archival data makes this very challenging. it is comparably much simpler to model bibliographic metadata into archival systems than the reverse. with well-defined data models, we can easily add bibliographic metadata to an arclight index, just like with a blacklight instance. these records could stand alone or also be linked to archival description. this provides arclight with the potential to unify bibliographic and archival metadata in a single user environment, offering the usability of detailed records with the extensibility of archival hierarchy. this would provide us with the full potential of archival description to flexibly allocate our descriptive labor based on the value of materials and user needs. navigating complex archival data structures for items with lesser value may still be challenging for users. if we can make decisions based on the value of materials, rather than systems limitations, this should actually be an effective allocation of our limited descriptive resources. there are also additional opportunities to improve the usability of archival description. since arclight is just an extension of blacklight, it presents description to users in search results as discrete units much like bibliographic metadata. what we can do is experiment with how archival tree structures are indexed to better match how dacs envisions inheritance. since dacs expects notes that are usually only applied at collection or series levels—like scope and content or historical notes—to apply to lower-level components as well, we can experiment with indexing these notes as part of lower-level components too and just return them with lower relevancy scores. arclight currently indexes parent access and use notes like this but does not use them to return search results. this has the potential to return better results for minimally-described materials, but would need to be part of an iterative usability testing process so that results are weighted appropriately. these are exciting possibilities, but we cannot do usability testing on archival discovery systems until they exist. conclusion archival description takes a very different approach to description than what is commonly used elsewhere – whether that be in library catalogs, digital repositories, or on the web. archival methodology has key strengths that make it very useful for managing the vast quantity of digital materials held by libraries and avoiding a digital divide in an era where pandemics and the emissions costs of travel may limit in-person research. our descriptive labor, no matter how extensive it is or should be, has limits. if academic libraries continue to prioritize bibliographic approaches to metadata and apply the same level of descriptive care to objects one by one regardless of value, there will always be a hard line between what is accessible and what is not. archival description provides flexibility that empowers us to apply that valuable descriptive care based on the needs of users and prepares us to experiment with automated metadata approaches and iterative workflows. archival methods simply more accurately and appropriately model our descriptive resources to our materials. unfortunately, it is currently very challenging to use archival description to manage and provide access to digital materials, as current digital repositories are not designed to work with archival description. archivists manage description for materials in systems like archivesspace that are designed for archival description, but dams and similar digital repository systems expect them to create additional bibliographic metadata for any digital material they manage, whether that is an appropriate use of resources or not. there is usually no easy way to link that metadata from two different systems together. in practice, this means that lower-valued items or the increasing number of items digitized by archives on user request are not made available or discoverable for future users because they do not have the value needed to receive detailed bibliographic description. this is silly considering archival description already exists for them. since archives data structures can accommodate bibliographic metadata, but the reverse is very challenging, discovery systems design must privilege archival description. currently, there is no easy way to integrate archival description from systems like archivesspace with digital materials managed in digital repositories into a single discovery point for users. ualbany’s approach of using a “bento” style discovery layer on top of these two systems works functionally but has substantial usability limitations and sustainability concerns. the misunderstandings around archival description have marginalized archival systems in academic libraries. because our digital access systems never have worked for archival methods, libraries long took shortcuts by establishing whole separate programs to manage unique digital materials and limiting archives and special collections to a very traditional understanding of their collecting scopes. instead of working with archives, libraries often worked around them – often causing needless duplication in metadata work, digitization, asset management, and digital preservation across different reporting structures. arclight has the potential to unify discovery of archival and bibliographic description and provide a single discovery point for physical and digital materials that allows archivists to fully leverage the affordances of archival description. we need further community consensus on a data model for archival description – most notably for digital objects, files, and file versions. i hope description_indexer can be a helpful example that can be iterated upon, that further work can be done to index digital materials in the arclight index, and that we can experiment more with indexing archival description in general. while not really discussed here, archival description’s focus on agents and functions behind the creation of records has the potential for opening new patterns for discovery.[24] overall, we need examples of digital materials in arclight alongside archival and bibliographic description for iterative usability testing. about the author gregory wiedeman is the university archivist at the university at albany, suny where he helps ensure long-term access to the school’s public records. he manages the university archives and supports born-digital collecting, web archives, and systems implementation for the department’s outside collecting areas. he currently serves as co-chair of the technical subcommittee for describing archives: a content standard (ts-dacs). endnotes [1] joyce chapman, kinza masood, chrissy rissmeyer, dan zelner, “digitization cost calculator raw data,” digital library federation (dlf) assessment interest group (2015). https://dashboard.diglib.org/data/. amanda j. wilson, “toward releasing the metadata bottleneck: a baseline evaluation of contributor-supplied metadata,” library resources & technical services vol. 51, no. 1 (2007). https://journals.ala.org/index.php/lrts/article/view/5384/6604. [2] “dcmi: one-to-one principle,” dublin core metadata innovation. https://web.archive.org/web/20220627093857/https://www.dublincore.org/resources/glossary/one-to-one_principle/ [3] mccoy states that “…the national archives had to deal with the greatest volume of records in the world; the unparalleled diversity of their origins, arrangement, and types; and their widely scattered locations in 1935.” donald r. mccoy, the national archives: america’s ministry of documents 1934-1968 (chapel hill, nc: the university of north carolina press, 1978), 45, 69. [4] mccoy, 78-80. philip m. hamer, “finding mediums in the national archives: an appraisal of six years’ experience,” the american archivist, vol. 5, no. 2 (1942): 86-87. [5] the national archives, guide to the material in the national archives (washington, dc: united states government printing office, 1940), ix. [6] this process is discussed in more depth in gregory wiedeman, “the historical hazards of finding aids,” the american archivist, vol. 82, no. 2 (2019): 381-420. https://doi.org/10.17723/aarc-82-02-20. [7] in addition to working well at scale, archival description is also more effective at maintaining contextual relationships between records, their creators, and the activities that created them. this is further discussed in jodi allison-bunnell, maureen cresci callahan, gretchen gueguen, john kunze, krystyna k. matusiak, and gregory wiedeman, “lost without context: representing relationships between archival materials in the digital environment,” the lighting the way handbook: case studies, guidelines, and emergent futures for archival discovery and delivery, eds. m.a. matienzo and dinah handel (stanford, ca: stanford university libraries, 2021). https://doi.org/10.25740/gg453cv6438. [8] this practice is best described in daniel a. santamaria, extensible processing for archives and special collections: reducing processing backlogs (chicago: neal-schuman, 2015). shan c. sutton also discusses the further extension of this to digitization in shan c. sutton, “balancing boutique-level quality and large-scale production: the impact of “more product, less process” on digitization in archives and special collections,” rbm: a journal of rare books, manuscripts, and cultural heritage vol. 13, no. 1 (2012). https://doi.org/10.5860/rbm.13.1.369. [9] paul kelly, “better together: improving the lives of metadata creators with natural language processing,” in code4lib journal issue 51 (june 14, 2021), https://journal.code4lib.org/articles/15946. kaldeli, eirini, orfeas menis-mastromichalakis, spyros bekiaris, maria ralli, vassilis tzouvaras, giorgos stamou, and evaggelos spyrou, “crowdheritage: crowdsourcing for improving the quality of cultural heritage metadata,” information vol. 12, no. 2 (february 2021). [10] christopher j. prom, “user interactions with electronic finding aids in a controlled setting,” american archivist 67, no. 2 (2004): 234–68, https://doi.org/10.17723/aarc.67.2.7317671548328620. anne j. gilliland-swetland, “popularizing the finding aid: exploiting ead to enhance online discovery and retrieval in archival information systems by diverse user groups,” journal of internet cataloging 4, nos. 3–4 (2001): 199–225, https://doi.org/10.1300/j141v04n03_12. luanne freund and elaine g. toms, “interacting with archival finding aids,” journal of the association for information science and technology 67, no. 4 (2016): 1007, https://doi.org/10.1002/asi.23436. wendy scheir, “first entry: report on a qualitative exploratory study of notice user experience with online finding aids,” journal of archival organization 3, no. 4 (2005): 49–85, https://doi.org/10.1300/j201v03n04_04. joyce celeste chapman, “observing users: an empirical analysis of user interaction with online finding aids,” journal of archival organization 8 (2010): 4–30, https://doi.org/10.1080/15332748.2010.484361. [11] james e. murphy, carla j. lewis, christena a. mckillop, and marc stoeckle, “expanding digital academic library and archive services at the university of calgary in response to the covid-19 pandemic,” ifla journal vol. 48, no. 1 (2021). https://doi.org/10.1177/03400352211023067. florence sloan, “special collections practice in response to the challenges of covid-19: problems, opportunities, and future implications for digital collections at the louis round wilson library at the university of north carolina at chapel hill,” masters thesis, university of north carolina at chapel hill school of information and library science (april 30, 2021). https://cdr.lib.unc.edu/concern/masters_papers/1z40m3313. the infeasibility of creating item level records is also discussed in stephanie becker, anne kumer, and naomi langer, “access is people: how investing in digital collections labor improves archival discovery & delivery,” the lighting the way handbook: case studies, guidelines, and emergent futures for archival discovery and delivery, eds. m.a. matienzo and dinah handel (stanford, ca: stanford university libraries, 2021), 33. https://doi.org/10.25740/gg453cv6438. [12] esmé cowles, “valkyrie, reimagining the samvera community,” https://library.princeton.edu/news/digital-collections/2018-06-05/valkyrie-reimagining-samvera-community. [13] elvia arroyo-ramírez, annalise berdini, shelly black, greg cram, kathryn gronsbell, nick krabbenhoeft, kate lynch, genevieve preston, and heather smedberg, “speeding towards remote access: developing shared recommendations for virtual reading rooms,” the lighting the way handbook: case studies, guidelines, and emergent futures for archival discovery and delivery, eds. m.a. matienzo and dinah handel (stanford, ca: stanford university libraries, 2021). https://doi.org/10.25740/gg453cv6438. [14] m. watt espy, john ortiz smykla, executions in the united states, 1608-2002: the espy file (icpsr 8451), (ann arbor, mi: inter-university consortium for political and social research (distributor), 2016-07-20). https://doi.org/10.3886/icpsr08451.v5. [15] m. watt espy papers, 1730-2008. m.e. grenander department of special collections and archives, university libraries, university at albany, state university of new york. https://archives.albany.edu/description/catalog/apap301. “commission recommends $7 million in grants,” the u.s. national archives and records administration, 2010 june 1. https://web.archive.org/web/20220307211150/https://www.archives.gov/press/press-releases/2010/nr10-107.html. [16] blackman and mclaughlin summarize the widespread praise for espy’s work, while also highlighting some of the espy file’s limitations and criticizing its use for quantitative analysis. blackman and mclaughlin, “the espy file on american executions: user beware,” homicide studies vol. 15, no. 3 (2011): 209-227. [17] models for the ualbany hyrax instance. https://github.com/ualbanyarchives/hyrax-ualbany/tree/main/app/models. [18] egad – expert group on archival description, “records in contexts – ontology,” july 22, 2021. https://www.ica.org/en/records-in-contexts-ontology. [19] “request items for digitization,” m.e. grenander department of special collections & archives, university at albany, suny. https://archives.albany.edu/web/reproductions/. [20] “quicksearch,” north carolina state university libraries. https://www.lib.ncsu.edu/projects/quicksearch. [21] sean aery, “arclight at the end of the tunnel,” november 15th, 2019. https://blogs.library.duke.edu/bitstreams/2019/11/15/arclight-at-the-end-of-the-tunnel/. [22] portland common data model (april 18, 2016), https://web.archive.org/web/20220912065008/https://pcdm.org/2016/04/18/models. [23] description_indexer experimental archival description model, https://github.com/ualbanyarchives/description_indexer/blob/dao-indexing/description_indexer/models/description.py. [24] the rockefeller archive center’s dimes access system is a really interesting step in this direction by emphasizing agent records and requiring users to click through archival components to convey description inheritance. renee pappous, hannah sistrunk, and darren young, “connecting on principles: building and uncovering relationships through a new archival discovery system,” the lighting the way handbook: case studies, guidelines, and emergent futures for archival discovery and delivery, eds. m.a. matienzo and dinah handel (stanford, ca: stanford university libraries, 2021). the records in contexts – conceptual model (ric-cm) also has a very intriguing focus on agents and functions for discovery that deserves further practical exploration. “records in contexts – conceptual model.” expert group on archival description (egad), https://web.archive.org/web/20221007020234/https://www.ica.org/en/records-in-contexts-conceptual-model. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 55, 2023-1-20 previous issues issue 54, 2022-08-29 issue 53, 2022-05-09 issue 52, 2021-09-22 issue 51, 2021-06-14 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – revamping metadata maker for ‘linked data editor’: thinking out loud mission editorial committee process and structure code4lib issue 55, 2023-1-20 revamping metadata maker for ‘linked data editor’: thinking out loud with the development of linked data technologies and launch of the bibliographic framework initiative (bibframe), the library community has conducted several experiments to design and build linked data editors. while efforts have been made to create original linked data ‘records’ from scratch, less attention has been given to copy cataloging workflows in a linked data environment. developed and released as an open-source application in 2015, metadata maker is a cataloging creation tool that allows users to create bibliographic metadata without previous knowledge in cataloging. metadata maker might have the potential to be adopted by paraprofessional catalogers in practice with new linked data sources added, including auto suggestion of virtual international authority file (viaf) personal name and library of congress subject heading (lcsh) recommendations based on the users’ text input. this article introduces those new features, shares the user testing results, and discusses the possible future steps. by greta heng, myung-ja han introduction libraries have been using machine readable cataloging (marc) as a tool to create bibliographic and authority data since the 1960s. while marc brought libraries a new way to organize information in the past, the evolving information landscape asks for libraries to explore other means of information organization that can connect library collections with resources on the web. as a successor to marc, bibliographic framework (bibframe) initiative was launched by the library of congress (lc) in 2012.[1] it is expressed in the resource description framework (rdf, a data model for structured data)[2] and based on three categories of abstraction (work, instance, item). as the library’s new entity relation data model, bibframe is grounded in linked data techniques, which allows metadata creators to build relationships with web resources by facilitating shared structured data and uniform resource identifiers (uris). many national and research libraries have been exploring the possibility of converting the marc format metadata to bibframe and, even further, creating metadata as linked data using a linked data/bibframe editor. libraries such as the swedish national library,[3] the french national library,[4] the german national library (dnb),[5] and the library of congress[6] have been involved in the marc to linked data conversion, linked data based new discovery services, and linked data editor experiments. in addition, some external linked data management platforms are gaining popularity among glam (galleries, libraries, archives, and museums) institutions. wikidata,[7] an open, collaborative, and multilingual global linked data repository, is being used by libraries as an alternate source of name and subject data for bibliographic description. however, since wikidata is designed to represent all domains of knowledge and not specific for library use, concerns about its capacity and suitability for describing library resources were raised by the wikidata community.[8] while there has been much discussion on and development of tools for creating full original linked data, less attention has been given to copy cataloging workflows (creating new short records by deriving from other records or creating minimum records) in linked data environments. developed and released as an open-source application in 2015, metadata maker[9] is a metadata creation tool that can be used by anyone regardless of their cataloging experience and knowledge, allowing them to create a minimum level catalog record. metadata maker has been updated in several areas since then, including supporting different formats of resources (currently in ten modules) and bibframe output service for monographs.[10] as more and more cataloging and metadata creation work relies on paraprofessional catalogers or non-catalogers[11] with language or subject expertise, the authors tried to revamp metadata maker with linked data authority services to test whether this tool and the updated functions can facilitate the minimum record creation in a linked data cataloging environment. this paper shares the revamping process and issues found in linked data sources and their service, and discusses the user testing results of metadata maker and a bibframe editor. changing landscape the development of linked data technologies brought out a systematic change in libraries’ cataloging production practice. as van der werf said, “libraries used to be knowledge organizations and library professionals were trained in bibliographic description and authority control. now, authorities are called entities and the new description logic is about creating a ‘knowledge graph of entities.’”[12] it is noticeable that the focus of metadata creation has gradually shifted from the curation of text strings to the management of entities (work, persons, corporate bodies, places, events, etc), i.e., linking resources using uris and managing uris instead of name strings.[13] this revolution has triggered a discussion on linked data cataloging models, standards, and tools in the library. changing library cataloging production practice libraries have carried out several initiatives to re-design cataloging workflows and devise the transition plan from traditional cataloging to linked data cataloging, for example, the development of marc to bibframe conversion tools and bibframe editors. notably, the linked data for libraries (ld4l)[14] community made a series of significant efforts on linked data cataloging from 2014 to 2022, including linked data for libraries labs (ld4l labs),[15] linked data for production (ld4p),[16] linked data for production: pathway to implementation (ld4p2),[17] and linked data for production: closing the loop (ld4p3).[18] albeit those new linked data cataloging tools, catalogers need to be versed in new linked data related knowledge and exercise new skills, such as rdf, sparql, bibframe ontology, and more, to create library data as linked data. in addition, as linked data implementations in libraries are still under development, it is hard to keep up to date with the most current linked data application developments, e.g., bibframe editors. it is challenging to identify the type of skills that catalogers need to be developing. as a result, catalogers may feel overwhelmed by the new linked data technology, and administrators are experiencing challenges in designing and providing training for the ever-growing skill set and emerging linked data tools for catalogers.[19] the shifting roles of librarians and staff in technical services are an additional challenge in linked data training and planning. libraries used to depend on professional cataloging librarians to do original cataloging. copy cataloging was usually performed by paraprofessional catalogers. however, this is no longer true. with shrinking budgets, organizational restructuring, and changes in cataloging software and workflows, more paraprofessional staff are responsible for both original and copy cataloging tasks (el-sherbini & klim, 1997; zhu, 2012).[20] as van der werf articulated, the number of professional librarians is decreasing while paraprofessional staff are increasing in cataloging departments.[21] in fact, not only are professional librarians decreasing, but the whole cataloging team is also shrinking. while there are several options that can ease the shortage of manpower, such as outsourcing to vendors, cooperative cataloging programs, and more productive cataloging workflows, libraries still lack staff with expertise to catalog special collections and/or foreign language materials. the need for foreign language and special collection cataloging will not go away in a linked data environment as libraries keep purchasing resources from foreign countries and work with perpetual backlogs. bibframe editors and copy cataloging currently, there are three bibframe editors that are widely known and used: lc’s bibframe editor,[22] marva,[23] and ld4p’s sinopia.[24] all three editors seem to target experienced catalogers as their user group, not paraprofessional catalogers or non-catalogers. for one, they use the resource description and access (rda) terms[25] as field names and bibframe’s three categories of abstraction, work, instance, item, as record/data types. those cataloging terms, though commonly used by professional catalogers, may result in a learning curve for paraprofessional catalogers. for example, “parallel title” is not a common phrase and the differences between work and instance are not self-explanatory for many. for another, some abbreviations that appear in the user interface as controlled vocabularies, including getty_aat, lcgft, and gac, are not familiar to paraprofessional catalogers. in order to use the editor and add appropriate values to those data fields, it requires training on rda, bibframe ontology, authority, and the editor itself at the very least. another challenge is a lack of clear definition as to what makes full level and brief bibframe data. the core bibframe data fields are still under discussion by the program for cooperative cataloging (pcc) bibframe interoperability group (big).[26] as there are no clear guidelines, some bibframe editors mark required fields while some do not. for catalogers or users of bibframe editors, it seems that one needs to fill out all fields to create full-level bibframe data and provide values for those required fields, if applied, to generate brief bibframe data. as there is no quick way of filling out the minimum data fields to produce brief bibframe data, the cataloging workflow used in the current bibframe editors might not meet libraries’ needs for cataloging large volumes of perpetual backlogs with a shrinking cataloging team. lorimer (2022) stated that the notion of copy cataloging has broadened and expanded in a linked data environment,[27] which emphasizes reusing metadata rather than creating completely new metadata from scratch. some bibframe editors like sinopia indeed allow catalogers to search, load, and copy or clone existing bibframe data to revise and reuse those descriptions by sharing uris. this workflow would help reduce duplicate work-level bibliographic records and increase cataloging efficiency. yet, considering the reality and looking into the future, libraries, with professional catalogers and language/subject experts shortage, will have to resort to non-catalogers and paraprofessional catalogers with limited linked data and cataloging knowledge to create records in bibframe editors. shall users adapt to the bibframe editors or shall the editors be designed to be more friendly to their users? this dilemma raises a question: is it possible to build a linked data editor without cataloging jargon in the application interface? given the above mentioned issues, this project is an attempt to build a straightforward linked data editor that does not use rda terms for non-catalogers for the purpose of copy cataloging. libraries may benefit from adopting metadata maker as it does not require new hiring or training for catalogers and allows non-catalogers with needed language/subject knowledge to create minimum level cataloging records. the authors also conducted a small-scale survey to learn catalogers’ opinions about metadata maker and a linked data editor. revamping metadata maker metadata maker enables any user to create catalog records that are “good enough” (provide sufficient information to identify a bibliographic item and generate a basic bibliographic description)[28] in various formats, including marc, regardless of one’s knowledge of or experience with cataloging standards, integrated library systems, or oclc. it now has ten different modules or templates (datasets[29], monographs[30], monographs (ld)[31], ebooks[32], government documents[33], maps[34], microfilms[35], scores[36], serials[37], theses and dissertations[38]). users can select a module based on the resource type, fill out basic information about the resource, and choose the download format, including marc binary, marcxml, metadata object description schema (mods), html, and bibframe.[39] for this phase, two new linked data features, virtual international authority file (viaf) personal name suggestions and library of congress subject heading (lcsh) suggestions were added in the monographs (ld) module in metadata maker. the new functions support search and auto completion of personal names in viaf, and lcsh (keywords) generation based on the user provided text. uris of the controlled terms are added in the output metadata. figure 1. metadata maker interface screenshot. linked data input viaf name search the viaf personal name autocomplete dropdown list in fig. 2 uses viaf auto suggest api[40] to retrieve the personal name’s label, viaf uri, and library of congress name authority file (lcnaf) uri. when a name is selected, the links to both uris, if they are available in viaf, will be presented on metadata maker. users have the option to verify the name entity’s information on either the viaf or lcnaf page if so desired. the application then retrieves values of the 100 field subfields a to d from lcnaf whenever they are available. if no lcnaf uri is provided in viaf, the preferred label from dnb[41] is the alternative option if that can be found in viaf. the lcnaf and viaf uris are added to subfield 0 and 1 respectively in the marc and marcxml 100 field or 700 field based on their role. for other supported output formats, the uris and the label/preferred name are also inserted into the appropriate elements. if there is no satisfactory result in the autocomplete dropdown list, it also allows users to manually input the name strings. the code is available online.[42] figure 2. viaf auto suggest dropdown list. // using viaf auto suggest api fetch personal names (function($) { $.widget("oclc.viafauto", $.ui.autocomplete, { options: { select: function(event, ui) { alert("selected!"); return this._super(event, ui); }, source: function(request, response) { var term = $.trim(request.term); var url = "https://viaf.org/viaf/autosuggest?query=" + term; var me = this; $.ajax({ url: url, datatype: "jsonp", success: function(data) { if (data.result) { response( $.map( data.result, function(item) { if (item.nametype == "personal"){ var retlbl = item.term + " [" + item.nametype + "]"; var uri = "http://viaf.org/viaf/" + item.viafid; if (item.lc){ return { label: retlbl, value: item.term, id: item.viafid, viafuri: uri, lcuri: "http://id.loc.gov/authorities/names/" + item.lc, nametype: item.nametype } }else{ return { label: retlbl, value: item.term, id: item.viafid, viafuri: uri, lcuri: "nolc", nametype: item.nametype } } } })); } else { me._trigger('nomatch', null, {term: term}); } }, }); } }, _create: function() { return this._super(); }, _setoption: function( key, value ) { this._super( key, value ); }, _setoptions: function( options ) { this._super( options ); } }); })(jquery); // get information for user selected name in the author input field $(function() { $(".author").viafautox( { select: function(event, ui){ var item = ui.item;} } }); }); lcsh and fast suggest the second function that was added to metadata maker is the lcsh suggestion using annif api.[43] annif (http://annif.org/) is a subject suggest tool for documents, originally developed by the national library of finland.[44] according to its webpage, annif can be trained through natural language processing and machine learning algorithms to support any kind of subject headings. to make annif support lcsh, the ld4p group used annif’s built-in algorithms and training corpus from the ivyplus platform for open data (pod)[45] and share-vde (virtual discovery environment)[46] to train annif (hahn, 2022;[47] khan, 2020[48]).[49] upon request, annif lcsh api returns a list of suggested lcshs labels, uris, and predicted scores. the list is sorted by the predicted score from high to low: the higher the score, the more relevant the subject heading is. // annif lcsh api response [ { "label": "clothing and dress--china--history", "notation": null, "score": 0.06058865785598755, "uri": "http://id.loc.gov/authorities/subjects/sh2003012066" }, { "label": "costume--china", "notation": null, "score": 0.014286939986050129, "uri": "http://id.loc.gov/authorities/subjects/sh85033251" }, { "label": "costume--china--history", "notation": null, "score": 0.014127381145954132, "uri": "http://id.loc.gov/authorities/subjects/sh85033252" }, { "label": "clothing and dress--history", "notation": null, "score": 0.011828765273094177, "uri": "http://id.loc.gov/authorities/subjects/sh2003012061" }, { "label": "clothing and dress--social aspects", "notation": null, "score": 0.008354970254004002, "uri": "http://id.loc.gov/authorities/subjects/sh85027167" }, { "label": "fashion--history", "notation": null, "score": 0.008040583692491055, "uri": "http://id.loc.gov/authorities/subjects/sh2008103592" }, { "label": "fashion--history--20th century", "notation": null, "score": 0.007795797660946846, "uri": "http://id.loc.gov/authorities/subjects/sh2008103594" }, { "label": "chinese poetry--translations into english", "notation": null, "score": 0.007471516728401184, "uri": "http://id.loc.gov/authorities/subjects/sh2008100615" }, { "label": "medicine, chinese", "notation": null, "score": 0.0065437802113592625, "uri": "http://id.loc.gov/authorities/subjects/sh85083125" }, { "label": "clothing and dress in literature", "notation": null, "score": 0.005863940808922052, "uri": "http://id.loc.gov/authorities/subjects/sh85033275" } ] using annif lcsh api, metadata maker can recommend ten lcsh terms given a book summary in any romance languages. users can select zero to ten lcsh terms by checking the provided checkbox. it is also possible to re-run the suggest function by updating the summary in the input box and clicking the suggest button. if one is not satisfied with the recommended keywords or uncomfortable using lcsh, users can still use an autocomplete faceted application of subject terminology (fast) heading search box to add keywords. figure 3. keyword (summary suggest and keyword search box) screenshot. // if a user clicks the #lcshsuggest button, based on the user’s // text input in the #summary box, lcsh will generate in the #lcshresponse div container $(function() { document.getelementbyid('lcshsuggest').onclick = function(){ document.getelementbyid("lcshresponse").innerhtml = ""; var summary = document.getelementbyid('summary').value; if (summary!=null){ var requests = "text=" + summary; var url = "http://annif.info/v1/projects/upenn-omikuji-bonsai-en-gen/suggest"; var xhr = new xmlhttprequest(); xhr.open("post", url, false); xhr.setrequestheader("content-type", "application/x-www-form-urlencoded"); xhr.setrequestheader("accept", "application/json"); xhr.onreadystatechange = function () { if (xhr.readystate === 4) { var data = xhr.responsetext; var jsonresponse = json.parse(data); console.log(jsonresponse); if (jsonresponse["results"] && jsonresponse["results"].length){ for (var i = 0; i < jsonresponse["results"].length; i++){ var lcshabel = jsonresponse["results"][i]["label"]; var lcshurl = jsonresponse["results"][i]["uri"]; document.getelementbyid("lcshresponse").innerhtml += ''+lcshabel+'
';} } } }; xhr.send(requests); } }; }); bibframe output with recent updates, the bibframe output data now includes uris of personal names, lcsh, and fast headings in the monographs (ld) module. below is an example of a . the lcnaf uri of “shakespeare” is added to the agent node. both viaf and lcnaf uris of “shakespeare” are added as the value of identifiers. --> shakespeare, william, 1564-1616 the second is an example of a . the fast heading uri is added to the topic node. these are represented in bibframe metadata as below. --> comedy plays [form/genre] some consideration while developing new features for metadata maker, the authors found some issues with the apis and linked data sources. encoding viaf provides a single name authority file that combines name authority files from more than 40 organizations,[50] making it convenient for libraries to take advantage of linked data and obtain information about name entities from one source. yet, the aggregation process might cause some encoding issues in viaf records. for example, when one searches for “greta reyghere,”[51] the name includes empty boxes in the dropdown list returned by the api. the same issue also appeared in the viaf json record: the source of the name with empty boxes was dnb according to the viaf json record (see fig. 5).[52] however, the dnb record did not have anything anomalous.[53] it seems that the empty boxes in the name label only exist in the viaf record; aggregation setting in viaf might be the reason why. figure 4. empty boxes in viaf. figure 5. empty boxes in viaf json. name entities search scope when describing resources in bibframe editors, cataloging experts tend to use name authority files like lcnaf. however, non-catalogers or paraprofessional catalogers may not be aware of those sources and are more likely to rely on the linked data editor itself. it is expected that bibframe editors understand the different name entity search behaviors between experienced and nascent catalogers. specifically, there are two expectations for the name entity search function in linked data editors: (1) no restraints on the order of a name; and (2) supporting variant name searching. as many non-professional catalogers may not receive identity management (authority) training, it is not intuitive for them to search names following the marc 100 field format: “last name, first name.” it is also important to make bibframe editors connected to various linked data sources for name entities on the web and collect the name variances from as many sources as possible. to meet the two expectations, metadata maker adopts viaf auto suggest api for personal name searching. the viaf auto suggest api supports both preferred name and variant name searching without any name format or name order constraints. this flexibility allows non-catalogers to find the desired personal name in different ways. one bibframe editor that was tested for this project supports only preferred name label search. a korean author, han, shin-kap,[54] has name variances: “한신갑” and “shin-kap han” in his authority record. the bibframe editor only brought a result when the term “han, shin-kap” was searched, as it matched with the existing lcnaf record 100 field. the other two variant names did not bring any results as the selected editor does not support variant name search. the failed search may drive non-catalogers to create duplicate name entity records or use strings instead of uris to represent the person. figure 6. search han, shin-kap in a linked data editor. figure 7. search 한신갑 in a linked data editor. figure 8. search shin-kap han in a linked data editor. quality of authority data viaf authority data provided via json-linked data (json-ld) format does not always have detailed and granular information. viaf authority cluster endpoint allows catalogers to retrieve authority data in various formats.[55] the name-related elements in the json-ld representation of viaf authority records include family name, given name, alternative name, and name (full name). more complicated names may contain title, numeration, and other information about the entity. take “john paul ii, pope” as an example.[56] “pope” is the title of “john paul ii.” “john paul” is the papal name and “ii” is the numeration. however, in his viaf json-ld record (see below), “john paul ii” is treated as the family name and “pope” is treated as the given name, which is not correct. while this would not be a problem when using data models that do not require name parts information like bibframe, it could be a problem for schemas that have fields or attributes specifically designated for name part, e.g., first name and last name. // json-ld description of john paul ii, pope in viaf { ... "familyname" : [ "janis", "john paul ii", "juan pablo ii.", "jawién", "joannes paulus ii.", "ioannis pauli ii", "yūhạnnā būlus at-tanī", "ioann pavel ii", "wojytla", "jean paul ii.", "ויטילה", "vajtyla", "wojtila", "ii", "jean paul ii", "jean-paul ii.", "voitilah", "ján pavol ii.", "jános pál ii.", "ivan pavao ii.", "yuhạnnā-būlus at-tanī", "jawieň", "ṿoiṭilah", "juan pablo ii", "vojtyla", "ivan pavlo ii.", "ян павел ii", "johannes paulus ii.", "giovanni paolo‏ ii", "voityla", "jasien", "jasień", "yoḥanan paʾulus ha-sheni", "voitila", "xoán paulo ii", "ṿoiṭilah", "jasień", "ואיטילה", "gruda", "giovanni paolos ii.", "wojtyla", "johano paŭlo la dua", "войтыла", "jawień", "wojtiła", "johannes paul ii", "paulus", "yuḥannā-būlus at-tānī", "johannes paul ii.", "john paul ii.", "wojtyła", "보이티야", "アンジェイ", "jan paweł ii", "jean-paul ii", "yuhạnnā-būlus at-tanī", "보이티와", "janez", "jan paweł ii.", "jawien", "jan paweł", "jawień", "yūḥannā būlus at-tānī", "giovanni paolo ii", "janez pavel ii.", "ioannis paulus ii.", "yūḥannā būlus at-tānī", "vojtila", "iohannes paulus pp. ii", "yūhạnnā būlus at-tanī", "jan pavel druhý", "ioannes paulus ii.", "jānis pāvils ii.", "yuḥannā-būlus at-tānī", "joannes paulus ii" ], "gender" : "http://www.wikidata.org/entity/q6581097", "givenname" : [ "karal'", "pape", "papież", "karol józef", "al-bābā", "stanislaw andrzej", "carlo", "karols", "karol'", "stanisław a.", "ḳarol", "pope", "папа рымскі", "кароль", "קארול", "‏ papa", "karol joźef", "johannes", "andrzej", "pāvests", "papież", "ḳarol", "carol", "ヤヴィエニ", "karol j.", "카롤", "piotr", "saint", "lolek", "stanisław", "k.", "stanisław andrzej", "papa", "heiliger", "santo", "karolis", "karol jozef", "pavils", "pápa", "papa", "카롤 유제프", "karol józef", "karolʹ", "papst, heiliger", "papst", "al-bābā", "ii", "karol", "karel", "pavel", "pape", "john paul", "paus", "קרול" ], ... } testing after adding the viaf api into metadata maker, the authors did a very small scale unofficial usability testing in university of illinois with eleven participants: five paraprofessional catalogers who create original cataloging records as part of their responsibilities; two hourly catalogers who did not have cataloging knowledge but with language and subject knowledge; two graduate assistants; and two cataloging and metadata librarians. they were asked to create a record for a monograph book in sinopia and metadata maker and share their thoughts on two things: ease of use and knowledge/skills required to use each tool. the survey also had a section where testers could add their thoughts.[57] ease of use for the first question, testers could choose one answer from the following options: extremely hard hard, but can follow through it easy very easy figure 9. survey result: ease of use. eight participants said that metadata maker is easy to use (five chose “very easy” and three chose “easy”) while ten people said that sinopia is hard to use (five chose “extremely hard” and another five chose “hard, but can follow through it”). the survey reveals that the majority of participants prefer the simple interface of metadata maker to the relatively complex and verbose interface of sinopia. there is one person who chose that metadata maker is “extremely hard to use” and two people chose “hard, but can follow through it”. those who answered that metadata maker is hard to use are paraprofessional catalogers who create original records in oclc. during the follow-up interview, they expressed that they do not like the simple interface of metadata maker and the notion of creating short/minimum records. they want the bibframe editors to be similar to the oclc connexion, the tool that they are familiar with and allows them to create full level cataloging records. an undergraduate student with language skills answered that sinopia is easy to use. the student added that while there is a lot to learn and it takes time, they can follow through the sinopia by reading the information provided for each element. while sinopia allows users to view the output data in json-ld, turtle, n-triples, rdf table, and interface view formats, three participants commented that it is hard to check the outcome of their work in sinopia. it might be because those participants have not learned rdf data models and linked data serialization formats. metadata maker, however, allows records to be downloaded and viewed locally. those participants also added that it would be helpful to know the dataflow once the record is created in both editors. knowledge and skills required to use the editors the second multiple-choice question was to ask participants what kind of skills they thought were needed for the two bibframe editors, such as functional requirements for bibliographic records (frbr),[58] rda, bibframe, lcsh, and other controlled vocabularies, name authority, linked data, and marc. however, the authors quickly realized that the jargon and acronyms in this question caused misunderstandings for many participants as they did not know some or all options, especially the two non-catalogers who do not have cataloging knowledge/education. those staff members who routinely create original records also are not familiar with frbr, bibframe, and linked data. as a result, the answers to this question are all over the place as below: table 1. answers from 11 participants: knowledge and skills required to use the editors. sinopia metadata maker unsure none bibframe, marc, lcsh and other controlled vocabularies, name authority, linked data marc, none rda, bibframe, frbr, lcsh and other controlled vocabularies, linked data, need an extreme understanding of frbr terms and rda standards just to read/understand the interface i feel like you don’t actually need to know anything about cataloging standards to use this interface marc, lcsh and other controlled vocabularies, name authority, linked data marc marc, lcsh and other controlled vocabularies, name authority, linked data none bibframe none rda, bibframe, frbr, linked data, i did not use it enough to know all that one needs to know, but this is meant for experienced (and very technically savvy) catalogers none, if applicable, an non-english language. marc lcsh and other controlled vocabularies i do not know? rda basic book information everything basic book information none none however, one thing that is clear is that while many participants said there are things that are necessary to learn in order to use the bibframe editor, the majority of participants said no knowledge is needed to use the metadata maker. discussion and next steps the process of revamping metadata maker with linked data sources and bibframe output presented a possibility for building a linked data editor without any cataloging terminologies that can be used by anyone. the intuitive design, self-explanatory wording, and one-page web form break the learning barriers of bibframe cataloging and allow non-professional catalogers and language/subject experts to get involved in linked data metadata creation. as metadata maker is designed for generating “good enough” records, it can also serve as a quick bibframe generation tool for paraprofessional catalogers. however, the authors have learned some concerns from catalogers with regard to using this tool in practice, such as an oversimplified interface and unclear dataflow. the authors were perplexed by the variant degree of acceptance for metadata maker among survey participants. paraprofessional catalogers are inclined to use quasi-connexion editors with the option to describe detailed information about resources; whereas nascent catalogers might be more comfortable using linked data editors that do not require such prerequisite knowledge. the developers of linked data editors will need to balance those two needs. while the library domain has made significant progress in the development of and experimentation with linked data and bibframe production, there are still many things that the library community has to think further about and work together on to find a solution. first, a clear dataflow needs to be established. as of now, bibframe linked data created from the current bibframe editors are not automatically ingested into any integrated library system.[59] this was brought up by several staff members who tested sinopia. in addition, most vendors do not support bibframe import as of this writing. the authors acknowledge that the dataflow requires a possible new integrated library system that can work with metadata in different formats and with a different ontology. second, libraries may have a completely different data sharing method in the linked data environment compared with the current centralized shared database.[60] if that is the case, what would a data sharing model be like? if it is still possible to have a centralized linked data database, then who is going to manage it, and how is it going to be managed? third, a discussion of work distribution between human catalogers and machines needs to start. as machines can do marc to bibframe conversion and authority reconciliation work rather effectively, libraries might want to think about what machines can do and what cataloging and metadata professionals should do. if there are tasks that machines can do better, then it would be better to leave those to the machines, and identify what cataloging and metadata professionals should focus on, in terms of linked data creation and workflows. fourth, according to fortier, pretty, and scott (2022),[61] the understanding and knowledge of bibframe among canadian libraries is still low after close to two decades of ongoing discussion and development efforts. while it is important to understand the underlying structure of bibframe and linked data, it would be worthwhile to think about how much training is adequate for cataloging professionals and how much integration of rda terms into the bibframe editors is necessary for the transition to linked data creation. or, maybe what libraries really need is a linked data editor rather than a bibframe editor. if there are problems in understanding bibframe and rda among ourselves, it would be much more difficult for users on the web to understand what kind of data we are sharing. about the author greta heng (orcid: 0000-0002-3606-6357) is cataloging and metadata strategies librarian at san diego state university. myung-ja (mj) k. han (orcid: 0000-0001-5891-6466) is a professor and metadata librarian at the university of illinois at urbana-champaign. bibliography [1] library of congress. bibliographic framework initiative. https://www.loc.gov/bibframe/. [2] world wide web consortium (w3c). rdf. https://www.w3.org/rdf/. [3] wennerlund, b., & berggren, a. (2017). leaving comfort behind: a national union catalogue transition to linked data. paper presented at: ifla wlic 2019 – athens, greece – libraries: dialogue for change in session s15 – big data. in: data intelligence in libraries: the actual and artificial perspectives, 22-23 august 2019, frankfurt, germany. [4] french national library. semantic web and data model. https://data.bnf.fr/en/semanticweb. [5] german national library. linked data service. https://www.dnb.de/en/professionell/metadatendienste/datenbezug/lds/lds_node.html. [6] library of congress. marva editor. https://bibframe.org/marva/editor/. [7] wikidata. wikidata main page. https://www.wikidata.org/wiki/wikidata:main_page. [8] godby, j., smith-yoshimura, k., washburn, b., davis, k., detling, k., eslao, c., folsom, s., li, x., mcgee, m., miller, k., moody, h., thomas, c., & tomren, h. (2019). creating library linked data with wikibase: lessons learned from project passage (pp.70). oclc research. https://doi.org/10.25333/faq3-ax08. [9] han, m. k., ream-sotomayor, n. e., lampron, p., & kudeki, d. (2016). making metadata maker: a web application for metadata production, library resources & technical services, 60(2), 89–98.; all the source codes are available in github: https://github.com/dkudeki/metadata-maker; metadata maker is still in the exploratory phase and currently only supports linked data cataloging for monographs. [10] michael, b., & han, m. j. k. (2019). assessing bibframe 2.0: exploratory implementation in metadata maker. proceedings of the international conference on dublin core and metadata applications, 26-31. [11] non-catalogers refer to people who do cataloging work but do not have adequate cataloging experience or may not need it as they do not pursue a career in cataloging. [12] van der werf, t. (2021, march 4). next generation metadata… it’s getting real! hanging together, oclc research blog. https://hangingtogether.org/next-generation-metadata-it-is-getting-real/. [13] dalgord,c. shared entity management infrastructure project update. oclc. https://www.loc.gov/bibframe/news/source/bibframe-from-home-oclc-update.pptx. [14] linked data for libraries. https://wiki.lyrasis.org/pages/viewpage.action?pageid=41354028. [15] linked data for libraries labs. https://wiki.lyrasis.org/pages/viewpage.action?pageid=77447730. [16] linked data for production. https://wiki.lyrasis.org/pages/viewpage.action?pageid=74515029. [17] linked data for production: pathway to implementation. https://wiki.lyrasis.org/display/ld4p2. [18] linked data for production: closing the loop. https://wiki.lyrasis.org/display/ld4p3. [19] lnenicka, m., kopackova, h., machova, r., & komarkova, j. (2020). big and open linked data analytics: a study on changing roles and skills in the higher educational process. international journal of educational technology in higher education, 17(1), 1-30. [20] el-sherbini, m. & klim, g. (1997). changes in technical services and their effect on the role of catalogers and staff education: an overview. cataloging & classification quarterly, 24(1-2), 23-33; zhu, l. (2012). the role of paraprofessionals in technical services in academic libraries. library resources & technical services, 56(3), 127-154. [21] van der werf, next generation metadata… it’s getting real! [22] library of congress, bibframe editor. https://bibframe.org/bfe/index.html. [23] library of congress, marva. https://bibframe.org/marva/editor/. [24] linked data for production: pathway to implementation. sinopia. https://sinopia.io/. [25] rda toolkit: https://www.rdatoolkit.org/. [26] bibframe interoperability group. (2022. april 15). terms of reference. https://www.loc.gov/aba/pcc/bibframe/taskgroups/big/big-tor.pdf. [27] lorimer, n.(2022, march 8). re-use or copy? redefining copy cataloging in a linked data environment. ala copy cataloging ig, online. https://docs.google.com/presentation/d/1ukxcdjea-cwmxnfixfibdpbcvn_jxvmymoojgzibi9o/edit?usp=sharing. [28] library of congress. appendix c – minimal level record examples. https://www.loc.gov/marc/bibliographic/bdapndxc.html. [29] http://quest.library.illinois.edu/marcmaker/dataset/. [30] http://quest.library.illinois.edu/marcmaker/. [31] aka, monograph (linked data), http://quest.library.illinois.edu/marcmaker/monoviaf/. [32] http://quest.library.illinois.edu/marcmaker/ebooks/. [33] http://quest.library.illinois.edu/marcmaker/govdocs/. [34] http://quest.library.illinois.edu/marcmaker/maps/. [35] http://quest.library.illinois.edu/marcmaker/microfilms/. [36] http://quest.library.illinois.edu/marcmaker/scores/. [37] http://quest.library.illinois.edu/marcmaker/serials/. [38] http://quest.library.illinois.edu/marcmaker/theses/. [39] bibframe is only added to two monograph modules for now. [40] oclc developer network. authority cluster resource. https://www.oclc.org/developer/api/oclc-apis/viaf/authority-cluster.en.html. [41] dnb was selected as an alternative name label source because (1) it provides linked data service; and (2) it is a national library for a non-native english speaking countries which may compensate for lcnaf. [42] https://github.com/dkudeki/metadata-maker/blob/monoviaf/lcsh/lcshsearch.js. [43] suominen, o., inkinen, j., virolainen, t., fürneisen, m., kinoshita, b. p., veldhoen, s., sjöberg, m., zumstein, p., neatherway, r., & lehtinen, m. (2022). annif (version 0.60.0-dev) [computer software]. https://doi.org/10.5281/zenodo.2578948; https://api.annif.org/v1/ui/. [44] annif github repository. https://github.com/natlibfi/annif. [45] ivyplus platform for open data. https://pod.stanford.edu/. [46] share-vde (virtual discovery environment). https://www.svde.org/. [47] hahn, j. (2022, june 20). cataloger acceptance and use of semiautomated subject recommendations for web scale linked data systems. 87th ifla world library and information congress (wlic) / 2022 in dublin, ireland. https://repository.ifla.org/handle/123456789/1955. [48] khan ,h. (2020, march 10). annif use and explanation. linked data for production: pathway to implementation. https://wiki.lyrasis.org/display/ld4p2/annif+use+and+explanation. [49] when accessed http://lcsh.annif.info/ in october 2022, annif lcsh api project updated its vocabulary sources: “ivyplus-tfidf” was changed to “penn-fasttext-en” (penn (lcsh english) conference papers and proceedings), “upenn-omikuji-bonsai-en-gen” (upenn (lcsh english) all genres), and “upenn-omikuji-bonsai-spa-gen” (upenn (lcsh spanish) all genres). [50] virtual international authority file. https://www.oclc.org/en/viaf.html. [51] viaf authority record for de reyghère, greta. retrieved on september 11, 2022, from http://viaf.org/viaf/69118441. [52] viaf authority record in json for de reyghère, greta. retrieved on september 11, 2022, from https://viaf.org/viaf/69118441/viaf.json. [53] dnb authority record for de reyghère, greta. retrieved on september 11, 2022, from https://hub.culturegraph.org/entityfacts/134496175, and https://d-nb.info/gnd/134496175. [54] viaf authority record for han, shin-kap. retrieved on september 11, 2022, from http://viaf.org/viaf/198153409742041581752. [55] oclc authority cluster resource. https://www.oclc.org/developer/api/oclc-apis/viaf/authority-cluster.en.html. retrieved on october 5, 2022. [56] viaf authority record in json-ld for john paul ii, pope. retrieved on september 20, 2022, from https://viaf.org/viaf/35605/viaf.jsonld. [57] we chose sinopia over other bibframe editors because it is created for the community and has pcc templates that have been tested out by many catalogers. we also understand that the purpose of the bibframe editor and metadata maker are different. [58] the international federation of library associations and institutions. functional requirements for bibliographic records (frbr). https://www.loc.gov/marc/bibliographic/bdapndxc.html. [59] there are some unofficial statements that folio and ex libris have been working on bibframe data import. but as of october 6, 2022, there has not been a bibframe data import function released by them. [60] library of congress. bibframe and the pcc. https://www.loc.gov/aba/pcc/bibframe/bibframe-and-pcc.html. [61] fortier, a., pretty, h., & scott, d. (2022): assessing the readiness for and knowledge of bibframe in canadian libraries, cataloging & classification quarterly. https://doi.org/10.1080/01639374.2022.2119456. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 55, 2023-1-20 previous issues issue 54, 2022-08-29 issue 53, 2022-05-09 issue 52, 2021-09-22 issue 51, 2021-06-14 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – a fast and full-text search engine for educational lecture archives mission editorial committee process and structure code4lib issue 55, 2023-1-20 a fast and full-text search engine for educational lecture archives e-lecturing and online learning are more common and convenient than offline teaching and classroom learning in the academic community after the covid-19 pandemic. universities and research institutions are recording the lecture videos delivered by the faculty members and archiving them internally. most of the lecture videos are hosted on popular video-sharing platforms creating private channels. the students access published lecture videos independent of time and location. searching becomes difficult from large video repositories for students as search is restricted on metadata. we presented a design and developed an open-source application to build an education lecture archive with fast and full-text search within the video content. by arun f. adrakatti and k.r. mulla introduction e-lecturing has become increasingly popular over the past decade. there is an exponential increase in the amount of lecture video data being posted on the internet. many universities and research institutions are recording their lectures and publishing them online for students to access independently of time and location. in addition, there are numerous massive open online courses (moocs) that are popular across the globe for their ability to provide online lectures in a wide variety of fields. the availability of online courses and ease of access has made them a popular learning tool. lecture videos and moocs are hosted on cloud platforms to make them available to registered users. most of these resources are for the benefit of the public. the majority of lecture videos are available on popular video-sharing platforms such as youtube, vimeo, and dailymotion. due to a shortage of storage servers and internet bandwidth, academic and research institutions are having difficulty maintaining video repositories. there are a great number of lecture videos uploaded to online platforms that are annotated only with a few keywords, which results in the search engine returning incomplete results. there are only a limited number of keywords available to access lectures and the search can be conducted primarily using their occurrences or tags. videos are retrieved solely based on their metadata, such as their title, author information, and annotations, or by user navigation from generic to specific topics. users can fetch the materials only based on limited options. irrelevant and random annotations, navigation, and manual annotations made without considering the video contents are the major stumbling blocks to retrieving lecture videos. existing approaches and proposed models the fast development of video data has made efficient video indexing and retrieval technologies one of the most essential concerns in multimedia management.[1] bolettieri et al. (2007) proposed a system based on milos, a general-purpose multimedia content management system that was developed to aid in the design and implementation of digital library applications.[2] the goal is to show how digital information, such as video documents or powerpoint presentations, may be reused by utilizing existing technologies for automatic metadata extraction, including ocr, speech recognition, cut detection, and mpeg-7 visual video retrieval (cbvr). yang et al. (2011) provided a method for automated lecture video indexing based on video ocr technology, built a new video segmenter for automated slide video structure analysis and implemented a new algorithm for slide structure analysis and extraction based on the geometrical information of identified text lines.[3] an approach for automated video indexing and video search in huge lecture video archives has been developed.[4] by using video optical character recognition (ocr) technology on key-frames and automatic speech recognition (asr) on lecture audio files, automatic video segmentation and key-frame identification can provide a visual guideline for video content navigation and textual metadata. for keyword extraction, both videoand segment-level keywords are recovered for content-based video browsing and search using ocr and asr transcripts as well as identified slide text. the authors developed and proposed text to speech extraction from videos and text extracted from ocr from the slide used while delivering the lecture videos. most of these proposed ideas and developed concepts are in commercial platforms which the academic and research institutions are not in a position to be able to afford. motivation for the project in the indian scenario, the vast majority of lecture video repositories and moocs are hosted on the youtube platform and the web links are embedded in content management systems (cms) and e-learning applications. typically, these applications only search for metadata and human annotations associated with specific videos. some applications lack search engines, and the swayam (study webs of active-learning for young aspiring minds) platform is the best example of this. the user needs to browse the videos according to the subject on this platform. a commercial video lecture application focuses only on recording and organizing video lectures based on their subject and topics. the retrieval of videos from repositories is the least concern and applications neglect user concerns. the users do not find the desired information on the lecture video repositories and invest lots of time in browsing and listening to the videos to gain information. the popular open source institutional repositories systems are limited to maintaining only document and image files. the retrieval is restricted to metadata and human annotations. these limitations and restrictions led to the design and development of the educational lecture archive with more focus on the retrieval of the videos based on the content from these videos. development of content-based video retrieval system the capture of e-lecturing has become more popular and the amount of lecture video data on the web is growing rapidly. universities and research institutions are recording their lectures and publishing them online for students to access independent of time and location. on the other side, users find it difficult to search for the desired information from these video repositories. to overcome this issue, the application is designed and developed to organize the education lecture videos through the practical approach of searching videos based on the entire speech contained in the video. the application is named cbmir (content-based multimedia information retrieval). the scope of cbmir covers only educational lecture video repositories maintained and hosted by universities, r&d, and nonprofit organizations of india and is limited to speech extraction and automatic text indexing of lecture videos available in the english language. figure 1. overview of the cbmir application. technical requirement linux based os – ubuntu django – content management system: python-based web framework. python – programming language anaconda – python distribution whoosh library: whoosh is a fast, full text, python search engine library. the cbmir application has three major modules: the administrator, information processing & retrieval, and user platform. figure 2 shows a detailed workflow process for an administrator uploading video into repositories to a user retrieving the desired videos from repositories. flow chart figure 2. flow chart of the content-based multimedia information retrieval. modules of cbmir applications administrator module the administrator module allows the admin to upload the external video using hosted weblinks, and to upload the video file from a personal device or external storage device. figure 3 shows the dashboard of the administrator module. the authorized admin will have access to the application to manage and upload the data. the admin can view / edit / delete the videos uploaded using the option to manage the content library. the admin can add a new user and assign a limited admin role to distribute the workload of uploading the content in cases when they are creating large educational repositories. the admin was given the privilege to add / edit / delete the text of transcript uploaded into the database. figure 3. dashboard of administrator module. information retrieval processing module the following are the step-by-step processes for the information processing and storage of videos with the cbmir application. figure 4 shows the status bar of the work process after uploading the external video link into the cbmir application. uploading videos download the video from external source: video will be downloaded to the application server from an external server or from the device using youtube api to access the youtube data and google api for accessing the youtube data for login credentials. download the videos from the device: video uploaded from personal devices or external storage devices using python library. video to audio: video file will be converted to audio using the ffmpeg python library audio to text: audio file will be converted to a text file with speech recognition using pocket-spinx python library with an acoustic model text segmentation: whole text will be broken into multiple segments based on the timings. automatic indexing and searching: the text file will be auto-indexed using the whoosh library figure 4. the status bar of processing of information and storage. user module the user module acts as a search engine, where the page contains a search box. it shows a dashboard below the search box that displays the total number of videos and subject, or topic-wise, collections. the user is allowed to search desired information using keyword terms. search results are displayed based on the word occurrence. the whoosh library converts all audio files into a text file, auto indexes these files, and makes a fast searchable format. the keyword term will be searched in the database and a list of videos will be displayed. the video starts playing on a single click on a specific video. figure 5. display of search result of the keyword speech. the whole text will be broken into multiple segments based on the speech timing and auto indexed into the database. with the advantage of segmentation, the video will start playing on the specific keyword or phrase searched by the user. figure 6. display of search results based on the time segmentation. unique features of applications: open source application: the cbmir application is designed and developed based on the open source web application and databases. the source code of the application will be shared on a popular source code distribution platform for further development from the community. fast and full-text search engine: searching through a full-text database or text document is referred to as a full-text search. a search engine is embedded within cbmir that analyses all the words contained within a document and compares them to the search criteria specified by the user during the search process. searching a huge converted text file using a full-text search ensures fast retrieval of results for large numbers of documents. the search engine provides efficient search results that are accurate and precise in all fields. domain based lecture video repository: the cbmir application allows subject-based lecture video archiving for academic institutions. this repository provides consolidated subject-specific lectures and helps users to spend less time searching on popular video sharing platforms for lectures. no video advertisements on the application: the video on the cbmir application will not play advertisements. popular video sharing platforms typically include advertisements at the beginning and middle of their videos. lightweight, less storage space, and unlimited video upload: the cbmir application is very lightweight and it can extract speech, convert it into audio, and save text files stored into databases; usually it requires less storage space. there is no limitation in uploading video into the application. text segmentation for retrieval of video content: the application divides converted text from audio into sets of words along with video timestamp. the video is played from the specific timestamp depending on the user desired search term. conclusion: due to covid-19, e-lecturing became a part of the academic community for all levels of education. academic institutions are finding it difficult to archive lecture video on internal servers and hosting on popular video sharing platforms for future use. users find it difficult to search the desired video information from the larger video repositories. the search function is restricted only to metadata, human annotation, and tags of a particular video. to overcome this problem, the cbmir application has been developed using open source technology to build educational videos repositories based on the relevant subject, focusing on fast and full-text search of the video content. the current cmbir application is limited to converting the speech to text in the english language only. the further development of the application includes converting speech to text for indian regional languages, translating converted transcripts into indian regional languages, including voice search options, text extraction using ocr technology, finding objects from the videos, creating a dashboard on the user module, etc. the source code of the cbmir application is being submitted to the authors’ university in order to fulfill requirements for a degree and will be distributed on github under a creative commons cc by-nc license after completion of the degree. reference [1] saoudi, e. m., & jai-andaloussi, s. (2021). a distributed content-based video retrieval system for large datasets. journal of big data, 8(1). https://doi.org/10.1186/s40537-021-00479-x [2] bolettieri, p., falchi, f., gennaro, c., & rabitti, f. (2007). automatic metadata extraction and indexing for reusing e-learning multimedia objects. proceedings of the acm international multimedia conference and exhibition. https://doi.org/10.1145/1290067.1290072 [3] yang, h., siebert, m., lühne, p., sack, h., & meinel, c. (2011). lecture video indexing and analysis using video ocr technology. proceedings – 7th international conference on signal image technology and internet-based systems, sitis 2011. https://doi.org/10.1109/sitis.2011.20 [4] yang, h., & meinel, c. (2014). content based lecture video retrieval using speech and video text information. ieee transactions on learning technologies, 7(2). https://doi.org/10.1109/tlt.2014.2307305 about the authors arun f. adrakatti (arun@iiserbpr.ac.in) is assistant librarian at the indian institute of science education research (iiser) in berhampur, odisha, india. k.r. mulla (krmulla@vtu.ac.in) is a librarian at visvesvaraya technological university in belagavi, karnataka, india. subscribe to comments: for this article | for all articles leave a reply name (required) mail (will not be published) (required) website δ issn 1940-5758 current issue issue 55, 2023-1-20 previous issues issue 54, 2022-08-29 issue 53, 2022-05-09 issue 52, 2021-09-22 issue 51, 2021-06-14 older issues for authors call for submissions article guidelines log in this work is licensed under a creative commons attribution 3.0 united states license. the code4lib journal – drying our library’s libguides-based webpage by introducing vue.js mission editorial committee process and structure code4lib issue 55, 2023-1-20 drying our library’s libguides-based webpage by introducing vue.js at the kingsborough community college library, we recently decided to bring the library’s website more in line with dry principles (don’t repeat yourself). we felt we this could improve the site by creating more concise and maintainable code. dryer code would be easier to read, understand and edit. we adopted the vue.js framework in order to replace repetitive, hand-coded dropdown menus with programmatically generated markup. using vue allowed us to greatly simplify the html documents, while also improving maintainability. by mark e. eaton keeping it dry a common goal among programmers is to write code that is dry, in other words, code where you don’t repeat yourself. this is usually motivated by the insight that computers can often effectively automate repetitive tasks, making it unnecessary to repeat yourself in code. taking advantage of the efficiencies of automation is widely regarded as a best practice among programmers. however, html, when written by hand, is unfortunately not terribly suited to dry practices. html is particularly declarative: all elements of the page are explicitly laid out by the programmer, so as to fully describe its structure. the problem with this is that it means that hand-written webpages are often not very dry. even those of a relatively modest amount of complexity can quickly grow into very long html documents. this can be problematic, for a few reasons: it can become difficult to conceptualize the structure of a whole page when it stretches out over hundreds of lines. even relatively trivial aspects of coding, such as indentation, can become difficult with the deeply nested html structures of a large page. it is easy to introduce syntax errors or formatting problems into long html documents, because typos can be easily overlooked. this is especially problematic in cases where there is no built-in linting or validation.[1] at our college these challenges were familiar to us at kingsborough community college, a college of the city university of new york. our homepage, built on libguides cms, ran to over 500 lines, not including the or