mv: 'input-file.zip' and './input-file.zip' are the same file Creating study carrel named subject-robinHood-freebo Initializing database Unzipping Archive: input-file.zip inflating: ./tmp/input/xml2htm.xsl inflating: ./tmp/input/metadata.csv inflating: ./tmp/input/A07897.xml caution: excluded filename not matched: *MACOSX* === DIRECTORIES: ./tmp/input === DIRECTORY: === metadata file: ./tmp/input/metadata.csv === found metadata file === updating bibliographic database Building study carrel named subject-robinHood-freebo May 24, 2021 8:29:02 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed. See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io for optional dependencies. May 24, 2021 8:29:02 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem WARNING: Tesseract OCR is installed and will be automatically applied to image files unless you've excluded the TesseractOCRParser from the default parser. Tesseract may dramatically slow down content extraction (TIKA-2359). As of Tika 1.15 (and prior versions), Tesseract is automatically called. In future versions of Tika, users may need to turn the TesseractOCRParser on via TikaConfig. May 24, 2021 8:29:02 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem WARNING: org.xerial's sqlite-jdbc is not loaded. Please provide the jar on your classpath to parse sqlite files. See tika-parsers/pom.xml for the correct version. INFO Starting Apache Tika 1.24.1 server INFO Setting the server's publish address to be http://localhost:9998/ INFO Logging initialized @3076ms to org.eclipse.jetty.util.log.Slf4jLog INFO jetty-9.4.27.v20200227; built: 2020-02-27T18:37:21.340Z; git: a304fd9f351f337e7c0e2a7c28878dd536149c6c; jvm 1.8.0_281-b09 INFO Started ServerConnector@3e74829{HTTP/1.1, (http/1.1)}{localhost:9998} INFO Started @3168ms WARN Empty contextPath INFO Started o.e.j.s.h.ContextHandler@62010f5c{/,null,AVAILABLE} INFO Started Apache Tika server at http://localhost:9998/ INFO rmeta/text (autodetecting type) FILE: cache/A07897.xml OUTPUT: txt/A07897.txt === file2bib.sh === INFO Detecting media type for Filename: b'A07897.xml' INFO rmeta/text (autodetecting type) A07897 txt/../pos/A07897.pos A07897 txt/../ent/A07897.ent A07897 txt/../wrd/A07897.wrd === file2bib.sh === id: A07897 author: Henry, Chettle, d. 1607?. title: The Death of Robert, Earl of Huntingdon date: 1601 pages: extension: .xml txt: ./txt/A07897.txt cache: ./cache/A07897.xml Content-Encoding ISO-8859-1 Content-Type text/plain; charset=ISO-8859-1 X-Parsed-By ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'] X-TIKA:content_handler ToTextContentHandler X-TIKA:embedded_depth 0 X-TIKA:parse_time_millis 19 resourceName b'A07897.xml' Done mapping. Reducing subject-robinHood-freebo === reduce.pl bib === id = A07897 author = Henry, Chettle, d. 1607?. title = The Death of Robert, Earl of Huntingdon date = 1601 pages = extension = .xml mime = text/plain words = 286217 sentences = 91352 flesch = 101 summary = Textual changes and metadata enrichments aim at making the text more computationally tractable, easier to read, and suitable for network-based collaborative curation by amateur and professional end users from many walks of life. Otherwise called Robin Hood of merrie Sherwodde: with the lamentable tragedie of chaste Matilda, his faire maid Marian, poysoned at Dunmowe by King Iohn. Otherwise called Robin Hood of merrie Sherwodde: with the lamentable tragedie of chaste Matilda, his faire maid Marian, poysoned at Dunmowe by King Iohn. Acted by the Right Honourable, the Earle of Notingham, Lord high Admirall of England, his seruants. Acted by the Right Honourable, the Earle of Notingham, Lord high Admirall of England, his seruants. A notation like "6-b-2890" means "look for EEBO page image 6 of that text, word 289 on the right side of the double-page image." That reference is followed by the corrupt reading. head , we shall haue hornes good slore . cache = ./cache/A07897.xml txt = ./txt/A07897.txt Building ./etc/reader.txt A07897 A07897 number of items: 1 sum of words: 286,217 average size in words: 286,217 average readability score: 101 nouns: xml; pc; l; p; pos="n1; pos="vvi; >; pos="n2; cs; pos="n1-nn; w; pos="po; q; sic; speaker; cc; stage; surface; av; g; x; pos="pns; r; im; unit="sentence; type="unclear; pos="vvg; rendition="#hi">.sonnevponhauemeebruse&hubertilesteroxfordkeepedoncasterfrierfairehaueiohnmeefeare.doegoehauerobin,queenewarmanknowe proper nouns: id="a07897; w; pos="acp; pos="d; pos="j; pos="vvb; pos="po; xml; pos="pns; sp; pos="cc; unit="sentence"/; pos="n; lemma="be; lemma="and; lemma="the; pos="vvz; lemma="i; speaker; pos="pn; pos="pno; pos="vmb; pos="vvn; lemma="you; pos="av; pos="crq; lemma="a; lemma="my; lemma="he; lemma="will; lemma="of; lemma="have; rendition="#hi">,?.irobinmatildahubert,yeevsvponsonnequeeneneuermeeloueleaueladiehauegiuefairedoebruse&head , we shall haue hornes good slore . ==== make-pages.sh questions ==== make-pages.sh search ==== make-pages.sh topic modeling corpus Zipping study carrel