id author title date pages extension mime words sentences flesch summary cache txt tedunderwood-com-8283 A half-decent OCR normalizer for English texts after 1700. | The Stone and the Shell .html text/html 2099 162 75 Basically, I'm sharing the code I use to correct OCR in my own research. The percentage of tokens in the HathiTrust corpus that are recognized as words before (red) and after (black) correction by my script. E.g., it's a problem that "today" is sometimes written "to day" and sometimes "to-day," and it's a problem that eighteenth-century verbs get "condens'd." A script designed to correct OCR might leave these variants unaltered, but in order to make meaningful diachronic comparisons, I have to produce a corpus where variations of spelling and word division are normalized. (Illinois has signed an agreement with HathiTrust, which gives me access to public-domain works for research purposes.) So each data point is the average percentage of tokens recognized as words in a given year. In your comment above you mention that each data point represents the average percentage of tokens recognised as a word in a given year. https://tedunderwood.com/2013/12/10/a-half-decent-ocr-normalizer-for-english-texts-after-1700/ ./cache/tedunderwood-com-8283.html ./txt/tedunderwood-com-8283.txt