A Prototype for Authorship Attribution Studies Patrick Juola∗ juola@mathcs.duq.edu John Sofko sofko936@hotmail.com Patrick Brennan brennan998@comcast.net Duquesne University Pittsburgh, PA 15282 UNITED STATES OF AMERICA Abstract Despite a century of research, statistical and computational methods for authorship attribution are neither reliable, well-regarded, widely-used, or well-understood. This paper presents a survey of the current state-of- the-art as well as a framework for uniform and unified development of a tool to apply the state-of-the-art, despite the wide variety of methods and techniques used. The usefulness of the framework is confirmed by the development of a tool using that framework that can be applied to authorship analysis by researchers without a computing specialization. Using this tool, it may be possible both to expand the pool of available researchers as well as to enhance the quality of the overall solutions (for example, by incorporating improved algorithms as discovered through em- pirical analysis [Juola, 2004a]). 1 Introduction The task of computationally inferring the author of a document based on its internal statistics – sometimes called “stylometrics,” “authorship attribution,” or (for the completists) “non-traditional authorship attribution” is an active and vibrant research area, but at present largely without use. For example, unearthing the author of the anonymously-written Primary Colors (Joe Klein) became a substantial issue in 1996. In 2004, “anonymous” published Imperial Hubris, a followup to his (her?) earlier work Through Our Enemies’ Eyes. Who wrote these books?1 Did the same person actually write these books? Does the(?) author actually have the expertise claimed on the dust cover (“a senior ∗Corresponding author 1According to news report consensus, as first revealed by Jason Vest in the July 2 edition of the Boston Phoenix, the author is Michael Scheuer, a senior CIA officer. But how seriously should we take this consensus? 1 U.S. intelligence official with nearly two decades of experience”)? And, why haven’t our computers already given us the answer? Determining the author of a particular piece of text has been a methodolog- ical issue for centuries. Questions of authorship can be of interest not only to humanities scholars, but in a much more practical sense to politicians, journal- ists, and lawyers as in the examples above. In recent years, the development of improved statistical techniques [Holmes, 1994] in conjunction with the wider availability of computer-accessible corpora [Nerbonne, 2004] has made the auto- matic inference of authorship (variously called “authorship attribution” or more generally “stylometry”) at least a theoretical possibility, and research in this area has expanded tremendously. From a practical standpoint, acceptance of this technology is dogged by many issues — epistemological, technological, and political — that limit and in some cases prevent its wide acceptance. Part of this lack of use can be attributed to simple unfamiliarity on the part of the relevant communities, combined with a perceived history of inaccuracy (see, for exam- ple, the discussion of the cusum technique [Farringdon, 1996] in [Holmes, 1998]). Since 1996, however, the popularity of corpus linguistics as a field of study and vast increase in the amount of data available on the Web have made it practical to use much larger sets of data for inference. During the same period, new and increasingly sophisticated techniques have improved the quality (and accuracy) of judgments the computers make. This paper summarizes some recent findings and experiments and presents a framework for development and analysis to address these issues. In particu- lar, we discuss two major usability concerns, accuracy and user-friendliness. In broad terms, these concerns can only be addressed by expansion of the number of clients (users) for authorship attribution technology. We then present a the- oretical framework for description of authorship attribution to make it easier and more practical for the development and improvement of genuine off-the-shelf attribution solutions. 2 Background With a history stretching to 1887 [Mendenhall, 1887], and 10,700 hits on Google2, it is apparent that statistical/quantitative authorship attribution is an active and vibrant research area. With nearly 120 years of research, it is surprising that it has not been accepted by relevant scholars : “Stylometrics is a field whose results await acceptance by the world of literary study by and large.”3 This can be attributed at least partially to a limited view of the range of applicability, to a history of inaccuracy, and to the mathematical complexity (and corresponding difficulty of use) of the techniques deployed. For example, and taking a broad view of “stylometry” to include the infer- ence of group characteristics of a speaker, the story from Judges 12:5–6 describes 2Phrasal search for “authorship attribution,” June 2, 2005 3Anonymous, personal communication to Patrick Juola, 2004 2 how tribal identity can be inferred from the pronunciation of a specific word (to be elicited). Specifically, The Gileadites captured the fords of the Jordan leading to to Ephraim, and whenever a survivor of Ephraim said, “Let me cross over,” the men of Gilead asked him, “Are you an Ephraimite?” If he replied, “No,” they said, “All right, say ‘Shibboleth.’ ” He said, “Sibboleth,” because he could not pronounce the word correctly, they seized him and killed him at the fords of the Jordan. Forty-two thousand Ephraimites were killed at that time. A more modern version of such shibboleths could involve specific lexical or phonological items; a person who writes of a “Chesterfield” as a piece of furni- ture is presumptively Canadian, and an older Canadian at that [Easson, 2002]. [Wellman, 1936][p. 114] describes how an individual spelling error — an idiosyn- cratic spelling of “toutch” was elicited and used in court to validate a document for evidence. At the same time, such tests cannot be relied upon. Idiosyncratic spelling or not, the word “touch” is rather rare (86 tokens in the million-word Brown corpus [Kučera and Francis, 1967]), and although one may be able to elicit it in a writing produced on demand, it’s less likely that one will be able to find it independently in two different samples. People are also not consistent in their language, and may (mis)spell words differently at different times; often the tests must be able to handle distributions instead of mere presence/absence judgments. Most worryingly, the tests themselves may be inaccurate [see espe- cially the discussion of CUSUM [Farringdon, 1996] in [Holmes, 1998]], rendering any technical judgment questionable, especially if the test involves subtle sta- tistical properties such as “vocabulary size” or “distribution of function words,” concepts that may not be immediately transparent to the lay mind. Questions of accuracy are of particular importance in wider applications such as law. The relevance of a document (say, an anonymously libelous let- ter) to a court may depend not only upon who wrote it, but upon whether or not that authorship can be demonstrated. Absent eyewitnesses or confessions, only experts, defined by specialized knowledge, training, experience, or educa- tion, can offer “opinions” about the quality and interpretation of evidence. U.S. law, in particular, greatly restricts the admissibility of scientific evidence via a series of epistemological tests4. The Frye test states that scientific evidence is admissible only if “generally accepted” by the relevant scholarly community, explicitly defining science as a consensus endeavor. Under Frye, (widespread) ig- norance of or unfamiliarity with the techniques of authorship attribution would be sufficient by itself to prevent use in court. The Daubert test is slightly more epistemologically sophisticated, and establishes several more objective tests, in- cluding but not limited to empirical validation of the science and techniques used, the existence of an established body of practices, known standards of ac- curacy (including so-called type I and type II error rates), a pattern of use in 4Frye vs. United States, 1923; Daubert vs. Merrill Dow, 1993. 3 non-judicial contexts, and a history of peer review and publication describing the underlying science. At present, authorship attribution cannot meet these criteria. Aside from the question of general acceptance (the quote presented in the first paragraph of this section, by itself, shows that stylometrics couldn’t pass the Frye test), the lack of standard practices and known error rates eliminates stylometry from Daubert consideration as well. 3 Recent developments To meet these challenges, we present some new methodological and practical developments in the field of authorship attribution. In June 2004, ALLC/ACH hosted an “Ad-hoc Authorship Attribution Competition”[Juola, 2004a] as a par- tial response to these concerns. Specifically, by providing a standardized test corpus for authorship attribution, not only could the mere ability of statistical methods to determine authors be demonstrated, but methods could further be distinguished between the merely “successful” and “very successful.” (From a forensic standpoint, this would validate the science while simultaneously, estab- lishing the standards of practice and creating information about error rates.) Contest materials included thirteen problems, in a variety of lengths, styles, gen- res, and languages, mostly gathered from the Web but including some materials specifically gathered to this purpose. Two dozen research groups participated by downloading the (anonymized) materials and returning their attributions to be graded and evaluated against the known correct answers. The specific problems presented included the following: • Problem A (English) Fixed-topic essays written by thirteen Duquesne stu- dents during fall 2003. • Problem B (English) Free-topic essays written by thirteen Duquesne stu- dents during fall 2003. • Problem C (English) Novels by 19th century American authors (Cooper, Crane, Hawthorne, Irving, Twain, and ‘none-of-the-above’), truncated to 100,000 characters. • Problem D (English) First act of plays by Elizabethan/Jacobean play- wrights (Johnson, Marlowe, Shakespeare, and ‘none-of-the-above’). • Problem E (English) Plays in their entirety by Elizabethan/Jacobean play- wrights (Johnson, Marlowe, Shakespeare, and ‘none-of-the-above’). • Problem F ([Middle] English) Letters, specifically extracts from the Paston letters (by Margaret Paston, John Paston II, and John Paston III, and ‘none-of-the-above’ [Agnes Paston]). • Problem G (English) Novels, by Edgar Rice Burrows, divided into “early” (pre-1914) novels, and “late” (post-1920). 4 • Problem H (English) Transcripts of unrestricted speech gathered dur- ing committee meetings, taken from the Corpus of Spoken Professional American-English. • Problem I (French) Novels by Hugo and Dumas (pere). • Problem J (French) Training set identical to previous problem. Testing set is one play by each, thus testing ability to deal with cross-genre data. • Problem K (Serbian-Slavonic) Short excerpts from The Lives of Kings and Archbishops, attributed to Archbishop Danilo and two unnamed authors (A and B). Data was originally received from Alexsandar Kostic. • Problem L (Latin) Elegaic poems from classical Latin authors (Catullus, Ovid, Propertius, and Tibullus). • Problem M (Dutch) Fixed-topic essays written by Dutch college students, received from Hans van Halteren. The contest (and results) were surprising at many levels; some researchers initially refused to participate given the admittedly difficult tasks included among the corpora. For example, Problem F consisted of a set of letters ex- tracted from the Paston letters. Aside from the very real issue of applying methods designed/tested for the most part for modern English on documents in Middle English, the size of these documents (very few letters, today or in cen- turies past, exceed 1000 words) makes statistical inference difficult. Similarly, problem A was a realistic exercise in the analysis of student essays (gathered in a freshman writing class during the fall of 2003) – as is typical, no essay exceeded 1200 words. From a standpoint of literary analysis, this may be regarded as an unreasonably short sample, but from a standpoint both of a realistic test of forensic attribution, as well as a legitimately difficult problem for testing the sensitivity of techniques, these are legitimate. Results from this competition were heartening. (“Unbelievable,” in the words of one contest participant.) Despite the data set limitations, the highest scoring participant [Koppel and Schler, 2004], scored an average success rate of approximately 71%. (Juola’s solutions, in the interests of fairness, averaged 65% correct.) In particular, Schler’s methods achieved 53.85% accuracy on problem A and 100.00% accuracy on problem F, both acknowledged to be difficult and considered by many to be unsolvably so. More generally, all participants scored significantly above chance. Perhaps as should be expected, performance on English problems tended to be higher than on other languages. Perhaps more surprisingly, the availability of large docu- ments was not as important to accuracy as the availability of a large number of smaller documents, perhaps because they can give a more representative sample of the range of an author’s writing. In particular, the correlation between the av- erage performance of a method on English samples (problems A-H) correlation significantly (0.594, p < 0.05) with that method’s performance on non-English 5 samples. Correlation between large-sample problems (problems with over 50,000 words per sample) and small sample problems was still good, although no longer significant (r = 0.3141). This suggests that the problem of authorship attribu- tion is at least somewhat a language- and data-independent problem, and one to which we may be able to expect to find wide-ranging technical solutions for the general case, instead of (as, for example, in machine translation) to have to tailor our solutions with detailed knowledge of the problem/texts/languages at hand. In particular, we offer the following challenge to all researchers in the process of developing a new authorship attribution algorithm : if you can’t get 90% correct on the Paston letters (problem F), then your algorithm is not com- petitively accurate. Every well-performing algorithm studied had no difficulty achieving this standard. Statements from researchers that their methods will not work with only a handful of letters as training data should be regarded with appropriate suspicion. Finally, methods based on simple lexical statistics tended to perform sub- stantially worse than methods based on N-grams or similar measures of syn- tax in conjunction with lexical statistics. We continue to examine the de- tailed results in an effort to identify other characteristics of good solutions. Unfortunately, another apparent result is that the high-performing algorithms appear to be mathematically and statistically (although not necessarily lin- guistically) sophisticated. The good methods have names that may appear fearsome to the uninitiated : linear discriminant analysis [Baayen et al., 2002, van Halteren et al., 2005], orthographic cross-entropy [Juola and Baayen, 2003, Juola and Baayen, 2005], common byte N-grams [Keselj and Cercone, 2004], SVM with a linear kernel function [Koppel and Schler, 2004]. These techniques can be difficult to implement, or even to understand or to use, by a casual, non- technical scholar. At the same time, the sheer number of techniques proposed (and therefore, the number of possibilities available to confuse) has exploded, which also limits the pool of available users. We can no longer expect a casual professor of literature — let alone a journalist, lawyer, judge, or interested lay- man — to apply these new methods to a problem of interest without technical assistance. 4 New technologies The variation in these techniques can make authorship attribution appear to be an unorganized mess, but it has been claimed that under an appropriate theoretical framework [Juola, 2004b], many of these techniques can be unified, combined, and deployed. Using this framework, it is possible — indeed, we hope to demonstrate as the basis for incremental improvement — to develop “commercial off the shelf” (COTS) software to perform much of the technical analytic aspects. The initial observation is that, broadly speaking, all known human languages can be described as an unbounded sequence chosen from a finite space of possible events. For example, the IPA phonetic alphabet [Ladefoged, 1993] describes an 6 inventory of approximately 100 different phonemes; a typewriter shows approx- imately 100 different Latin-1 letters; a large dictionary will present an English vocabulary of 50–100,000 different words. An (English) utterance is “simply” a sequence of phonemes (or words). The proposed framework postulates a three-phase division of the author- ship attribution task, each of which can be independently performed, rather in the manner of a Unix or Linux pipeline, where the output of one phase is immediately made available as the input of the following one. These phases are: • Canonicization — No two physical realizations of events will ever be ex- actly identical. We choose to treat similar realizations as identical to restrict the event space to a finite set. • Determination of the event set — The input stream is partitioned into individual non-overlapping “events.” At the same time, uninformative events can be eliminated from the event stream. • Statistical inference — The remaining events can be subjected to a variety of inferential statistics, ranging from simple analysis of event distributions through complex pattern-based analysis. The results of this inference determine the results (and confidence) in the final report. As an example of how this procedure works, we consider a method for iden- tifying the language in which a document is written. The statistical distribution of letters in English text is well-known (see any decent cryptography handbook, including [Stinson, 2002]). We first canonicize the document by identifying each letter (an italic e, a boldface e, or a capital E should be treated identically) and producing a transcription. This canonicization process would also implicitly involve other transformations, for example, partitioning a PDF image into text regions to be analyzed as opposed to illustrations and margins to be ignored. A much more sophisticated canonicization process, following [Rudman, 2003], could regularize spelling, eliminate extraneous material such as chapter headings and page numbers, and even “de-edit” the invisible hand of the editor or redac- tor, to approximate as closely as possible the state of the original manuscript as it left the pen or typewriter of the author. The output of this canonicization process would then be a sequence of linguistic elements. We then identify each letter as a separate event, eliminating all non-letter characters such as numbers or punctuation. A more sophisticated application might demand instead that letters be grouped into morphemes, syllables, words, and so forth. Finally, by compiling an event histogram and comparing it with the known distribution, we can determine a probability that the document was written in English. A similar process would treat each word as a separate event (eliminating words not found in a standard lexicon) and comparing event histograms with a standardized set such as the Brown histogram [Kučera and Francis, 1967]. Note that the difference between an analysis based on letter histograms and one based on word histograms is purely in the second, event set determination, 7 phase; the statistics of histogram generation and analysis are identical and can be performed by the same code. The question of the comparative accuracy of these methods can be judged empirically. The Burrows methods [Burrows, 1989, Burrows, 2003] for authorship attri- bution can be described in similar terms. After the document is canonicized, the document is partitioned into words-events. Of the words, most words (except for a chosen few function words) are eliminated. The remaining word-events are collected in a histogram, and compared statistically via principle content analysis (PCA) to similar histograms collected from anchor documents. (The difference between the 1989 and 2003 methods is simply in the nature of the statistics performed.) Even Wellman’s “toutch” method can be so described; after canonicization, the event set of words is compiled, specifically, the number of words spelled “toutch.” If this set is non-empty, the document’s author is determined. This framework also allows researchers both to focus on the important dif- ferences between methods and to mix and match techniques to achieve the best practical results. For example, [Juola and Baayen, 2005] describes two tech- niques based on cross-entropy that differ only in their event models (words vs. letters). Presumably, the technique would also generalize to other event models (function words, morphemes, parts of speech), and and similarly other inference techniques would work on a variety of event models. It is to be hoped that from this separation, researchers can identify the best inference techniques and the best models in order to assemble a sufficiently powerful and accurate system. 5 Demonstration The usefulness of this framework can be shown in a newly-developed user-level authorship attribution tool. This tool coordinates and combines (at this writing) several different technical approaches to authorship attribution [Burrows, 1989, Juola, 1997, Burrows, 1989, Kukushkina et al., 2000, Juola, 2003b, Keselj and Cercone, 2004]. Written in Java, this program combines a simple GUI atop the three-phase approach defined above. Users are able to select a set of sample documents (with labels for known authors) and a set of testing documents by unknown authors. The three-phase framework described above fits well into the now standard modular software design paradigm using Java’s object-oriented frame- work. Each of the individual phases is handled by a separate class/module that can be easily extended to reflect new research developments The original JGAAP5 prototype was developed in July, 2004. It served as a proof of concept for automating authorship attribution technologies. Un- fortunately, the prototype was not developed with extensibility in mind. The architecture used was not clearly defined and the application was not easily modified. These design issues were addressed in the second (current) version of JGAAP. Nearly all of the original source code was refactored to conform to the 5Java Graphical Authorship Attribution Program; the authors invite suggestions for a bet- ter name for future versions. 8 new design framework. The new JGAAP framework is devised from a strongly object oriented perspective. The core functionality of JGAAP is distilled into seven basic operations. These seven operations include: • Core Classes • Document Input • Creating Events • Document Preprocessing • Document Scoring • Displaying Results • and Graphical User Interface The directory structure of the application reflects these operations, making the source code easy to follow and understand. Core Classes As the name implies, the Core Classes provide the basic frame- work of the application. By themselves, they provide no application functional- ity. They are necessary, however, when implementing Java Interfaces to extend functionality. Document Input The document input module provides methods for import- ing documents into JGAAP. Currently, JGAAP provides input from local files only, although ongoing improvements to accept documents by remote file trans- fer or from the Web are in the process of being added. Creating Events The events module modifies the input documents prior to scoring. These events specify the means by which the documents are presented to the scoring method. Currently, JGAAP provides two types of events: Letters or Words. Document Preprocessing The document preprocessing module provides meth- ods of modifying the documents prior to scoring as detailed above. Currently, we have made available the following preprocessing options: Removing End Punc- tuation, Removing HTML Tags, Removing Non-Letters, Removing Numerals (and replacing them with a tag), Removing Spaces, and Conversion of Documents to Lower Case. Document Scoring The document scoring module contains methods for doc- ument comparison. These methods apply authorship attribution techniques to compare the input documents and provide a quantitative score for each com- parison. Displaying Results This module contains implementations that are utilized to display scoring results to the end user. The scoring methods currently output a matrix that contains the result of comparing each unknown document with all documents of known authorship. Code within this module may reformat this information into a visual representation of the matrix. Currently, JGAAP provides output of the matrix to the console, to file, or via message box. 9 Graphical User Interface This module contains the methods responsible for creating the user interface of JGAAP. The user is able to select from a menu of event selection/preprocessing op- tions and of technical inference mechanisms. Specifically, we designed a multi- menu, panel-based GUI that resembles Microsoft software to facilitate ease of use. The menus are clearly marked and set up so the flow of work is fairly linear and maps closely to the phase structure described above. The documents to be analyzed and the pre-processing and methods of the analysis are selected by the user, then that data gets sent to the (computational) “backend”, which returns the results back to the GUI to be displayed. There are still a number of substantial issues to address in further versions of JGAAP, including improvement of existing factors and the development of new features. First, we are unsatisfied with the saving/loading method currently imple- mented into JGAAP. While it is functional, it relies on absolute path names, so it is not especially flexible as we should like, and specifically is restricted only to local files. We would like to add the capability of dynamic path-based “mani- fest” files in folders of documents. When you load the manifest, you would only have to point JGAAP to where the folder of files is located and it would do the rest of the work for you. We also hope to incorporate state-based processing, where the program generates a a static list that loads along with the program while it starts a new session. This list would have on it all the documents that have been previously input into the program along with the program saving local copies of the tests. When a user wishes to analyze documents, he can select which documents he wishes to check from the permanent list, instead of having to load in the documents every time he wishes to analyze them. While it might almost mean a complete re-design of the GUI from the ground up, it could drastically improve functionality for users that check the same documents over and over. I would like to get the opinions of the community as a whole, because it might not be all that useful. We hope to get opinions from the community on how they would like to see the data graphically interpreted and displayed. Because this is being developed for the community as a whole, it is important for them to have feedback on how they would like to see the data presented. Finally, we wish to add a wizard mode and in-context help files to assist new users to the JGAAP program. As more features get added, the complexity of the program will warrant helping the user as much as possible; especially if the idea is to make the program suitable for the general user. Parties interested in seeing or using this program, and especially in helping with the necessary feedback, should contact the corresponding author. 6 Design Issues The framework outlined above relies heavily on the Java concept of Interface. A Java Interface provides a powerful tool that can be used to create highly exten- 10 sible application frameworks. Conceptually speaking, an interface is a defined set of functions (formally, “methods”)that a piece of code can “implement” us- ing any algorithm desired. This permits other pieces of code to use differing variations of the same interface with no changes, permitting easy updates as new techniques are developed and implemented. Within the Interfaces directory of the application, there exists five defined interfaces: Display, Event, Input, Preprocess, and Score. These interfaces spec- ify required methods that must exist in classes that intend to implement the respective interfaces. The classes within the Core Classes directory contain methods that accept interfaces as parameters. For example, we will assume that a future developer wants to create a new method to display score output. According to the Display interface, the new method may implement Display if and only if it contains the public void display() method. The core class Display contains a public void display(DisplayInterface display) method. This method accepts an object of type DisplayInterface as a parameter and calls that object’s public void display() method. Conversely, any code with a public void display() method can be called as a Display interface, so a technically sophisticated user who wants to see dendrograms as output need only write a single function, one that takes the matrix results from document scoring and computes (and dis- plays) an appropriate dendrogram. This function can be added on the fly to the JGAAP program and further can be re-used by others, irrespective of the different choices they may have made about the documents, the event model, or the statistics. Similarly, preprocessing can be handled by separate instantiations and sub- classes. Even data input and output can be modularized and separated. As written, the program only reads files from a local disk, but a relatively easy modification (in progress) would allow files to be read from the network (for instance, Web pages from a site such as Project Gutenberg or literature.org). 7 Discussion and Future Work From initial impressions, this tool is both usable and fulfills part of the need of non-technical researchers interested in authorship attribution. On the other hand, this tool is clearly a “research-quality” prototype, and additional work will be needed to implement a wide variety of methods, to determine and im- plement additional features, to establish a sufficiently user-friendly interface. Even questions such as the preferred method of output — dendrograms? MDS subspace projections? Fixed attribution assignments as in the present system? — are in theory open to discussion and revision. It is hoped that the input of research and user groups such as the present meeting will help guide this development. Most importantly, the availability of this tool (which we hope will spur ad- ditional research by the interested but computationally unsophisticated) should also spur discussion of the role to be played by commercial, off the shelf (COTS) attribution software. As discussed in depth by [Rudman, 2003], authorship at- 11 tribution is a very nuanced process when properly done. Ideally, as Rudman’s Law puts it, the closest text to the holograph should be found and used. The editor’s pen, the typist’s fingers, and the printer’s press can all introduce errors – and when a document exists only in physical or image form, the errors in- troduced by an OCR process [Juola, 2003b] can entirely invalidate the results. Only if all of the analytic and control texts are valid can the results be trusted. This includes not only issues of authenticity, but also of representativeness – if an author’s style changes over time [Juola, 2003a, Juola, 2006] a work from outside the period of study will be unrepresentative and may poison the analytic well. Similarly, texts with extensive quotation may be more represented of the quoted sources than of the official author. Texts from the Internet in particular may well be regarded with suspicion due to the poor quality control of Internet publishing in general. Only once a suitable test suite has been developed can the computational analysis truly proceed, but even here, there are possible pitfalls. The analyst should also be aware of some of the issues introduced by the computational tool. For example, JGAAP uses a fairly simple (and naive) definition of a “word” — a maximal non-blank string of characters. This means that some items may be treated as multiple words (“New” “York” “City”) while others are treated as a single word (“non-blank”). An analysis based on part-of-speech types [Juola and Baayen, 2005] will depend upon the accuracy of POS tagger as well as on its tag set. Such subtle distinctions will almost certainly have an effect in some analyses and be entirely irrelevant in others. The computer, of course, is blissfully ignorant of such nuances and will happily analyze the most appalling garbage imaginable. A researcher who accepts such garbage as accurate — Garbage In, Gospel Out — may be said to deserve the consequences. But the client of a lawyer wrongly convicted on such weak evidence deserves better. Have we, then, made a Faustian bargain in creating such a “plug and play” authorship attribution system? We hope not. The benefits from the wide avail- ability of a tool to the reasoned and cautious researchers who will benefit from it should outweigh the harm caused by misuse in the hands of the injudicious. It is, however, appropriate to consider what sort of safeguards might be created and to what extent the program itself may be able to incorporate and to enforce automatically these safeguards. From a broader perspective, this program provides a uniform framework un- der which competing theories of authorship attribution can both be compared and combined (to their hopefully mutual benefit). It also form the basis of a simple user-friendly tool to allow users without special training to apply tech- nologies for authorship attribution and to take advantage of new developments and methods as they become available. From a standpoint of practical episte- mology, the existence of this tool should provide a starting point for improving the quality of authorship attribution as a forensic examination – by allowing the widespread use of the technology, and at the same time providing an easy method for testing and evaluating different approaches to determine the neces- sary empirical validation and limitations. 12 References [Baayen et al., 2002] Baayen, R. H., van Halteren, H., Neijt, A., and Tweedie, F. (2002). An experiment in authorship attribution. In Proceedings of JADT 2002, pages 29–37, St. Malo. Université de Rennes. [Burrows, 2003] Burrows, J. (2003). Questions of authorships : Attribution and beyond. Computers and the Humanities, 37(1):5–32. [Burrows, 1989] Burrows, J. F. (1989). ‘an ocean where each kind. . . ’ : Statis- tical analysis and some major determinants of literary style. Computers and the Humanities, 23(4-5):309–21. [Easson, 2002] Easson, G. (2002). The linguistic implications of shibboleths. In Annual Meeting of the Canadian Linguistics Association, Toronto, Canada. [Farringdon, 1996] Farringdon, J. M. (1996). Analyzing for Authorship : A Guide to the Cusum Technique. University of Wales Press, Cardiff. [Holmes, 1994] Holmes, D. I. (1994). Authorship attribution. Computers and the Humanities, 28(2):87–106. [Holmes, 1998] Holmes, D. I. (1998). The evolution of stylometry in humanities computing. Literary and Linguistic Computing, 13(3):111–7. [Juola, 1997] Juola, P. (1997). What can we do with small corpora? Docu- ment categorization via cross-entropy. In Proceedings of an Interdisciplinary Workshop on Similarity and Categorization, Edinburgh, UK. Department of Artificial Intelligence, University of Edinburgh. [Juola, 2003a] Juola, P. (2003a). Becoming Jack London. In Proceedings of QUALICO-2003, Athens, GA. [Juola, 2003b] Juola, P. (2003b). The time course of language change. Comput- ers and the Humanities, 37(1):77–96. [Juola, 2004a] Juola, P. (2004a). Ad-hoc authorship attribution competition. In Proc. 2004 Joint International Conference of the Association for Literary and Linguistic Computing and the Association for Computers and the Humanities (ALLC/ACH 2004), Göteborg, Sweden. [Juola, 2004b] Juola, P. (2004b). On composership attribution. In Proc. 2004 Joint International Conference of the Association for Literary and Lin- guistic Computing and the Association for Computers and the Humanities (ALLC/ACH 2004), Göteborg, Sweden. [Juola, 2006] Juola, P. (2006). Becoming Jack London. Journal of Quantitative Linguistics. 13 [Juola and Baayen, 2003] Juola, P. and Baayen, H. (2003). A controlled-corpus experiment in authorship attribution by cross-entropy. In Proceedings of ACH/ALLC-2003, Athens, GA. [Juola and Baayen, 2005] Juola, P. and Baayen, H. (2005). A controlled-corpus experiment in authorship attribution by cross-entropy. Literary and Linguis- tic Computing, 20:59–67. [Keselj and Cercone, 2004] Keselj, V. and Cercone, N. (2004). CNG method with weighted voting. In Juola, P., editor, Ad-hoc Authorship Attribution Contest. ACH/ALLC 2004. [Koppel and Schler, 2004] Koppel, M. and Schler, J. (2004). Ad-hoc author- ship attribution competition approach outline. In Juola, P., editor, Ad-hoc Authorship Attribution Contest. ACH/ALLC 2004. [Kukushkina et al., 2000] Kukushkina, O. V., Polikarpov, A. A., and Khmelev, D. V. (2000). Using literal and grammatical statistics for authorship attribu- tion. Problemy Peredachi Informatii, 37(2):96–198. Translated in “Problems of Information Transmission,” pp. 172–184. [Kučera and Francis, 1967] Kučera, H. and Francis, W. N. (1967). Computa- tional Analysis of Present-day American English. Brown University Press, Providence. [Ladefoged, 1993] Ladefoged, P. (1993). A Course in Phonetics. Harcourt Brace Jovanovitch, Inc., Fort Worth, 3rd edition. [Mendenhall, 1887] Mendenhall, T. C. (1887). The characteristic curves of com- position. Science, IX:237–49. [Nerbonne, 2004] Nerbonne, J. (2004). The data deluge. In Proc. 2004 Joint International Conference of the Association for Literary and Linguistic Com- puting and the Association for Computers and the Humanities (ALLC/ACH 2004), Göteborg, Sweden. To appear in Literary and Linguistic Computing. [Rudman, 2003] Rudman, J. (2003). On determining a valid text for non- traditional authorship attribution studies : Editing, unediting, and de- editing. In Proc. 2003 Joint International Conference of the Association for Computers and the Humanities and the Association for Literary and Linguis- tic Computing (ACH/ALLC 2003), Athens, GA. [Stinson, 2002] Stinson, D. R. (2002). Cryptography: Theory and Practice. Chapman & Hall/CRC, Boca Raton, 2nd edition. [van Halteren et al., 2005] van Halteren, H., Baayen, R. H., Tweedie, F., Haverkort, M., and Neijt, A. (2005). New machine learning methods demon- strate the existence of a human stylome. Journal of Quantitative Linguistics, 12(1):65–77. 14 [Wellman, 1936] Wellman, F. L. (1936). The Art of Cross-Examination. MacMillan, New York, 4th edition. 15