Burns, Philip R. MorphAdorner v2.0. Northwestern University. 2013. <http://morphadorner.northwestern.edu/>.
The upcoming public release, in January 2015, of around twenty-five thousand texts from the EEBO-TCP corpus promises to transform the way scholars of early English print conduct research. However, while the greater access that the release will provide is valuable in itself, the scale and nature of the corpus poses unique conceptual and technical challenges that need to be addressed before its revolutionary promise can be realized. In recent years computational and quantitative approaches to literary analysis, often associated with the rubrics of “distant reading” popularized by Franco Moretti or “literary informatics” put forward by Martin Mueller, have provided glimpses of what the contours of this emerging terrain of scholarship might look like. Such computational approaches complement the conventional “close reading” of texts by throwing into relief aspects of style, genre, and literary history by making possible an algorithmically assisted bird’s eye view of the corpus. But while such a move towards scale and computational analysis is difficult in itself, these problems are exacerbated in the case of a corpus like EEBO-TCP that faithfully captures the inconsistencies and idiosyncrasies of early modern orthography and syntax.
MorphAdorner, a tool for the morphological tagging of word forms in large corpora, developed by Philip R. Burns at Northwestern University, is designed to accommodate the orthographic and syntactic variation in historical corpora in ways that would allow scholars to analyze these texts computationally without compromising their faithfulness to the originals. Of course, it is the relative accuracy of EEBO-TCP, given its size and scale, that makes it so valuable as a scholarly resource even as it makes computational approaches difficult. And while many errors and lacunae remain, such difficulties are only to be expected with a body of print so vast and complex and are already being addressed to some extent through projects at various institutions. The MorphAdorner project is perhaps the most conspicuous and important of these. It envisions itself as part of a longer term effort of creating a vast corpus of early modern print—a “Book of English”—and lays out the ambitious goal of eventually becoming the foundation for “a large, growing, collaboratively curated, and public domain corpus of written English since its earliest modern form.”
The fundamental intervention of the project lies in the way it tries to balance the multiple demands that are made of historic corpora such as EEBO-TCP in terms of preserving and setting the standard for accuracy and scholarly editorial practices, as well as for making such vast archives tractable for large-scale computational analysis. Most modern computational approaches to language have been built on the assumption of the stability of orthographic “types”—in other words, each orthographic form is treated as a unique word-form. Of course, this is an assumption that collapses in the case of early modern spelling conventions which, while slowly moving towards standardization over the period covered by EEBO-TCP, is tolerant of a wide range of orthographic variation. Experienced human readers can navigate these variant forms with ease, but they make it impossible to apply most of the tools that are widely used for the computational analysis of texts, from simple word searches across corpora to more complex techniques such as “topic modeling.”
In its current version, MorphAdorner consists of a suite of computational-linguistics and natural language processing tools that use a variety of techniques to “adorn”—a term the project prefers over “tag” or “annotate” because of its associations with medieval monastic practices of textual illustration—texts with various types of linguistic information. While it can perform other tasks, the main attraction of MorphAdorner for most scholars wanting to work with EEBO-TCP or similar historical corpora is its ability to generate part-of-speech, lemma, and, most importantly, standardized spellings for words. Such “adornments” are primarily added to the text in the form of XML tags around each word, thus preserving the original version of the text. As an example, here is the TEI encoded EEBO-TCP version of two lines from Chaunce of the Dolorous Lover, printed in 1520 by Wynken de Worde:
<L>Upon a certayne tyme as it befell</L> <L>I was all pensyfe and thoughtfull in my herte</L>
On processing this file, MorphAdorner produces the following adorned version from the above lines:
<L> <w lem=”upon” pos=”p-acp” reg=”Upon” spe=”Upon” xml:id=”A01907-00550”>Upon</w> <c> </c> <w lem=”a” pos=”dt” reg=”a” spe=”a” xml:id=”A01907-00560”>a</w> <c> </c> <w lem=”certain” pos=”j” reg=”certain” spe=”certayne” xml:id=”A01907-00570”>certayne</w> <c> </c> <w lem=”time” pos=”n1” reg=”time” spe=”tyme” xml:id=”A01907-00580”>tyme</w> <c> </c> <w lem=”as” pos=”c-acp” reg=”as” spe=”as” xml:id=”A01907-00590”>as</w> <c> </c> <w lem=”it” pos=”pn31” reg=”it” spe=”it” xml:id=”A01907-00600”>it</w> <c> </c> <w lem=”befall” pos=”vvd” reg=”befell” spe=”befell” xml:id=”A01907-00610”>befell</w> </L> <L> <c> </c> <w lem=”i” pos=”pns11” reg=”I” spe=”I” xml:id=”A01907-00620”>I</w> <c> </c> <w lem=”be” pos=”vbds” reg=”was” spe=”was” xml:id=”A01907-00630”>was</w> <c> </c> <w lem=”all” pos=”d” reg=”all” spe=”all” xml:id=”A01907-00640”>all</w> <c> </c> <w lem=”pensive” pos=”j” reg=”pensive” spe=”pensyfe” xml:id=”A01907-00650”>pensyfe</w> <c> </c> <w lem=”and” pos=”cc” reg=”and” spe=”and” xml:id=”A01907-00660”>and</w> <c> </c> <w lem=”thoughtful” pos=”j” reg=”thoughtful” spe=”thoughtfull” xml:id=”A01907-00670”>thoughtfull</w> <c> </c> <w lem=”in” pos=”p-acp” reg=”in” spe=”in” xml:id=”A01907-00680”>in</w> <c> </c> <w lem=”my” pos=”po11” reg=”my” spe=”my” xml:id=”A01907-00690”>my</w> <c> </c> <w lem=”heart” pos=”n1” reg=”heart” spe=”herte” xml:id=”A01907-00700”>herte</w> </L>
Clearly, the MorphAdorner output is not meant for human reading in the way the original EEBO-TCP text, even with its limited markup indicating lines, can be read. What the program generates is a transitional text meant for computational access, but it is also a text sensitive to the needs of humanities scholars. It acknowledges that the standardization of spelling is not merely an instrumental goal for literary scholars and historians, a task that needs to be completed before they can get to the computational analysis of the corpus. Rather, as in the adorned output, multiple aspects or “states” of the text coexist and can be activated in different contexts—be it scholarly close reading that pays attention to the original spelling forms of words, or computational analysis that uses the information generated by MorphAdorner. By default, MorphAdorner generates lemmas (the dictionary head-word form), the part-of-speech (based on the NUPOS tag-set designed by Martin Mueller), the regularized or standard-spelling version of the word, the spelling which combines words that are fragmented in the original text, and, finally, a unique ID for every word. The goal is to produce a text that is fundamentally “scholarly” in an age where that description increasingly needs to accommodate various types of computational access as necessary complements to traditional forms of textuality.
This aspect of MorphAdorner is likely to be overlooked if one merely takes a more technical view of the program. Building a relatively comprehensive suite of natural language processing and computational linguistics tools is an immensely involved task that intervenes in a field that is under constant research in the computer sciences, a field where the “cutting-edge” in terms of technology is constantly being redefined and driven by the research and funding that large-scale commercial interest brings. Various advanced toolkits for language processing tasks exist and are constantly maintained and updated. Why, then, would the humanities want to build its own set of tools instead of borrowing the state-of-the-art from other disciplines? To answer this question one must look at the ways in which MorphAdorner is aware of the particular needs of humanities scholars, drawing on a combination of sophisticated computational techniques like machine-learning and accumulated humanistic expertise in the form of databases of orthographic forms, suffixes etc. to produce texts that strive to retain their historical accuracy while providing computational access. In computational terms the two major tasks that MorphAdorner performs—spelling standardization and part-of-speech tagging—are interdependent, and require an awareness of historical orthographic and syntactic usage that most modern natural language processing programs simply cannot implement. As an example, consider the disambiguation of two instances of the word “souldiers” from the 1590 edition of Sidney’s The Countess of Pembroke’s Arcadia. Much of early modern usage omits the apostrophe that many contemporary natural language parsers would look for as a key indicator of a possessive. MorphAdorner, however, can draw on its internal databases of early modern words to correctly disambiguate different uses of the word. The phrase “by a souldiers hande written” is parsed as:
<w lem=”by” pos=”p-acp” reg=”by” spe=”by” xml:id=”A12229-2107680”>by</w> <c> </c> <w lem=”a” pos=”dt” reg=”a” spe=”a” xml:id=”A12229-2107690”>a</w> <c> </c> <w lem=”soldier” pos=”ng1” reg=”soldiers” spe=”souldiers” xml:id=”A12229-2107700”>souldiers</w> <c> </c> <w lem=”hand” pos=”n1” reg=”hand” spe=”hande” xml:id=”A12229-2107710”>hande</w> <c> </c> <w lem=”write” pos=”vvn” reg=”written” spe=”written” xml:id=”A12229-2107720”>written</w>
The part-of-speech tag “NG1” assigned to “souldiers” correctly identifies it as a possesive singular. In another phrase with the same spelling—“the fury of the entring souldiers”—however, the word is classified as a plural noun and tagged with “n2”:
<w lem=”the” pos=”dt” reg=”the” spe=”the” xml:id=”A12229-2140940”>the</w> <c> </c> <w lem=”fury” pos=”n1” reg=”fury” spe=”fury” xml:id=”A12229-2140950”>fury</w> <c> </c> <w lem=”of” pos=”pp-f” reg=”of” spe=”of” xml:id=”A12229-2140960”>of</w> <c> </c> <w lem=”the” pos=”dt” reg=”the” spe=”the” xml:id=”A12229-2140970”>the</w> <c> </c> <w lem=”enter” pos=”vvg” reg=”entering” spe=”entring” xml:id=”A12229-2140980”>entring</w> <c> </c> <w lem=”soldier” pos=”n2” reg=”soldiers” spe=”souldiers” xml:id=”A12229-2140990”>souldiers</w>
One might note in passing that the accuracy of both the spelling standardization and POS-tagging has vastly improved in the current release with the availability of larger databases and “training data” although, as with any natural language processing task of this complexity, not every variation of early modern English can be reliably standardized. The goal, rather, is to move as close to a computationally tractable corpus of early English texts as possible, and no doubt future versions will improve on the already high standards set by the current release.
But this level of involvement and complexity comes at a cost for the literary scholar. As one might guess from the output the program produces, MorphAdorner is not meant to be a simple tool or pre-built suite that enables a given workflow in the sense its closely aligned projects, MONK and WordHoard do. It is rather meant for users who already have some facility with computational work, or who have access to technical support through libraries, digital humanities centers etc. It is meant to be used primarily from the command line and, at an even lower level for the experienced programmer, exposes a well-documented Java API that can be used to build customized programs based on MorphAdorner’s functionality. The design of the program consistently opts for customizability over ease of use as evinced by the list of command line parameters and a properties file that allows one to set various options from output format to a variety of underlying databases and dictionaries. However, this customizability should be welcomed by the scholarly community rather than be looked upon as a hindrance. It is rare for tools in the humanities to implement this level of complexity and customizability, but in this case it’s deployed in the service of a task that is itself immensely technically challenging and that, if the scholarly community takes up the challenge, can truly transform the way we look at the first few centuries of English print.
The carefully considered design decisions behind MorphAdorner are also apparent in the way it eschews graphical interfaces and interactive environments, which can be costly and time consuming to build and maintain, not to mention restrictive for many complex computational tasks. Instead it provides a variety of pre-set configurations as shell scripts that address most of the common use-cases and corpora that can be used by most scholars with minimal experience of working with the command line. But perhaps more importantly from the perspective of getting scholars to try it out, the new release separates out its demonstration module into its own package called MorphAdorner Server (also usable on the project website), which allows individuals or, more likely, university libraries and digital humanities centers to build their own web interfaces to the program. Scholars can then use this interface to process single texts or short passages. One can only hope that having once tried it through such an interface, more people will be willing to invest the moderate amount of time needed to learn how to use the main suite of programs because they represent one of the most outstanding achievements of humanities computing. MorphAdorner is likely to play a significant part in our scholarly lives as more of the early printed record becomes available, and as we look beyond the problems of access to those posed by analysis at such scales.
Humanities Digital Workshop
Interdisciplinary Project in the Humanities
Washington University in St. Louis
 Martin Mueller, “Digital Shakespeare: Or, Towards a Literary Informatics,” Shakespeare 4, no. 3 (2008): 300–317; Franco Moretti, “Conjectures on World Literature,” New Left Review 1, Jan-Feb (2000): 56; Franco Moretti, Graphs, Maps, Trees: Abstract Models for Literary History (Verso, 2007); Franco Moretti, Distant Reading (London ; New York: Verso, 2013).
 While I will focus on EEBO-TCP here, MorphAdorner is designed to handle a wide variety of historical corpora and comes with pre-built profiles for eighteenth and nineteenth century usage. It includes direct support for several widely used scholarly corpora such as “Eighteenth Century Collections Online” http://gdc.gale.com/products/eighteenth-century-collections-online/, “Documenting the American South,” http://docsouth.unc.edu/, and the “Wright fiction archive,” http://webapp1.dlib.indiana.edu/TEIgeneral/welcome.do?brand=wright.