The figure of my study is round, and there is no more open wall than what is taken up by my table and my chair, so that the remaining parts of the circle present me a view of all my books at once, ranged upon five rows of shelves round about me. It has three noble and free prospects, and is sixteen paces in diameter… . ’Tis there that I am in my kingdom … (Montaigne)
In the next fifty years the entirety of our inherited archive of cultural works will have to be re-edited within a network of digital storage, access, and dissemination. This system, which is already under development, is transnational and transcultural. (Jerome McGann, TLS November 20, 2009)
When I did my PhD nobody had a PC on their desktop. Now everybody has one. And that makes a definitive difference to how we work with data. Many economists think in their daily work that progress is frustratingly slow. But if you take a longer view, there has been huge progress. (Translated from an interview in the Frankfuter Allgemeine Zeitung, August 27, 2014, with Alvin Roth, winner of the 2012 Nobel Prize in Economics)
At some point in 2015, 25,000 transcriptions of Early English Books Online (EEBO) texts will move into the public domain. In 2020, they will be joined by another 40,000 titles, and it will be possible to say then that anybody anywhere and anytime has free access to at least one digital surrogate of just about every book published before 1700 in the English-speaking world. Montaigne’s vision of all “my” books at hand and in view gives way to everything from the early modern print record at hand on a mobile device or “table of memory,” to use Hamlet’s words. One could reasonably opt for Montaigne’s library rather than the cellphone. But this is a consequential change for the documentary infrastructure of early modern studies, and it will change the ways in which early modernists go about their work.
The Text Creation Partnership (TCP), which has been responsible for this massive project, has been an unusual and very successful collaboration between a commercial partner (Proquest) and an academic consortium led by the Universities of Michigan and Oxford. Subscribing institutions provided the funds for the transcriptions, which were done by commercial vendors working from digital scans of microfilm page images. Early modernists should thank Mark Sandler, then at the University of Michigan, for his foresight in negotiating an agreement that the texts from each phase of the project should move into the public domain after five years. Phase 1 ended in 2010. Phase 2 will end in 2015. Hence the release dates of 2015 and 2020.
The English Short Title Catalogue, which includes the bibliographical work of Pollard, Redgrave, and Wing, includes about half a million records from before 1800. About 130,000 records, a little more than a quarter, are from before 1700. There are no easy answers to questions like “What is a book?” or “What is a duplicate of a book?” But in a rough and ready way one can say EEBO-TCP adds up to a complete deduplicated digital library of the first 230 years of English print culture. Think of it as a cultural genome of early modern English. There will of course be holes, but if they matter, they can be filled in later.
I know colleagues who take a rather dim view of this project. They say that these digital texts are not scholarly editions or even proper diplomatic transcriptions, and that they are riddled with errors. People like to judge a barrel by its worst apples. There are certainly bad apples in the TCP archive, although for the most part they have less to do with the quality of transcription than with the quality of the digital image and the microfilm image or printed page that it mirrors at three removes. Moreover, errors cluster heavily so that the bad apples tend to look very bad indeed. This is both good and bad. It is bad because pages with lots of errors disproportionately color perceptions of the project as a whole. It is good because errors that cluster are easier to find and fix.
Whatever scholars may think of the quality of the texts and whatever they say in public, the EEBO-TCP texts will quickly establish themselves as the most common and often the only access to early modern books. They will circulate as “epubs” of one kind or another on mobile devices. If a text is half-way readable, convenience and speed will trump quality every time. The text of choice is the text you can get to at 2:00 AM in your pajamas.
This may be regrettable, but it is so. Scholars or librarians who worry about the quality of EEBO-TCP texts should join a movement to make them better through collaborative curation or “crowdsourcing.” The large majority of textual problems in the EEBO-TCP are not of the philologically exqusite kind (was it an “Indian” or “Iudean” who threw away that pearl?). They belong to the world of manifest error where the correct answer is almost never in doubt and typically not difficult. Undergraduates who like old books can do good work in that field and have done so. But so can high school students in AP classes and retired lawyers or school teachers. And the results of human curatorial labor can become inputs for machine learning techniques that are surprisingly good, although the results need to be reviewed by humans.
Judith Siefring and Eric Meyer in their excellent study “Sustaining the EEBO-TCP Corpus in Transition” say more than once that in user surveys “transcription accuracy” always ranks high on the list of concerns. They also say that when asked whether they reported errors, 20% of the users said “yes” and 55% said “no.” But 73% said that they would report errors if there were an appropriate mechanism and only 6% said they wouldn’t. There is a big difference between what people say and what they do. But given a user-friendly environment for collaboration, it may be that a third or more of EEBO-TCP users could be recruited into a five-year campaign for a rough clean-up of the corpus so that most of its texts would be good enough for most scholarly purposes. It is a social rather than technical challenge to get to a point where early modernists think of the TCP as something that they own and need to take care of themselves. They had better do it because nobody else will do their work for them. Greg Crane has argued that “Digital editing lowers barriers to entry and requires a more democratized and participatory intellectual culture.” In the context of the much more specialized community of Greek papyrologists, Joshua Sosin has successfully called for “increased vesting of data-control in the user community.” If the clean-up of texts is not done by scholarly data communities themselves it will not be done at all. And it is likely to be most successful if it is done along the model of “Adopt a highway,” where scholarly neighborhoods agree to get rid of litter in texts of special interest to them. The engineer John Kittle helped improve the Google map for his hometown of Decatur, Georgia, and was reported in the New York Times as saying:
Seeing an error on a map is the kind of thing that gnaws at me. By being able to fix it, I feel like the world is a better place in a very small but measurable way.
Compare this with the printer’s plea in the errata section of Harding’s Sicily and Naples, a mid-seventeenth century play:
Reader. Before thou proceed’st farther, mend with thy pen these few escapes of the presse: The delight & pleasure I dare promise thee to finde in the whole, will largely make amends for thy paines in correcting some two or three syllables.
Fixing manifest errors is an important task, and much of it needs to be done before other and more interesting things can be done. But it is not enough to think of the digital surrogate as a new and in some ways more flexible vehicle for serving up readable stuff. You have to think of a digital corpus as a Janus-faced thing: human readable and machine actionable. When human readers “process” texts they take advantage of powerful and mostly tacit pattern recognition capabilities by which they resolve ambiguities and supply contextual information. From a computer’s perspective, a written text is a woefully underspecified set of instructions, but given the slow speed at which even fast readers move through a text, it has to be that way. The great Romantic philologist August Boeckh (ten years older than Keats) at the age of 23 defined philology as “die Erkenntnis des Erkannten” (the further knowing of the already known) and as a task of “infinite approximation.” The machine can be a powerful assistant in the task of such further knowing and approximation. Given appropriate data and processing instructions, it can do in seconds or minutes and much more accurately what it would take a human months or years to do. But the machine has no tacit knowledge and requires degrees of “explicitation” and verbosity that make the fully machine-actionable text a deeply repugnant thing to look at. On the other hand, it is not meant to be read. It is intended as a data structure from which different phenomena are abstracted for further analysis. The details can be technical, but the principle is simple and very much like the Biblical concordances of medieval monks: you transform a verbal structure into an inventory of its parts, isolate, compare, and combine the parts in different ways in the hope that from these literal exercises of “dis-solution” (“ana-lysis”) will spring a fuller understanding of the whole and its parts.
Goethe’s devil describes the procedure very well when he says to the freshman student:
Wer will was Lebendigs erkennen und beschreiben,
Sucht erst den Geist heraus zu treiben,
Dann hat er die Teile in seiner Hand,
Fehlt, leider! nur das geistige Band.
If you want to describe a living thing
First drive out its spirit,
Then you have the parts in your hand,
But alas the spiritual bond will be missing.
Mock it as you will, but in the right hands it remains the most powerful tool for advancing knowledge and understanding.
Siefring and Meyer in their EEBO-TCP report state that the TEI-XML encoding of the TCP texts is not valued highly by users. This is unsurprising. Awareness of XML encoding and its “affordances” is generally low among humanities scholars for the simple reason that there are no tools yet that let ordinary users explore the query potential of encoded texts. In the most common interfaces for TCP texts, the TEI encoding is either ignored or buried so deep that it is lost for all practical purposes. In principle, however, TEI encoding is very valuable, and the TCP archive is by far the largest collection of consistently encoded TEI texts in the world.
You can think of TEI encoding as cataloguing of the discursive parts of a book. It is particularly helpful in genres like drama, where a system of rigid and genre-wide notation is part of the genre itself. In TEI encoding, acts, scenes, speeches, stage directions and other discursive part are, as it were, containerized in “elements” bounded by the (in)famous angle brackets familiar from HTML. Such encoding increases the query potential of a corpus: it becomes much easier to look for words spoken in lines of verse, quotations, stuff in “paratext,” etc.
A more radical form of cataloguing the parts of a book is the practice of linguistic annotation, where you associate every word token with standardized lexical and grammatical information that can be automatically generated with error rates that make little or no difference to most quantitatively-based inquiries. In the early days of digital library catalogues it was a difficult task to manage a million MARC records. Today, it is not a particularly challenging technical task to create something like a word-level MARC record for each of the two billion or so words in the 65,000 TCP texts. Such a record “inherits” from the bibliographical catalogue all the data that apply to the text as a whole. From the catalogue of TEI encoding it inherits the properties associated with the particular discursive element in which it occurs. A “divide and conquer” approach of this kind lets you tell many Tales of Three Catalogues. Or you could say that in an annotated corpus of this kind the collapse of the distinction between Book and Lexicon becomes a source of many fruitful inquiries.
The TCP archive is probably the last large-scale manual transcription and encoding project of historic texts. Optical Character Recognition (OCR) has made enormous progress in the last decade. There have been reports of good enough results from OCR of seventeenth-century texts in black letter type. The Early Modern OCR Project at Texas A&M has eighteenth-century texts as its major focus, but it is also a goal to produce OCR texts of untranscribed EEBO texts. It is a clear advantage of OCR that the process records the spatial co-ordinates of each character on the printed page. OCR will always create a model of the “white space XML” of the original, which makes the alignment of page image and transcription much easier. We can look forward to a time when every printed text from the world of the hand-operated letter press will be available as a digital surrogate with a query potential that in many cases exceeds that of the original. There may come a moment when a digital transcription is just another data field in every item of the English Short Title Catalogue. That would be a lot of progress towards what I have called a Book of English and defined as:
- a large, growing, collaboratively curated and public domain corpus
- of printed English since its earliest modern form
- with full bibliographical detail
- and light but consistent structural and linguistic annotation
Readers who mourn the loss of the tactile intimacy of “real” books need not worry that such a digital Book of English will replace the originals. On the contrary, the wide availability of digital surrogates of early modern texts will stimulate interest in the originals.
 Gregory Crane, “Give us editors! Re-inventing the edition and re-thinking the humanities,” Online Humanities Scholarship: The Shape of Things to Come, March 26-28, 2010. <http://shapeofthings.org/papers/GCrane.doc>.
 New York Times, 16 Nov. 2009.
After writing this sentence I came across the announcement of COW (Corpora from the Web), a project at the Free University of Berlin to create “linguistically processed gigatoken web corpora” in Dutch, English, French, German, and Swedish (http://hpsg.fu-berlin.de/cow/) These corpora are between one and ten billion words. For copyright reasons, the corpora are “sentence shuffles,” a concession that an early modern English corpus would not have to make.
 Martin Mueller, “Towards a Book of English: A linguistically annotated corpus of the EEBO-TCP texts,” In Bodleian Libraries, University of Oxford, “Revolutionizing Early Modern Studies”? The Early English Books Online Text Creation Partnership in 2012: EEBO-TCP 2012. <http://ora.ox.ac.uk/objects/uuid%3Aa2b1af11-5278-4e5a-8aa0-ee9c16550c44>.