T-PEN: Transcribing the Text, Keeping the Image

by Craig A. Berry

Firey, Abigail, and Ginther, Jim, Principal Investigators, T-PEN: Transcription for Paleographical and Editorial Notation.

T-PEN: Transcription for Paleographical and Editorial Notation is a software package and also a collaborative project for transcribing manuscripts, developed by the Center for Digital Theology at Saint Louis University with funding from the Mellon Foundation and NEH. Both the package and the project are impressive, but it is the combination of capable software and institutional collaborative prowess that sets T-PEN apart as a significant step forward in curating digital representations of pre-modern texts.

At its most basic, T-PEN provides a suite of tools for producing a digital transcription from digital page images. The distinguishing characteristic of the transcription environment is that the page image is never left behind in producing a transcription of it, but rather each line of the transcription is linked to the region on the image from whence it came. In digital media parlance, the image is a finely addressable “canvas” upon which shapes may be drawn, and in the case of a manuscript image, those shapes are the coordinates of the columns and lines of written material. Mapping these coordinates is accomplished via a combination of automated and manual processes. T-PEN parses the image and makes a (reasonably good) first cut at identifying columns and lines, and then the user fine-tunes the size and location of each line and column on each page before proceeding with the transcription proper.

Once this map of the page image is complete, T-PEN affords a one-line-at-a-time editing box immediately above the image of the line and surrounded by a rich paleographer’s toolbox, including: buttons for inserting special characters or XML tags into the transcription; dictionaries of Latin, Old English, Middle English, and French; a comparison tool for viewing manuscript pages side-by-side with the aid of a virtual magnifying glass; a look-up tool for frequently-used abbreviations; a history tool for viewing a log of edits made to the transcription; a note-taking facility for adding annotations to the transcription; and a line break tool for the addition of line encoding to a transcription produced without the aid of T-PEN but uploaded into it. All of the content-producing tools (but obviously not the reference tools) create materials linked to the same line coordinates identified in the initial step.

So what advantages does such a transcription have over a transcription produced the old-fashioned way, i.e., faintly scrawled in a notebook with number 4 pencil under the harsh gaze of a rare book librarian? A T-PEN transcription makes possible, at least potentially, a range of different digital presentations that include both the transcription and the digital page image, such as interlinear, facing page, or superimposed displays. Annotations and metadata produced as part of the transcription process are also aligned to the same page coordinates as the transcription itself. Transcribing a manuscript thus involves supplementing it while keeping it visible rather than replacing it with a surrogate at one or more removes from the original artifact, the only exception being the photographic images of a book’s pages standing in for the physical book. While not all of the possibilities are immediately or automatically available, T-PEN, by keeping the image close at hand while attaching modern scholarly interpretation to it, ably demonstrates the capacity of the digital medium to enhance without occluding the historical object of study.

The technical accomplishments of T-PEN are significant, but its institutional and collaborative achievements are no less impressive. There are myriad overlapping questions of ownership that arise when collecting large numbers of manuscript images from widely dispersed institutions and released under varying terms, and then producing transcriptions of those images by a diverse group of people. Clearly the T-PEN team has given a great deal of careful thought to managing the rights and responsibilities of all the different contributors, as a reading of the User Agreement on the About page makes clear.[1] Suffice it to say that owners of images and producers of transcriptions have considerable control over how their intellectual property is used while still having opportunities to collaborate to their mutual benefit.

The scope of the collaborative effort is evident in the sheer number of places and manuscripts involved. There are well over 4,000 digitized manuscripts available and they represent almost 70 libraries in nearly as many cities. Some collections are restricted to members of particular projects or institutions, but many are open access. Unfortunately the only way to locate a manuscript to transcribe is to know the city or library in which it is located and the manuscript number by which it is known. There are no facilities for searching by genre, century, or language, much less title and author. Hovering over the name of a city produces a tiny map of that city inset into a larger map of the region of the world in which it is located. It is difficult to imagine how a one-inch-wide map of the Palo Alto area inset into a large map of North America provides any pertinent guidance about the manuscript images available from the Stanford library, and this feature seems to be a rare instance in T-PEN where trendiness trumps common sense; the resources invested in geo-location of the libraries from which manuscript images have been drawn would have been better spent on more mundane and traditional search facilities.

Getting started with T-PEN is not especially difficult but will require some patience and persistence. For example, the tool for adjusting line and column identifications on a page image will take some minutes or perhaps half an hour of flailing about before the concepts and the hand movements click and begin to feel natural. The welcome page links to a twelve-minute video demonstrating the use of the tool, and there are quite a few more video demonstrations available on YouTube.[2] Viewing one or more of these videos is essential for new users of T-PEN as textual documentation is almost entirely absent, a curious choice for a tool intended primarily for people whose professional lives are devoted to old books.[3]

While anyone can request an account, the specific tools in the current toolbox and the necessity of choosing potential work items by library and manuscript number make it clear that T-PEN has been developed by and for professional medievalist paleographers. This is a commendable and largely realized goal, and the investigators were wise to focus their efforts on a constituency they understand and of which they are themselves members. But T-PEN also has considerable potential for use outside its original target audience. For example, as early modernists are well aware, manuscripts did not in any way cease to exist with the advent of the printing press, nor are the problems involved in transcribing early printed books entirely different from those involved in transcribing manuscripts. It is easy to imagine relatively slight alterations in the already highly-configurable toolbox that would make T-PEN quite an attractive tool for working with primary materials some centuries later than its current focus.

The T-PEN software is open source and released under an Educational Community License.[4] This means that the source code is publicly available and that any persons or institutions wanting to run their own copies of the software may do so with a few (very liberal) restrictions outlined in the license. This reviewer had a local instance up and running with a few hours of tinkering, an exercise that served primarily to illustrate how barren the software is without the dazzling library of images already available on the T-PEN site.[5] A more plausible scenario for taking advantage of the open source license would be contributing bug fixes and enhancements in collaboration with the T-PEN developers.

An interesting side benefit of having access to the source code is the ability to count it and draw conclusions about the project from the quantity of code it contains. T-PEN has some 40,000 lines of code excluding third-party libraries, and a common estimating technique in the software industry allows us to say that such a quantity of code represents about 115 months of programmer effort, or three people working non-stop for just over three years.[6] This estimate is quite rough and should be taken with a large grain of salt, but it serves to illustrate the commitment of programming resources required to produce a software environment of T-PEN’s scope and complexity. This is no small project, and by any estimate T-PEN represents a major investment of resources over a significant period.

It is more difficult to estimate the present and future prospects based on past accomplishments, especially since it appears that major development and the grants that funded it wound down in 2012. T-PEN is resting on a sound technical foundation using proven and ubiquitous technologies that should be viable for years to come, but one of the technologies it depends on is the web browser, of which there are several competing implementations that are updated almost weekly. It will take a sustained effort just to keep the software functioning properly much less to expand or enhance it. Large software projects tend to follow a trajectory best described, if readers will forgive a short parable, as a reverse Commedia. First there is the Paradiso of the development phase, when programmer joy is palpable and the gracious benevolence of those providing funding is everywhere evident. Next follows the Purgatorio of public release, wherein many of the sins of the development phase are purged in the fires of contact with actual users, though the arduous (if somewhat less happy) efforts of the team often result in successful completion. Finally there is the Inferno of long-term maintenance and support, characterized by the eternal raging of unquenched desires for new features amidst a growing backlog of small requests and support tasks rapidly expanding toward infinity, all endured while that most supreme grace of adequate funding has become a distant and bitter memory and the original programmers have moved on to other projects.

Of course this anti-salvation history can be reversed at any time by application of sufficient resources, but as far as supporting digital humanities goes, granting agencies are about as likely to provide long-term funding for maintenance as they are to make trips to the underworld. If the necessary resources are to be marshaled, it will be done by academic entities such as the Center for Digital Theology at Saint Louis University, and such entities will need permanent staff to accomplish their mission. T-PEN is in no way unique in this regard, but as a large and brilliantly successful digital humanities project funded initially by grants, its transition from grant-funded development to long-term support under the wing of an institution will be an example to watch for the future of digital humanities more generally. During the first few weeks of 2015, there were about a dozen check-ins to the T-PEN source repository, so there is reason to hope that future is a bright one.

Craig A. Berry
Chicago, Illinois

[1] <http://t-pen.org/TPEN/admin.jsp?selecTab=2>

[2] <https://www.youtube.com/channel/UCvkodwAYDB3ne2xUrok359A>

[3] An out-of-date draft of a user manual for an older version of T-PEN can be found at <http://web.stanford.edu/group/dmstech/cgi-bin/drupal/sites/default/files/T-PEN_Draft_User_Manual.pdf>.

[4] <http://opensource.org/licenses/ecl2.php>

[5] In principle, one can load one’s own images, but as of this writing that feature is not working.

[6] 115.4 = 2.4 x 40^1.05 using the “organic” variant of the standard COCOMO model, described, for example, at <http://www.mhhe.com/engcs/compsci/pressman/information/olc/COCOMO.html>, and originally developed in Barry W. Boehm, Software Engineering Economics (Upper Saddle River, NJ: Prentice-Hall, 1981).

Comments

papa uren 10 months, 2 weeks ago
Great article! I recently started studying programming because it is a very popular profession now. However, it's not quite as easy as it seems, but thanks to https://www.programmingassignment.net/blog/unterminated-string-literal-javascript-error-solution/, I'm not worried about it. Professionals complete assignments efficiently, saving my valuable time that can be allocated to other important tasks.
Link / Reply

Name
required

Email
required (not published)

Website
optional

Comment

If you enter anything in this field your comment will be treated as spam
rent ersi 4 months, 3 weeks ago
Thank you for the information! To be honest, I was recently looking for a way to protect my profile on the OnlyFans platform. I accidentally stumbled upon https://fans-crm.com/onlyfans-vpn/ .This VPN helps me stay safe and invisible on the platform, ensuring the privacy of my views and interactions.
Link / Reply

Name
required

Email
required (not published)

Website
optional

Comment

If you enter anything in this field your comment will be treated as spam

You must log in to comment.

44.3.66

Cite as:

Craig A. Berry, "T-PEN: Transcribing the Text, Keeping the Image," Spenser Review 44.3.66 (Winter 2015).

http://www.english.cam.ac.uk/spenseronline/review/item/44.3.66

Accessed April 23rd, 2024.