Providing large scale text corpora for research

The Stanford RegLab and the Stanford Literary Lab have both been processing and analyzing large text corpora for many years now and both recently received a chunk of OCR content from Stanford Libraries thanks to work that DLSS has undertaken to retrieve the digital files of more than 3 million items from the Stanford Libraries catalog that were scanned by Google.

Making that trove of materials available for researchers includes considerations of digital discovery, delivery and preservation. The Stanford Digital Repository (SDR) was upgraded in the process to support the massive accessioning effort. And a critical new piece of infrastructure — a pipeline to automate retrieval from Google, then accessioning and deposit into the SDR — was built to make this possible.

And it is all going smoothly. As of today, over 500,000 of those 3 million items have been deposited. A Data Use Agreement is available for researchers who want to dig in to these materials.

Daniel Ho, Professor of Law, Stanford HAI Associate Director, and PI for the RegLab, requested items from the catalog in English (or unspecified language), published after 1965 and within the Library of Congress Classification K for law and legislation. 265,032 items in our Stanford Libraries catalog match those parameters, but Google chose to scan only a small subset resulting in 70,446 items. The RegLab plans to use this corpus to fine-tune a language model on legal text.

Coincidentally, about the same number of items resulted from the Literary Lab's request. PI Mark Algee-Hewitt, Assistant Professor of Digital Humanities, was interested in novels and literary criticism from the Stanford Libraries catalog in any language. Each of the 70,942 items is delivered in a single folder named by the item's druid (a unique identifier that functions much like a DOI) with two files in each folder: one METS .xml file (METS is a standard for encoding descriptive, administrative, and structural metadata) and one .txt zip with the OCR files.

Both the RegLab's and LitLab's use of the library materials are non-consumptive. That is to say, they use data extracted from the texts (both teams chose OCR'ed text files only, not the scanned pages) to build computational models. Those models cannot be used to reconstruct the original texts.

What do we know about the corpus?

I have referred to the digitized texts selected for RegLab and LitLab as collections or corpora but both of those terms are misleading. The selection criteria, based mostly on LCC as described above are only broadly circumscribed. Corpus construction —which includes refining the selection so that it coheres in ways that are relevant to the research needs — will happen within the research lab.

Algee-Hewitt explained in a recent meeting of the Critical Data Practices in Humanities Research workshop sponsored by the Stanford Humanities Center that attention to how corpora are assembled is a growing concern in the digital humanities. The questions surround the different kinds of selection bias in a corpus. Which authors are included and which are not? What makes a corpus representative? And of what?

The work of corpus construction, which requires first being able to see what is in the group of items, follows with pruning and trimming as well as creating sets and subsets. This work is made easier because of access to the library metadata for the items. The metadata makes it possible to distinguish time period, genre, and other characteristics of the works.

But we can imagine that another essential part of library work — that of selection, curation, and collection development — could come to play an important role in corpus construction just as it did in selecting all of those 3 million books that were scanned by Google. The image above provides a useful starting point. It shows the distribution of items that were digitized in comparison to the distribution of those that fell under the parameters of the request. What can we learn about those that were digitized by looking at those that were not>