About a month ago we came to the end of our time on the NHPRC STOP AIDS Project processing grant. This post will discuss how we processed the born-digital files we imaged from floppy disks, CDs, and zip disks. During the course of the grant we received approximately one terabyte of data from the donor that we forensically imaged, but did not have time to process as it was outside the scope of the grant. Previous posts in this series discussed the pre-imaging and imaging processes, and can be found here and here.
For our purposes at the Stanford University Libraries Department of Special Collections and University Archives, processing born digital material involves two steps: filtering out restricted material (including social security numbers, credit card numbers, and other sensitive information) and assigning batches of files to record series that correspond with the paper collection. Due to time constraints, we were only able to filter for restricted information. I will talk a little bit about how we would have assigned files to record groups at the end of this post.
Better Filtering through Regular Expressions
We used AccessData Forensic Toolkit (FTK) to filter the 29,423 unique files we imaged for restricted material. FTK comes equipped with two search features that allow us to filter large numbers of records efficiently: the Live Search and the Index Search. The Live Search allows users to input terms or patterns, and then conducts a search of the files for those terms or patterns. The patterns that users can search for are called “regular expressions.” Regular expressions are common patterns of numbers or text that can be boiled down to computational abbreviations. A computer program can then “match” portions of text to a regular expression, which show up as search results. For example, Social Security numbers follow a particular pattern of three numbers, a space, two numbers, a space, and four numbers. We tell FTK to scan the files for that particular pattern, and FTK returns the files that contain numbers that match that pattern. It’s then up to a human to sort through those returned files to determine which actually contain Social Security numbers and which contain false positives. For STOP AIDS, we ran two Live Searches; one to identify Social Security numbers and one to identify credit card numbers. Since there is currently no way to redact information within an individual file, we used FTK’s flagging feature to mark the files containing Social Security numbers and credit card numbers so that they would not be exported for researchers to view.
The Index Search function was useful in further searches for potentially sensitive information. In the case of STOP AIDS, because participation in certain groups or workshops might imply not only sexuality but also blood serostatus (HIV positive or negative), we wanted to restrict all personal information about workshop participants. We also wanted to restrict sensitive information about STOP AIDS Project employees as well as information about donors. In order to find the files that contained this information, we came up with a list of terms that might indicate the presence of sensitive information (such as payroll, workshop, donor, pledge, etc.) and searched the index of terms that FTK had made when the images were first entered into the program. As with the Live Search, a human had to sort through all the files that the searches returned in order to separate actual restricted material from false positives.
Bookmarks and Record Groups
If we had the time, we would have used FTK’s bookmark function in conjunction with the Index Search to arrange the files into folders corresponding to the series structure of the textual portion of the collection. Bookmarks are different than tags because they are hierarchical; one can apply the “Series 1” bookmark to a large number of files, and the “Subseries 1” bookmark to a subset of files within that group. In order to perform this arrangement, our workflow would have been to come up with a list of terms that would indicate the presence of material associated with a particular series in a file, and use the Index Search function to call up, sort through, and flag those files.
The STOP AIDS Project has been an important milestone in processing born-digital material at Stanford University Libraries. We have successfully imaged all the born digital material in the collection and in the process have tested and refined workflows that were developed during the AIMS project. We’ve also begun to establish a baseline for estimating time and effort to process future born-digital collections. The processing of the STOP AIDS Project papers has ended, but there is a world of digital material waiting to be imaged and processed. We have only just begun.