Hopkins Marine Station Student Papers
From 1963 - 2011, Hopkins Marine Station offered Biology courses 175H or 176H. Students in these courses developed and conducted research projects in the area around the station, and the culmination of each of their efforts was a final paper. Copies of these papers were deposited in the station’s library, and we now have over 750 undergraduate research papers in our collection. These student research reports contain observations of environmental conditions, species and populations recorded over a span of nearly 60 years, and provide an extremely valuable corpus for conducting historical ecology research.
Exploring computational methods for extracting biodiversity data from student papers
Computational methods for text analysis are rapidly evolving, and initial testing on a subset of the student papers corpus demonstrates significant potential in this area. Almost without exception, the student reports were typed, which supports effective optical character recognition on digital surrogates. Plain-text versions of the reports can then be analyzed with existing Natural Language Processing (NLP) tools. For example, spaCy is a Python library for NLP that “excels at large-scale information extraction tasks.” Our partners in Stanford’s Center for Interdisciplinary Digital Research used spaCy to automate identification of genus and species names in student reports using a process called named entity recognition (NER). This process compared the complete World Register of Marine Species (a list of nearly 600,000 species) to the text of student papers to identify named species by date and location (e.g., Anthopleura elegantissima (an anemone) at Hopkins Marine Station, June 1959). An observation of an organism at a given place on a given date is called a “species occurrence.” Species occurrence data forms the foundation for biodiversity research. The critical nature of species occurrence records is evinced by the Global Biodiversity Information Facility (GBIF), an international research network which currently holds nearly 1.4 billion occurrence records in an open database. While the size of the GBIF database is remarkable, a deeper examination of the data reveals a serious limitation. When GBIF species occurrence records are viewed over time, a rapid decrease in the number of observations is apparent as you go backward. Sources of observational data that can fill these gaps in the record are indispensable. This is one area where the student research papers in our collections are poised to make an important contribution, if we can find a way to extract the relevant data from the physical corpus.
If you are interested in working with the Hopkins Marine Station student papers, please contact Amanda!