2. GROBID PDF-to-TEI XML Workflow

GROBID, short for GeneRation Of BIbliographic Data, is a machine learning environment that takes raw documents like PDF and generates structured TEI XML files with full-text content and other related metadata.

2.1. Running GROBID as a Docker Container

The easiest method to run GROBID is with Docker using the lightweight CSR image.

docker run -t --rm --init -p 8070:8070 -p 8071:8071 lfoppiano/grobid:0.6.1

2.2. Setup for Running

%reload_ext autoreload
%autoreload 2
%matplotlib inline
import pathlib
import sys
import datetime
# Assumes that the grobid Python Client (https://github.com/kermitt2/grobid_client_python) has 
# been cloned in the same directory as SPOC
import grobid_client as grobid
hopkins_pdfs = pathlib.Path("/Users/jpnelson/Google Drive/Shared drives/SUL AI 2020-2021/Project - Species Occurrences/papers_pdf")
hopkins_tei = pathlib.Path("/Users/jpnelson/Google Drive/Shared drives/SUL AI 2020-2021/Project - Species Occurrences/papers_tei")
client = grobid.grobid_client(config_path="../config/grobid.json")
GROBID server is up and running
def pdf2tei(pdf_path: pathlib.Path, tei_path: pathlib.Path):
    start = datetime.datetime.utcnow()
    print(f"{start} conversion of PDFs to TEI XML")
    client.process("processFulltextDocument", hopkins_pdfs.as_posix(), output=hopkins_tei.as_posix(), n=3)
    end = datetime.datetime.utcnow()
    print(f"Finished at {end} total time {(end-start).seconds / 60.}")
pdf2tei(hopkins_pdfs, hopkins_tei)
2021-03-02 18:54:39.406047 conversion of PDFs to TEI XML
Finished at 2021-03-02 19:16:35.501890 total time 21.933333333333334
missing = []
for row in hopkins_pdfs.iterdir():
    xml_tei = hopkins_tei/f"{row.stem}.tei.xml"
    if not xml_tei.exists():
PosixPath('/Users/jpnelson/Google Drive/Shared drives/SUL AI 2020-2021/Project - Species Occurrences/papers_pdf/fhl_2011_Broell_19896.pdf')
missing_filenames = [r.name for r in missing]
['bml_Covello 2011_Summer Seq Three.pdf',
 'fhl_2011_van’t Hul_19871.pdf',