# GROBID PDF-to-TEI XML Workflow
[GROBID](https://grobid.readthedocs.io/en/latest/), short for *GeneRation Of BIbliographic Data*, is a machine learning
environment that takes raw documents like PDF and generates structured TEI XML files with full-text content and other 
related metadata.

## Running GROBID as a Docker Container
The easiest method to run GROBID is with Docker using the lightweight CSR image.

`docker run -t --rm --init -p 8070:8070 -p 8071:8071 lfoppiano/grobid:0.6.1`

## Setup for Running 

In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [2]:
import pathlib
import sys
import datetime
# Assumes that the grobid Python Client (https://github.com/kermitt2/grobid_client_python) has 
# been cloned in the same directory as SPOC
sys.path.append("../../grobid_client_python/") 
import grobid_client as grobid

In [3]:
hopkins_pdfs = pathlib.Path("/Users/jpnelson/Google Drive/Shared drives/SUL AI 2020-2021/Project - Species Occurrences/papers_pdf")
hopkins_tei = pathlib.Path("/Users/jpnelson/Google Drive/Shared drives/SUL AI 2020-2021/Project - Species Occurrences/papers_tei")

In [4]:
client = grobid.grobid_client(config_path="../config/grobid.json")

GROBID server is up and running


In [5]:
def pdf2tei(pdf_path: pathlib.Path, tei_path: pathlib.Path):
    start = datetime.datetime.utcnow()
    print(f"{start} conversion of PDFs to TEI XML")
    client.process("processFulltextDocument", hopkins_pdfs.as_posix(), output=hopkins_tei.as_posix(), n=3)
    end = datetime.datetime.utcnow()
    print(f"Finished at {end} total time {(end-start).seconds / 60.}")

In [6]:
pdf2tei(hopkins_pdfs, hopkins_tei)

2021-03-02 18:54:39.406047 conversion of PDFs to TEI XML
Finished at 2021-03-02 19:16:35.501890 total time 21.933333333333334


In [7]:
1724-1262

462

In [7]:
missing = []
for row in hopkins_pdfs.iterdir():
    xml_tei = hopkins_tei/f"{row.stem}.tei.xml"
    if not xml_tei.exists():
        missing.append(row)

In [8]:
len(missing)

88

In [9]:
missing[0]

PosixPath('/Users/jpnelson/Google Drive/Shared drives/SUL AI 2020-2021/Project - Species Occurrences/papers_pdf/fhl_2011_Broell_19896.pdf')

In [11]:
missing_filenames = [r.name for r in missing]

In [13]:
sorted(missing_filenames)

['bml_Covello 2011_Summer Seq Three.pdf',
 'carl_1986_SpringIndex.pdf',
 'fhl_2011_Blackstone_19866.pdf',
 'fhl_2011_Bockmon_19854.pdf',
 'fhl_2011_Boeck_19878.pdf',
 'fhl_2011_Bourdillon_19873.pdf',
 'fhl_2011_Brezicha_19888.pdf',
 'fhl_2011_Broell_19896.pdf',
 'fhl_2011_Challener_19855.pdf',
 'fhl_2011_Cox_19895.pdf',
 'fhl_2011_Enzor_19864.pdf',
 'fhl_2011_Ewings_19889.pdf',
 'fhl_2011_Follis_19879.pdf',
 'fhl_2011_Gilmore_19862.pdf',
 'fhl_2011_Gordon_19881.pdf',
 'fhl_2011_Guenther_19863.pdf',
 'fhl_2011_Ho_19837.pdf',
 'fhl_2011_Hoang_19874.pdf',
 'fhl_2011_Johnson_19836.pdf',
 'fhl_2011_Johnson_19872.pdf',
 'fhl_2011_Kane_19884.pdf',
 'fhl_2011_Kapsenburg_19859.pdf',
 'fhl_2011_Kim_19880.pdf',
 'fhl_2011_Krauszer_19843.pdf',
 'fhl_2011_Lee_19883.pdf',
 'fhl_2011_Little_19825.pdf',
 'fhl_2011_Lucas_19894.pdf',
 'fhl_2011_Magley_19824.pdf',
 'fhl_2011_Meyer_19867.pdf',
 'fhl_2011_Møller_19891.pdf',
 'fhl_2011_Navratil_19826.pdf',
 'fhl_2011_Navratil_19868.pdf',
 'fhl_2011_Newcomb_