2. GROBID PDF-to-TEI XML Workflow

GROBID, short for GeneRation Of BIbliographic Data, is a machine learning environment that takes raw documents like PDF and generates structured TEI XML files with full-text content and other related metadata.

2.1. Running GROBID as a Docker Container

The easiest method to run GROBID is with Docker using the lightweight CSR image.

docker run -t --rm --init -p 8070:8070 -p 8071:8071 lfoppiano/grobid:0.6.1

2.2. Setup for Running

%reload_ext autoreload
%autoreload 2
%matplotlib inline
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-1-bb6be14f2361> in <module>
      1 get_ipython().run_line_magic('reload_ext', 'autoreload')
      2 get_ipython().run_line_magic('autoreload', '2')
----> 3 get_ipython().run_line_magic('matplotlib', 'inline')

/opt/hostedtoolcache/Python/3.9.5/x64/lib/python3.9/site-packages/IPython/core/interactiveshell.py in run_line_magic(self, magic_name, line, _stack_depth)
   2346                 kwargs['local_ns'] = self.get_local_scope(stack_depth)
   2347             with self.builtin_trap:
-> 2348                 result = fn(*args, **kwargs)
   2349             return result
   2350 

/opt/hostedtoolcache/Python/3.9.5/x64/lib/python3.9/site-packages/decorator.py in fun(*args, **kw)
    230             if not kwsyntax:
    231                 args, kw = fix(args, kw, sig)
--> 232             return caller(func, *(extras + args), **kw)
    233     fun.__name__ = func.__name__
    234     fun.__doc__ = func.__doc__

/opt/hostedtoolcache/Python/3.9.5/x64/lib/python3.9/site-packages/IPython/core/magic.py in <lambda>(f, *a, **k)
    185     # but it's overkill for just that one bit of state.
    186     def magic_deco(arg):
--> 187         call = lambda f, *a, **k: f(*a, **k)
    188 
    189         if callable(arg):

/opt/hostedtoolcache/Python/3.9.5/x64/lib/python3.9/site-packages/IPython/core/magics/pylab.py in matplotlib(self, line)
     97             print("Available matplotlib backends: %s" % backends_list)
     98         else:
---> 99             gui, backend = self.shell.enable_matplotlib(args.gui.lower() if isinstance(args.gui, str) else args.gui)
    100             self._show_matplotlib_backend(args.gui, backend)
    101 

/opt/hostedtoolcache/Python/3.9.5/x64/lib/python3.9/site-packages/ipykernel/zmqshell.py in enable_matplotlib(self, gui)
    599 
    600     def enable_matplotlib(self, gui=None):
--> 601         gui, backend = super(ZMQInteractiveShell, self).enable_matplotlib(gui)
    602 
    603         try:

/opt/hostedtoolcache/Python/3.9.5/x64/lib/python3.9/site-packages/IPython/core/interactiveshell.py in enable_matplotlib(self, gui)
   3515         """
   3516         from IPython.core import pylabtools as pt
-> 3517         from matplotlib_inline.backend_inline import configure_inline_support
   3518         gui, backend = pt.find_gui_and_backend(gui, self.pylab_gui_select)
   3519 

/opt/hostedtoolcache/Python/3.9.5/x64/lib/python3.9/site-packages/matplotlib_inline/backend_inline.py in <module>
      4 # Distributed under the terms of the BSD 3-Clause License.
      5 
----> 6 import matplotlib
      7 from matplotlib.backends.backend_agg import (  # noqa
      8     new_figure_manager,

ModuleNotFoundError: No module named 'matplotlib'
import pathlib
import sys
import datetime
# Assumes that the grobid Python Client (https://github.com/kermitt2/grobid_client_python) has 
# been cloned in the same directory as SPOC
sys.path.append("../../grobid_client_python/") 
import grobid_client as grobid
hopkins_pdfs = pathlib.Path("/Users/jpnelson/Google Drive/Shared drives/SUL AI 2020-2021/Project - Species Occurrences/papers_pdf")
hopkins_tei = pathlib.Path("/Users/jpnelson/Google Drive/Shared drives/SUL AI 2020-2021/Project - Species Occurrences/papers_tei")
client = grobid.grobid_client(config_path="../config/grobid.json")
GROBID server is up and running
def pdf2tei(pdf_path: pathlib.Path, tei_path: pathlib.Path):
    start = datetime.datetime.utcnow()
    print(f"{start} conversion of PDFs to TEI XML")
    client.process("processFulltextDocument", hopkins_pdfs.as_posix(), output=hopkins_tei.as_posix(), n=3)
    end = datetime.datetime.utcnow()
    print(f"Finished at {end} total time {(end-start).seconds / 60.}")
pdf2tei(hopkins_pdfs, hopkins_tei)
2021-03-02 18:54:39.406047 conversion of PDFs to TEI XML
Finished at 2021-03-02 19:16:35.501890 total time 21.933333333333334
1724-1262
462
missing = []
for row in hopkins_pdfs.iterdir():
    xml_tei = hopkins_tei/f"{row.stem}.tei.xml"
    if not xml_tei.exists():
        missing.append(row)
len(missing)
88
missing[0]
PosixPath('/Users/jpnelson/Google Drive/Shared drives/SUL AI 2020-2021/Project - Species Occurrences/papers_pdf/fhl_2011_Broell_19896.pdf')
missing_filenames = [r.name for r in missing]
sorted(missing_filenames)
['bml_Covello 2011_Summer Seq Three.pdf',
 'carl_1986_SpringIndex.pdf',
 'fhl_2011_Blackstone_19866.pdf',
 'fhl_2011_Bockmon_19854.pdf',
 'fhl_2011_Boeck_19878.pdf',
 'fhl_2011_Bourdillon_19873.pdf',
 'fhl_2011_Brezicha_19888.pdf',
 'fhl_2011_Broell_19896.pdf',
 'fhl_2011_Challener_19855.pdf',
 'fhl_2011_Cox_19895.pdf',
 'fhl_2011_Enzor_19864.pdf',
 'fhl_2011_Ewings_19889.pdf',
 'fhl_2011_Follis_19879.pdf',
 'fhl_2011_Gilmore_19862.pdf',
 'fhl_2011_Gordon_19881.pdf',
 'fhl_2011_Guenther_19863.pdf',
 'fhl_2011_Ho_19837.pdf',
 'fhl_2011_Hoang_19874.pdf',
 'fhl_2011_Johnson_19836.pdf',
 'fhl_2011_Johnson_19872.pdf',
 'fhl_2011_Kane_19884.pdf',
 'fhl_2011_Kapsenburg_19859.pdf',
 'fhl_2011_Kim_19880.pdf',
 'fhl_2011_Krauszer_19843.pdf',
 'fhl_2011_Lee_19883.pdf',
 'fhl_2011_Little_19825.pdf',
 'fhl_2011_Lucas_19894.pdf',
 'fhl_2011_Magley_19824.pdf',
 'fhl_2011_Meyer_19867.pdf',
 'fhl_2011_Møller_19891.pdf',
 'fhl_2011_Navratil_19826.pdf',
 'fhl_2011_Navratil_19868.pdf',
 'fhl_2011_Newcomb_19860.pdf',
 'fhl_2011_Olmstead_19845.pdf',
 'fhl_2011_Paxton_19890.pdf',
 'fhl_2011_Phillips_19858.pdf',
 'fhl_2011_Pietsch_19856.pdf',
 'fhl_2011_Rickards_19861.pdf',
 'fhl_2011_Shaffer_19877.pdf',
 'fhl_2011_Singer_19887.pdf',
 'fhl_2011_Smith_19886.pdf',
 'fhl_2011_Stelter_19835.pdf',
 'fhl_2011_Suzumura_19844.pdf',
 'fhl_2011_Taylor_19892.pdf',
 'fhl_2011_Thomas_19869.pdf',
 'fhl_2011_Ulmke_19875.pdf',
 'fhl_2011_Vancil_19876.pdf',
 'fhl_2011_Vaughn_19857.pdf',
 'fhl_2011_Walls_19870.pdf',
 'fhl_2011_Wilkins_19865.pdf',
 'fhl_2011_Witt_19885.pdf',
 'fhl_2011_Witt_25966.pdf',
 'fhl_2011_van’t Hul_19871.pdf',
 'fhl_2012_Albrecht_19808.pdf',
 'fhl_2012_Bruders_19793.pdf',
 'fhl_2012_Churches_19794.pdf',
 'fhl_2012_Conery_19809.pdf',
 'fhl_2012_Davies_19801.pdf',
 'fhl_2012_Dunnell_19802.pdf',
 'fhl_2012_Fodor_19795.pdf',
 'fhl_2012_Ge_19803.pdf',
 'fhl_2012_Girardo_19796.pdf',
 'fhl_2012_Jacobsen-Watts_19811.pdf',
 'fhl_2012_Kareiva_19797.pdf',
 'fhl_2012_Kreis_19812.pdf',
 'fhl_2012_Kulesza_19813.pdf',
 'fhl_2012_Marks_19804.pdf',
 'fhl_2012_Oxborrow_19814.pdf',
 'fhl_2012_Sanford_19798.pdf',
 'fhl_2012_Schreck_19805.pdf',
 'fhl_2012_Stull_19815.pdf',
 'fhl_2012_Swore_19799.pdf',
 'fhl_2012_Townsend_19806.pdf',
 'fhl_2012_Twomey_19807.pdf',
 'fhl_2012_Voon_19816.pdf',
 'fhl_2012_Wang_19817.pdf',
 'osu_20200612152001862.pdf',
 'osu_20200612152044760.pdf',
 'osu_20200612152132332.pdf',
 'osu_20200612152240497.pdf',
 'osu_20200612152633039.pdf',
 'osu_20200612152822066.pdf',
 'osu_20200612153112738.pdf',
 'osu_20200612153337143.pdf',
 'osu_20200612153509246.pdf',
 'osu_20200612153616194.pdf',
 'osu_20200612153848020.pdf',
 'osu_20200612154034430.pdf']