Transfer Earthworks indexing process to Earthworks codebase and VMs

Status: proposed
Decider(s):
- Access team
Date(s):
- Proposed: 2025-08-14

Context and Problem Statement

The focus of this proposal is to move Earthworks-related code and resources from the searchworks_traject_indexer stack into Earthworks proper.

The resulting implementation would:

Remove the traject config for indexing geo records from searchworks_traject_indexer and move it to Earthworks
Halt and remove any Earthworks indexing jobs running on searchworks_traject_indexer prod & stage
Re-implement some indexing jobs as ActiveJob jobs to be run on the Earthworks worker boxes.

Decision Drivers

It’s a little weird that the codebase called searchworks_traject_indexer is actually secretly Earthworks indexing too
Earthworks already does some of its indexing (see “secondary indexing” below) on its own, separate from searchworks_traject_indexer
- It would be nice not to have indexing happening in 2 places, because this process is manual and basically nobody remembers that it exists
Earthworks’s stack is already provisioned with worker boxes designed to handle indexing tasks so that the web servers don’t slow down
Earthworks’s indexer is much simpler than Searchworks’s (nothing from FOLIO, already uses Cocina not MODS, does not need to be particularly fast or parallel)
Sometimes making changes in searchworks_traject_indexer intended to help Earthworks users ends up breaking things for FOLIO indexing

Considered Options

There is a choice to be made about the “primary indexing” process (records being released from SDR to be indexed in Earthworks):

Run a background process that continuously monitors purl_fetcher’s queue, streaming in new records as they are released and indexing them.
Periodically invoke a job that asks purl_fetcher for any items updated within the last <time period> and reindexes those that were updated.

Option 1 above is more or less the current implementation in searchworks_traject_indexer. It’s assumed to be fast, but might be overkill for Earthworks.

Option 2 above is more or less the current implementation for DataWorks’s indexer (dataworks-etl). It may be simpler to implement.

Note that “secondary indexing” (records coming from external sources, like OpenGeoMetadata), which is currently a manual process, would likely be implemented as a scheduled ActiveJob process (similar to Option 2 above) as part of this proposal.

Positive Consequences

Earthworks-related code becomes centralized in the Earthworks repo instead of spread across repositories
Indexing jobs for Earthworks are easier for developers who don’t often touch Earthworks to understand, debug, and invoke
It’s easier to add or modify indexing jobs for Earthworks without potentially affecting Searchworks indexing

Negative Consequences

More new code and dependencies will be added to the Earthworks repo, including possibly traject

Risks

The existing indexer VMs are powerful, so the indexing process could slow down when not using those machines/becoming less parallel
Monitoring (e.g. Honeybadger) will need to be carefully set up to ensure the new jobs are running as designed