Extract Technical Metadata on a Per-File Basis

Status: drafted
Decider(s):
- Andrew Berger
- Vivian Wong
- Infrastructure Team
  - Justin Coyne
  - Mike Giarlo
  - Peter Mangiafico
  - Jeremy Nelson
  - Justin Littman
  - Naomi Dushay
  - John Martin
  - Aaron Collier
Date(s):
- drafted: 2019-10-29
- …

Context and Problem Statement

Currently, we extract technical metadata per-object and run one extraction job serially per-file. This takes a problematically long time for objects with many files; blocks other objects from accessioning; and complicates restarts which must begin again and process the entire object.

NOTE: Needs discussion: Fedora 3 does not support concurrent writes on the same datastream so we can either split out filesets as a first-class objects in the F3 data model or use temporary caching to generate a consolidated techMD datastream.

Decision Drivers

Blocker for Google Books project
Slows down accessioning process

Considered Options

Do nothing
Extract metadata on a per-file basis rather than on a per-object basis to benefit from parallelism

Decision Outcome

TBD!

Positive Consequences

[e.g., improvement of quality attribute satisfaction, follow-up decisions required, …]
…

Negative Consequences

[e.g., compromising quality attribute, follow-up decisions required, …]
…

Pros and Cons of the Options

[option 1]

[example

description

pointer to more information

…]

Good, because [argument a]
Good, because [argument b]
Bad, because [argument c]
…

[option 2]

[example

description

pointer to more information

…]

Good, because [argument a]
Good, because [argument b]
Bad, because [argument c]
…

[option 3]

[example

description

pointer to more information

…]

Good, because [argument a]
Good, because [argument b]
Bad, because [argument c]
…