Extract Technical Metadata on a Per-File Basis

  • Status: drafted
  • Decider(s):
    • Andrew Berger
    • Vivian Wong
    • Infrastructure Team
      • Justin Coyne
      • Mike Giarlo
      • Peter Mangiafico
      • Jeremy Nelson
      • Justin Littman
      • Naomi Dushay
      • John Martin
      • Aaron Collier
  • Date(s):
    • drafted: 2019-10-29

Context and Problem Statement

Currently, we extract technical metadata per-object and run one extraction job serially per-file. This takes a problematically long time for objects with many files; blocks other objects from accessioning; and complicates restarts which must begin again and process the entire object.

NOTE: Needs discussion: Fedora 3 does not support concurrent writes on the same datastream so we can either split out filesets as a first-class objects in the F3 data model or use temporary caching to generate a consolidated techMD datastream.

Decision Drivers

  • Blocker for Google Books project
  • Slows down accessioning process

Considered Options

  • Do nothing
  • Extract metadata on a per-file basis rather than on a per-object basis to benefit from parallelism

Decision Outcome

TBD!

Positive Consequences

  • [e.g., improvement of quality attribute satisfaction, follow-up decisions required, …]

Negative Consequences

  • [e.g., compromising quality attribute, follow-up decisions required, …]

Pros and Cons of the Options

[option 1]

[example description pointer to more information …]
  • Good, because [argument a]
  • Good, because [argument b]
  • Bad, because [argument c]

[option 2]

[example description pointer to more information …]
  • Good, because [argument a]
  • Good, because [argument b]
  • Bad, because [argument c]

[option 3]

[example description pointer to more information …]
  • Good, because [argument a]
  • Good, because [argument b]
  • Bad, because [argument c]