Skip to content

Instantly share code, notes, and snippets.

@cmharlow
Last active February 3, 2017 19:40
Show Gist options
  • Save cmharlow/84e30b81227e2e5b47f0f51d71e8d9db to your computer and use it in GitHub Desktop.
Save cmharlow/84e30b81227e2e5b47f0f51d71e8d9db to your computer and use it in GitHub Desktop.

No longer being updated, work has migrated here: https://confluence.cornell.edu/display/dmg/Embedded+Metadata+Profile+for+Digitized+Audio+Binaries

#Embedded Metadata Profile for Digitization Binaries

Scope of this document: Embedded metadata in the headers of the digital assets coming from the Digitization Unit and going to CULAR / delivery system workflows. This doesn't touch on, but rather is orthogonal to, the technical practices for these preservation (and eventually, delivery/access) files' creation. See https://confluence.cornell.edu/x/jAbhDg for more on that work.

##General Notes:

-When a field is "empty", leave it empty (no spaces, no text) but use the delimiter (; or ()) to indicate the empty field. -Needs to be open to working/being comprehensible outside of Cornell. -Should map back to managing unit/institution and collection.

i.e.

filename;digital collection ();Cornell U Library;123456;;BatchID;notes
  • Follow consistent naming practices, i.e., use Cornell U Library whenever and wherever the agent is Cornell University Libraries for the embedded metadata.
  • Obligation indicates what fields are always required (and also what is repeatable - currently, no fields should be repeatable). This doesn't indicate strong recommendations, which should be taken into account during the overall workflow discussions.
  • namespaces / standards / specifications in play:

##WAV Audio Binary File (Asset)

###Header Syntax:

filename;digital collection (digital collection ID);digitizing agent;analog resource ID;digital resource ID; digital collection batch MARC ID;digitization notes

###Embedded Metadata Profile:

field external NS mapping expected value obligation notes
filename FileName // ebucore:filename filename.extension {1,1} variable consisting of the original filename and the file extension, separated by a period
digital collection dcterms:isMemberOf "Cornell Department of Music recordings" / static text {0,1} Collection names should be pulled from the originating spreadsheet WHICH SHOULD BE NORMALIZED AND CONSISTENT FROM THE START.
digital collection ID ??? collBibID / integer, 1-9 digits {0,1} Voyager BibID for the digital collection's MARC bibliographic record
digitizing agent marcrel:digitizer "Cornell U Library" {1,1} Standardize the name - I think this is put in currently for sake of space.
analog resource ID dcterms:source BibID / integer variable, 1-9 digits {1,1} Voyager BibID for the MARC bibliographic record of the originated analog resource that was digitized & is represented by this file.
digital resource ID ebucore:identifier eBibID / integer, 1-9 digits {0,1} Voyager BibID for the MARC bibliographic record of the newly created digital resource.
digital collection batch MARC ID ??? 899 $a value / static text {1,1} The 899 field (for digital collections, the batch identifier assigned by MARC Batch processing when adding/updating digital collections and digital assets MARC records.
digitization notes dcterms:description text / variable text up to XXX characters {0,1} Any digitization notes. As more specific note types emerge, discuss making those specific fields.

Beyond the above header:

field external NS mapping expected value obligation notes
Description BWF:Description // dcterms:description? variable string, seems to have some construction order in test data. {1,1} Not sure if this will be replaced by the above profile or is serving a different purpose. Values in test data include Filename;Collection Name;Title;Creator:. These are not unique.
Originator (Agent & Tool) BWF:Originator // marcrel or prov? Cornell University Library, WaveLab v.v, WaveLabel v.v.v {1,1} Presume this is automatically generated. Do we want the agent name here to match what we use above?
BWF:OriginatorReference Identifier for running WaveLab session??
File Creation Date BWF:OriginationDate // dcterms:creation YYY-MM-DD {1,1} Presume this is also automatically generated.
Extent (time) BWF:OriginationTime // dcterms:extent HH:MM:SS: {1,1} Presume this is also automatically generated. Is this the file length after any cleanup or editing?
?? BWF:TimeReference (translated) datetime.time instance {1,1} Presume this is also automatically generated.
?? BWF:TimeReference integer {1,1} Presume this is also automatically generated.
Bext Spec Version BWF:BextVersion version number as single integer {1,1} Presume this is also automatically generated.
BWF:UMID {0,1}
??:LoudnessValue {0,1}
??:LoudnessRange {0,1}
??:MaxTruePeakLevel {0,1}
??:MaxMomentaryLoudness {0,1}
??:MaxShortTermLoudness {0,1}
BWF:CodingHistory {0,1}
RIFF:IARL
RIFF:IART
RIFF:ICMS
RIFF:ICMT
RIFF:ICOP
RIFF:ICRD
RIFF:IENG
RIFF:IGNR
RIFF:IKEY
RIFF:IMED
RIFF:INAM
RIFF:IPRD
RIFF:ISBJ
RIFF:ISFT
RIFF:ISRC
RIFF:ISRF
RIFF:ITCH

##MP3 Audio Binary File (Asset)

Nota bene: I believe these are derived from the WAV files, which means they don't necessarily have the same level of embedded metadata as the above WAV files. Need to confirm this, then discuss whether or not we want this embedded metadata trasnferred across file formats.

###MP3 Tags Syntax:

*** mp3 info #not captured below, since technical specification that is probably tied to digitization tool
MPEG1/layer III
Bitrate: 128KBps
Frequency: 48KHz
*** optional frames / tags
TYER (Year): YYYY
TIT2 (Title/songname/content description): Filename_As_Identifier_Without_Extension Work Title Taken from Originating Spreadsheet
TCON (Content type): Genre Term
COMM (Comments): (ID3v1 Comment)[XXX]: Filename_As_Identifier_Without_Extension;
TOFN (Original filename): Filename_As_Identifier_Without_Extension
TSOA ():  frame
PRIV (Private frame):  (unimplemented)
TALB (Album/Movie/Show title): Digital Collection Title

###Embedded (MP3 Tags) Metadata Profile:

The following profile presumes the first ten bytes follows id3v2.3.0 specification. The focus is on metadata derived from other systems and added to the file, not necessarily the technical specifications specific to AV, though this should be captured, added, updated, and standardized (as needed) as well. The idea is to treat the WAV as the file of record and where the majority of the technical metadata needed for back-tracking is captured. The MP3 should be able to confirm what collection it belongs to and share an identifier with the WAV files generated.

field external NS mapping expected value obligation notes
filename ID3v23file identifier // ebucore:filename filename {1,1} variable consisting of the original filename. Doesn't include file extension.
original filename ID3v2#TOFN Original filename // ebucore:filename filename {0,1} variable consisting of the original filename, if for some reason it cannot fit in the header of the MP3 file (above).
year of creation ID3v2#TYER Year // dcterms:created four digits {0,1} I believe this is the year the file was created, not the year the work was created. Is always kept as an empty field in the CU Lecture test files reviewed.
content type (genre-esque) ID3v2#TCON Content type // ebucore:genre Controlled Vocabulary Value {0,1} should be a genre-like term for the type of content captured. Most often used term is 'Spoken Word'. Not sure where this value comes from/what vocabulary is used. Should confirm this.
title of work ID3v2#TIT2 Title/songname/content description // dcterms:title string, variable length {0,1} title of the work captured in digital manifestation. Where is the title taken from currently? The originating spreadsheet? Also seems this field is currently used to captured the filename (see above) as an identifier preceding the title. Do we want to keep this practice?
digitization notes ID3v2#COMM // dcterms:description text / variable text up to XXX characters {0,1} Any digitization notes. As more specific note types emerge, discuss making those specific fields. Right now, these capture the filename when it isn't captured in the title (above). Should choose one or the other (or a different filename field) for this.
??? album sort order ??? ID3v2#TSOA text {0,1} The ID3 spec says this is for "Album sort order". However, this seems to 1. not be useful for these assets 2. field only has a value of 'frame' in the CU Lectures sample data given. Uncertain what this is meant to do and if we should keep it.
??? private frame ??? ID3v2#PRIV or ID3v2#sec4.28 Private frame text {0,1} Do we need to keep this? Doesn't seem to be used. Access and permissions would be managed in other, external metadata and file system management, no?
digital collection ID3v2#TALB Album/Movie/Show title // dcterms:isMemberOf "Cornell Department of Music recordings" / static text {0,1} Collection names should be pulled from the originating spreadsheet WHICH SHOULD BE NORMALIZED AND CONSISTENT FROM THE START.
@dd388
Copy link

dd388 commented Jan 3, 2017

@cmharlow
Copy link
Author

cmharlow commented Jan 9, 2017

@dd388 sure, will do

@jb221467
Copy link

Note from Dianne: Figure out how this fits into the workflow for vended material

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment