No longer being updated, work has migrated here: https://confluence.cornell.edu/display/dmg/Embedded+Metadata+Profile+for+Digitized+Audio+Binaries
#Embedded Metadata Profile for Digitization Binaries
Scope of this document: Embedded metadata in the headers of the digital assets coming from the Digitization Unit and going to CULAR / delivery system workflows. This doesn't touch on, but rather is orthogonal to, the technical practices for these preservation (and eventually, delivery/access) files' creation. See https://confluence.cornell.edu/x/jAbhDg for more on that work.
##General Notes:
-When a field is "empty", leave it empty (no spaces, no text) but use the delimiter (; or ()) to indicate the empty field. -Needs to be open to working/being comprehensible outside of Cornell. -Should map back to managing unit/institution and collection.
i.e.
filename;digital collection ();Cornell U Library;123456;;BatchID;notes
- Follow consistent naming practices, i.e., use
Cornell U Library
whenever and wherever the agent is Cornell University Libraries for the embedded metadata. - Obligation indicates what fields are always required (and also what is repeatable - currently, no fields should be repeatable). This doesn't indicate strong recommendations, which should be taken into account during the overall workflow discussions.
- namespaces / standards / specifications in play:
- dcterms: http://purl.org/dc/terms/ used for generic concept mapping between various metadata outputs. these fields are not actually embedded/tagged in the embedded metadata.
- ebucore: www.ebu.ch/metadata/schemas/EBUCore/ebucore.xsd used for generic concept mapping between various metadata outputs. these fields are not actually embedded/tagged in the embedded metadata.
- RIFF Info Tags (WAV):
- BWF (bext chunk, WAV):
- ID3 Tags (MP3): http://id3.org/id3v2.3.0
##WAV Audio Binary File (Asset)
###Header Syntax:
filename;digital collection (digital collection ID);digitizing agent;analog resource ID;digital resource ID; digital collection batch MARC ID;digitization notes
###Embedded Metadata Profile:
field | external NS mapping | expected value | obligation | notes |
---|---|---|---|---|
filename | FileName // ebucore:filename | filename.extension | {1,1} | variable consisting of the original filename and the file extension, separated by a period |
digital collection | dcterms:isMemberOf | "Cornell Department of Music recordings" / static text | {0,1} | Collection names should be pulled from the originating spreadsheet WHICH SHOULD BE NORMALIZED AND CONSISTENT FROM THE START. |
digital collection ID | ??? | collBibID / integer, 1-9 digits | {0,1} | Voyager BibID for the digital collection's MARC bibliographic record |
digitizing agent | marcrel:digitizer | "Cornell U Library" | {1,1} | Standardize the name - I think this is put in currently for sake of space. |
analog resource ID | dcterms:source | BibID / integer variable, 1-9 digits | {1,1} | Voyager BibID for the MARC bibliographic record of the originated analog resource that was digitized & is represented by this file. |
digital resource ID | ebucore:identifier | eBibID / integer, 1-9 digits | {0,1} | Voyager BibID for the MARC bibliographic record of the newly created digital resource. |
digital collection batch MARC ID | ??? | 899 $a value / static text | {1,1} | The 899 field (for digital collections, the batch identifier assigned by MARC Batch processing when adding/updating digital collections and digital assets MARC records. |
digitization notes | dcterms:description | text / variable text up to XXX characters | {0,1} | Any digitization notes. As more specific note types emerge, discuss making those specific fields. |
field | external NS mapping | expected value | obligation | notes |
---|---|---|---|---|
Description | BWF:Description // dcterms:description? | variable string, seems to have some construction order in test data. | {1,1} | Not sure if this will be replaced by the above profile or is serving a different purpose. Values in test data include Filename;Collection Name;Title;Creator:. These are not unique. |
Originator (Agent & Tool) | BWF:Originator // marcrel or prov? | Cornell University Library, WaveLab v.v, WaveLabel v.v.v | {1,1} | Presume this is automatically generated. Do we want the agent name here to match what we use above? |
BWF:OriginatorReference | Identifier for running WaveLab session?? | |||
File Creation Date | BWF:OriginationDate // dcterms:creation | YYY-MM-DD | {1,1} | Presume this is also automatically generated. |
Extent (time) | BWF:OriginationTime // dcterms:extent | HH:MM:SS: | {1,1} | Presume this is also automatically generated. Is this the file length after any cleanup or editing? |
?? | BWF:TimeReference (translated) | datetime.time instance | {1,1} | Presume this is also automatically generated. |
?? | BWF:TimeReference | integer | {1,1} | Presume this is also automatically generated. |
Bext Spec Version | BWF:BextVersion | version number as single integer | {1,1} | Presume this is also automatically generated. |
BWF:UMID | {0,1} | |||
??:LoudnessValue | {0,1} | |||
??:LoudnessRange | {0,1} | |||
??:MaxTruePeakLevel | {0,1} | |||
??:MaxMomentaryLoudness | {0,1} | |||
??:MaxShortTermLoudness | {0,1} | |||
BWF:CodingHistory | {0,1} | |||
RIFF:IARL | ||||
RIFF:IART | ||||
RIFF:ICMS | ||||
RIFF:ICMT | ||||
RIFF:ICOP | ||||
RIFF:ICRD | ||||
RIFF:IENG | ||||
RIFF:IGNR | ||||
RIFF:IKEY | ||||
RIFF:IMED | ||||
RIFF:INAM | ||||
RIFF:IPRD | ||||
RIFF:ISBJ | ||||
RIFF:ISFT | ||||
RIFF:ISRC | ||||
RIFF:ISRF | ||||
RIFF:ITCH |
##MP3 Audio Binary File (Asset)
Nota bene: I believe these are derived from the WAV files, which means they don't necessarily have the same level of embedded metadata as the above WAV files. Need to confirm this, then discuss whether or not we want this embedded metadata trasnferred across file formats.
###MP3 Tags Syntax:
*** mp3 info #not captured below, since technical specification that is probably tied to digitization tool
MPEG1/layer III
Bitrate: 128KBps
Frequency: 48KHz
*** optional frames / tags
TYER (Year): YYYY
TIT2 (Title/songname/content description): Filename_As_Identifier_Without_Extension Work Title Taken from Originating Spreadsheet
TCON (Content type): Genre Term
COMM (Comments): (ID3v1 Comment)[XXX]: Filename_As_Identifier_Without_Extension;
TOFN (Original filename): Filename_As_Identifier_Without_Extension
TSOA (): frame
PRIV (Private frame): (unimplemented)
TALB (Album/Movie/Show title): Digital Collection Title
###Embedded (MP3 Tags) Metadata Profile:
The following profile presumes the first ten bytes follows id3v2.3.0 specification. The focus is on metadata derived from other systems and added to the file, not necessarily the technical specifications specific to AV, though this should be captured, added, updated, and standardized (as needed) as well. The idea is to treat the WAV as the file of record and where the majority of the technical metadata needed for back-tracking is captured. The MP3 should be able to confirm what collection it belongs to and share an identifier with the WAV files generated.
field | external NS mapping | expected value | obligation | notes |
---|---|---|---|---|
filename | ID3v23file identifier // ebucore:filename | filename | {1,1} | variable consisting of the original filename. Doesn't include file extension. |
original filename | ID3v2#TOFN Original filename // ebucore:filename | filename | {0,1} | variable consisting of the original filename, if for some reason it cannot fit in the header of the MP3 file (above). |
year of creation | ID3v2#TYER Year // dcterms:created | four digits | {0,1} | I believe this is the year the file was created, not the year the work was created. Is always kept as an empty field in the CU Lecture test files reviewed. |
content type (genre-esque) | ID3v2#TCON Content type // ebucore:genre | Controlled Vocabulary Value | {0,1} | should be a genre-like term for the type of content captured. Most often used term is 'Spoken Word'. Not sure where this value comes from/what vocabulary is used. Should confirm this. |
title of work | ID3v2#TIT2 Title/songname/content description // dcterms:title | string, variable length | {0,1} | title of the work captured in digital manifestation. Where is the title taken from currently? The originating spreadsheet? Also seems this field is currently used to captured the filename (see above) as an identifier preceding the title. Do we want to keep this practice? |
digitization notes | ID3v2#COMM // dcterms:description | text / variable text up to XXX characters | {0,1} | Any digitization notes. As more specific note types emerge, discuss making those specific fields. Right now, these capture the filename when it isn't captured in the title (above). Should choose one or the other (or a different filename field) for this. |
??? album sort order ??? | ID3v2#TSOA | text | {0,1} | The ID3 spec says this is for "Album sort order". However, this seems to 1. not be useful for these assets 2. field only has a value of 'frame' in the CU Lectures sample data given. Uncertain what this is meant to do and if we should keep it. |
??? private frame ??? | ID3v2#PRIV or ID3v2#sec4.28 Private frame | text | {0,1} | Do we need to keep this? Doesn't seem to be used. Access and permissions would be managed in other, external metadata and file system management, no? |
digital collection | ID3v2#TALB Album/Movie/Show title // dcterms:isMemberOf | "Cornell Department of Music recordings" / static text | {0,1} | Collection names should be pulled from the originating spreadsheet WHICH SHOULD BE NORMALIZED AND CONSISTENT FROM THE START. |
Can the inline link https://confluence.cornell.edu/display/CAV/AV+Standards+for+Preservation+and+Access be replaced with the Confluence persistent URL https://confluence.cornell.edu/x/jAbhDg ?