av_embed_metadata_profile.md

No longer being updated, work has migrated here: https://confluence.cornell.edu/display/dmg/Embedded+Metadata+Profile+for+Digitized+Audio+Binaries

#Embedded Metadata Profile for Digitization Binaries

Scope of this document: Embedded metadata in the headers of the digital assets coming from the Digitization Unit and going to CULAR / delivery system workflows. This doesn't touch on, but rather is orthogonal to, the technical practices for these preservation (and eventually, delivery/access) files' creation. See https://confluence.cornell.edu/x/jAbhDg for more on that work.

##General Notes:

-When a field is "empty", leave it empty (no spaces, no text) but use the delimiter (; or ()) to indicate the empty field. -Needs to be open to working/being comprehensible outside of Cornell. -Should map back to managing unit/institution and collection.

i.e.

filename;digital collection ();Cornell U Library;123456;;BatchID;notes

Follow consistent naming practices, i.e., use Cornell U Library whenever and wherever the agent is Cornell University Libraries for the embedded metadata.
Obligation indicates what fields are always required (and also what is repeatable - currently, no fields should be repeatable). This doesn't indicate strong recommendations, which should be taken into account during the overall workflow discussions.
namespaces / standards / specifications in play:
- dcterms: http://purl.org/dc/terms/ used for generic concept mapping between various metadata outputs. these fields are not actually embedded/tagged in the embedded metadata.
- ebucore: www.ebu.ch/metadata/schemas/EBUCore/ebucore.xsd used for generic concept mapping between various metadata outputs. these fields are not actually embedded/tagged in the embedded metadata.
- RIFF Info Tags (WAV):
- BWF (bext chunk, WAV):
- ID3 Tags (MP3): http://id3.org/id3v2.3.0

##WAV Audio Binary File (Asset)

###Header Syntax:

filename;digital collection (digital collection ID);digitizing agent;analog resource ID;digital resource ID; digital collection batch MARC ID;digitization notes

###Embedded Metadata Profile:

field	external NS mapping	expected value	obligation	notes
filename	FileName // ebucore:filename	filename.extension	{1,1}	variable consisting of the original filename and the file extension, separated by a period
digital collection	dcterms:isMemberOf	"Cornell Department of Music recordings" / static text	{0,1}	Collection names should be pulled from the originating spreadsheet WHICH SHOULD BE NORMALIZED AND CONSISTENT FROM THE START.
digital collection ID	???	collBibID / integer, 1-9 digits	{0,1}	Voyager BibID for the digital collection's MARC bibliographic record
digitizing agent	marcrel:digitizer	"Cornell U Library"	{1,1}	Standardize the name - I think this is put in currently for sake of space.
analog resource ID	dcterms:source	BibID / integer variable, 1-9 digits	{1,1}	Voyager BibID for the MARC bibliographic record of the originated analog resource that was digitized & is represented by this file.
digital resource ID	ebucore:identifier	eBibID / integer, 1-9 digits	{0,1}	Voyager BibID for the MARC bibliographic record of the newly created digital resource.
digital collection batch MARC ID	???	899 $a value / static text	{1,1}	The 899 field (for digital collections, the batch identifier assigned by MARC Batch processing when adding/updating digital collections and digital assets MARC records.
digitization notes	dcterms:description	text / variable text up to XXX characters	{0,1}	Any digitization notes. As more specific note types emerge, discuss making those specific fields.

Beyond the above header:

field	external NS mapping	expected value	obligation	notes
Description	BWF:Description // dcterms:description?	variable string, seems to have some construction order in test data.	{1,1}	Not sure if this will be replaced by the above profile or is serving a different purpose. Values in test data include Filename;Collection Name;Title;Creator:. These are not unique.
Originator (Agent & Tool)	BWF:Originator // marcrel or prov?	Cornell University Library, WaveLab v.v, WaveLabel v.v.v	{1,1}	Presume this is automatically generated. Do we want the agent name here to match what we use above?
BWF:OriginatorReference	Identifier for running WaveLab session??
File Creation Date	BWF:OriginationDate // dcterms:creation	YYY-MM-DD	{1,1}	Presume this is also automatically generated.
Extent (time)	BWF:OriginationTime // dcterms:extent	HH:MM:SS:	{1,1}	Presume this is also automatically generated. Is this the file length after any cleanup or editing?
??	BWF:TimeReference (translated)	datetime.time instance	{1,1}	Presume this is also automatically generated.
??	BWF:TimeReference	integer	{1,1}	Presume this is also automatically generated.
Bext Spec Version	BWF:BextVersion	version number as single integer	{1,1}	Presume this is also automatically generated.
BWF:UMID			{0,1}
??:LoudnessValue			{0,1}
??:LoudnessRange			{0,1}
??:MaxTruePeakLevel			{0,1}
??:MaxMomentaryLoudness			{0,1}
??:MaxShortTermLoudness			{0,1}
BWF:CodingHistory			{0,1}
RIFF:IARL
RIFF:IART
RIFF:ICMS
RIFF:ICMT
RIFF:ICOP
RIFF:ICRD
RIFF:IENG
RIFF:IGNR
RIFF:IKEY
RIFF:IMED
RIFF:INAM
RIFF:IPRD
RIFF:ISBJ
RIFF:ISFT
RIFF:ISRC
RIFF:ISRF
RIFF:ITCH

##MP3 Audio Binary File (Asset)

Nota bene: I believe these are derived from the WAV files, which means they don't necessarily have the same level of embedded metadata as the above WAV files. Need to confirm this, then discuss whether or not we want this embedded metadata trasnferred across file formats.

###MP3 Tags Syntax:

*** mp3 info #not captured below, since technical specification that is probably tied to digitization tool
MPEG1/layer III
Bitrate: 128KBps
Frequency: 48KHz
*** optional frames / tags
TYER (Year): YYYY
TIT2 (Title/songname/content description): Filename_As_Identifier_Without_Extension Work Title Taken from Originating Spreadsheet
TCON (Content type): Genre Term
COMM (Comments): (ID3v1 Comment)[XXX]: Filename_As_Identifier_Without_Extension;
TOFN (Original filename): Filename_As_Identifier_Without_Extension
TSOA ():  frame
PRIV (Private frame):  (unimplemented)
TALB (Album/Movie/Show title): Digital Collection Title

###Embedded (MP3 Tags) Metadata Profile:

The following profile presumes the first ten bytes follows id3v2.3.0 specification. The focus is on metadata derived from other systems and added to the file, not necessarily the technical specifications specific to AV, though this should be captured, added, updated, and standardized (as needed) as well. The idea is to treat the WAV as the file of record and where the majority of the technical metadata needed for back-tracking is captured. The MP3 should be able to confirm what collection it belongs to and share an identifier with the WAV files generated.

field	external NS mapping	expected value	obligation	notes
filename	ID3v23file identifier // ebucore:filename	filename	{1,1}	variable consisting of the original filename. Doesn't include file extension.
original filename	ID3v2#TOFN Original filename // ebucore:filename	filename	{0,1}	variable consisting of the original filename, if for some reason it cannot fit in the header of the MP3 file (above).
year of creation	ID3v2#TYER Year // dcterms:created	four digits	{0,1}	I believe this is the year the file was created, not the year the work was created. Is always kept as an empty field in the CU Lecture test files reviewed.
content type (genre-esque)	ID3v2#TCON Content type // ebucore:genre	Controlled Vocabulary Value	{0,1}	should be a genre-like term for the type of content captured. Most often used term is 'Spoken Word'. Not sure where this value comes from/what vocabulary is used. Should confirm this.
title of work	ID3v2#TIT2 Title/songname/content description // dcterms:title	string, variable length	{0,1}	title of the work captured in digital manifestation. Where is the title taken from currently? The originating spreadsheet? Also seems this field is currently used to captured the filename (see above) as an identifier preceding the title. Do we want to keep this practice?
digitization notes	ID3v2#COMM // dcterms:description	text / variable text up to XXX characters	{0,1}	Any digitization notes. As more specific note types emerge, discuss making those specific fields. Right now, these capture the filename when it isn't captured in the title (above). Should choose one or the other (or a different filename field) for this.
??? album sort order ???	ID3v2#TSOA	text	{0,1}	The ID3 spec says this is for "Album sort order". However, this seems to 1. not be useful for these assets 2. field only has a value of 'frame' in the CU Lectures sample data given. Uncertain what this is meant to do and if we should keep it.
??? private frame ???	ID3v2#PRIV or ID3v2#sec4.28 Private frame	text	{0,1}	Do we need to keep this? Doesn't seem to be used. Access and permissions would be managed in other, external metadata and file system management, no?
digital collection	ID3v2#TALB Album/Movie/Show title // dcterms:isMemberOf	"Cornell Department of Music recordings" / static text	{0,1}	Collection names should be pulled from the originating spreadsheet WHICH SHOULD BE NORMALIZED AND CONSISTENT FROM THE START.

cmharlow/av_embed_metadata_profile.md

Beyond the above header:

dd388 commented Jan 3, 2017

Uh oh!

cmharlow commented Jan 9, 2017

Uh oh!

jb221467 commented Jan 20, 2017

Uh oh!