Skip to content

Instantly share code, notes, and snippets.

View clingerman's full-sized avatar

Jason Clingerman clingerman

  • US National Archives
  • College Park, MD
View GitHub Profile
# -*- coding: utf-8 -*-
### NOTES: The following data is hard-coded:
### * "M268_" in source file names--this is necessary to parse the roll number. This needs to be updated for other publications or it will not be able to open the files.
### * Source XML files need to be in subdirectory titled "metadata" and have file names "M268_ROLL_metadata.xml", where "ROLL" is a four-digit number with leading zeroes
### * Following fields are all hard-coded based on M268's data: Level of description (file unit), general records type, data control group, use restriction, access restriction, online resource note, variant control number, physical occurrence, copy status, reference unit, location, media occurrence, general media type, object type, object designator, thumbnail file name.
### * All file paths must be in the form "https://opaexport-conv.s3.amazonaws.com/" + supplied path.
### * All online resources must be in the form "http://www.fold3.com/image/" + footnote ID.
### * The objects file is set to be
# -*- coding: utf-8 -*-
import csv, xml, re, time, os, datetime
import xml.etree.ElementTree as ET
x = 0
while x < 426:
roll = 2 + x
file = 'M269_' + str(roll).zfill(4)
## This part takes the partner XML and reformats it to more usable XML (i.e. going from attributes to elements - http://www.ibm.com/developerworks/library/x-eleatt/). The reformatted XML is saved as a new document with "_(reformatted)" appended to the name, so that the original file is not altered.