Skip to content

Instantly share code, notes, and snippets.

View DominicBM's full-sized avatar

Dominic Byrd-McDevitt DominicBM

View GitHub Profile
{
'error': {
'code': 'modification-failed',
'info': 'Malformed input: Surname, Given Name; Family Number; Age; Birth Year; Place of Birth Zimonowitz, Annie; 96; 6; 1914; New York Zimonowitz, Milton; 96; 4; 1916; New York Zimonowitz, Jack; 96; 2; 1918; New York Fernstein, James; 97; 31; 1889; New York Fernstein, Pauline; 97; 24; 1896; Russia Fernstein, George; 97; 6; 1914; New York Fernstein, Leo; 97; 3; 1917; New York Morowitz, Morris; 98; 66; 1854; Russia Morowitz, Stella; 98; 64; 1856; Russia Morowitz, Rose; 98; 16; 1904; New York Morowitz, Jacob; 98; 27; 1893; New York Corni, Gabriel; 99; 29; 1891; Turkey Corni, Sarah; 99; 25; 1895; Turkey Corni, Stella; 99; 5; 1915; New York Corni, Simon; 99; 3; 1917; New York Corni, Celia; 99; 0; 1920; New York Corni, Morris; 99; 24; 1896; Turkey Corni, Rachael; 99; 17; 1903; Turkey Fliegal, Joseph; 100; 33; 1887; Austria Fliegal, Sadie; 100; 27; 1893; Austria Fliegal, Sidney; 100; 7; 1913; New York Fliegal, Max; 100; 4; 1916; New York Fliegal, Abraham I; 100; 0; 19
# This script requires Wand for Python. Install using the documentation at http://docs.wand-py.org/en/0.4.1/index.html before running.
import sys, os, datetime
from wand.image import Image
list = os.listdir(os.getcwd())
tuples = []
for file in list:
# -*- coding: utf-8 -*-
### Script courtesy of Dominic-MP. Thanks Dominic!
### NOTES: The following data is hard-coded:
### * "M251_" in source file names--this is necessary to parse the roll number. This needs to be updated for other publications or it will not be able to open the files.
### * Source XML files need to be in subdirectory titled "metadata" and have file names "M268_ROLL_metadata.xml", where "ROLL" is a four-digit number with leading zeroes
### * Following fields are all hard-coded based on M268's data: Level of description (file unit), general records type, data control group, use restriction, access restriction, online resource note, variant control number, physical occurrence, copy status, reference unit, location, media occurrence, general media type, object type, object designator, thumbnail file name.
### * All file paths must be in the form "https://opaexport-conv.s3.amazonaws.com/" + supplied path.
### * All online resources must be in the form "http://www.fold3.com/image/"
@DominicBM
DominicBM / combine.py
Last active November 20, 2023 04:59 — forked from clingerman/combine.py
combine multiple xml files into one (Python 2.7)
import os, re
file = 'm384-import-11.xml'
filenames = ['M384_0201_output.xml','M384_0202_output.xml','M384_0203_output.xml','M384_0204_output.xml','M384_0205_output.xml','M384_0206_output.xml','M384_0207_output.xml','M384_0208_output.xml','M384_0209_output.xml','M384_0210_output.xml','M384_0211_output.xml','M384_0212_output.xml','M384_0213_output.xml','M384_0214_output.xml','M384_0215_output.xml','M384_0216_output.xml','M384_0217_output.xml','M384_0218_output.xml','M384_0219_output.xml','M384_0220_output.xml']
counter = 2
outputfile = file
for fname in filenames:
in_size = (os.stat(fname).st_size / 1000000)
# -*- coding: utf-8 -*-
import csv, xml, re, time, os, datetime
import xml.etree.ElementTree as ET
x = 0
while x < 426:
roll = 2 + x
file = 'M269_' + str(roll).zfill(4)
## This part takes the partner XML and reformats it to more usable XML (i.e. going from attributes to elements - http://www.ibm.com/developerworks/library/x-eleatt/). The reformatted XML is saved as a new document with "_(reformatted)" appended to the name, so that the original file is not altered.
# -*- coding: utf-8 -*-
### NOTES: The following data is hard-coded:
### * "M268_" in source file names--this is necessary to parse the roll number. This needs to be updated for other publications or it will not be able to open the files.
### * Source XML files need to be in subdirectory titled "metadata" and have file names "M268_ROLL_metadata.xml", where "ROLL" is a four-digit number with leading zeroes
### * Following fields are all hard-coded based on M268's data: Level of description (file unit), general records type, data control group, use restriction, access restriction, online resource note, variant control number, physical occurrence, copy status, reference unit, location, media occurrence, general media type, object type, object designator, thumbnail file name.
### * All file paths must be in the form "https://opaexport-conv.s3.amazonaws.com/" + supplied path.
### * All online resources must be in the form "http://www.fold3.com/image/" + footnote ID.
### * The objects file is set to be
import requests, json, csv, urllib, argparse
## This is what allows the user to pass the initial Wikipedia category as an argument, such as'--c "History of the United States"'.
parser = argparse.ArgumentParser()
parser.add_argument('--c', dest='cat', metavar='CAT',
action='store')
args = parser.parse_args()
## The script will create two CSVs. One with the articles and page views, and another that is a running list of subcategories, so that it can continue to run down the list and take each new category in turn. Here, the names of the CSVs are generated from the initial category given by the user, and a set is created, starting with that category, to ensure duplicates are not added.
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en">
<head>
<title>Upcoming Events</title>
<meta http-equiv="Content-Type" content="text/html;" />
<meta http-equiv="Content-Language" content="en-US" />
<link rel="icon" href="http://archives.gov/favicon.ico" type="image/x-icon" />
<link rel="shortcut icon" href="http://archives.gov/favicon.ico" />
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import requests, json, csv, argparse
parser = argparse.ArgumentParser()
parser.add_argument('--series_NAID', dest='series_NAID', metavar='SERIES_NAID',
action='store')
parser.add_argument('--file_units', dest='file_units', metavar='FILE_UNITS',
action='store')
18503259 07542
18475522 28003
18471472 05386
17412775 14289
17408517 27799
17408508 27773
17408507 27772
17408488 27714
17408487 27714
17408401 27426