Skip to content

Instantly share code, notes, and snippets.

@edsu
Created January 11, 2011 19:12
Show Gist options
  • Select an option

  • Save edsu/774947 to your computer and use it in GitHub Desktop.

Select an option

Save edsu/774947 to your computer and use it in GitHub Desktop.
check simon's report of dupes for things that haven't been deleted
#!/usr/bin/env python
"""
Uses Simon's report of duplicates in the id.loc.gov LCSH data, and
looks to see which are actually duplicates that return 200 OK.
"""
import json
import urllib
dupes = urllib.urlopen('http://www.ibiblio.org/fred2.0/dupes.txt')
seen = {}
for line in dupes:
line = line.strip()
row = line.split(" | ")
if len(row) != 2 or not row[0].startswith('http'):
continue
url, label = row
url = url.strip()
label = label.strip()
print "fetching %s" % url
r = urllib.urlopen(url)
if r.code == 200:
if seen.has_key(label):
seen[label].append(url)
else:
seen[label] = [url]
dupes = {}
for label, urls in seen.items():
if len(urls) > 1:
dupes[label] = urls
open("dupes.json", "w").write(json.dumps(dupes, indent=2))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment