Skip to content

Instantly share code, notes, and snippets.

@rcarmo
Created September 16, 2011 12:20
Show Gist options
  • Save rcarmo/1221999 to your computer and use it in GitHub Desktop.
Save rcarmo/1221999 to your computer and use it in GitHub Desktop.
High-fidelity Twitter Archive (includes metadata and avatars)
#!/usr/bin/python
'''Dump'''
import os, sys, codecs, json, urllib, base64, gzip, datetime, time
import tweepy, session
attrs = {
'status': ['author', 'contributors', 'coordinates', 'created_at', 'favorited', 'geo', 'id', 'id_str', 'in_reply_to_screen_name', 'in_reply_to_status_id', 'in_reply_to_status_id_str', 'in_reply_to_user_id', 'in_reply_to_user_id_str', 'place', 'retweet_count', 'retweeted', 'source', 'source_url', 'text', 'truncated', 'user'],
'user': [ 'default_profile', 'default_profile_image', 'description', 'id', 'id_str','location', 'name', 'profile_image_url', 'protected', 'screen_name', 'time_zone', 'url', 'utc_offset', 'verified'],
}
images = {}
def getimage(uri):
# we'll use urlretrieve because it has implicit caching, but still...
if uri not in images.keys():
(filename, headers) = urllib.urlretrieve(uri)
images[uri] = ("data:%s;base64,%s" %( headers['Content-Type'], base64.b64encode(open(filename,'rb').read()))).strip()
return images[uri]
# retweet list, author, user,
def main():
cutoff = datetime.datetime.now() + datetime.timedelta(days=-4)
api = session.doAuth()
out = gzip.open('archive.json.gz', 'ab')
for i in tweepy.Cursor(api.user_timeline, include_rts = True).items(3000):
item = dict(map(lambda x: (x,getattr(i,x)), attrs['status']))
for u in ['author', 'user']:
item[u] = dict(map(lambda x: (x,getattr(item[u],x)),attrs['user']))
item[u]['profile_image_url'] = getimage(item[u]['profile_image_url'])
if item['created_at'] < cutoff:
print item['created_at']
out.write(str(item))
out.flush()
try:
api.destroy_status(item['id'])
time.sleep(10)
except Exception, e:
print e
out.close()
if __name__ == "__main__":
main()
@rcarmo
Copy link
Author

rcarmo commented Sep 16, 2011

Okay, so, some notes:

  • storing the avatar images takes up essentially zero extra space thanks to using gzip
  • gzip files are appendable (not many people know that)
  • this will archive (and delete) your retweets as well
  • the session module is simply a little wrapper that does the OAuth (and has my tokens inline).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment