Last active
December 18, 2015 18:19
-
-
Save askmeegs/5825204 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"metadata": { | |
"name": "process_tweets" | |
}, | |
"nbformat": 3, | |
"nbformat_minor": 0, | |
"worksheets": [ | |
{ | |
"cells": [ | |
{ | |
"cell_type": "heading", | |
"level": 1, | |
"metadata": {}, | |
"source": "process_tweets.py" | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "Edited version of Eni's graph provenance code which imports tweet data from JSON files and compiles essential information into a tweets and users dictionary. Writes the tweet dict to a new JSON." | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "6/20/13" | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "import os, json\nfrom datetime import datetime", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 2 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "get files:" | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "files = [f for f in os.listdir('data/') if f.endswith('.json')]", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 3 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "print \"Total number of files: \", len(files)", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "Total number of files: 1\n" | |
} | |
], | |
"prompt_number": 4 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "we are testing this with the hastings json file." | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "We define the store_data function to store the user's information in the userDict, and to store in the tweetsDict information about whether the tweet is a retweet, its author, the text, the datetime, the retweet count, the urls in the tweet, and the file from which the tweet came:" | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "def store_data(t, rtstatus, filename):\n\trt = rtstatus\n\tid = t['id_str']\n\tuser = t['user']\n\tusersDict[user['id_str']] = (user['screen_name'], user['description'])\n\ttext = t['text']\n\tdate = t['created_at']\n\tcounts = t['retweet_count']\n\turls = t['entities']['urls']\n\tf = filename\n\ttweetsDict[id] = (rt, user, text, date, counts, urls, f)", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 11 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "Define empty dictionaries for the tweets and users. Define an empty list for tweets that cannot be stored:" | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "tweetsDict = {} # store ALL tweet IDs, together with the timestamp of the tweet\nusersDict = {} # store the \"user\" field of a tweet\nnotFound = [] # store dictionary that are not tweets (in case the collection contains non-tweet data)", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 12 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "iterate through the files list and their tweets, calling storeData if the tweet is an original, and also if it is a retweet. If the tweet is a retweet, it stores the retweet as well as the original tweet (if the original tweet is original):" | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "for f in files:\n tweets = json.load(open('data/' + f))\n for t in tweets:\n try:\n\t\t\tif 'retweeted_status' in t:\n\t\t\t\tstore_data(t, True, f) #stores retweets in addition to their original tweets\n\t\t\t\torigTweet = t['retweeted_status']\n\t\t\t\tif origTweet['id_str'] not in tweetsDict:\n\t\t\t\t\tstore_data(origTweet, False, f)\n\t\t\telse:\n\t\t\t\tstore_data(t, False, f)\n except:\n notFound.append(t)", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 13 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "print the length of the dictionaries to verify that data was stored:" | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "print len(usersDict)\nprint len(tweetsDict)\nprint len(notFound)", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "1220\n1364\n0\n" | |
} | |
], | |
"prompt_number": 14 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "This means that it found 1220 users, 1364 different tweets, and encountered zero non-tweets." | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "Lastly, we define an output filename and dump the tweetsDict to a JSON:" | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "filename = \"processedhastings.json\"\njson.dump(tweetsDict, open(filename, 'w'))", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 15 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "Then the users dict:" | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "userfilename = \"hastingsusers.json\"\njson.dump(usersDict, open(userfilename, 'w'))", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 16 | |
} | |
], | |
"metadata": {} | |
} | |
] | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment