Skip to content

Instantly share code, notes, and snippets.

@askmeegs
Last active December 18, 2015 18:19
Show Gist options
  • Save askmeegs/5825204 to your computer and use it in GitHub Desktop.
Save askmeegs/5825204 to your computer and use it in GitHub Desktop.
{
"metadata": {
"name": "process_tweets"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": "process_tweets.py"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Edited version of Eni's graph provenance code which imports tweet data from JSON files and compiles essential information into a tweets and users dictionary. Writes the tweet dict to a new JSON."
},
{
"cell_type": "markdown",
"metadata": {},
"source": "6/20/13"
},
{
"cell_type": "code",
"collapsed": false,
"input": "import os, json\nfrom datetime import datetime",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 2
},
{
"cell_type": "markdown",
"metadata": {},
"source": "get files:"
},
{
"cell_type": "code",
"collapsed": false,
"input": "files = [f for f in os.listdir('data/') if f.endswith('.json')]",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 3
},
{
"cell_type": "code",
"collapsed": false,
"input": "print \"Total number of files: \", len(files)",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "Total number of files: 1\n"
}
],
"prompt_number": 4
},
{
"cell_type": "markdown",
"metadata": {},
"source": "we are testing this with the hastings json file."
},
{
"cell_type": "markdown",
"metadata": {},
"source": "We define the store_data function to store the user's information in the userDict, and to store in the tweetsDict information about whether the tweet is a retweet, its author, the text, the datetime, the retweet count, the urls in the tweet, and the file from which the tweet came:"
},
{
"cell_type": "code",
"collapsed": false,
"input": "def store_data(t, rtstatus, filename):\n\trt = rtstatus\n\tid = t['id_str']\n\tuser = t['user']\n\tusersDict[user['id_str']] = (user['screen_name'], user['description'])\n\ttext = t['text']\n\tdate = t['created_at']\n\tcounts = t['retweet_count']\n\turls = t['entities']['urls']\n\tf = filename\n\ttweetsDict[id] = (rt, user, text, date, counts, urls, f)",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 11
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Define empty dictionaries for the tweets and users. Define an empty list for tweets that cannot be stored:"
},
{
"cell_type": "code",
"collapsed": false,
"input": "tweetsDict = {} # store ALL tweet IDs, together with the timestamp of the tweet\nusersDict = {} # store the \"user\" field of a tweet\nnotFound = [] # store dictionary that are not tweets (in case the collection contains non-tweet data)",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 12
},
{
"cell_type": "markdown",
"metadata": {},
"source": "iterate through the files list and their tweets, calling storeData if the tweet is an original, and also if it is a retweet. If the tweet is a retweet, it stores the retweet as well as the original tweet (if the original tweet is original):"
},
{
"cell_type": "code",
"collapsed": false,
"input": "for f in files:\n tweets = json.load(open('data/' + f))\n for t in tweets:\n try:\n\t\t\tif 'retweeted_status' in t:\n\t\t\t\tstore_data(t, True, f) #stores retweets in addition to their original tweets\n\t\t\t\torigTweet = t['retweeted_status']\n\t\t\t\tif origTweet['id_str'] not in tweetsDict:\n\t\t\t\t\tstore_data(origTweet, False, f)\n\t\t\telse:\n\t\t\t\tstore_data(t, False, f)\n except:\n notFound.append(t)",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 13
},
{
"cell_type": "markdown",
"metadata": {},
"source": "print the length of the dictionaries to verify that data was stored:"
},
{
"cell_type": "code",
"collapsed": false,
"input": "print len(usersDict)\nprint len(tweetsDict)\nprint len(notFound)",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "1220\n1364\n0\n"
}
],
"prompt_number": 14
},
{
"cell_type": "markdown",
"metadata": {},
"source": "This means that it found 1220 users, 1364 different tweets, and encountered zero non-tweets."
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Lastly, we define an output filename and dump the tweetsDict to a JSON:"
},
{
"cell_type": "code",
"collapsed": false,
"input": "filename = \"processedhastings.json\"\njson.dump(tweetsDict, open(filename, 'w'))",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 15
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Then the users dict:"
},
{
"cell_type": "code",
"collapsed": false,
"input": "userfilename = \"hastingsusers.json\"\njson.dump(usersDict, open(userfilename, 'w'))",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 16
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment