askmeegs · December 18, 2015 18:19
diff --git a/process_tweets.json b/process_tweets.json
 {
 "metadata": {
  "name": "process_tweets"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": "process_tweets.py"
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "Edited version of Eni's graph provenance code which imports tweet data from JSON files and compiles essential information into a tweets and users dictionary. Writes the tweet dict to a new JSON."
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "6/20/13"
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "import os, json\nfrom datetime import datetime",
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 2
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "get files:"
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "files = [f for f in os.listdir('data/') if f.endswith('.json')]",
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 3
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "print \"Total number of files: \", len(files)",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "Total number of files:  1\n"
      }
     ],
     "prompt_number": 4
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "we are testing this with the hastings json file."
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "We define the store_data function to store the user's information in the userDict, and to store in the tweetsDict information about whether the tweet is a retweet, its author, the text, the datetime, the retweet count, the urls in the tweet, and the file from which the tweet came:"
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "def store_data(t, rtstatus, filename):\n\trt = rtstatus\n\tid = t['id_str']\n\tuser = t['user']\n\tusersDict[user['id_str']] = (user['screen_name'], user['description'])\n\ttext = t['text']\n\tdate = t['created_at']\n\tcounts = t['retweet_count']\n\turls = t['entities']['urls']\n\tf = filename\n\ttweetsDict[id] = (rt, user, text, date, counts, urls, f)",
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 11
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "Define empty dictionaries for the tweets and users. Define an empty list for tweets that cannot be stored:"
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "tweetsDict = {}  # store ALL tweet IDs, together with the timestamp of the tweet\nusersDict = {}   # store the \"user\" field of a tweet\nnotFound = []    # store dictionary that are not tweets (in case the collection contains non-tweet data)",
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 12
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "iterate through the files list and their tweets, calling storeData if the tweet is an original, and also if it is a retweet. If the tweet is a retweet, it stores the retweet as well as the original tweet (if the original tweet is original):"
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "for f in files:\n    tweets = json.load(open('data/' + f))\n    for t in tweets:\n        try:\n\t\t\tif 'retweeted_status' in t:\n\t\t\t\tstore_data(t, True, f) #stores retweets in addition to their original tweets\n\t\t\t\torigTweet = t['retweeted_status']\n\t\t\t\tif origTweet['id_str'] not in tweetsDict:\n\t\t\t\t\tstore_data(origTweet, False, f)\n\t\t\telse:\n\t\t\t\tstore_data(t, False, f)\n        except:\n            notFound.append(t)",
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 13
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "print the length of the dictionaries to verify that data was stored:"
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "print len(usersDict)\nprint len(tweetsDict)\nprint len(notFound)",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "1220\n1364\n0\n"
      }
     ],
     "prompt_number": 14
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "This means that it found 1220 users, 1364 different tweets, and encountered zero non-tweets."
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "Lastly, we define an output filename and dump the tweetsDict to a JSON:"
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "filename = \"processedhastings.json\"\njson.dump(tweetsDict, open(filename, 'w'))",
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 15
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "Then the users dict:"
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "userfilename = \"hastingsusers.json\"\njson.dump(usersDict, open(userfilename, 'w'))",
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 16
    }
   ],
   "metadata": {}
  }
 ]
 }
	{
	"metadata": {
	"name": "process_tweets"
	},
	"nbformat": 3,
	"nbformat_minor": 0,
	"worksheets": [
	{
	"cells": [
	{
	"cell_type": "heading",
	"level": 1,
	"metadata": {},
	"source": "process_tweets.py"
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "Edited version of Eni's graph provenance code which imports tweet data from JSON files and compiles essential information into a tweets and users dictionary. Writes the tweet dict to a new JSON."
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "6/20/13"
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "import os, json\nfrom datetime import datetime",
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 2
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "get files:"
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "files = [f for f in os.listdir('data/') if f.endswith('.json')]",
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 3
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "print \"Total number of files: \", len(files)",
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "Total number of files: 1\n"
	}
	],
	"prompt_number": 4
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "we are testing this with the hastings json file."
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "We define the store_data function to store the user's information in the userDict, and to store in the tweetsDict information about whether the tweet is a retweet, its author, the text, the datetime, the retweet count, the urls in the tweet, and the file from which the tweet came:"
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "def store_data(t, rtstatus, filename):\n\trt = rtstatus\n\tid = t['id_str']\n\tuser = t['user']\n\tusersDict[user['id_str']] = (user['screen_name'], user['description'])\n\ttext = t['text']\n\tdate = t['created_at']\n\tcounts = t['retweet_count']\n\turls = t['entities']['urls']\n\tf = filename\n\ttweetsDict[id] = (rt, user, text, date, counts, urls, f)",
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 11
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "Define empty dictionaries for the tweets and users. Define an empty list for tweets that cannot be stored:"
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "tweetsDict = {} # store ALL tweet IDs, together with the timestamp of the tweet\nusersDict = {} # store the \"user\" field of a tweet\nnotFound = [] # store dictionary that are not tweets (in case the collection contains non-tweet data)",
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 12
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "iterate through the files list and their tweets, calling storeData if the tweet is an original, and also if it is a retweet. If the tweet is a retweet, it stores the retweet as well as the original tweet (if the original tweet is original):"
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "for f in files:\n tweets = json.load(open('data/' + f))\n for t in tweets:\n try:\n\t\t\tif 'retweeted_status' in t:\n\t\t\t\tstore_data(t, True, f) #stores retweets in addition to their original tweets\n\t\t\t\torigTweet = t['retweeted_status']\n\t\t\t\tif origTweet['id_str'] not in tweetsDict:\n\t\t\t\t\tstore_data(origTweet, False, f)\n\t\t\telse:\n\t\t\t\tstore_data(t, False, f)\n except:\n notFound.append(t)",
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 13
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "print the length of the dictionaries to verify that data was stored:"
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "print len(usersDict)\nprint len(tweetsDict)\nprint len(notFound)",
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "1220\n1364\n0\n"
	}
	],
	"prompt_number": 14
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "This means that it found 1220 users, 1364 different tweets, and encountered zero non-tweets."
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "Lastly, we define an output filename and dump the tweetsDict to a JSON:"
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "filename = \"processedhastings.json\"\njson.dump(tweetsDict, open(filename, 'w'))",
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 15
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "Then the users dict:"
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "userfilename = \"hastingsusers.json\"\njson.dump(usersDict, open(userfilename, 'w'))",
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 16
	}
	],
	"metadata": {}
	}
	]
	}