The Python scripts attached here take care of the following tedious work, and should help one quickly get started with some real work on the corpus:
- Respect the Twitter API rate limits and throttle API hits.
- Don't hit the API for already expanded tweet ID's, so you can resume tweet expansion after stopping midway.
- Parse the API response and dump it into the correct column in the sqlite3 database.
- Gracefully handle exceptions while acquiring tweets from the API.
- Wrap version 1.1 of the Twitter API.
- Start from a specified tweet ID, assuming the input file is sorted in increasing order of tweet ID.
Note that the python-twitter module has not been updated to match v1.1 of the Twitter API yet. For now, I've monkey-patched the module to update just the methods required for our purpose.
The Edinburgh Twitter FSD corpus contains around 52 million tweets collected between July and mid-September, 2011, labelled with their corresponding events. You can find the corpus here in the post titled FSD corpus of tweets.
Citation:
Saša Petrović, Miles Osborne, and Victor Lavrenko. 2012. Using paraphrases for improving first story detection in news and Twitter. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT '12). Association for Computational Linguistics, Stroudsburg, PA, USA, 338-346.
There are 2 files in the corpus ZIP: tweet_ids and relevance_judgements.
This contains all the 52 million tweets, one per line, in the form <tweetid: Long> <username: String>. A snippet:
...
86383662232375296 Nil_Volentibus
86383662215598080 Bette_Fernandes
86383666405707776 SiiennaRose
86383666397315073 __Kaylenn
...
This contains 3034 labelled tweets in the form <tweetid: Long> <eventindex: Int, in range [1, 27]>. The larger fraction of unlabelled tweets that are not in this file are not labelled with any event (i.e., they are noise, random chatter, or false rumors). The event index is an integer between 1 and 27, corresponding to 27 different events. A snippet:
...
94801731006894080 1
94801873634213888 1
94802733437169664 1
94803010257039361 1
...
- Join the tweets and relevance judgements into a single file:
sort -k1 relevance_judgments > sjudgements
sort -k1 tweet_ids > stweets
join -1 1 -2 1 -a 1 -a 2 -e '0' -o '0,1.2,2.2' stweets sjudgements > tweets_and_judgements
Note that this will place a 0 in the column for event index where there is no event corresponding to a tweet.
- Register a Twitter application to receive your authorization keys.
You will need your consumer key, consumer secret, access token and access token secret.
- 
Ensure you have downloaded the 2 Python scripts into the working directory 
- 
Run the Python script: 
python tweets-to-sqlite.py \
    --dbfile fsd.db \
    --tweetsfile tweets_and_judgements \
    --consumerkey <...> \
    --consumersecret <...> \
    --accesstoken <...> \
    --accesstokensecret <...>
Nice ! Great work! I am looking for such code.