LING4100-How-I-got-the-data.md

Getting Twitter Data

I used the twitter gem to grab stream data from twitter. Before writing any code I needed to create a developer account and project on twitter to get all the api keys I would need. Once that was done I could install the gem and start an irb session. Then I was able to initialize the client.

client = Twitter::Streaming::Client.new do |config|
  config.consumer_key        = "YOUR_CONSUMER_KEY"
  config.consumer_secret     = "YOUR_CONSUMER_SECRET"
  config.access_token        = "YOUR_ACCESS_TOKEN"
  config.access_token_secret = "YOUR_ACCESS_SECRET"
end

Once the client was initialized it was pretty straightforward to get data from the stream.

tweets = []

client.sample do |object|
    tweets << object if object.is_a?(Twitter::Tweet) && object.lang == ‘en’ && object.hashtags?
end

I waited for about 40 minutes and was able to get 3728 example tweets. Then I cleaned the tweets and put them into an easy format to work with.

tweet_data = []

tweets.each do |tweet|
    data = {}
    data["tweet_text"] = tweet.text.gsub(/^(.*?)\:+\s/, "").gsub(/(\#+[a-z,A-Z]*)/, "").gsub("\n",  " ").rstrip 
    data['hashtags'] = tweet.hashtags.map(&:text)
    tweet_data << data
end

I chose to remove the hastags from the tweet text itself, as well as any new lines. I decided to do this because I didn't want the model to train on inputs that included the expected output.

I then split the data into training and testing sets. I did 80% training 20% testing. Then I split the training data into training and validation sets at 70% training, 30% validation.

Then I used the following method to write them into their respective files.

def write_to_files(data, input_file_name, output_file_name)
 data.each do |tweet|
   tweet["hashtags"].each do |hashtag|
     File.open(output_file_name, 'a') { |f| f << "<" + hashtag + ">\n" }
     File.open(input_file_name, 'a') { |f| f << "<" + tweet["tweet_text"].gsub("\n", "") + ">\n" }
   end
  end
end

I decided to do one hastag per line so the training input did have duplicate tweet inputs that mapped to unique hastag outputs. If I were to do this again I would probably have the tweet input map to a list of hashags.

CjMoore/LING4100-How-I-got-the-data.md

Getting Twitter Data