I used the twitter gem to grab stream data from twitter. Before writing any code I needed to create a developer account and project on twitter to get all the api keys I would need. Once that was done I could install the gem and start an irb session. Then I was able to initialize the client.
client = Twitter::Streaming::Client.new do |config|
config.consumer_key = "YOUR_CONSUMER_KEY"
config.consumer_secret = "YOUR_CONSUMER_SECRET"
config.access_token = "YOUR_ACCESS_TOKEN"
config.access_token_secret = "YOUR_ACCESS_SECRET"
end
Once the client was initialized it was pretty straightforward to get data from the stream.
tweets = []
client.sample do |object|
tweets << object if object.is_a?(Twitter::Tweet) && object.lang == ‘en’ && object.hashtags?
end
I waited for about 40 minutes and was able to get 3728 example tweets. Then I cleaned the tweets and put them into an easy format to work with.
tweet_data = []
tweets.each do |tweet|
data = {}
data["tweet_text"] = tweet.text.gsub(/^(.*?)\:+\s/, "").gsub(/(\#+[a-z,A-Z]*)/, "").gsub("\n", " ").rstrip
data['hashtags'] = tweet.hashtags.map(&:text)
tweet_data << data
end
I chose to remove the hastags from the tweet text itself, as well as any new lines. I decided to do this because I didn't want the model to train on inputs that included the expected output.
I then split the data into training and testing sets. I did 80% training 20% testing. Then I split the training data into training and validation sets at 70% training, 30% validation.
Then I used the following method to write them into their respective files.
def write_to_files(data, input_file_name, output_file_name)
data.each do |tweet|
tweet["hashtags"].each do |hashtag|
File.open(output_file_name, 'a') { |f| f << "<" + hashtag + ">\n" }
File.open(input_file_name, 'a') { |f| f << "<" + tweet["tweet_text"].gsub("\n", "") + ">\n" }
end
end
end
I decided to do one hastag per line so the training input did have duplicate tweet inputs that mapped to unique hastag outputs. If I were to do this again I would probably have the tweet input map to a list of hashags.