Skip to content

Instantly share code, notes, and snippets.

@frutik
Last active August 22, 2020 20:56
Show Gist options
  • Save frutik/55fd57c00a7e73f6a7a6acb76a04d1aa to your computer and use it in GitHub Desktop.
Save frutik/55fd57c00a7e73f6a7a6acb76a04d1aa to your computer and use it in GitHub Desktop.
https://stackoverflow.com/questions/787496/what-is-the-best-way-to-compute-trending-topics-or-tags/826509#826509
http://sayrohan.blogspot.com/2013/06/finding-trending-topics-and-trending.html
JUN
6
Finding Trending Topics and Trending Sentiments From Live Twitter Stream Using Storm
Finding Trending Topics and Sentiment Trends from Live Twitter Stream
Rohan Sunil Karwa
[email protected]
Abstract
Popularity of the social media and the amount of importance given by an individual to social media has significantly increased in last few years. As more and more people become part of the social networks like Twitter, Facebook, information which flows through the social network, can potentially give us good understanding about what is happening around in our locality, state, nation or even in the world. The conceptual motive behind the project is to quantify the information that flows through Twitter via finding Trending Topics from live Twitter Stream, with a hidden technical motive of building a scalable system, and face and solve the challenges encountered during the system construction. Thus, the project aims at building a system which finds trending topics from live Twitter Stream. Also, for each trending topic, the system also shows a sentiment graph showing how positive and negative sentiments are trending as the topic is getting trended. The system uses Storm for handling and processing the Twitter data stream in the distributed fashion. The built system is deployed on the Amazon EC2 server and is available to public with free access.
Keywords: Trending Topics, Sentiment, Twitter, Storm
1. Introduction
Popularity of the social media and the amount of importance given by an individual to social media has significantly increased in last few years. Recently, Facebook announced that 1.11 billion people use their website each month while 665 million people are active each day [1]. Similar is the story with another popular social website, Twitter. Twitter has 554 million active registered users per month, 190 million unique Twitter site visitors every month, and 58 million tweets tweeted every day [2]. From this data one can clearly see the importance of social media in our day-to-day life. Social media provides a platform via which people can easily communicate their thoughts, ideas, beliefs, and share it with other people (called as Friends in Facebook, or Followers in Twitter). Advancement in the technology and scalable distributed systems has made the propagation of information shared by users to reach out to other people in merely fraction of seconds. I personally find it amazing. Social Network can be thought as one giant network just like internet network, with people as nodes which are constantly feeding information in the network via messages/tweets. Thus, information which flows through the social media, can potentially give us good understanding about what is happening around in our locality, state, nation or even in the world.
As part of this project, I studied the way of quantifying the information which flows through social network site, via finding Trending Topics from live Twitter Streams. The notion of “Trending” is bit ambiguous and need more precise definition which is provided in subsequent sections of the report, but for now we can assume that “Trending Topics” corresponds to dominant concepts that is currently been flown or talked about in the social network (Twitter). “Trending Topics” are time dependent and change with time, and new concepts might start to dominate or talked about in the social network as time passes by. Here, notion of “talking about” corresponds to the mention of the concept in the tweets. Though, Twitter already provides the functionality of letting users know about current “Trending Topics”, in this project, I have implemented the similar system which does the similar thing, but with technical motive of understanding how these systems can be built and building one to understand technical challenges associated with it.
Whenever people tweet a tweet, tweeted message is never random, it is associated with some topic/concept. And generally as a human, whenever we write/tweet something worth sharing, we associate some emotion to it. The words chosen to write a tweet give a good idea of the sentiment tone of the user about the concept/topic. So, if we do basic sentimental analysis on the tweets associated with some topic, we can get a fair amount of idea if the topic is positively talked about or negatively talked about in the Twitter. Hence, also as part of the project, apart from finding trending topic, the developed systems also show how sentiment associated with the topic trends for all the current trending topics. This broadly helps us to mentally classify the current trending topic in one of three groups: Positively talked about, negatively talked about, or controversial topic that have mix of positive and negative opinions. This analysis is something that is not given by Twitter, and would help users to get more quality information about the trending topics.
2. Technical Motivation
Introduction section mentions the high level goals of the project. This section describes technical motivation for doing the project. Below are technical goals:
· Learning about Data Streams as a field: In recent years, data streams have received much attention because of the large amount data getting generated and the requirement for real time processing and analyzing the generated data. Example of such applications include financial applications (stock monitoring), network monitoring (packet monitoring), security, telecommunications data management, web applications, manufacturing, sensor networks, and others [3]. It is difficult to use traditional data analysis approaches for the data stream, primarily because of below reasons:
o Unbounded data size in the data stream, requires the algorithm to be independent of data size
o Generally because of the size of the data, it is not possible to persist the incoming data; hence any algorithm operating on the data stream should avoid having criteria of looking/processing the data more than once.
o Most of the data stream related problems expects real time answer, hence all the computation is needed to be done on the fly, and give real time response.
o Traditional algorithms are developed to give accurate results. In data streams, generally the approximations are accepted.
Finding Trending Topics from live Twitter stream also falls into this bucket of problems and the project provides an opportunity to learn about this field and develop an approach that have above mentioned characteristics and be robust.
· Twitter API: These days, most of the study in the field of social network, is done using data from popular social network size like Twitter, Facebook. Project offers an opportunity to learn about Twitter Streaming API [4], and how application can be developed which run-time consumes the data retrieved from the Streaming API.
· Storm: Storm [5] is open source an open source, real time distributed system developed for processing unbounded stream of data via providing reliability of processing of data. Storm was originally developed at Backtype (company) which was acquired by Twitter, and later Twitter made this project as an open source project [6]. One of the main learning goals of the project is learning about Storm. More details about Storm can be found in the Background Section of this report.
· Sentiment Analysis: One of the reasons for aiming at doing Sentiment Analysis for the trending topic was to learn how this analysis can be done on the real time data. The project provides an opportunity to survey different ways of doing sentiment analysis and to learn about different popular sentiment score dictionaries.
· Building end-to-end system: In the Introduction section, it is mentioned that Twitter itself provides the information about topics that are currently trending in Twitter. One of the aims for doing the project is developing an end-to-end system which is robust, handles and processes the data from live twitter stream, and summarizes/ranks the trending topic and sentiment results and shows them to the end user on UI which can be intuitively interpreted.
· Server deployment on Amazon EC2 [7]: One of my personal goals from last half year was learning about EC2 and deploying some system on EC2 which runs 24x7. This project provided me an opportunity to accomplish my long pending goal.
3. Background
Before going into details of the built system, this section provides required background about the topics/concepts which would be used in subsequent sections of the report. Readers, who have knowledge about Storm and Z-Score, can safely skip this section.
3.1 Storm
Storm [5] is distributed real time computation system. Storm can be thought as a system like Hadoop, but main difference between Hadoop and Storm is that Hadoop does data processing in batches on the data present in the Hadoop Distributed File System (HDFS), whereas Storm does real time processing of unbounded stream of data. Storm is fast, and benchmark result shows that Storm processes one million tuples per second per node [5].
A Storm cluster is something similar to Hadoop cluster. On Hadoop cluster, “Map-Reduce jobs” are run, whereas in Storm terminology, we run “Topologies” on Storm cluster. Key difference between “Topologies” and “Jobs” is that, Jobs are expected to be completed after some finite amount of time, while Topologies processes data forever [8].
Storm cluster has two kinds of nodes: master node and worker node. “Nimbus” daemon runs on master node, which is responsible for distributing code across cluster machines, assigning tasks to different machines, and monitoring machine/task failure. Worker node runs a daemon called as “Supervisor” which is responsible for listening to the “Nimbus” for its task assignment, and is responsible for starting and stopping processes on worker node as necessary. Each worker is responsible for execution of some part of the topology. Typical storm topology would have many worker nodes.
ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services [9]. Storm uses Zookeeper for all coordination between Nimbus and Supervisor. All state related information is kept in Zookeeper, and Nimbus and Supervisor are all stateless. Thus, if any worker node or nimbus node crashes, the crashed nodes can be restarted with their backups and gives the impression that as if nothing has happened thus providing incredible stability to Storm cluster. Below figure, figure 1, shows the high-level interactions:
Figure 1: Storm: Interaction between Nimbus, Zookeeper and Supervisor [8]
In Storm, “Topology” is nothing but a graph of computation. Each node in the topology contains the logic that is to be processed. Links between the nodes of topology indicate how the data should flow between the nodes. So a topology is created and is submitted to Storm cluster for processing. Another core abstraction in Storm is “Stream”. “Stream” is an unbounded sequence of tuples. Storm provides different primitives like transforming one stream to another stream reliably and in distributed fashion.
Very important primitive for Stream transformations are “spouts” and “bolts”. A “spout” is the source of stream. For example, Twitter Streaming API can be implemented in “spout” which would take incoming tweets, and create “stream” in storm “topology”. A “bolt” consumes any number of input “streams”, does some processing, and possibly emits new streams. Thus using “spouts” and “bolts” complex topologies can be designed in Storm. Below figure, figure 2, shows an example:
Figure 2: Some Storm Topology [8]
Once the topology is created, different number of worker nodes can be assigned to different types of spouts/bolts. All assigned worker nodes to a specific bolt/spout would execute the same bolt/spout logic in parallel.
Another important concept in Storm is of shuffling of data between two bolts/spouts. Shuffling of data determines how the data is to be passed between worker nodes of one spout/bolt to another spout/bolt. For example, consider a simple topology which has one spout and one bolt. Spout emits words, and bolt is responsible for doing word count (where we need to know the number of occurrence of each word in the data stream). So, as we have multiple worker nodes executing the same bolt logic, each worker node of bolt would be responsible for tracking some set of word occurrence. And hence what we expect is that the worker nodes of the spouts send a specific word to the specific worker node. There are two important types of shuffling supported by Storm: Random Shuffling and Field Shuffling. In Random shuffling the data is sent to different worker nodes randomly. In Field shuffling, a data that contains specific field always go the same worker (similar way how data with same key goes to the same reducer in Hadoop). In above word count example, Field shuffling can be used on the word, in order to make sure that same worker node processes occurrence of a specific word.
In this project, storm topology is created in order to find trending topics from the live twitter stream. Subsequent sections would provide more information about the same.
3.2 Z-Score
In statistics, Z-Score, also called as “Standard Score”, indicates by how many standard deviations an observation is above or below the mean [10]. Z-Score is dimensionless quantity, and can be easily computed for any raw score (x) using below formula:
Figure 3: Formula for calculating Z-score
Z-Score can be used to model time-series data. Given set of time-series points for an observation, mean and standard deviation can be easily calculated. Now, when we receive a new time-series point, we can use the mean and standard deviation computed to find the Z-score for the new time point. High Z score would indicate that the new point is significantly higher than the expectation while a high negative Z-score indicate that the new point is significantly below the expectation.
In this project, Z-score is used for ranking the trending topics.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment