Ladies and gentlemen,
Do you know what my prayer was the moment I knew I got my talk selected? That I would not be allocated a slot right after lunch. Yet here we are.
You must be wondering why a dude from India has come all the way over to Singapore and is giving a talk on Redis at a Python conference. Well, I believe you'll have the answers to those questions by the time I am done with my talk. This is intended for a beginner level audience and as such, if you have already implemented redis in your stack, then you might be a little disappointed.
There are times when, in your Django web application, you need a certain specific data to be saved. Let me give you an example. Let us say you are gathering all the tweets for the Football World Cup. You hit the Twitter API and tweets are pouring in by the second. How do you keep a counter? Of course, put a Python variable in the loop and keep incrementing.
tweets = fetch_tweets(hashtag = "#WorldCup2014") #Use the Twython Library
count = 0
for tweet in tweets:
entities = process_tweet(tweet)
count = count + 1
The only problem is that if another process/view wants to display it, it won't be able to access it.
Which means you should have persistence. If you're using Postgres or any other SQL database for that matter, you could have a field that would allow you to keep the count or maybe do a count(*) on your Tweets model each time you want to get the total number of tweets.
#Assume you have defined a model Tweet
count = Tweet.objects.all().count()
The count(*) option is going to get your SQL query to execute quite slow once you have about 20000 rows or so.
#Assume you have defined a model Stat to store the count which has a field tweet_count
Stat.objects.get(hashtag = "#WorldCup2014").update(tweet_count = F('tweet_count') + 1)
The next option being to increment the count within the Postgres field. This has an immense potential to lead you into race conditions and thereby screwing up your count.
So a fast, reliable and persistent solution is to have redis. Believe it or not, you can use this as an actual Database because of its persistence. All you need to do is to get the redis server up and running on your machine, use the redis-py Python library to increment a "key" by one each time a new tweet comes in. You don't even need to "initialize" the key. The increment command creates a key if it is not already present and increments it. Really neat. Hence, redis is a persistent key-value based NoSQL Data storage.
import redis #We are using the redis-py library
r = redis.StrictRedis()
tweets = fetch_tweets(hashtag = "#WorldCup2014")
for tweet in tweets:
entities = process_tweet(tweet)
r.incr("tweets_count", amount = 1)
count = r.get("tweets_count")
Now, persistence is not the only thing that makes Redis useful. Suppose you just don't stop with counting tweets. You count the pictures, videos and other links form within them. Also, you are doing the same with Facebook as well. Now you have two sources and their corresponding fields. Intuitively, a dictionary comes to mind. Name of one dictionary would be "Twitter" and the other one "Facebook". Each of them will have fields "statuses", "photos", "links", etc.
Guess what? Redis has a dictionary data type and let's you do exactly this. The various types of in-built data types that it provides is fantastic. People tend to call it the data structure server due to this reason.
import redis
r = redis.StrictRedis()
tweets = fetch_tweets(hashtag = "#WorldCup2014")
for tweet in tweets:
entities = process_tweet(tweet)
r.hincrby("Twitter", "tweets_count", amount = 1)
if "photo" in entities:
r.hincrby("Twitter", "photo_count", amount = 1)
if "video" in entities:
r.hincrby("Twitter", "video_count", amount = 1)
if "link" in entities:
r.hincrby("Twitter", "link_count", amount = 1)
twitter_photos_count = r.hget("Twitter", "photo_count")
...
posts = fetch_fb_posts(hashtag = "#WorldCup2014")
for post in posts:
entities = process_post(post)
r.hincrby("Facebook", "posts_count", amount = 1)
if "photo" in entities:
r.hincrby("Facebook", "photo_count", amount = 1)
if "video" in entities:
r.hincrby("Facebook", "video_count", amount = 1)
if "link" in entities:
r.hincrby("Facebook", "link_count", amount = 1)
fb_photos_count = r.hget("Facebook", "photo_count")
...
It supports 5 data types comprising of strings, sets, dictionaries, sorted sets and lists.
So, one, the persistence and two, the data types. These two are what makes Redis special.
Oh and incidentally, I am Haris Ibrahim K. V. and I am from the southern most state of India called Kerala. I work as a Computer Science Engineer at a small company called Eventifier. I've been a Python developer only since the past 7 months and hence, have relatively lesser experience when it comes to programming. Although I have organized conferences and workshops by myself, as a part of my earlier job, this is my first ever talk at one. So there might be a few rusty edges. Do bare with me. Also, as a hobby and passion, I love writing.
Alright, enough with the narcissism. Let's get back to business.
Redis stores its data in a Big In-Memory dictionary where they keys can only be strings, but the values can be any of the 5 data types that we mentioned earlier. Each of these data structures have their own implementation which will come to later. Let us go back to a few more use cases where you can use redis.
LEADER BOARD (using sorted sets)
Let's talk about leader board. What I am trying to do here is to give you examples that cover all the 5 data structures that Redis provides so that you will know what to use where and why. Leader board. I am sure you are familiar with the concept of leaderboard, but for those among you who are not, it is place where the top 10 of something is shown. Top 10 or 20, it does not matter. But a list of entities sorted based on their rank.
An example should clarify this right away. Let's go back to the football world cup example. The tweets are pouring in. Boy, reminds me of monsoon back at home. Anyway, You want to show the most retweeted tweets in descending order of their retweet count. This will give you an idea of what is trending for that particular hashtag. Now, what do you do? This is where the "sorted set" data type comes into picture. As the name suggests, it is a set, but sorted.
What is this sorted based on? Ah yes. So when you hear a sorted set, the picture that should come into your mind is a key with a value as a list of tuples. I use "tuples" in a loose sense. Once you have that picture in mind, this is how the structure would look like:
key: (score member) (score member)
All you need to do is to define a key called "trending_tweets" and then use the "zadd" redis command to specify the score as the number of retweets and the member as the "tweet text + username" or something.
import redis
r = redis.StrictRedis()
tweets = fetch_tweets(hashtag = "#WorldCup2014") #Use the Twython Library
count = 0
for tweet in tweets:
entities = process_tweet(tweet)
r.zadd("trending_tweets", tweet.retweet_count, tweet.text)
trending_tweets = r.zrange("trending_tweets", 0, -1)
You could also store the tweet ids as the members and just do a query on your SQL database to fetch tweets with those particular ids. This would work much better since sorted set is a set and it will be expensive to maintain uniqueness on members if they are huge chunks of text.
import redis
r = redis.StrictRedis()
tweets = fetch_tweets(hashtag = "#WorldCup2014") #Use the Twython Library
count = 0
for tweet in tweets:
entities = process_tweet(tweet)
t = Tweet.objects.create(tweet = tweet)
r.zadd("trending_tweets", tweet.retweet_count, t.id)
trending_tweets = r.zrange("trending_tweets", 0, -1)
popular_tweet_list = []
for tweet_id in trending_tweets:
popular_tweet_list.append(Tweet.objects.get(id = tweet_id))
To retrieve the top 10, use the "zrange" command and specify the indices. That should get you going.
CACHING (using list)
This introduces a new data type as well as a useful feature.
Redis allows you to set "expire" on certain keys. You can specify the key name and the number of seconds in which the key should expire. You might have already guessed it. Yes, you can implement a caching mechanism with this. The timeout remains valid as long as you only "alter" the keys using operations such as increment, add, etc. However, if you set the key once more or delete it, the deal is off. No timeout for you.
The way to implement this would be to first know what value want to be cached. Save that value into redis with a key. Call expire(key, seconds) and you're done. What goes hand in hand with this is the TTL command. Known as Time To Live. As you could guess, this gives you the time left before a certain key expires. It returns -2 if the key has expired or -1 if an expire has not been set on the key to begin with. Pretty handy.
Let's go back to the Football world cup tweets example once again. Suppose you want to showcase the photos that got retweeted the most every 5 minute or so. You might have to do something like fetching the popular tweets, get the corresponding photo url, push them into a list and set an expiry on that list's name.
import redis
r = redis.StrictRedis()
tweets = fetch_tweets(hashtag = "#WorldCup2014") #Use the Twython Library
count = 0
for tweet in tweets:
entities = process_tweet(tweet)
t = Tweet.objects.create(tweet = tweet)
r.zadd("trending_tweets", tweet.retweet_count, t.id)
trending_tweets = r.zrange("trending_tweets", 0, -1)
popular_tweet_list = []
for tweet_id in trending_tweets:
popular_tweet_list.append(Tweet.objects.get(id = tweet_id))
if r.ttl("trending_photos") in [-1, -2]:
for tweet in popular_tweet_list:
r.rpush("trending_photos", tweet.media_url)
trending_photos = r.lrange("trending_photos", 0, -1)
r.expire("trending_photos", 120) #Expire in 2 minutes
else:
trending_photos = r.lrange("trending_photos", 0, -1)
The list is a double ended list actually. You can insert at the left or the right. Accordingly you can pop from either side as well.
CREDITS
The first person whom I would like to thank is someone who deserves much more than me to be up on this stage and give this talk. However, he usually prefers to be behind the scenes, getting things done and motivate people to do things. He is my colleague and the CTO of the company I work for, Mr Nazim Zeeshan and there he is.
The second would be Sripathi. There is a company called HasGeek back in India who organizes technology conferences and workshops. They had organized a Redis miniconf recently where Sripathi gave a talk on Redis Memory optimization. What I am going to present next is from his inspiration.
Last but not the least, the PyCon Singapore team who organized and made this a reality. Kudos to them!
INTERNAL DATA TYPES
This is something that I picked up from what Sripathi explained. I confess I'm not an expert on this but thought it would spark a few minds if presented. Redis stores all that we talked about right now internally using 6 different data types.
Refer to the slides and video for this part.