Text analytics engine!

Hey guys! I'm @soaxelbrooke, and I am here to show you ladies and guys how to create a basic text analytics engine with Elasticsearch.

Getting the data

Let's get the data first! These are product reviews from Amazon, which can be found here.

$ curl http://times.cs.uiuc.edu/~wang296/Data/LARA/Amazon/AmazonReviews.zip -o reviews.zip
$ mkdir data
$ unzip -d data reviews.zip

There are actually quite a few reviews here, so we'll just use laptops, which have been extracted to data/laptops. Let's peek at what is in there...

$ cat data/laptops/B00005NBIT.json | jq .
{
  "Reviews": [
    {
      "Title": "poor",
      "Author": "mary may",
      "ReviewID": "RY9V1118JJGZ8",
      "Overall": "1.0",
      "Content": "Computer has never worked. Screen is black only. Price was cheap so I guess you get what you pay for!",
      "Date": "December 17, 2013"
    },
...
}

Looks like we have well formatted JSON, and we get 6 fields in there: title, author, review ID, overall, content, and date. Sounds like we don't need to do much work before getting this stuff into Elasticsearch - but we will need to parse that score and the date.

Installing Elasticsearch

Now that we've got our data, let's install elasticsearch.

$ brew install elasticsearch24
$ elasticsearch
$ curl localhost:9200 # check that it's running

Et voila! We have the latest version of Elasticsearch up and running.

Now, let's install the python lib we'll need to interact with it:

$ pip install elasticsearch tqdm certifi

Loading Data

Now, let's get to loading some data!

''' loader.py '''
from tqdm import tqdm
import fileinput
import json
from dateutil import parser
from elasticsearch import Elasticsearch


def parse_review(data: dict) -> dict:
    data['Overall'] = float(data['Overall'])
    data['Date'] = parser.parse(data['Date'])
    return data


def make_review_iter():
    ''' Make an iterator of reviews using fileinput '''
    for line in fileinput.input():
        reviews = json.loads(line).get('Reviews', [])
        for review in reviews:
            try:
                yield parse_review(review)
            except:
                print("Failed to parse review! {}".format(review))


client = Elasticsearch()

for review in tqdm(make_review_iter()):
    client.index(index='reviews', doc_type='reviews', body=review)

Then, let's run this script for data from the laptops directory:

$ python loader.py data/laptops/*

This will load all the reviews in data/laptops/ into Elasticsearch. Try this to see some of the documents as they're loaded:

$ curl localhost:9200/reviews/_search

Poking the Data

Go get Sense! It's a chrome plugin that helps you query Elasticsearch. Once you've got that, let's start querying! The following will just plain get some documents from Elasticsearch from our index:

GET /reviews/_search

As we can see, the data we put in is still there. We can also do interesting things, like aggregate over the data. Here, we look at the average score of reviews:

POST /reviews/_search
{
    "size": 3,
    "aggs": {
        "average_score": {
            "avg": {
                "field": "Overall"
            }
        }
    }
}

We can also aggregate over queried documents - here we will look at the average score for documents with "great" in them:

POST /reviews/_search
{
    "size": 3,
    "query": {
        "match": {"Content": "great"}
    }, 
    "aggs": {
        "average_score": {
            "avg": {
                "field": "Overall"
            }
        }
    }
}

Try changing "great" to "terrible", and see how it affects it!

Another very interesting thing we can do in elasticsearch is a "significant terms" query - essentially a TF-IDF scoring of terms for the documents that your query specifies. These queries help us see what words are unique to the reviews we find with our query:

POST /reviews/_search
{
    "size": 0,
    "query": {
        "match": {"Content": "terrible"}
    }, 
    "aggs": {
        "average_score": {
            "avg": {
                "field": "Overall"
            }
        },
        "sig_terms": {
            "significant_terms": {
                "field": "Content"
            },
            "aggs": {
                "average_score": {
                    "avg": {
                        "field": "Overall"
                    }
                }
            }
        }
    }
}

Presenting the Data

So, let's actually make this a usable to the outside, via a tiny Flask app! First, let's install Flask...

$ pip install flask

Now let's write a tiny little querying API for summarizing reviews that match a given query!

''' server.py '''
from flask import Flask, jsonify
app = Flask(__name__)

from elasticsearch import Elasticsearch

client = Elasticsearch()


@app.route("/")
def hello():
    return "Hello Elasticsearch!"


@app.route("/search/<search>")
def search(search):
    query = {
        "size": 3,
        "query": {
            "match": {"Content": search}
        }, 
        "aggs": {
            "average_score": {
                "avg": {
                    "field": "Overall"
                }
            },
            "sig_terms": {
                "significant_terms": {
                    "field": "Content"
                },
                "aggs": {
                    "average_score": {
                        "avg": {
                            "field": "Overall"
                        }
                    }
                }
            }
        }
    }
    return jsonify(client.search(index='reviews', query))

if __name__ == "__main__":
    app.run()

Running this with python server.py will start the Flask server, and let us start poking it from the browser:

$ curl localhost:5000/search/wow%20this%20is%20terrible # so what if my browser is curl

Awesome! We can try different queries and see how they stack up with each other.

This is only the faintest glimpse into the search and summary powers of Elasticsearch! At Apptentive, we're using Elasticsearch, with a lot of other awesome tech to turn App feedback for the largest companies into understandable, truly quantifiable information. Also, the Python libraries for Elasticsearch are second in quality only to the Java libs - which to be fair, Elasticsearch is written in Java. I think, paired with powerful inference tools (like machine learning), Elasticsearch is going to be querying and summary builder that helps humanity learn a lot about the world.

I'd love to hear about the projects you're working on, so please reach out! You can @ me at @soaxelbrooke!

Extra credit: installing and using NLTK vader sentiment