PageRank Your Data With Mortar And Pig

One of the most exciting things about Apache Pig is the level of control it gives you over data pipelines. In a pigscript, you can split or recombine parts of your data at any point, allowing you to avoid redundant computation or store intermediate results for your job. However, there is a limit to what you can do with a pigscript alone: Pig by itself does not have any ways to specify control flow, such as loops or conditional operations.

To use Pig as part of a control flow, you can use a Jython "control script" (Jython is simply a Java-based implementation of the Python programming language) which calls and configures pigscripts using an API called "Embedded Pig". The control script can dynamically pass parameters to pigscripts based on algorithm or business logic, and can repeatedly call a pigscript in a loop, using the output of the previous iteration as the input to the next. This makes it possible to implement a wide range of iterative algorithms which would not be feasible in a pigscript alone.

To show you how to implement an iterative algorithm, we've made a reusable project using the Mortar Development Framework to calculate Pageranks for the nodes of any graph. This post will have four parts: first, we'll show how the project works by running it on the Twitter social graph; second, we'll change a few configuration settings to run the code on a graph of US Patent citations; third, we'll show how to configure the project to run Pagerank on your own data; and lastly, we'll discuss things to consider when using Pagerank in a production setting.

To follow along with the examples we'll be showing, you'll first have to do the following steps:

Signup for a Mortar account
Install the Mortar Development Framework

Clone the mortar-pagerank repository to your computer and register it as a project with Mortar:

 git clone [email protected]:mortardata/mortar-pagerank.git
 cd mortar-pagerank
 mortar register mortar-pagerank

Pagerank #1: Who's the most influential on Twitter?

The first example runs Pagerank on the Twitter social graph: there is an edge from node A to node B if user A follows user B. For simplicity, we'll refer to nodes as users, but there is nothing Twitter-specific about the code other than the input data.

Here's how the controlscript controlscripts/twitter-pagerank.py works. Each of the following links is to a Github Gist which has the relevant section of the code. If you just want to see the output of who's influential on Twitter, you can skip down to the end of this section.

Setting Parameters

The first part of the controlscript just sets configuration parameters. There are two paths to input data: EDGES_INPUT, which represents the graph as a collection of directed edges between user-ids, and NODE_NAMES_INPUT, which is an index of user-ids to usernames (we use this to produce human-readable output at the end). The parameters DAMPING_FACTOR and CONVERGENCE_THRESHOLD configure the Pagerank algorithm.

A Preprocessing Step

The second part defines a preprocessing step. First, we compile the PREPROCESS_SCRIPT (pigscripts/pagerank_preprocess.pig). This script calculates initial pagerank values for each node in the graph. Then we bind parameters to configure the script--in this case, specifying where input should come from and where output should go to. Finally, we run the script using Embedded Pig. This operation returns results object, which we use to access output data.

The Pagerank Loop

The third part is the loop where we calculate new pageranks based on the old pageranks. We do the same process of compiling a pigscript, binding parameters to it, and running it. Note how we set the input and output paths based on which iteration of the loop we're in.

After many iterations, the pageranks converge and change very little from one iteration to the next. The iteration pigscript calculates a metric "aggregate_rank_change" which describes this behaviour (see the project README for more detail on this). When this metric falls beneath a convergence threshold, we break out of the loop.

A Choice Of Two Postprocessing Steps

The last part is a postprocessing step. This code demonstrates how you can use controlscript logic to make a script more reusable. There are two postprocessing scripts that can be used: POSTPROCESS_SCRIPT_WITH_NAME_JOIN and POSTPROCESS_SCRIPT_WITHOUT_NAME_JOIN. The former joins the iteration output, which is a list of user-ids and their pageranks, to the index of user-ids to human-readable usernames. The latter leaves the user-ids untouched. Both then order the ids/names by pagerank and store the data to s3.

If you want to see that the algorithm worked as expected, you would want to do the join to usernames. However, if you wanted to use these pageranks in a production setting, it would be far more efficient to leave them in terms of the integer user-ids. So we have a configuration parameter POSTPROCESS_JOIN_WITH_NODE_NAMES: if set to True, we call the pigscript that does the join; if set to False, we call the pigscript that does not do the join.

To run the Twitter example on a 5-node cluster, run the command:

mortar run twitter-pagerank -s 5

It will finish in a little under 2 hours. Here's a selection of the output if you don't want to wait. Note that the data was collected in 2010, which is why Justin Bieber is conspicuously missing.

Twitter Pagerank Output

Pagerank #2: Which organizations produce the most influential patents?

How much code needs to be changed to get the controlscripts/twitter-pagerank script to use a different dataset? Just 7 lines of configuration settings. Here's a diff between controlscripts/patents-pagerank, which runs on a graph of US patent citations (2007-2012), and controlscripts/twitter-pagerank to prove it.

Of course, this assumes you already have a suitable dataset in the form of a directed graph. Generating a usable graph from the patent citation data required some manipulation of the data. Patents, like academic papers, are required to cite previous patents which have related innovations. Because these citations are time-constrained--you can't cite patents from the future--the citation graph is acyclic and not suitable for Pagerank.

To solve this problem, we used a "graph reduction". Instead of nodes being patents, we made nodes be organizations which patents are assigned to (such as IBM or Microsoft). An edge between organization A and organization B has a weight equal to the # of patents assigned to A which cite patents assigned to B. This reduction eliminates the sparsity of the graph and permits cycles, making it suitable for Pagerank. The script pigscripts/generate_patent_citation_graph.pig implements this graph reduction in Pig.

To run the Pagerank example on a 2-node cluster, run the command:

mortar run patents-pagerank -s 2

It will finish in a little under an hour. Here's a comparison of how Pagerank ranks organizations compared to simply ranking them by # of citations.

Patents Pagerank Output

You can see that Boehringer Ingelheim is 2nd by # citations, but not even in the top 15 by Pagerank. Looking at the underlying graph data, we found that this was because almost all of Boehringer Ingelheim's citations were from a company Cypress Pharmaceutical, and no patents in the dataset's timeframe (2007-2012) cited Cypress Pharmaceutical.

This shows how Pagerank can help avoid a "fanboy effect": where one or many non-influential nodes link unusually frequently to a node, making the latter seem more influential than it ought to be. In the context of web links, this "fanboy effect" would be equivalent to link spam/farming for SEO.

Pagerank #3: Now with your data

You can use Pagerank on your own data with just three steps.

Generate a directed graph from your data. There is a template script pigscripts/generate_my_graph.pig which has two examples of generating graphs to help you get started. Your graph must have the schema "from, to, weight", where "from" and "to" are node-ids (either numbers or strings) and "weight" is a number describing the strength of the link between them. It must be tab delimited. If your graph is undirected, you need to make two records for each link A-B: one for A->B, and another for B->A. If your graph is unweighted, simply set each weight value to 1.0.
Set configuration values in controlscripts/my-pagerank.py. These consist of input and output paths, as well as parameters to the Pagerank algorithm. See the comments in the file for explanations of each.
Lastly, if your node-ids are not already human-readable, you can make an index of node-ids to node names as we did for the Twitter example. Simply make an index dataset where records have the schema "node_id, node_name" using a tab delimiter; then in the controlscript set POSTPROCESS_JOIN_WITH_NODE_NAMES to True and NODE_NAMES_INPUT to the s3 location of the index dataset.

Once you have your data ready and your controlscript configured, you can run Pagerank on your own data with the command:

mortar run my-pagerank -s N

where N is a number of computing-nodes appropriate to the size of your data (try 3 per GB of uncompressed input data). The algorithm should take about 16-20 iterations to converge.

If you have a small dataset, or a subset of a large one which can fit in memory, you can run Pagerank on your local machine with the command:

mortar local:run my-pagerank

For small data, this can be faster than running your script in the cloud, as it avoids the overhead of Hadoop.

Using Pagerank in Production

Now that you can run Pagerank on your own data, here are a few things to consider if you want to turn your demo scripts into something usable and ready for production.

Know the scale and velocity of your data

This simple implementation of Pagerank works best for use cases where 1) the data is smaller than 100 GB uncompressed and 2) the pageranks only need to be calculated infrequently (not more than once a week or so). For reference, the Twitter example runs on a subset of the full Twitter graph which is 2.25 GB. We stop the iteration a bit early for sake of example. If you wanted truly accurate output, it would take 16-20 iterations, which would use about 20 node-hours (~$14) of compute time.

If your data is larger than 100 GB, it will probably require a cluster larger than the 80-node maximum currently supported by Mortar. One way to get around this is to perform a "graph reduction" like we did for the Patent dataset. First, aggregate groups of related nodes together into "supernodes" (for example, all the webpages in a website). Then aggregate edges between supernodes (ex. all links from website A to website B become a single edge with a weight equal to the # of links). Finally, calculate pageranks on this reduced graph.

If your Pageranks need to be updated frequently, a paper by A Langville and C Meyer describes an algorithm to efficiently update an old set of Pageranks with new data.

Consider implementing "topic-specific pagerank"

Sometimes you might want to calculate how important a node is to a given topic, rather than to the entire graph that it's part of. For example, a webpage could have a "NewsRank", a "SportsRank", a "FashionRank", etc. You can calculate these relative pageranks by finding "representative subsets" of the graph for each topic--for example, NYTimes, CNN, etc. pages for "NewsRank"--and modifying the Pagerank algorithm to be biased towards these topic subsets. A paper by T Haveliwala shows how this modification works.

Learn math

Having implemented the Pagerank iterative algorithm, it's natural to want to implement iterative algorithms for all sorts of problems, such recommending products or classifying text or events. Unfortunately, Pagerank is pretty much the only one of these algorithms for which the math is very simple. Anything machine learning related will require some knowledge of probability and of linear algebra (combining the two, one gets a mathematical object called a Markov Chain), so we encourage you to read up a bit on these topics. For a primer on the mathematics underlying Pagerank (spoiler: it involves Markov Chains), see here.

mrchoey/gist:5291774