Skip to content

Instantly share code, notes, and snippets.

@cmk
Created September 21, 2016 17:34
Show Gist options
  • Select an option

  • Save cmk/5ecbf69e20d60e9d8452d0e68e3199d7 to your computer and use it in GitHub Desktop.

Select an option

Save cmk/5ecbf69e20d60e9d8452d0e68e3199d7 to your computer and use it in GitHub Desktop.

Practical PageRank

Introduction

This tutorial will introduce you to GraphX and guide you through the process of creating the graph representation of a dataset in Spark. You will also learn about relevance scoring in ElasticSearch by tuning the scoring algorithm to boost attribute weights.

Part 0

You will need the source code accompanying this tutorial in order to complete part 3. The source code contains a number of helper functions found in DataFrameUtils, as well as the pipeline used to create the Wikipedia datasets. Look through WikipediaPageRank to gain an understanding of GraphX pipelines.

Part 1: Introducing GraphX

GraphX is a graph abstraction built on top of Spark RDDs that allows graph-parallel computation. Relationships in data are represented as directed edges in a property graph and information about the data itself is stored in the vertices of the Graph. Because the underlying implementation of edges and vertices are specialized RDDs, graphs share many properties with RDDs (e.g. immutability and fault-tolerance). Like an RDD or DataFrame, a change to a Graph produces a new Graph. Optimizations are made by Spark to reduce network traffic and computational overhead when creating a new Graphfrom an existing Graph.

Part 1(a): Vertices and Edges

The Graph[VD, ED] is described by its set of vertices, VertexRDD[VD], and its set of edges, EdgeRDD[ED]. Vertices are accessed using the .vertices method and edges are accessed using the .edges method. A Graph can be creating by passing in an RDD[(VertexId, VD)] and RDD[Edge[ED]] into the Graph constructor. Edge[ED] is a triplet of type (VertexId, VertexId, ED), where the first VertexId is the source vertex, the second is the destination, and ED is the edge attribute. VertexId is an alias for type Long, and ED can be any serializable type.

Task 1(a) Represent the tables below using GraphX. Use this example in the GraphX documentation to guide you. First create an RDD[(VertexId, VD)] and then create an RDD[Edge[ED]] using the Edge case class. Paste your source code below.

ID Name Position Class
1 Chris Instructor Scala
1 Chris Instructor Models
2 Jason Instructor Methods
3 Prianna Instructor Methods
3 Prianna Instructor Spark
4 Amitoj Instructor Models
5 Dami Instructor Models
6 Peter Instructor Scala
7 John Resident Scala
7 John Resident Methods
7 John Resident Models
7 John Resident Spark

Part 1(b): Operations on Graphs

Graph supports the basic operations seen in RDDs such as filter, map, and reduceByKey through a number of graph-specific operations that can be found in Graph and GraphOps. These include methods that return information about the graph (degrees of vertices, |V|, |E|, etc) or transform the graph (joins, mapVertices, etc) in some way. The mapFoo methods transform the properties of the graph without modifying its underlying structure and are thus structure preserving. Graph indices are retained when these methods are called and various optimizations are made to reduce overhead. There are also structure modifying operations such as reverse, subgraph and groupEdges. Note that groupEdges will not repartition the edges when it is called, as it assumes that identical edges are colocated on the same partition.

Resources

Spark GraphX Programming Guide

Part 2: Relevance Scoring in ElasticSearch

The weight of a term in an ElasticSearch document is determined by three factors: its term frequency, its inverse document frequency, its field length norm, and a term-boost factor. The first two factors are ones you should already be familiar with and the third is simple the inverse square root of the number of terms in the field. The intuition behind this is that, if a term appears in a short field, it's more likely that the content of that field is about the term itself. Think of it like finding a 'big fish in a small pond'. We'll discuss the term-boost factor in the section below.

Weights of multiple terms are combined into a vector and are compared against the vector representation of documents with respect to term weights.

Part 2(a): Scoring Functions

Take a look at the default scoring function used by ElasticSearch. You'll see that it's the product of the query normalization factor, the coordination factor, and the sum of the term weights as described above. The query normalization factor normalizes the score using the sum of squared term inverse document frequencies. The coordination factor scales the score by the fraction of query terms that appear in the document.

####Query-Time Boosting

Clauses within a query can be boosted to return results that are more relevant with respect to one field versus another. Boosts to a clause within a query are pushed down to individual terms and are used to compute the term weight. A query clause is boosted by setting the boost field to an integer value - this value represents the relative weight of that clause. See the documentation for a usage example.

Task 2(a) Modify your search for Wikipedia pages about 'David Cameron' by boosting the title attribute and paste your query below. See the usage example above to guide you. Compare your boosted results with the unboosted results: are there any differences? What is the first page returned by the boosted query? What other attributes could you boost to get more relevant results? What could you do to avoid pages corresponding to redirects (hint: boolean operations)?

Task 2(b) Write a query that returns the David Cameron page as its first result, using boosting.

Part 3: PageRank with GraphX and ElasticSearch

GraphX supports a number of graph algorithms, including PageRank. Static PageRank runs for a fixed number of iterations, where as dynamic PageRank runs until a stationary distribution is reached. We've run PageRank on the Wikipedia dataset using GraphX and added the PageRanked data as a second index in ElasticSearch, wikipedia_ranked. You'll use both to complete the remainder of this tutorial.

Task 3(a) Re-run the queries from tutorial 5 using the PageRanked wikipedia_ranked index. Do you notice any difference in the results that are returned? How about the queries you wrote in part 2?

Task 3(b) The PageRanked Wikipedia dataset can be found at s3://ds12-methods/labs/lab-3/data/wikipedia/ranked, and an inverse PageRanked dataset can be found at s3://ds12-methods/labs/lab-3/data/wikipedia/inverse_edges_ranks. Load both of these in addition to the dictionary found at s3://ds12-methods/labs/lab-3/data/wikipedia/linkdictionary. Compare the PageRanked dataset with the inverse PageRanked dataset. What do you observe?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment