This tutorial will introduce you to GraphX and guide you through the process of creating the graph representation of a dataset in Spark. You will also learn about relevance scoring in ElasticSearch by tuning the scoring algorithm to boost attribute weights.
You will need the source code accompanying this tutorial in order to complete part 3. The source code contains a number of helper functions found in DataFrameUtils, as well as the pipeline used to create the Wikipedia datasets. Look through WikipediaPageRank to gain an understanding of GraphX pipelines.
GraphX is a graph abstraction built on top of Spark RDDs that allows graph-parallel computation. Relationships in data are represented as directed edges in a property graph and information about the data itself is stored in the vertices of the Graph. Because the underlying implementation of edges and vertices are specialized RDDs, graphs share many properties with RDDs (e.g. immutability and fault-tolerance). Like an RDD or DataFrame, a change to a Graph produces a new Graph. Optimizations are made by Spark to reduce network traffic and computational overhead when creating a new Graphfrom an existing Graph.
The Graph[VD, ED] is described by its set of vertices, VertexRDD[VD], and its set of edges, EdgeRDD[ED]. Vertices are accessed using the .vertices method and edges are accessed using the .edges method. A Graph can be creating by passing in an RDD[(VertexId, VD)] and RDD[Edge[ED]] into the Graph constructor. Edge[ED] is a triplet of type (VertexId, VertexId, ED), where the first VertexId is the source vertex, the second is the destination, and ED is the edge attribute. VertexId is an alias for type Long, and ED can be any serializable type.
Task 1(a) Represent the tables below using GraphX. Use this example in the GraphX documentation to guide you. First create an RDD[(VertexId, VD)] and then create an RDD[Edge[ED]] using the Edge case class. Paste your source code below.
| ID | Name | Position | Class |
|---|---|---|---|
| 1 | Chris | Instructor | Scala |
| 1 | Chris | Instructor | Models |
| 2 | Jason | Instructor | Methods |
| 3 | Prianna | Instructor | Methods |
| 3 | Prianna | Instructor | Spark |
| 4 | Amitoj | Instructor | Models |
| 5 | Dami | Instructor | Models |
| 6 | Peter | Instructor | Scala |
| 7 | John | Resident | Scala |
| 7 | John | Resident | Methods |
| 7 | John | Resident | Models |
| 7 | John | Resident | Spark |
Graph supports the basic operations seen in RDDs such as filter, map, and reduceByKey through a number of graph-specific operations that can be found in Graph and GraphOps. These include methods that return information about the graph (degrees of vertices, |V|, |E|, etc) or transform the graph (joins, mapVertices, etc) in some way. The mapFoo methods transform the properties of the graph without modifying its underlying structure and are thus structure preserving. Graph indices are retained when these methods are called and various optimizations are made to reduce overhead. There are also structure modifying operations such as reverse, subgraph and groupEdges. Note that groupEdges will not repartition the edges when it is called, as it assumes that identical edges are colocated on the same partition.
Resources
Spark GraphX Programming Guide
The weight of a term in an ElasticSearch document is determined by three factors: its term frequency, its inverse document frequency, its field length norm, and a term-boost factor. The first two factors are ones you should already be familiar with and the third is simple the inverse square root of the number of terms in the field. The intuition behind this is that, if a term appears in a short field, it's more likely that the content of that field is about the term itself. Think of it like finding a 'big fish in a small pond'. We'll discuss the term-boost factor in the section below.
Weights of multiple terms are combined into a vector and are compared against the vector representation of documents with respect to term weights.
Take a look at the default scoring function used by ElasticSearch. You'll see that it's the product of the query normalization factor, the coordination factor, and the sum of the term weights as described above. The query normalization factor normalizes the score using the sum of squared term inverse document frequencies. The coordination factor scales the score by the fraction of query terms that appear in the document.
####Query-Time Boosting
Clauses within a query can be boosted to return results that are more relevant with respect to one field versus another. Boosts to a clause within a query are pushed down to individual terms and are used to compute the term weight. A query clause is boosted by setting the boost field to an integer value - this value represents the relative weight of that clause. See the documentation for a usage example.
Task 2(a) Modify your search for Wikipedia pages about 'David Cameron' by boosting the title attribute and paste your query below. See the usage example above to guide you. Compare your boosted results with the unboosted results: are there any differences? What is the first page returned by the boosted query? What other attributes could you boost to get more relevant results? What could you do to avoid pages corresponding to redirects (hint: boolean operations)?
Task 2(b) Write a query that returns the David Cameron page as its first result, using boosting.
GraphX supports a number of graph algorithms, including PageRank. Static PageRank runs for a fixed number of iterations, where as dynamic PageRank runs until a stationary distribution is reached. We've run PageRank on the Wikipedia dataset using GraphX and added the PageRanked data as a second index in ElasticSearch, wikipedia_ranked. You'll use both to complete the remainder of this tutorial.
Task 3(a) Re-run the queries from tutorial 5 using the PageRanked wikipedia_ranked index. Do you notice any difference in the results that are returned? How about the queries you wrote in part 2?
Task 3(b) The PageRanked Wikipedia dataset can be found at s3://ds12-methods/labs/lab-3/data/wikipedia/ranked, and an inverse PageRanked dataset can be found at s3://ds12-methods/labs/lab-3/data/wikipedia/inverse_edges_ranks. Load both of these in addition to the dictionary found at s3://ds12-methods/labs/lab-3/data/wikipedia/linkdictionary. Compare the PageRanked dataset with the inverse PageRanked dataset. What do you observe?