Graphs encode the relationship between entities and are essential to data mining and machine learning.
Example: predicting user behavior
- as a base technique that does a classification re: political bias
- you can estimate the political bias and take that idea and apply it to a large user base
- the vast majority of the users may not post frequently, but you have the follower structure, the graph, so you can use a conditional random field
"triangle counting"
- count triangles passing through each vertex
- measures "cohesiveness" of a local community
- real estate agents are the "highest" triangles
How should we program these? Everyone reaches for hadoop
- it has a very good parallelizing primitive, but it's row-based, not graph-based
- how do you do a dependency graph?
- how do you iterate on a graph with partitioning?
Pregel - Google's graph parallel abstraction
- a vertex program runs on each vertex
- that program is constrained to only interact with nodes that share its vertices
Example: noise reduction in image files
- Most people use adaptive belief propagation
- the problem with noisy images is that the boundaries are hard
- "This is easy to reason about with grid-like graphs, but in the real world graphs are governed by power laws and high-degree vertices"
- hard to partition high-degree vertices
- Graphlab splits vertices rather than edges
- they have some (presumably proprietary) heuristics for splitting vertices
- unlike Pregel, which uses uses message-passing, Graphlab uses a shared state model
Page rank example
- Alex Smola's technique can process 150 million tokens per second;
- the Yahoo altavista web graph: 1B links processed per second 30 lines of user code
- Graphlab ~100 tokens per second.... but the code is 30 lines written in 4 human-hours
Python libraries
- they've built a series of python notebooks; you can apply them to real data
- Their python tools work as a packaged set of binaries so that you don't need to compile the C++
- you can do local prototyping but also deploy the same setup to your EC2 cluster
- toolkits customized for different use cases
- Large-scale graph computing at Google
- Graphlab
- "The GraphLab project consists of a core C++ GraphLab API and a collection of high-performance machine learning and data mining toolkits built on top of the GraphLab API. In addition, we are actively developing new interfaces to allow users to leverage the GraphLab API from other languages and technologies."
- Alex Smola's page on Google Research
- Fast Counting of Triangles in Large Real Networks - paper