MLConf 2013 notes (3)

Joseph Gonzalez, Co-Founder at GraphLab

Graphs encode the relationship between entities and are essential to data mining and machine learning.

Example: predicting user behavior

as a base technique that does a classification re: political bias
you can estimate the political bias and take that idea and apply it to a large user base
the vast majority of the users may not post frequently, but you have the follower structure, the graph, so you can use a conditional random field

"triangle counting"

How should we program these? Everyone reaches for hadoop

Pregel - Google's graph parallel abstraction

Example: noise reduction in image files

Most people use adaptive belief propagation
the problem with noisy images is that the boundaries are hard
"This is easy to reason about with grid-like graphs, but in the real world graphs are governed by power laws and high-degree vertices"
hard to partition high-degree vertices

Graphlab splits vertices rather than edges
they have some (presumably proprietary) heuristics for splitting vertices
unlike Pregel, which uses uses message-passing, Graphlab uses a shared state model

Page rank example

Alex Smola's technique can process 150 million tokens per second;
the Yahoo altavista web graph: 1B links processed per second 30 lines of user code
Graphlab ~100 tokens per second.... but the code is 30 lines written in 4 human-hours

Python libraries

they've built a series of python notebooks; you can apply them to real data
Their python tools work as a packaged set of binaries so that you don't need to compile the C++
you can do local prototyping but also deploy the same setup to your EC2 cluster
toolkits customized for different use cases

Large-scale graph computing at Google
Graphlab
- "The GraphLab project consists of a core C++ GraphLab API and a collection of high-performance machine learning and data mining toolkits built on top of the GraphLab API. In addition, we are actively developing new interfaces to allow users to leverage the GraphLab API from other languages and technologies."
Alex Smola's page on Google Research
Fast Counting of Triangles in Large Real Networks - paper