Created
August 23, 2014 21:26
-
-
Save sebastien-bratieres/0295aaa9a4d4acab4a0d to your computer and use it in GitHub Desktop.
IPython notebook to illustrate an algorithm to compute PageRanks.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"metadata": { | |
"name": "", | |
"signature": "sha256:5d172e02eeffb5de9592061ec75a01f2a8d71652e030e5c6a4e934474635c2f2" | |
}, | |
"nbformat": 3, | |
"nbformat_minor": 0, | |
"worksheets": [ | |
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"This page demonstrates the use of a short Python implementation of the PageRank algorithm on the link structure contained in the graph on the [PageRank Wikipedia](http://en.wikipedia.org/wiki/PageRank) page:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"from IPython.display import Image\n", | |
"Image(url='http://upload.wikimedia.org/wikipedia/commons/f/fb/PageRanks-Example.svg')" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"html": [ | |
"<img src=\"http://upload.wikimedia.org/wikipedia/commons/f/fb/PageRanks-Example.svg\"/>" | |
], | |
"metadata": {}, | |
"output_type": "pyout", | |
"prompt_number": 1, | |
"text": [ | |
"<IPython.core.display.Image at 0x7fba5dfb0208>" | |
] | |
} | |
], | |
"prompt_number": 1 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"First, we will encode the links present on this graph as a count matrix `M_counts`." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"n_pages = 11 # numbering pages A through K as 0 to 10\n", | |
"M_counts = np.zeros((n_pages, n_pages)) # will hold the number of link counts (assumed 0 or 1)\n", | |
"# columns = starting page, row = destination page, ie M_ij = whether or not there is a link from j to i\n", | |
"\n", | |
"M_counts[:,0] = 1 # page 0 (A in the graphic) is a sink because it has no outgoing links at all; \n", | |
"# however, M cannot contain an all-zero column, so do as if A was linking to all other pages (ie put 1's everywhere)\n", | |
"M_counts[2,1] = 1 # B->C\n", | |
"M_counts[1,2] = 1 # C->B\n", | |
"M_counts[0,3] = 1 # D->A\n", | |
"M_counts[1,3] = 1 # D->B\n", | |
"M_counts[1,4] = 1 # E->B\n", | |
"M_counts[3,4] = 1 # E->D\n", | |
"M_counts[5,4] = 1 # E->F\n", | |
"M_counts[1,5] = 1 # F->B\n", | |
"M_counts[4,5] = 1 # F->E\n", | |
"M_counts[1,6] = 1 # G,H,I->B,E\n", | |
"M_counts[4,6] = 1\n", | |
"M_counts[1,7] = 1\n", | |
"M_counts[4,7] = 1\n", | |
"M_counts[1,8] = 1\n", | |
"M_counts[4,8] = 1\n", | |
"M_counts[4,9] = 1 # J,K->E\n", | |
"M_counts[4,10] = 1\n", | |
"print(M_counts)" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"[[ 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]\n", | |
" [ 1. 0. 1. 1. 1. 1. 1. 1. 1. 0. 0.]\n", | |
" [ 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]\n", | |
" [ 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]\n", | |
" [ 1. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1.]\n", | |
" [ 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]\n", | |
" [ 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]\n", | |
" [ 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]\n", | |
" [ 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]\n", | |
" [ 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]\n", | |
" [ 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]\n" | |
] | |
} | |
], | |
"prompt_number": 2 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Now we can make an adjacency matrix `M` out of `M_counts`, by dividing each column by its sum, ie we are making sure columns sum to 1 :" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"M = np.empty((n_pages, n_pages))\n", | |
"for j in range(n_pages):\n", | |
" M[:,j] = M_counts[:,j] / M_counts[:,j].sum()\n", | |
"np.set_printoptions(precision=3)\n", | |
"print(M)" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"[[ 0.091 0. 0. 0.5 0. 0. 0. 0. 0. 0. 0. ]\n", | |
" [ 0.091 0. 1. 0.5 0.333 0.5 0.5 0.5 0.5 0. 0. ]\n", | |
" [ 0.091 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. ]\n", | |
" [ 0.091 0. 0. 0. 0.333 0. 0. 0. 0. 0. 0. ]\n", | |
" [ 0.091 0. 0. 0. 0. 0.5 0.5 0.5 0.5 1. 1. ]\n", | |
" [ 0.091 0. 0. 0. 0.333 0. 0. 0. 0. 0. 0. ]\n", | |
" [ 0.091 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ]\n", | |
" [ 0.091 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ]\n", | |
" [ 0.091 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ]\n", | |
" [ 0.091 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ]\n", | |
" [ 0.091 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ]]\n" | |
] | |
} | |
], | |
"prompt_number": 3 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Let us check that all the conditions on M are fulfilled." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"import numpy\n", | |
"def check_M(M):\n", | |
" \"\"\"\n", | |
" check that M has the right format to be used by pagerank function\n", | |
" \"\"\"\n", | |
" n_pages = M.shape[0] # n_pages is the number of rows of M\n", | |
" np.testing.assert_equal(M.shape[0], M.shape[1], err_msg = 'M should be square')\n", | |
" np.testing.assert_array_almost_equal(M.sum(axis=0), np.ones((n_pages)), \n", | |
" err_msg = 'assert each column sums to one (M is assumed column-stochastic)')\n", | |
" for j in range(n_pages):\n", | |
" M_column = M[:,j]\n", | |
" n_nonzero = np.count_nonzero(M[:,j])\n", | |
" np.testing.assert_array_almost_equal(M_column[M_column.nonzero()], np.ones((n_nonzero)) / n_nonzero,\n", | |
" err_msg = 'in column %g, all non-zero entries should be equal (and equal to 1 divided by their number)' % j)\n", | |
"\n", | |
"check_M(M) # will produce error if M does not have the right format" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 4 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"And we are now ready to apply the `pagerank` function, which will iteratively apply page transitions to an randomly initialized distribution over the pages, until convergence." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"import numpy as np\n", | |
"def pagerank(M, d=0.85, square_error=1e-6):\n", | |
" \"\"\"\n", | |
" M : the adjacency matrix of the pages. It is assumed to be column-stochastic (ie column sum to 1); all links have equal weight.\n", | |
" A page with no outgoing links (sink) is represented as a page with outgoing links to each other page (ie restart page).\n", | |
" d: damping factor\n", | |
" square_error : the algorithm iterates until the difference between two successive PageRank vectors v is less than this (in squared norm)\n", | |
" returns the PageRanks of all pages\n", | |
" \"\"\"\n", | |
" n_pages = M.shape[0] # n_pages is the number of rows of M\n", | |
" v = np.random.rand(n_pages) # initialize to random vector\n", | |
" v = v / v.sum() # make v sum to 1\n", | |
" last_v = np.ones((n_pages)) # will contain the previous v\n", | |
" M_hat = d * M + (1-d)/n_pages * np.ones((n_pages, n_pages)) # equation (***) in Wikipedia page\n", | |
" while np.square(v - last_v).sum() > square_error:\n", | |
" last_v = v\n", | |
" v = M_hat.dot(v) # at each iteration, progress one timestep\n", | |
" return v\n", | |
" " | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 5 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"pagerank(M)" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"metadata": {}, | |
"output_type": "pyout", | |
"prompt_number": 6, | |
"text": [ | |
"array([ 0.033, 0.385, 0.343, 0.039, 0.081, 0.039, 0.016, 0.016,\n", | |
" 0.016, 0.016, 0.016])" | |
] | |
} | |
], | |
"prompt_number": 6 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"These are the numbers (within the allowed error) displayed on the graph (the numbers on the graph are rounded exact values)." | |
] | |
} | |
], | |
"metadata": {} | |
} | |
] | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment