Created
January 21, 2012 21:23
-
-
Save jdan/1654064 to your computer and use it in GitHub Desktop.
(Nokogiri) Tests the idea that the first link on each wikipedia article will eventually lead to philosophy
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env ruby | |
# wiki-scraper.rb by Jordan Scales | |
# http://jordanscales.com | |
# http://programthis.net | |
# | |
# Tests the idea that the first link on each wikipedia article | |
# will eventually lead to philosophy | |
# | |
# Usage: | |
# ruby wiki-scraper.rb daft punk | |
require 'nokogiri' | |
require 'open-uri' | |
require 'cgi' | |
ROOT_URL = 'http://en.wikipedia.org' | |
def search_url(query) | |
"http://en.wikipedia.org/w/index.php?search=#{CGI.escape(query)}" | |
end | |
def title_from_url(url) | |
doc = Nokogiri::HTML(open(url)) | |
doc.css('h1#firstHeading').first.content | |
end | |
def title_from_query(query) | |
title_from_url search_url(query) | |
end | |
def first_link(url) | |
doc = Nokogiri::HTML(open(url)) | |
parenth = 0 | |
# cycle through each paragraph | |
doc.css('div.mw-content-ltr > p').each do |p| | |
# in each paragraph, go through each node | |
p.children.each do |c| | |
# if we've found two parentheses, return the next link you see | |
if parenth == 0 or (parenth > 1 and (parenth % 2 == 0)) | |
if c.name == 'a' | |
return ROOT_URL + c.attributes["href"].value | |
end | |
end | |
# incremement the number of parentheses we've seen | |
if /\(/ === c.to_s | |
parenth += 1 | |
elsif /\)/ === c.to_s | |
parenth += 1 | |
end | |
end | |
end | |
end | |
def first_link_from_query(query) | |
first_link search_url(query) | |
end | |
start = ARGV.join(' ') | |
url = search_url start | |
title = title_from_url url | |
puts "1: #{title}" | |
count = 2 | |
while title != 'Philosophy' | |
url = first_link url | |
title = title_from_url url | |
puts "#{count}: #{title}: #{url}" | |
count += 1 | |
end |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Somebody on reddit posted exactly what I was thinking about when I saw this:
"If you really wanted to try all the articles, here are a couple of thoughts: - cache every article that eventually leads to Philosophy, then if a given page links to an article in the cache, it too eventually leads to Philosophy - Wikipedia used to (probably still does) make available an archive of all the articles for download. Then, you wouldn't have to deal with network latency, and whatnot.
Sounds like a fun project!"
That would be pretty awesome. Obviously, Wikipedia is changing all the time, but you could at least get a decent snapshot of how long it takes to get to Philosophy from any English article. You'd need to check the current path to check for infinite loops. Alternately, we could just write something to insert a Philosophy link as the first link in every Wikipedia article and call it a day...
My friend showed me this because I'm going to attempt to write something that randomly selects a link from a Wikipedia article. The only hard part is I only want links to other articles, no references, disambiguation pages, etc. (specifically only articles in the language of the current page). I did a prototype in VB .Net a couple of years ago (the framework used where I worked at the time). It worked pretty well and was a fun way to interact with Wikipedia.
Anyway, I just started learning ruby and whatnot, so this looks like a good place to start. Thanks for posting this.