-
-
Save jdan/1654064 to your computer and use it in GitHub Desktop.
#!/usr/bin/env ruby | |
# wiki-scraper.rb by Jordan Scales | |
# http://jordanscales.com | |
# http://programthis.net | |
# | |
# Tests the idea that the first link on each wikipedia article | |
# will eventually lead to philosophy | |
# | |
# Usage: | |
# ruby wiki-scraper.rb daft punk | |
require 'nokogiri' | |
require 'open-uri' | |
require 'cgi' | |
ROOT_URL = 'http://en.wikipedia.org' | |
def search_url(query) | |
"http://en.wikipedia.org/w/index.php?search=#{CGI.escape(query)}" | |
end | |
def title_from_url(url) | |
doc = Nokogiri::HTML(open(url)) | |
doc.css('h1#firstHeading').first.content | |
end | |
def title_from_query(query) | |
title_from_url search_url(query) | |
end | |
def first_link(url) | |
doc = Nokogiri::HTML(open(url)) | |
parenth = 0 | |
# cycle through each paragraph | |
doc.css('div.mw-content-ltr > p').each do |p| | |
# in each paragraph, go through each node | |
p.children.each do |c| | |
# if we've found two parentheses, return the next link you see | |
if parenth == 0 or (parenth > 1 and (parenth % 2 == 0)) | |
if c.name == 'a' | |
return ROOT_URL + c.attributes["href"].value | |
end | |
end | |
# incremement the number of parentheses we've seen | |
if /\(/ === c.to_s | |
parenth += 1 | |
elsif /\)/ === c.to_s | |
parenth += 1 | |
end | |
end | |
end | |
end | |
def first_link_from_query(query) | |
first_link search_url(query) | |
end | |
start = ARGV.join(' ') | |
url = search_url start | |
title = title_from_url url | |
puts "1: #{title}" | |
count = 2 | |
while title != 'Philosophy' | |
url = first_link url | |
title = title_from_url url | |
puts "#{count}: #{title}: #{url}" | |
count += 1 | |
end |
I've found infinite loop:
C:\Users\jarmo\Desktop>ruby wiki-scraper.rb michael jackson
1: Michael Jackson
2: MTV: http://en.wikipedia.org/wiki/MTV
3: Media in New York City: http://en.wikipedia.org/wiki/Media_of_New_York_City
4: Time Warner: http://en.wikipedia.org/wiki/Time_Warner
5: Turner Broadcasting System: http://en.wikipedia.org/wiki/Turner_Broadcasting_System
6: Time Warner: http://en.wikipedia.org/wiki/Time_Warner
7: Turner Broadcasting System: http://en.wikipedia.org/wiki/Turner_Broadcasting_System
....
E.g. your test fails :)
Nice find! Hitler also fails, but that is a bug with my actual program, since it chooses [German] in "[German] for blahblah" which I don't want it to do. There are some kinks, but I didn't want to spend more than about an hour working with this. I can't think of an elegant way to handle such cases.
Somebody on reddit posted exactly what I was thinking about when I saw this:
"If you really wanted to try all the articles, here are a couple of thoughts: - cache every article that eventually leads to Philosophy, then if a given page links to an article in the cache, it too eventually leads to Philosophy - Wikipedia used to (probably still does) make available an archive of all the articles for download. Then, you wouldn't have to deal with network latency, and whatnot.
Sounds like a fun project!"
That would be pretty awesome. Obviously, Wikipedia is changing all the time, but you could at least get a decent snapshot of how long it takes to get to Philosophy from any English article. You'd need to check the current path to check for infinite loops. Alternately, we could just write something to insert a Philosophy link as the first link in every Wikipedia article and call it a day...
My friend showed me this because I'm going to attempt to write something that randomly selects a link from a Wikipedia article. The only hard part is I only want links to other articles, no references, disambiguation pages, etc. (specifically only articles in the language of the current page). I did a prototype in VB .Net a couple of years ago (the framework used where I worked at the time). It worked pretty well and was a fun way to interact with Wikipedia.
Anyway, I just started learning ruby and whatnot, so this looks like a good place to start. Thanks for posting this.
It doesn't work if the first search doesn't return exact result, for example: