Skip to content

Instantly share code, notes, and snippets.

@hibariya
Created November 14, 2010 11:42
Show Gist options
  • Select an option

  • Save hibariya/676101 to your computer and use it in GitHub Desktop.

Select an option

Save hibariya/676101 to your computer and use it in GitHub Desktop.
#!/usr/bin/env ruby
# -*- coding: utf-8 -*-
require 'open-uri'
STDOUT.sync = true
words = {}
result_page = URI.parse('http://www.google.co.jp/search?num=100&hl=en&q=README+site:github.com').read
readme_links = result_page.scan(/['"](https?:\/\/[0-9a-z\-\.]:?[^'"<>:]+readme[\.0-9a-z]*)['"]/i).flatten.uniq
len = readme_links.length
readme_links.each_with_index do |uri, i|
cur = ((i.to_f/len)*100)
print ['|', "#"*cur.to_i, '-'*(100-cur.to_i), "|(#{(cur*100).round/100.0}%)\r"].join
readme_raw = URI.parse(uri.sub(/\/blob\//, '/raw/')).read rescue next
extracts = readme_raw.scan(/[a-z]+|[0-9]+[a-z\-_]+[a-z0-9]*|[a-z]+[0-9\-_]+[a-z0-9]*/)
extracts.each do |word|
next unless word.length <= 20
words[word] ||= 0
words[word] += 1
end
end
puts words.sort_by(&:last).reverse.map{|*m| m.join "\t" }.join("\n")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment