Skip to content

Instantly share code, notes, and snippets.

@yuki24
Created June 3, 2015 22:38
Show Gist options
  • Save yuki24/82122720417775a8e896 to your computer and use it in GitHub Desktop.
Save yuki24/82122720417775a8e896 to your computer and use it in GitHub Desktop.
A generator that creates a list of all available titles in Simple English Wiktionary
require 'rest-client'
base_url = "http://simple.wiktionary.org/w/api.php?action=query&aplimit=23294&list=allpages&format=json"
filename = "simple_english_titles.yml"
count = nil
apfrom = nil
num = 0
titles = []
begin
url = if apfrom
base_url + "&apfrom=#{apfrom}"
else
base_url
end
puts "processing page #{num}: #{url}"
json = RestClient.get(url)
json = JSON.load(json)
count = json["query"]["allpages"].size
apfrom = json["query"]["allpages"].last['title'] if count > 0
titles += json["query"]["allpages"].map {|hash| hash["title"] }
num += 1
end while count == 500
require 'json'
require 'yaml'
puts "Number of titles: #{titles.uniq.size}"
File.open(filename, 'w') do |file|
file.write(titles.uniq.to_yaml)
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment