Skip to content

Instantly share code, notes, and snippets.

@khacanh
Created December 12, 2014 17:39
Show Gist options
  • Save khacanh/3b895ece585c145ba437 to your computer and use it in GitHub Desktop.
Save khacanh/3b895ece585c145ba437 to your computer and use it in GitHub Desktop.
cralwers
VM1372:2 Personality types in Spanish
VM1372:2 Genres of movies in Spanish
VM1372:2 Finance in Spanish
VM1372:2 English proverbs in Spanish
VM1372:2 Spanish: Either or, Neither Nor
VM1372:2 Spanish Weather Phrases
VM1372:2 Common Spanish Phrases
urls = {
#'Greetings' => 'http://www.rocketlanguages.com/spanish/learn/greetings-in-spanish/',
#'Describing people' => 'http://www.rocketlanguages.com/spanish/learn/describing-people-in-spanish/',
#'Romantic' =>
#'http://www.rocketlanguages.com/spanish/learn/romance-in-spanish/',
#'Basic' => 'http://www.rocketlanguages.com/spanish/learn/basic-spanish-phrases/',
'On the phone' => 'http://www.rocketlanguages.com/spanish/learn/phone-in-spanish/',
'Job interview' => 'http://www.rocketlanguages.com/spanish/learn/job-interviews-in-spanish/',
'Workplace' => 'http://www.rocketlanguages.com/spanish/learn/workplace-spanish/',
'Sports' => 'http://www.rocketlanguages.com/spanish/learn/sports-in-spanish/',
}
#f = File.open 'crawler.log', 'a+'
urls.each do |category, url|
#f.puts "\"#{category}\""
html_doc = Nokogiri::HTML(open(url))
en = html_doc.xpath("//div[@class='media']//div[@class='WS_1']").map {|i| i.text.strip}
es = html_doc.xpath("//div[@class='media']//div[@class='WS_5']").map {|i| i.text.strip}
links = html_doc.xpath("//div[@class='boxcontainer phrasebox bluebackground']//div[@class='media']//a").map {|i| i.attributes['href'].value}
links.each_with_index do |link, i|
`wget #{link} -O "app/src/main/res/raw/#{en[i].gsub(/[^a-zA-Z0-9]/, '').downcase}.mp3"`
end
en.each_with_index do |e, i|
#f.puts "{\"#{e}\", \"#{es[i]}\"},"
end
end
#f.close
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment