Last active
September 27, 2017 07:41
-
-
Save roger35972134/f2e13bcd78a7d59942cd518e4473d8a2 to your computer and use it in GitHub Desktop.
crawler preparation for NEWS analysis
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
require 'nokogiri' | |
require 'open-uri' | |
# Let's try to fetch and parse HTML document | |
books = Nokogiri::HTML(open('https://udn.com/rank/pv/2/0/1')) | |
news = [] | |
i = 0 | |
books.css('dt h2 a').each do |link| | |
news.push link['href'] | |
i+=1 | |
end | |
string = '' | |
news.each do |n| | |
article = Nokogiri::HTML(open(n)) | |
article.css('p').each do |link| | |
string += link.content | |
end | |
end | |
File.open('article.txt', 'w') { |file| file.write(string) } |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment