Skip to content

Instantly share code, notes, and snippets.

@yswallow
Created June 1, 2015 15:30
Show Gist options
  • Select an option

  • Save yswallow/7b5b65b551c7b3dee4bb to your computer and use it in GitHub Desktop.

Select an option

Save yswallow/7b5b65b551c7b3dee4bb to your computer and use it in GitHub Desktop.
htmlの本文のxpathを探すアルゴリズム(その2)
require 'nokogiri'
def mobilizer(page)
page = Nokogiri::HTML(page.to_s) if page.class == String
remove_tags = ['script', 'style', 'select', 'a']
remove_tags.each do |tag|
page.xpath('//' + tag).each { |n| n.remove }
end
path = identifier(page, '/html/body')
end
def identifier(page, recent_path)
sum = 0
pathes = []
page.xpath(recent_path + '/node()').each do |node|
size = node.text.size
puts size
sum += size > 100 ? size : 0
pathes << node.path
end
least_importance = sum * 0.5r
pathes.each do |path|
return identifier(page,path) if page.xpath(path).text.size > least_importance
end
recent_path
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment