Created
February 13, 2012 14:03
-
-
Save tily/1817146 to your computer and use it in GitHub Desktop.
Web ページの本文から【相変わらず女の子からも人気の定番アイテム!】みたいな prefix を抜き出すやつ
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| # coding: utf-8 | |
| #!/usr/bin/env ruby | |
| require 'open-uri' | |
| require 'nokogiri' | |
| # Usage: ruby extract_prefix.rb http://shop.menz-style.com/ | |
| # ruby extract_prefix.rb http://hayabusa.2ch.net/news4vip/subback.html | |
| def get_prefix_list(url) | |
| list = [] | |
| doc = Nokogiri::HTML.parse(open(url).read) | |
| doc.text.scan(/【.+?】/u).each do |text| | |
| list << text | |
| end | |
| list.uniq | |
| end | |
| puts get_prefix_list(ARGV[0]) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment