Created
April 21, 2014 22:41
-
-
Save masayuki5160/11159038 to your computer and use it in GitHub Desktop.
htmlをパースするスクリプト
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
use strict; | |
use warnings; | |
use LWP::UserAgent; | |
use HTML::TreeBuilder; | |
my $url = "http://www.yahoo.co.jp"; | |
my $user_agent = "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0)"; | |
my $ua = LWP::UserAgent->new('agent' => $user_agent); | |
my $res = $ua->get($url); | |
my $content; | |
if($res->is_success){ | |
$content = $res->content; | |
# print $content; | |
}else{ | |
die($res->status_line); | |
} | |
# HTML::TreeBuilderで解析する | |
my $tree = HTML::TreeBuilder->new; | |
$tree->parse($content); | |
# DOM操作してトピックの部分だけ抜き出す。 | |
# <div id='topicsfb'><ul><li>....の部分を抽出する | |
my @items = $tree->look_down('id', 'topicsfb')->find('li'); | |
print $_->as_text."\n" for @items; |
実行結果はこんな感じ.
$ perl htmlParser.pl 船差し押さえ 日中関係に懸念動画 新藤総務相が靖国参拝NEW 大手電力 6月も料金値上げ写真 韓国 修学旅行を全面禁止動画 PTAは罰ゲーム? 気が沈む選出写真 休養の高橋大輔 恋人ほしい写真NEW 熊切の事務所 破局報道を否定写真 観月「婚前旅行」から帰国写真 最近の話題 記事一覧
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Perlってすげ〜なと実感。。
【参考】
簡単!たった13行のコードでHTML取得&解析をするPerlスクリプト
http://dqn.sakusakutto.jp/2010/06/perlhtml.html