Created
April 21, 2014 23:02
-
-
Save masayuki5160/11159485 to your computer and use it in GitHub Desktop.
HTMLをパースしてaタグのとこをぬきだしとかしてみる
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
use strict; | |
use warnings; | |
use LWP::UserAgent; | |
use HTML::TreeBuilder; | |
my $url = "http://www.yahoo.co.jp"; | |
my $user_agent = "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0)"; | |
my $ua = LWP::UserAgent->new('agent' => $user_agent); | |
my $res = $ua->get($url); | |
my $content; | |
if($res->is_success){ | |
$content = $res->content; | |
# print $content; | |
}else{ | |
die($res->status_line); | |
} | |
# HTML::TreeBuilderで解析する | |
my $tree = HTML::TreeBuilder->new; | |
$tree->parse($content); | |
# | |
# DOM操作してトピックの部分だけ抜き出す。 | |
# | |
# <div id='topicsfb'><ul><li>....の部分を抽出する | |
my @items = $tree->look_down('id', 'topicsfb')->find('li'); | |
print $_->as_text."\n" for @items; | |
# <div id='topicsfb'><ul><li><a href>....の部分を抽出する | |
my @links = $tree->look_down('id', 'topicsfb')->find('a'); | |
print $_->attr('href')."\n" for @links; |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
実行結果
$ perl htmlPaser_getLink.pl 船差し押さえ 日中関係に懸念動画 新藤総務相が靖国参拝NEW 大手電力 6月も料金値上げ写真 韓国 修学旅行を全面禁止動画 PTAは罰ゲーム? 気が沈む選出写真 休養の高橋大輔 恋人ほしい写真NEW 熊切の事務所 破局報道を否定写真 観月「婚前旅行」から帰国写真 最近の話題 記事一覧 f/topics/top/1/*-http://dailynews.yahoo.co.jp/fc/domestic/japan_china_relations/?id=6114379 f/topics/top/2/*-http://dailynews.yahoo.co.jp/fc/domestic/yasukuni/?id=6114385 f/topics/top/3/*-http://dailynews.yahoo.co.jp/fc/economy/electricity_prices/?id=6114378 f/topics/top/4/*-http://dailynews.yahoo.co.jp/fc/world/korea_south/?id=6114380 f/topics/top/5/*-http://dailynews.yahoo.co.jp/fc/domestic/education/?id=6114372 f/topics/top/6/*-http://dailynews.yahoo.co.jp/fc/sports/takahashi_daisuke/?id=6114384 f/topics/top/7/*-http://dailynews.yahoo.co.jp/fc/entertainment/kabuki/?id=6114381 f/topics/top/8/*-http://dailynews.yahoo.co.jp/fc/entertainment/romance/?id=6114382 f/topics/top/11/*-http://news.yahoo.co.jp/list/?d=20140422&mc=f&mp=f r/ttl f/topics/top/9/*-http://dailynews.yahoo.co.jp/photograph/pickup/?1398086976 f/topics/top/10/*-http://dailynews.yahoo.co.jp/photograph/pickup/?1398086976
【参考】
HTML::TreeBuilderによるパース(リンクの取得)
http://www.geekpage.jp/programming/perl-network/html-treebuilder-2.php