Skip to content

Instantly share code, notes, and snippets.

@iorionda
Created May 9, 2013 05:09
Show Gist options
  • Select an option

  • Save iorionda/5545678 to your computer and use it in GitHub Desktop.

Select an option

Save iorionda/5545678 to your computer and use it in GitHub Desktop.
# -*- coding: utf-8 -*-
require 'active_support/core_ext/string/strip'
html = <<-EOS.strip_heredoc
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=Shift_JIS">
<meta content='width=device-width; initial-scale=1.0; maximum-scale=1.0;' name='viewport'>
<title>Shift-JISのHTMLメール</title>
<style type="text/css">
<!--
h1 { color: red; }
-->
</style>
</head>
<body>
&nbsp;<h1 id="hoge">サンプル&nbsp;メール</h1>&nbsp;
<a href="http://www.nexway.co.jp" target="_blank"><img src="https://www.google.co.jp/images/srpr/logo4w.png" border=0></a>
<ul>
<li><a href="http://www.google.com/">Google</a></li>
<li><a      href="http://www.yahoo.co.jp/">Yahoo!</a></li>
<li><a>hrefがない</a></li>
<li><a href>href値:なし</a></li>
<li><a href="">href値:空文字</a></li>
<li><a href="https">href値:プロトコルのみ</a></li>
<li><a href="http:">href値:プロトコルコロンまで</a></li>
<li><a href="https://">href値:プロトコルスラスラまで</a></li>
<li><a href="http://x">href値:不当なURL</a></li>
<li><a href="http://localhost:3000/">href値:ポートあり</a></li>
<li><a href="https://maps.google.co.jp/maps?q=%E6%9D%B1%E4%BA%AC%E9%83%BD%E6%B8%AF%E5%8C%BA%E8%99%8E%E3%83%8E%E9%96%80%EF%BC%94%E4%B8%81%E7%9B%AE%EF%BC%93%E2%88%92%EF%BC%91%EF%BC%93+%E7%A5%9E%E8%B0%B7%E7%94%BA%E3%82%BB%E3%83%B3%E3%83%88%E3%83%A9%E3%83%AB%E3%83%97%E3%83%AC%E3%82%A4%E3%82%B9&hl=ja&ie=UTF8&sll=36.5626,136.362305&sspn=47.525256,87.626953&oq=%E7%A5%9E%E8%B0%B7%E7%94%BA%E3%82%BB%E3%83%B3%E3%83%88%E3%83%A9%E3%83%AB&brcurrent=3,0x60188b90ef579bd9:0x7aea7cb12f141dfb,0&hnear=%E6%9D%B1%E4%BA%AC%E9%83%BD%E6%B8%AF%E5%8C%BA%E8%99%8E%E3%83%8E%E9%96%80%EF%BC%94%E4%B8%81%E7%9B%AE%EF%BC%93%E2%88%92%EF%BC%91%EF%BC%93+%E7%A5%9E%E8%B0%B7%E7%94%BA%E3%82%BB%E3%83%B3%E3%83%88%E3%83%A9%E3%83%AB%E3%83%97%E3%83%AC%E3%82%A4%E3%82%B9&t=m&z=16">href値:パラメータあり</a></li>
<li><a href="https://goo.gl/maps/DAWJC">href値:短縮URL</a></li>
<li><a href="https://www.google.co.jp/#safe=off&hl=ja&sclient=psy-ab&q=%C2%A9&oq=%C2%A9&gs_l=hp.3..0l8.2809.7540.0.10616.13.11.1.0.0.1.90.724.11.11.0...0.0...1c.1j4.12.psy-ab.mBF2wCSTvTM&pbx=1&bav=on.2,or.r_cp.r_qf.&bvm=bv.46226182,d.dGI&fp=f29db67d20b9dfd1&biw=1378&bih=783">href値:#を含む</a></li>
<li><a href="https://google.com">&quot; &amp; &lt; &gt; &nbsp; &copy;</a></li>
<li><a href="https://google.com">"&<> ©</a></li>
</ul>
</body>
</html>
EOS
tokenizer = HTML::Tokenizer.new(html)
results = []
index = 0
while token = tokenizer.next
node = HTML::Node.parse(nil, 0, 0, token, false)
results << if node.is_a?(HTML::Tag) && %w(a href).include?(node.name) && node.attributes && node.attributes['href']
index += 1
node.to_s
else
node.to_s
end
end
puts results.join
@iorionda
Copy link
Copy Markdown
Author

iorionda commented May 9, 2013

bundle exec rails r html_scanner_sample.rb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment