Created
May 14, 2012 02:46
-
-
Save pedrobachiega/2691483 to your computer and use it in GitHub Desktop.
Regex to extract links from HTML ( http://rubular.com/r/ESweX4uBlb )
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
require "test/unit" | |
class HtmlLinkTagsRegex < Test::Unit::TestCase | |
def regex | |
regex = /<a.+?href=["']([^"']+)["'].*?>(.+?)<\/a>/im | |
end | |
def test_extract_links | |
html = <<-html | |
<p>Do you know <A title="Bacon Ipsum" href="http://baconipsum.com/" | |
target="_blank">Bacon Ipsum</A> - from <a HREF='http://pedrobachiega.com' >pedrobachiega.com</a></p> | |
<p><a title="Bacon Ipsum" href="http://baconipsum.com/" target="_blank"><img src="http://baconipsum.com/wp-content/uploads/2011/06/bacon-ipsum-banner1.jpg" /></a></p> | |
<p>Hamburger beef bresaola pig tongue, pork chop sirloin tail pork belly shankle short loin pork. Pork loin ball tip pork meatloaf strip steak. <a href="http://wiki.answers.com/Q/Is_bacon_pork_or_beef">Bacon pork</a> loin pastrami, sirloin biltong ham hock spare ribs ground round hamburger shoulder tail pork chop. Speck pork belly bresaola t-bone. Swine prosciutto short ribs, tail pastrami leberkas shankle.</p> | |
<p><a href='https://en.wikipedia.org/wiki/Spare_ribs' target="_blank">Spare | |
ribs</a> kielbasa shank, frankfurter meatball tenderloin short loin salami beef ribs. Pastrami strip steak pork chop short ribs hamburger, speck chicken biltong tri-tip jerky meatloaf venison spare ribs pork loin corned beef. Tri-tip bresaola cow tail ball tip, filet mignon ham sirloin short loin beef ribs meatball. Ball tip pork belly beef ribs, flank turducken bacon ham shank jowl cow short ribs venison shoulder bresaola chicken. Spare ribs strip steak shankle kielbasa tri-tip. Ham hock jowl pancetta, turducken biltong prosciutto venison ball tip pork chop filet mignon fatback spare ribs corned beef pork loin.</p> | |
<p>Chicken ham drumstick, <a href="http://www.foodnetwork.com/recipes/emeril-live/boudin-sausage-recipe/index.html" | |
target="_blank">boudin sausage</a> shankle fatback jerky prosciutto short ribs ground round andouille chuck shoulder sirloin. Filet mignon andouille shankle pork loin, fatback short loin brisket. Turkey pork loin turducken, ball tip frankfurter shoulder brisket rump sirloin meatball sausage. Brisket meatball meatloaf andouille, spare ribs salami jowl pig drumstick corned beef speck ham hock tri-tip. Ground round shankle ham prosciutto, strip steak ball tip venison shank.</p> | |
html | |
links = html.scan(regex) | |
links.each_with_index do |link, i| | |
puts "#{i} - #{link}" | |
end | |
end | |
end |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment