Skip to content

Instantly share code, notes, and snippets.

@Snarp
Created April 21, 2018 20:44
Show Gist options
  • Save Snarp/714715d117643b119ba0510f00a59888 to your computer and use it in GitHub Desktop.
Save Snarp/714715d117643b119ba0510f00a59888 to your computer and use it in GitHub Desktop.
Scrapes an HTML string for image URLs using Oga + regex.
require 'oga'
# Scrapes an HTML string for image URLs using Oga + regex.
class HtmlImageUrlScraper
EXTNAMES = ['.jpeg','.jpg','.gif','.png','.bmp','.svg','.tif','.tiff','.ai','.apng','.bpg','.cgm','.dxf','.eps','.flif','.hdp','.hdr','.heic','.heif','.ico','.iff','.jp2','.jpx','.jxr','.lbm','.pbm','.pgm','.pnm','.ppm','.wdp','.webp']
attr_accessor :extnames, :regexes
def initialize(extnames: EXTNAMES, regexes: nil)
@extnames,@regexes=extnames,regexes
unless !!@regexes
extnames_str = extnames.map { |n| n.sub('.','') }.join('|')
@regexes = [
/"([^"]+\.(?:{{EXTNAMES}}))"/i,
/'([^']+\.(?:{{EXTNAMES}}))'/i
].map { |frmt| Regexp.new(frmt.to_s.gsub('{{EXTNAMES}}',extnames_str)) }
end
end
# @return [Array<String>]
def scrape(string)
doc = Oga::parse_html(string.force_encoding('UTF-8'))
# ignores extnames; assumes any img src value is an image URL
img_srcs = doc.css('img').map { |img| img.get('src') }.select {|src| !!src}
# relies on extnames
a_hrefs = doc.css('a').map { |a| a.get('href') }.select do |href|
!!href && @extnames.include?(File.extname(href).downcase)
end
via_regex = @regexes.map { |regex| string.scan(regex) }.flatten
return (img_srcs + a_hrefs + via_regex).uniq
end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment