Last active
August 29, 2015 14:06
-
-
Save allejo/dad4052c799fc635732e to your computer and use it in GitHub Desktop.
Read through an HTML file of anchor tags, parse all of the hyperlinks, and download them
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/ruby | |
# | |
# License: Public Domain | |
line_number = 0; | |
dry_run = ARGV[1] | |
if ARGV[0].nil? || ARGV[0].empty? | |
puts "Usage: ruby urlFetcher.rb [FILE_PATH] [OPTION]" | |
puts "" | |
puts "The file given to this script must contain <a> tags with 'href' attributes" | |
puts "and those links are what will be parsed and downloaded." | |
puts "" | |
puts "Options" | |
puts " ---" | |
puts " --dry-run It will print out all of the files that will be downloaded" | |
puts " but won't actually download them" | |
puts "" | |
exit | |
end | |
text = File.open(ARGV[0]).read | |
text.gsub!(/\r\n?/, "\n") | |
text.each_line do |line| | |
current_link = line[/href=\".*?\"/].to_s | |
current_link.gsub! 'href=', '' | |
current_link.gsub! /\?.+/, '' | |
current_link.gsub! '"', '' | |
next if current_link.nil? || current_link.empty? | |
if dry_run == "--dry-run" | |
puts current_link | |
else | |
`wget #{current_link}` | |
end | |
line_number += 1 | |
end | |
puts "#{line_number} files downloaded" |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment