Skip to content

Instantly share code, notes, and snippets.

@katzueno
Created May 26, 2017 08:29
Show Gist options
  • Select an option

  • Save katzueno/ec9232d5ee027e92f89a8819cfca6573 to your computer and use it in GitHub Desktop.

Select an option

Save katzueno/ec9232d5ee027e92f89a8819cfca6573 to your computer and use it in GitHub Desktop.
require 'logger'
require "kconv"
# Site URL
#SITE_URL = "https://imperavi.com/kube/"
SITE_URL = "http://example.com/"
# log File
LOG_FILE = "log.txt"
# Reject Extension
RJCT_EXT = "ico,ICO,gif,GIF,jpg,JPG,jpeg,JPEG,png,PNG,js,JS,css,CSS,pdf,PDF,xml,XML,txt,TXT,xls,XLS,doc,DOC,ppt,PPT,xlsx,XLSX,docx,DOCX,pptx,PPTX,wmv,WMV,zip,ZIP,swf,SWF,svg,SVG,mp3,MP3,mp4,MP4"
File.open(LOG_FILE, "w") do |f|
f.puts(SITE_URL)
end
puts "crawlering start"
result = open("| wget --recursive --level inf --spider --wait=5 --random-wait --reject=#{RJCT_EXT} --no-verbose --no-parent --directory-prefix=temp --no-directories #{SITE_URL} 2>&1 | tee temp.txt")
while !result.eof
if (result.gets).toutf8 =~ %r{.*(https?://[^ ]*) [^ ].*}
File.open(LOG_FILE, "a") do |f|
f.puts($1)
end
puts $1
headers = open("| wget --spider --server-response #{$1} 2>&1 | tee temp2.txt")
while !headers.eof
if (headers.gets) =~ %r{.*(Cache-Control: max-age=.*)}
File.open(LOG_FILE, "a") do |f|
f.puts($1)
end
puts $1
end
if (headers.gets) =~ %r{.*(X-Cache:.*)}
File.open(LOG_FILE, "a") do |f|
f.puts($1)
end
puts $1
end
end
end
end
result.close
puts "crawlering end"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment