Last active
August 29, 2015 14:02
-
-
Save prokizzle/e02fc378ccaaf9065e1c to your computer and use it in GitHub Desktop.
80legs_Robots_Parser
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# 80legs Bulk Robots Checker | |
# 80legs.com 2014 | |
# Usage: ruby robots.rb <path_to_url_list> | |
# Example: $> ruby robots.rb /Users/nick/Documents/url_list_1.txt | |
# | |
# Installation/Requirements: | |
# gem install rest-client | |
# | |
class RobotsParser | |
require 'json' | |
require 'rest-client' | |
MASHAPE_API_KEY = "put_key_here" | |
def robots_file_for(url) | |
parsed_url = URI.parse(url) | |
return "#{parsed_url.site.scheme}://#{parsed_url.host.downcase}/robots.txt" rescue "" | |
end | |
def parse_robots_file(url_to_parse) | |
begin | |
response = RestClient.get("https://robotstxt.p.mashape.com/site/robots/?url=#{URI.encode url_to_parse}", :"X-Mashape-Authorization" => MASHAPE_API_KEY) | |
array = JSON.parse response | |
agents = Hash.new | |
array["agents"].each do |a| | |
agents[a["name"]] = {allow: a["allow"], disallow: a['disallow']} | |
end | |
rescue | |
end | |
return agents | |
end | |
end | |
app = RobotsParser.new | |
file = ARGV[0] | |
File.open(file, "r").each_line do |url| | |
begin | |
robots_file = app.robots_file_for(url) | |
agents = app.parse_robots_file(robots_file) | |
puts "Site: #{URI.parse(url).host} ->" | |
puts "\t#{agents["*"]}" | |
puts "\t#{agents["008"]}" | |
rescue | |
puts "InvalidURL" | |
end | |
end |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment