Skip to content

Instantly share code, notes, and snippets.

@dearing
Created April 27, 2012 02:48
Show Gist options
  • Save dearing/2505284 to your computer and use it in GitHub Desktop.
Save dearing/2505284 to your computer and use it in GitHub Desktop.
WXR export attachment scraper (hard-coded for type/jpeg); checks existing CDN objects and publishes them if needed
#!/usr/bin/ruby
require 'net/http'
require 'cloudfiles'
require 'colorize'
require 'uri'
i = 0
skipto = ARGV[0].to_i
snooze = ARGV[1].to_f
cf = CloudFiles::Connection.new(:username => "username", :api_key => "api-key", :snet => true)
container = cf.container('my-container')
Net::HTTP.start("example.tld") do |http|
# Read file for urls to test; download and publish if not extant
work = File.readlines('work.txt')
work.each do |line|
i = i +1
# quick and dirty 'skipto' index for sessions restarts
if i < skipto then next end
sleep(snooze)
URI.extract(line, "http").each do |url|
url =~ /(http:\/\/example.tld\/)/
name = $'
print "\t[#{i}/#{work.length}] check: #{url}"
# if extant move on; else publish to CDN
if !container.objects.include?(name)
print "\rFAIL\n".red
print "\t#{name} => pushing to CDN...".magenta
resp = http.get(url)
object = container.create_object name, false
object.write resp.body, {'Content-Type' => 'image/jpeg'}
print "done\n".green
else
print "\rPASS\n".green
end
end
end
end # NET::HTTP
@dearing
Copy link
Author

dearing commented Apr 27, 2012

Ug, -- "A vote for me is a vote for a consistent, globally enforced tab size"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment