Created
October 11, 2012 18:18
-
-
Save lazypower/3874439 to your computer and use it in GitHub Desktop.
Sanitizer script
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
require 'mongoid' | |
#`Job` _Job Model_ bound to the mongo database. Handles the dynamic properties of the mongo documents. | |
# Helper data model to define the dynamic properties of jobs which has no standard definition between | |
# all the aggregated data | |
class Job | |
include Mongoid::Document | |
field :company, :type => String, :default => "" | |
field :location, :type => String, :default => "" | |
field :industries, :type => Array, :default => [] | |
field :job_type, :type => Array, :default => [] | |
field :job_copy, :type => String, :default => "" | |
field :job_title, :type => String, :default => "" | |
field :original_link, :type => String, :default => "" | |
#TODO: Define reference document for has_one/has_many - and update the scraper with Job Creation Date. | |
def created | |
jid = self.original_link | |
p = Processedlink.where(:url => jid) | |
p.first().at | |
end | |
end |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
require 'rubygems' | |
require 'mongoid' | |
require './job.rb' | |
#not a rails app, tell mongo where and how to do stuff | |
Mongoid.load!("./config/mongoid.yml", :development) | |
class Sanitizer | |
@encoding_options = { | |
:invalid => :replace, | |
:undef => :replace, | |
:replace => '' | |
} | |
def start | |
#accumulator | |
saved_counter = 0 | |
#lazy load the collection | |
jobs = Job.all() | |
#we have DEEZE records, kthnx | |
puts "Processing #{jobs.count} Records" | |
#iterate through each document as job | |
jobs.each do | job | | |
#tell mongoid, that the job_copy property is encoded wrong. Convert from UTF-8 back ot ASCII and drop whatever | |
#unicode is in the string. Silently dropped based on @encoding_options | |
job.job_copy = job.job_copy.encode(Encoding.find('ASCII'), @encoding_options) | |
#if the record saved, clap and display the output copy. | |
if job.save() | |
saved_counter += 1 | |
puts job.id | |
end | |
end | |
#we finished. | |
puts "processed #{saved_counter} records" | |
end | |
end | |
#driver | |
s = Sanitizer.new | |
s.start |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment