Skip to content

Instantly share code, notes, and snippets.

@lazypower
Created October 11, 2012 18:18
Show Gist options
  • Save lazypower/3874439 to your computer and use it in GitHub Desktop.
Save lazypower/3874439 to your computer and use it in GitHub Desktop.
Sanitizer script
require 'mongoid'
#`Job` _Job Model_ bound to the mongo database. Handles the dynamic properties of the mongo documents.
# Helper data model to define the dynamic properties of jobs which has no standard definition between
# all the aggregated data
class Job
include Mongoid::Document
field :company, :type => String, :default => ""
field :location, :type => String, :default => ""
field :industries, :type => Array, :default => []
field :job_type, :type => Array, :default => []
field :job_copy, :type => String, :default => ""
field :job_title, :type => String, :default => ""
field :original_link, :type => String, :default => ""
#TODO: Define reference document for has_one/has_many - and update the scraper with Job Creation Date.
def created
jid = self.original_link
p = Processedlink.where(:url => jid)
p.first().at
end
end
require 'rubygems'
require 'mongoid'
require './job.rb'
#not a rails app, tell mongo where and how to do stuff
Mongoid.load!("./config/mongoid.yml", :development)
class Sanitizer
@encoding_options = {
:invalid => :replace,
:undef => :replace,
:replace => ''
}
def start
#accumulator
saved_counter = 0
#lazy load the collection
jobs = Job.all()
#we have DEEZE records, kthnx
puts "Processing #{jobs.count} Records"
#iterate through each document as job
jobs.each do | job |
#tell mongoid, that the job_copy property is encoded wrong. Convert from UTF-8 back ot ASCII and drop whatever
#unicode is in the string. Silently dropped based on @encoding_options
job.job_copy = job.job_copy.encode(Encoding.find('ASCII'), @encoding_options)
#if the record saved, clap and display the output copy.
if job.save()
saved_counter += 1
puts job.id
end
end
#we finished.
puts "processed #{saved_counter} records"
end
end
#driver
s = Sanitizer.new
s.start
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment