Skip to content

Instantly share code, notes, and snippets.

@tlehman
Created July 27, 2012 18:44
Show Gist options
  • Select an option

  • Save tlehman/3189721 to your computer and use it in GitHub Desktop.

Select an option

Save tlehman/3189721 to your computer and use it in GitHub Desktop.
Tag spam comments
# tag_spam_comments_in looks at a database and tags the records of
# the 'comments' table that are most likely to be spam.
# NOTE: This is meant to be run on the same system as the database
#
# Dependencies:
# gems: [sequel, spellchecker, mysql2]
# apps: [aspell, mysql]
# NOTE: aspell must have the 'en' dictionary installed
require 'rubygems'
require 'sequel'
require 'spellchecker'
def tag_spam_comments_in(dbinfo)
db = Sequel.connect("mysql2://#{dbinfo[:user]}:#{dbinfo[:pass]}@" +
"localhost/#{dbinfo[:dbname]}")
url_regex = /(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?/
db[:comments].to_a.each do |comment|
if comment[:body] =~ url_regex
comment[:spam] = 1
elsif not spelledgood? comment[:body]
comment[:spam] = 1
elsif ads? comment
comment[:spam] = 1
else
comment[:spam] = 0
end
# write to database
db[:comments].filter(:id=>comment[:id]).update(:spam => comment[:spam])
end
end
def spelledgood?(sentince)
tolerance = 0.2
# this method looks at all the words in a sentence and tallies up the good/notgood ratio
spelled = {true=>0.0, false=>0.0}
checked = Spellchecker.check(sentince).map { |s| s[:correct] }
checked.each do |good|
spelled[good] += 1.0
end
spelled[false]/checked.length.to_f < tolerance
end
def ads?(comment)
adwords = %w{propecia accutane stilnox vigrx ambien valium ativan klonopin tramadol modafinil paxil phentermine cialis adipex lexapro levitra xanax lipitor pharm payday handbag enlarge viagra pills cheap free vimax suppressant chanel vuitton prostitut}
adregex = Regexp.new("(#{adwords.join("|")})", true)
comment[:body] =~ adregex || comment[:email] =~ adregex || comment[:website] =~ adregex || comment[:name] =~ adregex
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment