Created
September 15, 2011 19:21
-
-
Save isomorphisms/1220207 to your computer and use it in GitHub Desktop.
How to identify spam on tumblr
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
init Score; | |
##### most telling features ####### | |
Score += const * 10th_decile( vector_of_likes$Date ); | |
Score += const * another_functional_of( vector_of_likes$Date ); | |
Score += const * number of likes; #interaction term with % reblogs / number of written posts ... if the written posts start being hacked from somewhere else there are ways to identify that too | |
Score += const * avg_speed_of like_submission; | |
Score += const * is_there_an_ad_on_the_blog; | |
###### | |
##### most telling feature overall: hashtag squatting ##### | |
@hashtags = qw( tags_on most_recently_liked_post ); | |
init squattercount; | |
foreach (@hashtags) { | |
squattercount = number_of_likes_with the_same_hashtag; | |
} | |
Score += const * squattercount ; | |
#do this for a few randomly spread Likes | |
#the squattercount is the best screening mechanism, i.e. if you're figuring out which accounts to run this regression on | |
###### names and photos are telling ######### | |
run SVM on photo; | |
run SVM on name; | |
if (the kernelised photo is north of the hyperplane) { Score += const; } | |
if (the kernelised name is north of the hyperplane) { Score += const; } | |
###### | |
###### pure laziness play ########## | |
if (TumblrMessage == default "I love here :)") { Score += const; } | |
#interaction terms with default message + every other term | |
#interaction terms with default title + every other term | |
Score += concave_fn( percent of posts that are reblogs ) * const; | |
#actually the fn. should be nearly linear up to 80% and then starkly nonlinear monotonic, just so long as the highest %'s are differentiated | |
########## put it all together ####### | |
Do a median-polish regression on the logistic of {spam, not_spam} to get decent values for the constants. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment