Skip to content

Instantly share code, notes, and snippets.

@afomi
Created October 3, 2024 16:31
Show Gist options
  • Save afomi/2e2bd4b835e4b35a47e31fa484bba4c6 to your computer and use it in GitHub Desktop.
Save afomi/2e2bd4b835e4b35a47e31fa484bba4c6 to your computer and use it in GitHub Desktop.

To identify spam programmatically across a large dataset of survey responses across surveys, you can take a multi-pronged approach that includes statistical analysis, machine learning, and heuristic-based methods. Here’s a strategy you can implement:

1. Heuristic-Based Rules for Identifying Spam

Start by defining clear rules that could help flag potentially suspicious responses. These rules should be tuned to the nature of your surveys but might include the following:

  • Repeated Entries from the Same IP: Responses coming from the same IP address multiple times in a short period might indicate spam.
  • Unrealistic Completion Times: Responses completed too quickly or too slowly compared to the average time could be flagged. If the time to complete a survey is abnormally fast, it might suggest the respondent didn’t provide genuine answers.
  • Uniform or Repetitive Answers: If responses to multiple questions are identical (e.g., all answers are “1” or "Yes"), it could indicate spam.
  • Long Strings of Random Characters: Use regular expressions to detect gibberish in text fields.
  • Incomplete Responses: Responses that fail to provide key information (e.g., missing many fields) could be flagged as suspicious.
  • Patterned Answers: Look for repeated patterns in responses, such as alternating answer patterns (e.g., "Yes, No, Yes, No").

2. Statistical Methods

Use statistical approaches to find outliers in the dataset that might indicate anomalous (and possibly spammy) responses.

  • Outlier Detection: Use techniques like Z-score or Interquartile Range (IQR) to identify survey responses with values far from the norm. For example, if most people rate a service between 3 and 5, but some consistently rate it 1, they could be spammers or bots.
  • Response Length Analysis: Analyze the lengths of text responses. Very short answers for open-ended questions (e.g., single words) may indicate spam. Similarly, responses much longer than average could be auto-generated content.
  • Distribution Analysis: Check if there are large spikes in responses from certain sources or times. For example, if a large number of responses suddenly come from a specific region or a set of suspicious IPs, they might be generated by bots.

3. Machine Learning-Based Detection

Once you’ve identified key patterns using heuristics, you can build machine learning models to automatically detect potential spam. This approach will scale better with large datasets.

  • Supervised Learning (Spam/Not Spam Classifier): If you have a labeled dataset of known spam responses, you can train a classifier (e.g., Logistic Regression, Random Forest, or an ensemble method) to detect spam. Use features such as:
    • Time taken to complete the survey.
    • IP address uniqueness.
    • Answer patterns and length of open responses.
    • Number of consecutive identical answers (e.g., all "1"s).
    • Text-based features (e.g., presence of specific words or gibberish).
  • Unsupervised Learning (Clustering/Anomaly Detection): If you don’t have labeled data, clustering techniques like K-Means or DBSCAN can be used to identify groups of anomalous responses that deviate from the majority.
  • Natural Language Processing (NLP): Use NLP techniques to analyze text responses. Features like sentiment, syntactic structure, or even language detection could help identify low-quality, generated, or inappropriate responses. Pre-trained models like BERT could be used to find nonsensical or contextually inappropriate text.

4. Behavioral-Based Analysis

  • IP Geolocation Check: Responses originating from suspicious or unexpected geolocations (e.g., countries not related to the survey) could be flagged.
  • Time-of-Day Patterns: Analyze the times when surveys were completed. If there is a high volume of responses at odd times (e.g., 3 a.m.), especially if they come in bursts, they might be spam.
  • Device Fingerprinting: Analyze browser and device data (if available). If you detect many responses from the same device or browser fingerprint, it could indicate spam activity.

5. Use CAPTCHA or ReCAPTCHA on Entry

In future survey deployments, implement CAPTCHA or reCAPTCHA to deter automated bots. While this doesn't help with historical data, it can prevent a significant amount of spam going forward.

6. Evaluate Consistency Across Responses

  • Cross-Survey Comparison: Check if similar responses across multiple surveys (from the same respondent) show signs of inconsistency or copy-pasting. Identical responses across different survey topics can be a sign of automated spam.
  • Survey-Specific Filters: Some surveys may have specific questions with expected consistent answers (e.g., demographic info). Inconsistent responses to these questions could help identify spam.

7. Human Review and Feedback Loop

  • Once you’ve implemented some of these automated checks, have a small team manually review flagged responses to tune and improve your model and heuristics over time. This could also provide labeled data to train a more accurate ML model.

8. Set Up an Evaluation Framework

  • Precision and Recall Metrics: When deploying any automated system, track how many false positives (legit responses flagged as spam) and false negatives (spam missed by the system) your model is producing. A good balance between precision and recall is important.
  • Feedback Loop: Continuously improve the model based on feedback from manual review, refining rules, and adding new features or training data.

Would you like assistance with implementing any specific part of this process (e.g., setting up a machine learning model, writing rules for spam detection, or automating some of these checks)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment