Skip to content

Instantly share code, notes, and snippets.

@gphg
Created December 30, 2025 08:12
Show Gist options
  • Select an option

  • Save gphg/bbdaf4925ed50297623a6e1156f2551a to your computer and use it in GitHub Desktop.

Select an option

Save gphg/bbdaf4925ed50297623a6e1156f2551a to your computer and use it in GitHub Desktop.
Asked Gemini AI. It answers regardless how stupid it can be.

Normalizing User-Agent (UA) strings is a smart move. Raw UA strings are messy, inconsistent, and can be easily spoofed, but they provide a vital "fingerprint" layer when combined with IP and Session IDs.

For a high-traffic site like a video host, storing raw strings leads to database bloat and slow queries. Here is how to normalize them effectively for bot detection and rate limiting.


1. The Multi-Layered Normalization Approach

Don't just store one version of the UA. I recommend breaking it down into three distinct database columns to balance storage efficiency with query flexibility.

A. The "Clean" Component (Parsing)

Use a library (like uap-core or ua-parser-js) to extract the core metadata. Instead of the whole string, store:

  • Browser Family: (e.g., Chrome, Firefox, python-requests)
  • OS: (e.g., Windows 10, Android 13)
  • Device Type: (e.g., Mobile, Desktop, Bot)

B. The Normalized Hash (For Fast Lookups)

To catch aggressive users, you need to group them.

  1. Strip Versioning: Remove minor/patch versions (e.g., change Chrome/114.0.5735.198 to Chrome/114).
  2. Sort Tokens: Some scrapers reorder tokens to bypass filters. Sorting them alphabetically ensures they always result in the same string.
  3. Hash it: Apply a SHA-256 or MD5 hash to this cleaned string.
  • Benefit: Querying an indexed hash column is significantly faster than a LIKE %...% text search.

C. The Categorization Flag

Mark the UA based on its nature:

  • Known Bot: (Googlebot, Bingbot)
  • Library/Script: (cURL, Python, Go-http-client) — These are usually your "aggressive" uploaders.
  • Browser: (Standard user traffic)

2. Database Schema Example

Column Data Type Purpose
ua_hash BINARY(16) or CHAR(32) Indexed. Used for grouping "identical" devices.
browser_family VARCHAR(50) To see if attacks are coming from a specific browser.
is_automated BOOLEAN Flags UA strings that identify as scripts/libraries.
last_seen TIMESTAMP For sliding-window rate limiting.

3. Detecting Aggressive Behavior

Since your goal is to stop aggressive uploaders, look for these "Red Flag" patterns in your normalized data:

  1. Generic Library UAs: Legitimate users rarely upload videos via python-requests/2.28.1. If you see a high frequency of these, they are likely automating your upload endpoint.
  2. Version Mismatch: A UA claiming to be Windows 10 but using a Chrome 50 (a very old version) is often a headless browser or a bot script.
  3. High Entropy / Low Volume: If you see 1,000 different IPs all using the exact same complex, high-version UA string, it’s a coordinated botnet.

4. Implementation Logic

When a request comes in:

  1. Lower-case the entire string.
  2. Remove noise: Strip out specific timestamps or unique identifiers some apps inject into the UA.
  3. Hash the result.
  4. Check the Cache: Use Redis to count hits for IP + UA_Hash over a 60-second window. If it exceeds your threshold (e.g., 10 uploads/min), trigger a CAPTCHA or a 429 Too Many Requests.

Note: Remember that User-Agents can be spoofed. Never rely on them for security, only for traffic shaping and heuristics.

Would you like me to provide a Python or Node.js code snippet for a normalization function that strips versions and hashes the string?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment