Normalizing User-Agent (UA) strings is a smart move. Raw UA strings are messy, inconsistent, and can be easily spoofed, but they provide a vital "fingerprint" layer when combined with IP and Session IDs.
For a high-traffic site like a video host, storing raw strings leads to database bloat and slow queries. Here is how to normalize them effectively for bot detection and rate limiting.
Don't just store one version of the UA. I recommend breaking it down into three distinct database columns to balance storage efficiency with query flexibility.
Use a library (like uap-core or ua-parser-js) to extract the core metadata. Instead of the whole string, store:
- Browser Family: (e.g., Chrome, Firefox, python-requests)
- OS: (e.g., Windows 10, Android 13)
- Device Type: (e.g., Mobile, Desktop, Bot)
To catch aggressive users, you need to group them.
- Strip Versioning: Remove minor/patch versions (e.g., change
Chrome/114.0.5735.198toChrome/114). - Sort Tokens: Some scrapers reorder tokens to bypass filters. Sorting them alphabetically ensures they always result in the same string.
- Hash it: Apply a
SHA-256orMD5hash to this cleaned string.
- Benefit: Querying an indexed hash column is significantly faster than a
LIKE %...%text search.
Mark the UA based on its nature:
- Known Bot: (Googlebot, Bingbot)
- Library/Script: (cURL, Python, Go-http-client) — These are usually your "aggressive" uploaders.
- Browser: (Standard user traffic)
| Column | Data Type | Purpose |
|---|---|---|
ua_hash |
BINARY(16) or CHAR(32) |
Indexed. Used for grouping "identical" devices. |
browser_family |
VARCHAR(50) |
To see if attacks are coming from a specific browser. |
is_automated |
BOOLEAN |
Flags UA strings that identify as scripts/libraries. |
last_seen |
TIMESTAMP |
For sliding-window rate limiting. |
Since your goal is to stop aggressive uploaders, look for these "Red Flag" patterns in your normalized data:
- Generic Library UAs: Legitimate users rarely upload videos via
python-requests/2.28.1. If you see a high frequency of these, they are likely automating your upload endpoint. - Version Mismatch: A UA claiming to be
Windows 10but using aChrome 50(a very old version) is often a headless browser or a bot script. - High Entropy / Low Volume: If you see 1,000 different IPs all using the exact same complex, high-version UA string, it’s a coordinated botnet.
When a request comes in:
- Lower-case the entire string.
- Remove noise: Strip out specific timestamps or unique identifiers some apps inject into the UA.
- Hash the result.
- Check the Cache: Use Redis to count hits for
IP + UA_Hashover a 60-second window. If it exceeds your threshold (e.g., 10 uploads/min), trigger a CAPTCHA or a 429 Too Many Requests.
Note: Remember that User-Agents can be spoofed. Never rely on them for security, only for traffic shaping and heuristics.
Would you like me to provide a Python or Node.js code snippet for a normalization function that strips versions and hashes the string?