Skip to content

Instantly share code, notes, and snippets.

@usametov
Last active April 14, 2025 04:00
Show Gist options
  • Save usametov/3f1b7e4ab844023f0a8fa6a1e43b8807 to your computer and use it in GitHub Desktop.
Save usametov/3f1b7e4ab844023f0a8fa6a1e43b8807 to your computer and use it in GitHub Desktop.
stealth scraping

This is an example of stealth scraping with Etaoin in Clojure, simulating browser behavior while incorporating insights from the provided scrape.do blog posts. The blog posts focus on rotating proxies, avoiding detection, we will implement those concepts into a Clojure/Etaoin context, emphasizing browser simulation and anti-detection techniques.

Key Insights from the Blog Posts

  1. Cheap Rotating Proxies:

    • Rotating proxies are essential to avoid IP bans by distributing requests across multiple IP addresses, mimicking different users.
    • Residential proxies are recommended for stealth as they appear more legitimate than datacenter proxies.
    • Affordable proxy services can integrate with scraping tools to rotate IPs automatically.
  2. Web Scraping Detection:

    • Websites detect bots via headers (e.g., User-Agent), IP patterns, request frequency, and JavaScript fingerprinting (e.g., navigator.webdriver).
    • Stealth techniques include:
      • Randomizing User-Agent strings.
      • Using headless browsers to render JavaScript.
      • Mimicking human behavior (random delays, clicks, scrolls).
      • Managing cookies and sessions to maintain consistent browser state.
      • Avoiding honeypot traps (hidden links or fields).
  3. Python Web Scraping:

    • Emphasizes libraries like Requests and BeautifulSoup for static pages, but highlights the need for browser automation (e.g., Selenium, Playwright) for dynamic content.
    • Proxies and User-Agent rotation are critical for scaling without blocks.
  4. Web Scraping with Playwright:

    • Playwright excels at scraping JavaScript-heavy sites by automating browsers.
    • Stealth features include:
      • Masking automation indicators (e.g., navigator.webdriver).
      • Proxy rotation for IP diversity.
      • Random delays to avoid predictable request patterns.
      • Handling CAPTCHAs or anti-bot systems via plugins or APIs.
    • Playwright’s ability to simulate clicks, scrolls, and waits mimics human interaction.

Translating to Etaoin for Stealth Scraping

Etaoin is a Clojure library for browser automation, controlling headless browsers like Chrome or Firefox via WebDriver. It’s ideal for stealth scraping because it:

  • Runs real browsers, rendering JavaScript and handling dynamic content.
  • Supports user interactions (clicks, typing, scrolling) to mimic humans.
  • Allows customization of headers, User-Agents, and proxies.
  • Manages cookies and sessions naturally, like a browser.

To achieve stealth scraping with Etaoin, inspired by the blog posts, we’ll:

  1. Use rotating proxies to avoid IP detection.
  2. Randomize User-Agent strings to mimic different browsers/devices.
  3. Simulate human-like behavior (delays, clicks, scrolls).
  4. Execute JavaScript to mask automation (e.g., spoof navigator.webdriver).
  5. Handle dynamic content by waiting for elements to load.
  6. Manage cookies/sessions for consistent state.
  7. Extract data using CSS selectors or XPath, similar to Playwright’s approach.

Step-by-Step Implementation with Etaoin

Below is a comprehensive example of stealth scraping with Etaoin, targeting a hypothetical dynamic site (e.g., https://example.com), incorporating the blog posts’ techniques. I’ll assume you’re scraping a JavaScript-heavy page that requires interaction, like clicking a button to load content.

Prerequisites

  • Clojure Setup: Ensure you have Clojure and Leiningen/Boot installed.
  • Etaoin: Add to your project.clj:
    [etaoin "1.0.40"]
  • WebDriver: Install ChromeDriver or GeckoDriver matching your browser version.
  • Proxy Service: Use a rotating proxy provider (e.g., Scrape.do, Oxylabs) with an API or proxy list. The blog post on cheap proxies suggests services like Smartproxy or IPRoyal for affordability.

Code Example

(ns stealth-scraper.core
  (:require [etaoin.api :as e]
            [etaoin.keys :as k]
            [clojure.string :as str]
            [clojure.java.io :as io]
            [clojure.core.async :refer [thread <!!]]))

;; Utility to pick random User-Agent
(def user-agents
  ["Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
   "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36"
   "Mozilla/5.0 (iPhone; CPU iPhone OS 16_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.0 Mobile/15E148 Safari/604.1"])

(defn random-user-agent []
  (rand-nth user-agents))

;; Proxy configuration (replace with your proxy service details)
(def proxy-list
  [{:host "proxy1.example.com" :port 8080 :username "user" :password "pass"}
   {:host "proxy2.example.com" :port 8080 :username "user" :password "pass"}])

(defn random-proxy []
  (rand-nth proxy-list))

;; Random delay to mimic human behavior
(defn random-delay []
  (Thread/sleep (+ 1000 (rand-int 2000)))) ; 1-3 seconds

;; Stealth JavaScript to mask automation
(def stealth-js
  "Object.defineProperty(navigator, 'webdriver', {get: () => undefined});")

(defn scrape-page [url]
  (let [prx (random-proxy)
        user-agent (random-user-agent)
        driver (e/chrome {:args [(str "--proxy-server=http://" (:host prx) ":" (:port prx))
                          "--no-sandbox"
                          "--disable-dev-shm-usage"
                          (str "--user-agent=" user-agent)
                          "--disable-blink-features=AutomationControlled"]
                         :headless true})]
      (try
        ;; Authenticate proxy if needed
        (when (and (:username prx) (:password prx))
          (e/go driver "chrome://settings/")
          (e/js-execute driver  ;; Fixed here
            (format "window.open('http://%s:%s@%s:%s');"
                    (:username prx) (:password prx)
                    (:host prx) (:port prx))))
        
        ;; Navigate to target page
        (e/go driver url)
        ;; Execute stealth script (fixed here)
        (e/js-execute driver stealth-js)
        ;; Simulate human behavior
        (random-delay)
        ;; Scroll to trigger dynamic content (fixed here)
        (e/js-execute driver "window.scrollTo(0, document.body.scrollHeight);")
        (random-delay)
        ;; Get raw HTML
        (e/get-source driver)
        (finally
          (e/quit driver)))))

Explanation of the Code

  1. Rotating Proxies (Inspired by Cheap Rotating Proxies):

    • A proxy-list is defined with multiple proxy servers (replace with real ones from a provider like Scrape.do).
    • random-proxy selects a proxy per session, mimicking IP diversity.
    • Etaoin’s :capabilities configure Chrome to use the proxy, with optional authentication.
  2. Avoiding Detection (Inspired by Web Scraping Detection):

    • User-Agent Randomization: random-user-agent picks from a list of realistic User-Agents, updated to recent browser versions (e.g., Chrome 120).
    • Stealth JavaScript: stealth-js hides navigator.webdriver to bypass automation detection, similar to Playwright’s stealth plugin.
    • Headless Mode Tweaks: --disable-blink-features=AutomationControlled reduces Chrome’s automation footprint.
    • Random Delays: random-delay introduces 1-3 second pauses between actions, mimicking human timing.
  3. Browser Simulation (Inspired by Python Web Scraping and Web Scraping with Playwright):

    • Dynamic Content: Etaoin navigates to the page, scrolls, and clicks a “Load More” button to trigger JavaScript-rendered content.
    • Waiting for Elements: e/wait-visible ensures content is loaded before extraction, handling AJAX or SPA behavior.
    • Interactions: Scrolling and clicking simulate user actions, reducing bot-like patterns.
  4. Data Extraction:

    • Uses CSS selectors (div.content > article) to extract titles and links, similar to Playwright’s selector-based approach.
    • Data is collected into a Clojure vector of maps and saved as EDN (you can modify to JSON/CSV).
  5. Session Management:

    • Etaoin automatically handles cookies like a browser, maintaining session state across interactions.
    • Proxy authentication ensures seamless IP rotation without breaking sessions.

Additional Stealth Tips

  • Proxy Quality: Use residential proxies (as suggested in Cheap Rotating Proxies) for better anonymity. Scrape.do’s API or Oxylabs’ residential proxies integrate well with Etaoin via HTTP proxy settings.
  • Rate Limiting: Add (random-delay) between page requests in multi-page scraping to avoid overwhelming servers.
  • Honeypot Avoidance: Check for hidden links (display: none) before clicking, using (e/get-element-attr d el "style").
  • CAPTCHA Handling: Etaoin doesn’t solve CAPTCHAs natively. For sites with CAPTCHAs, integrate a service like 2Captcha or Scrape.do’s API, which handles CAPTCHA bypassing (as noted in Web Scraping with Playwright).
  • Browser Fingerprinting: Inject more JavaScript to spoof properties like screen.width or WebGL if targeting sites with advanced fingerprinting (e.g., Cloudflare). Example:
    (e/execute-script d "Object.defineProperty(screen, 'width', {get: () => 1920});")

Limitations and Workarounds

  • CAPTCHAs/Anti-Bot Systems: For advanced protections (e.g., Cloudflare, DataDome), Scrape.do’s API (highlighted in Web Scraping Detection) can simplify bypassing. Alternatively, combine Etaoin with a proxy service’s anti-bot features.
  • Performance: Headless browsers are slower than HTTP clients. For static pages, consider clj-http with hickory (from my earlier response) to complement Etaoin.
  • Proxy Management: Etaoin doesn’t auto-rotate proxies mid-session. Implement a custom rotator by restarting the driver with a new proxy for each page:
    (defn rotate-driver [url]
      (let [driver (scrape-page url)]
        (e/quit driver)
        (recur url)))

Example Output

For https://example.com, assuming it has articles with <h2> titles and <a> links, the output might be:

[{:title "Article 1" :link "https://example.com/article1"}
 {:title "Article 2" :link "https://example.com/article2"}]

Saved to scraped-data.edn.

Running the Code

  1. Replace proxy-list with real proxy details from your provider.
  2. Update url to your target site.
  3. Ensure ChromeDriver is in your PATH.
  4. Run with lein run or from your REPL:
    (stealth-scraper.core/-main)

Notes

  • Dynamic Sites: Etaoin handles JavaScript-heavy sites like Playwright, making it suitable for SPAs or AJAX-driven pages (per Web Scraping with Playwright).
  • Ethical Scraping: Respect robots.txt and terms of service, as emphasized in the blog posts. Use throttling to avoid server strain.
  • Specific Sites: If you have a target site, provide its URL, and I can tailor the selectors or interactions.
  • Dependencies: The code assumes a basic Clojure setup. Install Chrome and ChromeDriver if not already present.

This approach combines Etaoin’s browser automation with stealth techniques from the blog posts, ensuring you can scrape dynamically while minimizing detection risks. If you need help with a specific site or proxy setup, let me know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment