guides

Scraping Without Getting Blocked: Proxies, Stealth & CAPTCHAs

Learn how to bypass anti-bot systems in 2026 with proxy rotation, TLS fingerprinting, browser stealth, and CAPTCHA solving. Real code, zero hand-waving.

Pierre

18 May 2026 • 18 min read

Your Scrapers Are Getting Blocked Because You're Fighting the Wrong Battle

You added random delays. You rotated User-Agent strings. You even paid for proxies. And your scraper still gets blocked after 50 requests. The problem is not your code. The problem is that you are fighting a 2020 battle with 2020 weapons while anti-bot systems have moved to 2026.

Modern bot detection does not just check your IP address. It fingerprints your TLS handshake, analyzes your HTTP/2 settings, measures your mouse movements, and scores your browser's canvas rendering. A single mismatched signal across any of these layers, and your session is toast.

Thesis: Bypassing anti-bot systems in 2026 requires a layered approach across network, protocol, browser, and behavioral dimensions. This guide covers every layer with real code, real detection rates, and real strategies that work against Cloudflare, DataDome, Akamai, and PerimeterX.

Why Sites Block Scrapers (And Why It's Getting Harder)
The Five Layers of Bot Detection
TLS Fingerprinting: The Silent Killer
Proxy Types: Residential, Datacenter, Mobile, and ISP
Proxy Rotation Strategies That Actually Work
Browser Fingerprinting: Canvas, WebGL, and Beyond
Stealth Headers and HTTP Consistency
CAPTCHA Solving: AI, Humans, and Hybrid Approaches
Rate Limiting: The Art of Not Being Greedy
How to Detect That You're Being Detected
Putting It All Together: A Complete Stealth Stack
Tools and Resources
Key Takeaways
FAQ

Why Sites Block Scrapers (And Why It's Getting Harder)

Websites block scrapers for three reasons: server load, data protection, and competitive advantage. An aggressive scraper can generate 100x more traffic than a human visitor. E-commerce sites lose pricing power when competitors scrape their catalogs in real time. And content platforms lose ad revenue when their data appears elsewhere.

The anti-bot industry has exploded in response. Cloudflare protects over 20% of all websites. DataDome claims 99.9% detection accuracy with a false positive rate under 0.01%. Akamai Bot Manager, PerimeterX (now HUMAN), and Imperva Incapsula protect most of the Fortune 500. Anti-bot adoption has grown sharply in recent years.

Here is the critical shift: detection has moved from static rules to machine learning. Cloudflare now trains per-customer ML models that learn normal traffic patterns. DataDome's behavioral analysis catches scrapers that pass every fingerprint test. Spoofing a single signal is no longer enough. You need consistency across every layer.

The Five Layers of Bot Detection

Anti-bot systems operate across five distinct layers. Each layer generates signals that feed into a risk score. Fail one layer and you might get a CAPTCHA. Fail two and you get blocked. Understanding these layers is the foundation of any stealth strategy.

Layer 1: Network. IP reputation, ASN classification, geolocation consistency. Datacenter IPs are flagged instantly on protected sites.

Layer 2: Protocol. TLS fingerprint (JA3/JA4), HTTP/2 settings, cipher suite ordering. This is where most Python scrapers fail silently.

Layer 3: HTTP. Header ordering, header values, cookie handling. Sending headers in the wrong order is a dead giveaway.

Layer 4: Browser. Canvas fingerprint, WebGL renderer, installed fonts, screen resolution, timezone. Headless browsers leak signals that real browsers do not.

Layer 5: Behavioral. Mouse movements, scroll patterns, click timing, navigation flow. ML models compare your session behavior against statistical baselines of real users.

The rest of this guide addresses each layer with specific techniques and code. For a broader introduction to scraping fundamentals, see our complete guide to web scraping in 2026.

TLS Fingerprinting: The Silent Killer

TLS fingerprinting is the most underestimated detection vector. When your client initiates an HTTPS connection, it sends a Client Hello packet containing supported cipher suites, TLS extensions, elliptic curves, and their ordering. This creates a unique fingerprint.

JA3 (developed by Salesforce) hashes these values into a single string. JA4 (by FoxIO, the same creator) normalizes extension ordering to counter Chrome's TLS extension permutation feature, which shuffles extensions randomly on each connection. In 2026, most anti-bot systems use JA4 or proprietary variants.

The problem: Python's requests library produces a completely different TLS fingerprint than Chrome. Even if your User-Agent header claims a current Chrome version, the server sees your TLS handshake and knows you are lying. This mismatch is the number one reason scrapers get blocked on Cloudflare-protected sites.

Fixing TLS fingerprints with curl_cffi

curl_cffi is a Python binding for curl-impersonate that replicates real browser TLS fingerprints. It supports Chrome, Safari, and Firefox impersonation out of the box.

# pip install curl_cffi
from curl_cffi import requests

# Impersonate Chrome's TLS fingerprint
response = requests.get(
    "https://tls.browserleaks.com/json",
    impersonate="chrome"
)

fingerprint = response.json()
print(f"JA3 Hash: {fingerprint.get('ja3_hash')}")
print(f"Protocol: {fingerprint.get('tls_version')}")
print(f"HTTP Version: {fingerprint.get('http_version')}")

# curl_cffi with proxy rotation and browser impersonation
from curl_cffi import requests
import random

BROWSERS = ["chrome", "safari", "safari_ios"]

def stealth_request(url, proxy=None):
    """Send a request with randomized browser impersonation."""
    browser = random.choice(BROWSERS)
    proxies = {"https": proxy} if proxy else None

    response = requests.get(
        url,
        impersonate=browser,
        proxies=proxies,
        timeout=30
    )
    return response

This single change, switching from requests to curl_cffi, fixes TLS fingerprinting, HTTP/2 settings, and header ordering simultaneously. It is the highest-impact change you can make to a Python scraper.

Proxy Types: Residential, Datacenter, Mobile, and ISP

Not all proxies are created equal. The type of proxy you use determines your baseline trust score before any other detection layer kicks in.

Datacenter proxies

Cost: $0.50-2/GB. Speed: Fast (1-5ms latency). Detection risk: High.

Datacenter IPs come from cloud providers like AWS, GCP, and Hetzner. Their IP ranges are publicly known and flagged by every major anti-bot system. Use them only for unprotected sites or internal tools. On Cloudflare-protected sites, expect block rates above 80%.

Residential proxies

Cost: $3-15/GB. Speed: Medium (50-200ms latency). Detection risk: Low.

Residential proxies route through real ISP-assigned IPs from home internet connections. They are nearly indistinguishable from real users at the network layer. Premium residential proxy providers offer large IP pools with country, city, and ISP-level targeting, and achieve high success rates on most protected sites.

Mobile proxies

Cost: $10-40/GB. Speed: Variable (100-500ms). Detection risk: Very low.

Mobile proxies use 4G/5G connections from real carriers. Because carriers use CGNAT (Carrier-Grade NAT), hundreds of real users share the same IP. Blocking a mobile IP risks blocking legitimate customers. This makes mobile proxies the gold standard for heavily protected targets.

ISP proxies

Cost: $2-8/GB. Speed: Fast (10-30ms). Detection risk: Low-medium.

ISP proxies are datacenter-hosted but registered under residential ISP ASNs. They combine the speed of datacenter proxies with the trust score of residential IPs. A solid middle ground for price monitoring and other high-volume use cases.

Quick comparison

Type	Cost/GB	Latency	Trust Score	Best For
Datacenter	$0.50-2	1-5ms	Low	Unprotected sites, APIs
Residential	$3-15	50-200ms	High	E-commerce, search engines
Mobile	$10-40	100-500ms	Very High	Heavily protected targets
ISP	$2-8	10-30ms	Medium-High	High-volume monitoring

Proxy Rotation Strategies That Actually Work

Having good proxies is step one. Rotating them intelligently is step two. A bad rotation strategy burns through expensive IPs faster than necessary.

Per-request rotation

Every request uses a different IP. Maximum anonymity, zero session state. Best for scraping search results, product listings, and other stateless pages. Most residential proxy providers support this natively via a gateway endpoint.

# Per-request rotation with a residential proxy gateway
from curl_cffi import requests

PROXY_GATEWAY = "http://user:pass@gate.provider.com:7777"

urls = [
    "https://example.com/product/1",
    "https://example.com/product/2",
    "https://example.com/product/3",
]

for url in urls:
    # Each request routes through a different IP automatically
    resp = requests.get(
        url,
        impersonate="chrome",
        proxies={"https": PROXY_GATEWAY},
        timeout=30
    )
    print(f"{url}: {resp.status_code} via {resp.headers.get('x-proxy-ip', 'unknown')}")

Sticky sessions

The same IP persists for a set duration (5-60 minutes). Required for multi-step flows: login, add to cart, checkout. Most providers implement this via session IDs in the proxy URL.

# Sticky session: same IP for the entire flow
import uuid

session_id = uuid.uuid4().hex[:8]
STICKY_PROXY = f"http://user-session-{session_id}:pass@gate.provider.com:7777"

# All three requests use the same IP
login = requests.post(
    "https://example.com/login",
    data={"user": "test", "pass": "test"},
    impersonate="chrome",
    proxies={"https": STICKY_PROXY}
)

dashboard = requests.get(
    "https://example.com/dashboard",
    impersonate="chrome",
    proxies={"https": STICKY_PROXY}
)

Geo-targeted rotation

Match your proxy location to the target site's audience. Scraping Amazon.fr from a US IP triggers suspicion. Scraping it from a French residential IP does not. Most providers support country, city, and even ASN-level targeting.

Smart rotation with failure detection

The most effective strategy adapts in real time. Track success rates per IP and retire IPs that trigger blocks. This prevents wasting requests on burned proxies.

from collections import defaultdict
import random

class SmartProxyRotator:
    def __init__(self, proxies: list[str]):
        self.proxies = proxies
        self.failures = defaultdict(int)
        self.max_failures = 3

    def get_proxy(self) -> str:
        """Return a proxy that hasn't been burned."""
        healthy = [
            p for p in self.proxies
            if self.failures[p] < self.max_failures
        ]
        if not healthy:
            # Reset all failures and try again
            self.failures.clear()
            healthy = self.proxies
        return random.choice(healthy)

    def report_failure(self, proxy: str):
        self.failures[proxy] += 1

    def report_success(self, proxy: str):
        self.failures[proxy] = max(0, self.failures[proxy] - 1)

Browser Fingerprinting: Canvas, WebGL, and Beyond

If you use a headless browser (Playwright, Puppeteer, Selenium), you face a different detection surface. Anti-bot scripts run JavaScript that probes dozens of browser properties to build a unique fingerprint.

Key fingerprinting vectors

navigator.webdriver: Headless Chrome sets navigator.webdriver = true by default. This is the most basic detection signal, and many scrapers still fail here.

Canvas fingerprint: The browser renders text and shapes on an invisible canvas element. Differences in GPU, drivers, and font rendering create a unique hash. Headless browsers produce consistent, "too perfect" canvas outputs.

WebGL renderer: Anti-bot scripts query WEBGL_debug_renderer_info to get your GPU model. A headless Chrome instance on a server reports "SwiftShader" (software renderer) instead of a real GPU. Instant detection.

Fonts and plugins: Real browsers have a set of installed fonts and plugins that vary by OS. A Linux server running headless Chrome with zero installed fonts looks nothing like a Windows desktop with 200+ fonts.

Stealth with Playwright

// npm install playwright playwright-extra puppeteer-extra-plugin-stealth
const { chromium } = require('playwright-extra');
const stealth = require('puppeteer-extra-plugin-stealth')();
chromium.use(stealth);

(async () => {
  const browser = await chromium.launch({
    headless: true,
    args: [
      '--disable-blink-features=AutomationControlled',
      '--window-size=1920,1080',
    ]
  });

  const context = await browser.newContext({
    viewport: { width: 1920, height: 1080 },
    userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36',
    locale: 'en-US',
    timezoneId: 'America/New_York',
  });

  const page = await context.newPage();

  // Override WebGL renderer to look like a real GPU
  await page.addInitScript(() => {
    const getParameter = WebGLRenderingContext.prototype.getParameter;
    WebGLRenderingContext.prototype.getParameter = function(parameter) {
      if (parameter === 37445) return 'Google Inc. (NVIDIA)';
      if (parameter === 37446) return 'ANGLE (NVIDIA, NVIDIA GeForce RTX 3060)';
      return getParameter.call(this, parameter);
    };
  });

  await page.goto('https://example.com');
  const content = await page.content();
  console.log(`Page length: ${content.length}`);

  await browser.close();
})();

Stealth with Python (undetected-chromedriver)

# pip install undetected-chromedriver
import undetected_chromedriver as uc

options = uc.ChromeOptions()
options.add_argument("--window-size=1920,1080")

driver = uc.Chrome(options=options, version_main=131)
driver.get("https://nowsecure.nl")

# Check if we pass the bot detection test
print(driver.title)
driver.quit()

For advanced scraping scenarios, AI-powered automation tools can dynamically adapt fingerprint parameters based on detection responses.

Stealth Headers and HTTP Consistency

Headers are the low-hanging fruit of bot detection. Get them wrong and nothing else matters. Get them right and you pass the first gate for free.

Common header mistakes

Missing headers: Real browsers send 15-20 headers per request. A bare requests.get() sends 4. The absence of sec-ch-ua, sec-fetch-dest, and accept-language screams automation.

Wrong ordering: Chrome sends headers in a specific order. Python's requests library alphabetizes them. Anti-bot systems check header order as a fingerprint signal.

Inconsistent values: Claiming to be Chrome on Windows via User-Agent but sending Linux-style Accept-Encoding values creates a detectable mismatch.

A proper header set for Chrome impersonation

CHROME_HEADERS = {
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8",
    "accept-encoding": "gzip, deflate, br, zstd",
    "accept-language": "en-US,en;q=0.9",
    "cache-control": "max-age=0",
    "sec-ch-ua": '"Chromium";v="131", "Not_A Brand";v="24"',
    "sec-ch-ua-mobile": "?0",
    "sec-ch-ua-platform": '"Windows"',
    "sec-fetch-dest": "document",
    "sec-fetch-mode": "navigate",
    "sec-fetch-site": "none",
    "sec-fetch-user": "?1",
    "upgrade-insecure-requests": "1",
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
}

Header randomization

import random

USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:133.0) Gecko/20100101 Firefox/133.0",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/18.2 Safari/605.1.15",
]

LANGUAGES = ["en-US,en;q=0.9", "en-GB,en;q=0.9", "en-US,en;q=0.9,fr;q=0.8"]

def get_random_headers() -> dict:
    ua = random.choice(USER_AGENTS)
    lang = random.choice(LANGUAGES)

    # Match sec-ch-ua to the selected User-Agent
    if "Chrome" in ua:
        sec_ch = '"Chromium";v="131", "Not_A Brand";v="24"'
    elif "Firefox" in ua:
        sec_ch = None  # Firefox doesn't send sec-ch-ua
    else:
        sec_ch = None

    headers = {
        "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "accept-language": lang,
        "user-agent": ua,
        "upgrade-insecure-requests": "1",
    }

    if sec_ch:
        headers["sec-ch-ua"] = sec_ch
        headers["sec-ch-ua-mobile"] = "?0"
        headers["sec-ch-ua-platform"] = '"Windows"' if "Windows" in ua else '"macOS"'

    return headers

The key principle: every header must be internally consistent. A Windows User-Agent needs Windows sec-ch-ua-platform. A Firefox User-Agent should not include sec-ch-ua headers at all. Inconsistency is the signal, not any single header value.

CAPTCHA Solving: AI, Humans, and Hybrid Approaches

CAPTCHAs are the last line of defense. When fingerprinting and behavioral analysis flag your session, you get a challenge. In 2026, the three major CAPTCHA systems are reCAPTCHA v2/v3, hCaptcha, and Cloudflare Turnstile.

Prevention vs. solving

The cheapest CAPTCHA is the one you never trigger. Most CAPTCHAs appear because something upstream (bad proxy, mismatched fingerprint, suspicious rate) flagged your session. Fix the root cause first. Solving should be your fallback, not your strategy.

AI-powered solvers

Services like CapSolver use machine learning to solve challenges in 1-3 seconds at $0.40-0.90 per 1,000 solves. They handle reCAPTCHA v2, hCaptcha, and Turnstile well. Limitations: reCAPTCHA v3 (invisible scoring) can detect non-human solving patterns, and complex visual puzzles like FunCaptcha have lower success rates.

Human-powered solvers

2Captcha and Anti-Captcha route challenges to human workers. Solve time is 15-30 seconds at $1-3 per 1,000. Slower and more expensive, but higher accuracy on complex challenges. Best for login flows and checkout pages where reliability matters more than speed.

Hybrid approach

DeathByCaptcha runs AI for fast, easy CAPTCHAs and escalates to humans for difficult ones. This balances cost, speed, and accuracy. A practical strategy: use AI solvers as default and fall back to human solving when AI confidence is low.

Integration example

import time
import requests as http_requests  # standard requests for API calls

CAPSOLVER_API_KEY = "your_api_key"

def solve_recaptcha_v2(site_key: str, page_url: str) -> str:
    """Solve reCAPTCHA v2 using CapSolver API."""
    task_payload = {
        "clientKey": CAPSOLVER_API_KEY,
        "task": {
            "type": "ReCaptchaV2TaskProxyLess",
            "websiteURL": page_url,
            "websiteKey": site_key,
        }
    }

    # Create task
    resp = http_requests.post(
        "https://api.capsolver.com/createTask",
        json=task_payload
    )
    task_id = resp.json()["taskId"]

    # Poll for result
    for _ in range(60):
        result = http_requests.post(
            "https://api.capsolver.com/getTaskResult",
            json={"clientKey": CAPSOLVER_API_KEY, "taskId": task_id}
        ).json()

        if result["status"] == "ready":
            return result["solution"]["gRecaptchaResponse"]

        time.sleep(2)

    raise TimeoutError("CAPTCHA solve timed out")

Cost comparison

Service	Type	reCAPTCHA v2 (/1k)	Turnstile (/1k)	Avg. Solve Time
CapSolver	AI	$0.60	$1.45	1-3s
2Captcha	Hybrid	$1.00	$2.99	15-30s
Anti-Captcha	Human	$1.00	$2.00	20-40s
DeathByCaptcha	Hybrid	$1.39	$2.89	5-15s

Skip the CAPTCHA headache entirely. Trawl handles proxy rotation, fingerprint management, and CAPTCHA solving automatically. Focus on your data pipeline, not the plumbing. Try it free.

Rate Limiting: The Art of Not Being Greedy

Rate limiting is the simplest detection layer and the easiest to avoid. Yet most scrapers get this wrong by either going too fast or using predictable patterns.

Baseline recommendations

Start with one request every 2-5 seconds per IP. This mimics normal browsing behavior. With proxy rotation across 100+ IPs, your aggregate throughput can be high while each individual IP stays well under detection thresholds.

Random jitter

Fixed delays (exactly 3.0 seconds between requests) are detectable. Real humans do not browse with metronome precision. Add random jitter to break the pattern.

import random
import time

def human_delay(base: float = 2.0, jitter: float = 1.5):
    """Sleep for a randomized duration that mimics human browsing."""
    delay = base + random.uniform(0, jitter)
    # Occasionally add a longer "reading" pause
    if random.random() < 0.1:
        delay += random.uniform(3.0, 8.0)
    time.sleep(delay)

Exponential backoff with jitter

When you hit a 429 (Too Many Requests) or receive a CAPTCHA, back off exponentially. But add jitter, because pure exponential backoff is itself a recognizable pattern to sophisticated systems.

def backoff_request(url: str, max_retries: int = 5, **kwargs):
    """Request with exponential backoff and jitter."""
    for attempt in range(max_retries):
        resp = requests.get(url, **kwargs)

        if resp.status_code == 200:
            return resp

        if resp.status_code == 429:
            # Check for Retry-After header
            retry_after = resp.headers.get("Retry-After")
            if retry_after:
                wait = int(retry_after)
            else:
                wait = (2 ** attempt) + random.uniform(0, 2)

            print(f"Rate limited. Waiting {wait:.1f}s (attempt {attempt + 1})")
            time.sleep(wait)
        else:
            break

    return resp

Respect robots.txt (selectively)

Check robots.txt for Crawl-delay directives. While not legally binding in most jurisdictions, following them reduces your chance of being flagged. Many anti-bot systems check if your scraper even requested robots.txt as a signal of legitimacy.

How to Detect That You're Being Detected

The worst kind of block is the one you do not notice. Many sites serve soft blocks: degraded content, modified prices, or different product listings instead of a clear 403. Here is how to catch them.

Monitor these signals

Response size anomalies. A product page that usually returns 85KB suddenly returns 12KB. That 12KB is likely a CAPTCHA page or an error page, not the data you wanted.

Status code tracking. Track the ratio of 200, 403, 429, and 503 responses over time. A spike in non-200 responses means your stealth is failing.

Content validation. Parse a known element from each response. If the product title or price field is missing, you are getting served a different page.

Cookie tracking. Some systems set tracking cookies before blocking. A sudden Set-Cookie with a long value on every response is a fingerprinting cookie.

def validate_response(resp, expected_min_size: int = 10000) -> bool:
    """Check if a response contains real content or a block page."""
    # Size check
    if len(resp.content) < expected_min_size:
        return False

    # Status check
    if resp.status_code != 200:
        return False

    # Content check: look for common block page indicators
    block_signals = [
        "captcha", "blocked", "access denied",
        "rate limit", "please verify", "challenge-platform",
        "cf-browser-verification", "ddos-guard",
    ]
    body_lower = resp.text[:5000].lower()
    if any(signal in body_lower for signal in block_signals):
        return False

    return True

A/B testing your stealth

Compare your scraped data against a manual browser session. Open the same URL in a real Chrome browser and save the response. If the scraped version is different (different prices, missing products, fewer results), you are being fingerprinted and served modified content.

Putting It All Together: A Complete Stealth Stack

Here is a production-grade scraping setup that addresses all five detection layers. This is the pattern used by professional scraping operations handling millions of requests per day.

"""
Complete stealth scraping stack.
Addresses: TLS, headers, proxies, rate limiting, and detection monitoring.
"""
from curl_cffi import requests
from collections import defaultdict
import random
import time
import json

class StealthScraper:
    def __init__(self, proxy_gateway: str, requests_per_minute: int = 20):
        self.proxy_gateway = proxy_gateway
        self.delay = 60.0 / requests_per_minute
        self.browsers = ["chrome", "safari", "safari_ios"]
        self.stats = defaultdict(int)

    def _get_headers(self) -> dict:
        return {
            "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
            "accept-language": random.choice([
                "en-US,en;q=0.9",
                "en-GB,en;q=0.9",
                "en-US,en;q=0.9,es;q=0.8",
            ]),
            "cache-control": "no-cache",
            "upgrade-insecure-requests": "1",
        }

    def _human_delay(self):
        jitter = random.uniform(-0.5, 1.5)
        pause = self.delay + jitter
        if random.random() < 0.08:
            pause += random.uniform(3.0, 10.0)  # simulate reading
        time.sleep(max(0.5, pause))

    def _validate(self, resp) -> bool:
        if resp.status_code != 200:
            return False
        if len(resp.content) < 5000:
            return False
        block_words = ["captcha", "blocked", "access denied", "challenge"]
        text = resp.text[:3000].lower()
        return not any(w in text for w in block_words)

    def scrape(self, url: str, max_retries: int = 3) -> dict:
        for attempt in range(max_retries):
            self._human_delay()

            try:
                resp = requests.get(
                    url,
                    headers=self._get_headers(),
                    impersonate=random.choice(self.browsers),
                    proxies={"https": self.proxy_gateway},
                    timeout=30,
                )

                if self._validate(resp):
                    self.stats["success"] += 1
                    return {"status": "ok", "html": resp.text, "code": resp.status_code}

                self.stats["blocked"] += 1
                wait = (2 ** attempt) + random.uniform(0, 2)
                time.sleep(wait)

            except Exception as e:
                self.stats["error"] += 1
                time.sleep(2 ** attempt)

        return {"status": "failed", "html": None, "code": None}

    def report(self):
        total = sum(self.stats.values())
        if total == 0:
            return "No requests made."
        success_rate = (self.stats["success"] / total) * 100
        return (
            f"Total: {total} | "
            f"Success: {self.stats['success']} ({success_rate:.1f}%) | "
            f"Blocked: {self.stats['blocked']} | "
            f"Errors: {self.stats['error']}"
        )


# Usage
scraper = StealthScraper(
    proxy_gateway="http://user:pass@gate.provider.com:7777",
    requests_per_minute=15
)

urls = ["https://example.com/product/1", "https://example.com/product/2"]
results = [scraper.scrape(url) for url in urls]
print(scraper.report())

This is the DIY approach. If you would rather skip the infrastructure work, Trawl provides all of this out of the box: proxy rotation, TLS impersonation, CAPTCHA handling, and smart retries.

Tools and Resources

HTTP clients with TLS impersonation

curl_cffi: Python binding for curl-impersonate. Supports Chrome, Safari, Firefox fingerprints.
got-scraping: Node.js HTTP client with automatic header generation and fingerprint rotation.

Stealth browsers

Nodriver: Undetected Chrome automation for Python (successor to undetected-chromedriver).
Camoufox: Stealthy Firefox fork with built-in anti-fingerprinting.
Playwright + stealth plugin: Cross-browser automation with stealth patches.

Proxy providers

Premium residential proxy networks: large IP pools, premium pricing, strong success rates on protected sites.
Mid-tier residential providers: solid geo-targeting at competitive pricing, a good fit for mid-scale operations.
Budget residential providers: smaller pools and lower cost, suited to smaller projects.
ISP and datacenter proxies: faster and cheaper, best for less aggressively protected targets.

CAPTCHA solvers

CapSolver: AI-first, fastest solve times, cheapest for reCAPTCHA.
2Captcha: Human + AI hybrid, reliable for complex challenges.
Anti-Captcha: Human-powered, consistent accuracy.

Testing tools

BrowserLeaks TLS: Check your TLS fingerprint.
Sannysoft Bot Test: Comprehensive bot detection test suite.
CreepJS: Advanced browser fingerprint analysis.
NowSecure: Quick pass/fail bot detection test.

For a broader comparison of scraping tools and APIs, see our guide to the best web scraping APIs and tools in 2026.

Key Takeaways

TLS fingerprinting is the most overlooked detection vector. Switching from requests to curl_cffi fixes TLS, HTTP/2, and header ordering in one change.
Residential proxies are table stakes for protected sites. Datacenter IPs fail on Cloudflare, DataDome, and Akamai. Budget for residential or mobile proxies.
Consistency across layers matters more than perfection in any single layer. A Windows User-Agent with macOS headers and a Linux TLS fingerprint triggers detection even if each value individually looks valid.
Prevention beats solving. Fixing upstream fingerprint and rate issues eliminates 90%+ of CAPTCHAs. CAPTCHA solving is a fallback, not a strategy.
Smart rotation adapts in real time. Track success rates per proxy, retire burned IPs, and adjust delays based on server responses.
Monitor for soft blocks. Response size drops, missing data fields, and content differences between scraped and manual sessions indicate silent fingerprinting.
Managed scraping APIs exist for a reason. Building and maintaining a stealth stack costs engineering time. Tools like Trawl handle the infrastructure so you can focus on data extraction.

Fighting anti-bot systems yourself is a full-time job. Trawl runs the proxy rotation, stealth and retries as managed orchestration so you focus on the data.

Frequently Asked Questions

Why is my scraper getting blocked even with proxies?

Proxies only mask your IP. Modern anti-bot systems also analyze TLS fingerprints, HTTP/2 settings, browser fingerprints, and behavioral patterns. If your TLS handshake says "Python requests" while your User-Agent claims a current Chrome version, you get flagged instantly. You need to match fingerprints across all layers, not just the network layer.

What is the difference between residential and datacenter proxies?

Datacenter proxies come from cloud providers (AWS, GCP) and are fast but easily detected because their IP ranges are publicly known. Residential proxies use real ISP-assigned IPs from home internet connections, making them nearly indistinguishable from real users. Residential proxies cost 5-10x more but have significantly higher success rates on protected sites.

How many requests per second can I send without getting blocked?

There is no universal number. Start with one request every 2-5 seconds per IP, then adjust based on responses. Watch for 429 status codes and increasing CAPTCHA frequency. Add random jitter (plus or minus 0.5-2 seconds) to avoid predictable patterns. With proxy rotation across many IPs, your aggregate throughput can be much higher.

Is it legal to bypass CAPTCHAs for web scraping?

Legality depends on jurisdiction, the website's terms of service, and what you do with the data. In the US, the hiQ v. LinkedIn ruling supports scraping public data, but bypassing technical access controls may raise issues under the CFAA. In the EU, GDPR applies to personal data regardless of collection method. Always consult a lawyer for your specific use case.

What is TLS fingerprinting and why does it matter?

TLS fingerprinting analyzes the cipher suites, extensions, and their ordering in your TLS Client Hello packet. Each HTTP client produces a unique fingerprint (JA3/JA4 hash). Python's requests library has a completely different fingerprint than Chrome, so even with a Chrome User-Agent header, the server knows you are not a real browser. Tools like curl_cffi can impersonate browser TLS fingerprints to solve this.

Should I use headless browsers or HTTP clients for scraping?

It depends on the target. HTTP clients (requests, curl_cffi) are 10-50x faster and cheaper for sites with light protection. Headless browsers (Playwright, Puppeteer) are necessary for JavaScript-heavy sites and those with advanced fingerprinting. Start with HTTP clients and escalate to browsers only when needed.

How do I handle Cloudflare Turnstile challenges?

Cloudflare Turnstile runs invisible JavaScript challenges that score browser behavior. Your options: use a real browser with stealth plugins, use CAPTCHA solving services like CapSolver ($0.60-1.45 per 1,000 solves), or use managed scraping APIs that handle Turnstile automatically. Prevention (not triggering the challenge in the first place) is always cheaper than solving.

What is the best proxy type for scraping e-commerce sites?

Residential proxies with geo-targeting are the best choice for e-commerce. Sites like Amazon and Shopify use aggressive bot detection. Residential IPs from the same country as the target yield the highest success rates. Mobile proxies are even better but cost 2-3x more. ISP proxies offer a middle ground with good speed and decent trust scores.

How do I detect that my scraper is being detected?

Monitor for soft blocks (CAPTCHAs, redirects, altered content), HTTP 403/429 status codes, response size anomalies (blocked pages are often much smaller), increasing response times, and cookie-based tracking. Compare scraped content against what a real browser sees. A sudden drop in data quality often indicates silent fingerprinting.

Can AI help bypass anti-bot detection?

AI plays roles on both sides. Anti-bot systems use ML to analyze behavioral patterns like mouse movements and click timing. On the scraping side, AI solves CAPTCHAs in 1-3 seconds, generates human-like browsing patterns, and adapts strategies dynamically. AI-powered scraping tools are increasingly handling stealth automatically. Learn more in our guide to AI-powered web scraping.

Disclaimer: This article is published for educational and informational purposes. The techniques described here can be used both ethically and unethically. Always respect website terms of service, applicable laws (CFAA, GDPR, CCPA), and robots.txt directives. The author and Trawl are not responsible for any misuse of the information presented. When in doubt, consult a legal professional before scraping.

Written by Pierre | May 2026