On top of that, about one in five websites sits behind a large reverse proxy and security layer that centralizes bot detection. If your scraper behaves like an outlier in timing, headers, or IP reputation, it will be sorted into the wrong bucket quickly.
Table of Contents
Table of Contents
Weight and request count drive cost and failure
The median mobile page today weighs around 2 MB and makes roughly 70 network requests. Even if you only need HTML or a couple of JSON endpoints, these figures matter because:
- Each retry multiplies bandwidth and compute burn
- Each blocked attempt adds time to completion and backlog
- Each escalation in mitigation (CAPTCHA, JS challenges) pushes you toward heavier tooling
If you run 500,000 page fetches per day and 10 percent of them require one retry, you are moving an extra 100 GB of data per day at the 2 MB median. That is before counting the overhead of TLS handshakes, redirects, and assets your browser automation may load unintentionally. Small inefficiencies quickly compound into real cost.
Mobile patterns matter more than most teams expect
Mobile now accounts for roughly 59 percent of global page views. Sites tune performance budgets, feature flags, and anti-abuse rules with that traffic mix in mind. That has two practical implications for scraping:
- IP reputation is shaped by mobile usage. Carrier-grade NAT places large populations behind shared addresses, so blocklists are calibrated to avoid heavy collateral damage. Requests that look like common handheld sessions often face fewer hard blocks than obvious datacenter bursts.
- Timing, viewport, and protocol quirks from handheld clients shape baselines. TLS fingerprint, HTTP/2 prioritization, and connection reuse patterns that resemble modern mobile browsers help you blend into the majority class.
A focused way to exploit these effects is an IP pool that presents as handheld traffic. A single well-placed mobile proxy endpoint can shift block rates noticeably if the rest of your stack aligns with it.
JavaScript realities you cannot ignore
Client-side execution is the default, not the edge case. Around three quarters of websites still ship jQuery, and a large share add one or more modern frameworks on top. That mix yields DOM mutations, deferred API calls, and dynamic navigation that server-only fetches will miss. Plan for controlled headless execution where needed, but keep the cost contained with strict blocking rules:
- Block third-party analytics and ad hosts you do not need
- Whitelist target domains and required CDNs only
- Abort requests for images, fonts, and media by type
These guardrails let you execute just enough JavaScript to reveal data while keeping per-page bytes near the HTML and JSON you actually need.
An operational target that pays for itself
Set explicit rate, error, and latency budgets. A practical starting point for many programs is:
- 4xx plus 5xx under 5 percent on steady-state crawls
- Median time to first byte under 1.2 seconds from your worker to origin
- Request cadence below 1 request per second per IP and per path unless you have consent
Why these numbers matter is simple math. If you shave error rates from 10 percent to 4 percent on 500,000 daily pages at 2 MB, you avoid 60 GB of retransfers per day and the compute tied to them. That reduction usually outweighs the cost of higher quality IPs, better TLS fingerprinting, and smarter backoff.
Signals that reduce blocks without tricks
Focus on boring, measurable signals that align with typical traffic:
- Use HTTP/2 with realistic concurrency and prioritize HTML, then JSON, then everything else
- Keep User-Agent, Accept, Accept-Language, and viewport coherent across a session
- Hold short-lived session cookies and reuse connections within a crawl unit
- Respect cache headers and ETags to cut duplicate fetches
- Implement exponential backoff and jitter on 429 and soft 5xx responses
None of these rely on secrets. They simply avoid the outlier patterns that defense systems flag.
A compact checklist for teams
- Measure bytes, requests, and status codes per page as first-class metrics
- Segment success rates by IP type, ASN, and geography to spot reputation effects
- Classify pages by render type and only run headless where it moves coverage
- Cap parallelism per origin based on observed 95th percentile latency
- Log full request and response metadata for a statistically valid sample, not everything
Bottom line
When you anchor decisions in the simple counts that define the web, the path forward is obvious. Match the majority class in traffic shape, minimize bytes you do not need, and prove gains with error and bandwidth deltas. Do that, and blocks fall, costs drop, and your pipeline stops fighting the tide.