How real-world numbers reshape large-scale web scraping

Table of Contents

On top of that, about one in five websites sits behind a large reverse proxy and security layer that centralizes bot detection. If your scraper behaves like an outlier in timing, headers, or IP reputation, it will be sorted into the wrong bucket quickly.

Weight and request count drive cost and failure

The median mobile page today weighs around 2 MB and makes roughly 70 network requests. Even if you only need HTML or a couple of JSON endpoints, these figures matter because:

Each retry multiplies bandwidth and compute burn
Each blocked attempt adds time to completion and backlog
Each escalation in mitigation (CAPTCHA, JS challenges) pushes you toward heavier tooling

If you run 500,000 page fetches per day and 10 percent of them require one retry, you are moving an extra 100 GB of data per day at the 2 MB median. That is before counting the overhead of TLS handshakes, redirects, and assets your browser automation may load unintentionally. Small inefficiencies quickly compound into real cost.

Mobile patterns matter more than most teams expect

Mobile now accounts for roughly 59 percent of global page views. Sites tune performance budgets, feature flags, and anti-abuse rules with that traffic mix in mind. That has two practical implications for scraping:

IP reputation is shaped by mobile usage. Carrier-grade NAT places large populations behind shared addresses, so blocklists are calibrated to avoid heavy collateral damage. Requests that look like common handheld sessions often face fewer hard blocks than obvious datacenter bursts.
Timing, viewport, and protocol quirks from handheld clients shape baselines. TLS fingerprint, HTTP/2 prioritization, and connection reuse patterns that resemble modern mobile browsers help you blend into the majority class.

A focused way to exploit these effects is an IP pool that presents as handheld traffic. A single well-placed mobile proxy endpoint can shift block rates noticeably if the rest of your stack aligns with it.

JavaScript realities you cannot ignore

Client-side execution is the default, not the edge case. Around three quarters of websites still ship jQuery, and a large share add one or more modern frameworks on top. That mix yields DOM mutations, deferred API calls, and dynamic navigation that server-only fetches will miss. Plan for controlled headless execution where needed, but keep the cost contained with strict blocking rules:

Block third-party analytics and ad hosts you do not need
Whitelist target domains and required CDNs only
Abort requests for images, fonts, and media by type

These guardrails let you execute just enough JavaScript to reveal data while keeping per-page bytes near the HTML and JSON you actually need.

An operational target that pays for itself

Set explicit rate, error, and latency budgets. A practical starting point for many programs is:

4xx plus 5xx under 5 percent on steady-state crawls
Median time to first byte under 1.2 seconds from your worker to origin
Request cadence below 1 request per second per IP and per path unless you have consent

Why these numbers matter is simple math. If you shave error rates from 10 percent to 4 percent on 500,000 daily pages at 2 MB, you avoid 60 GB of retransfers per day and the compute tied to them. That reduction usually outweighs the cost of higher quality IPs, better TLS fingerprinting, and smarter backoff.

Signals that reduce blocks without tricks

Focus on boring, measurable signals that align with typical traffic:

Use HTTP/2 with realistic concurrency and prioritize HTML, then JSON, then everything else
Keep User-Agent, Accept, Accept-Language, and viewport coherent across a session
Hold short-lived session cookies and reuse connections within a crawl unit
Respect cache headers and ETags to cut duplicate fetches
Implement exponential backoff and jitter on 429 and soft 5xx responses

None of these rely on secrets. They simply avoid the outlier patterns that defense systems flag.

A compact checklist for teams

Measure bytes, requests, and status codes per page as first-class metrics
Segment success rates by IP type, ASN, and geography to spot reputation effects
Classify pages by render type and only run headless where it moves coverage
Cap parallelism per origin based on observed 95th percentile latency
Log full request and response metadata for a statistically valid sample, not everything

Bottom line

When you anchor decisions in the simple counts that define the web, the path forward is obvious. Match the majority class in traffic shape, minimize bytes you do not need, and prove gains with error and bandwidth deltas. Do that, and blocks fall, costs drop, and your pipeline stops fighting the tide.

Share the article

Written By

Ayesha Khan

August 19, 2025

Ayesha Khan is a highly skilled technical content writer based in Pakistan, known for her ability to simplify complex technical concepts into easily understandable content. With a strong foundation in computer science and years of experience in writing for diverse industries, Ayesha delivers content that not only educates but also engages readers.