Stopping distributed scraping with country-level CIDR, on stock nginx

June 12, 2026

On a certain website, I noticed traffic from a particular country’s IPs had grown noticeably. Looking closer, it wasn’t the usual “one server hammering you” type — it was scraping spread across a great many IPs, each at a modest pace.

This is a record of how I read its nature, chose how to block it by “blast radius” (the range of damage if something breaks), and rolled it out without stopping the running servers. I hope it offers a clue to anyone troubled by similar distributed access.

What was happening

Aggregating the access logs, the increased traffic had these characteristics:

Almost 100% GET (no writes such as login attempts or form posts)
Requests concentrated on article-content paths (almost no suspicious paths like vulnerability scans)
User-Agents rotating across multiple browser versions unnaturally evenly
Sources mainly consumer ISPs (not data centers)
Spread across more than 15,000 IPs, with each IP kept to about a dozen requests

So it wasn’t an attack — it was a content-harvesting bot designed to evade rate limits. It seems to be collecting articles relentlessly, apparently routed through a residential / mobile proxy network.

What matters here is that “each IP is modest” point. The approach of “detect and individually block IPs that hit hard in a short window” — looking at per-source behavior — barely works against this opponent.

Aspect	Typical high-volume access	This distributed access
Source IPs	Concentrated in a few	Spread across 15,000+
Per-IP frequency	High, stands out	Low, hard to tell from real users
Blockable per IP?	◎ Easy to detect by threshold	✕ Slips under the threshold; tightening catches real users

If fighting by individual-IP behavior is a losing matchup, the natural answer is to block in bulk by the source’s “country / network unit.” Fortunately, this bot was concentrated in one country’s ISPs.

Why block by IP range (CIDR)

When you hear “block by country,” you might brace for “do I list 15,000 IPs?” In practice there’s no need — you handle it with CIDR (IP address ranges).

CIDR is written like 203.0.113.0/24, where one entry represents not a single IP but a whole network range. The ranges are quite wide; the smaller the number (the prefix length), the vaster they get.

CIDR notation	IPs covered by one line (approx.)
`/24`	~256
`/16`	~65,000
`/12`	~1.04 million
`/10`	~4.19 million

So a single /10 line covers about 4.19 million IPs.

Why can one line cover so much? In binary, everything up to the prefix length is the network part (the range checked for a match), and the rest is the host part (matches any value). With /24, the lower 8 bits are free, so one line points at 256 addresses at once.

The address space allocated to a given country, as an aggregated CIDR list, comes to roughly 5,500 lines. Those 5,500 lines cover more than 340 million IPs. And crucially —

It includes not just the IPs you observe now, but IPs you haven’t observed yet, from the start
New IPs the proxy network starts using tomorrow are covered ahead of time, as long as they’re in the same country’s allocation

Blocking individual IPs one by one is whack-a-mole against 340 million. A country-level CIDR list lets you block by area, ahead of time, in about 5,500 lines. That was the essential advantage.

Choosing the approach by “blast radius”

There are several ways to do country-level CIDR blocking. The premise here was a setup with no load balancer or CDN in front — each server exposes 80/443 directly. In other words, if the frontmost nginx goes down, that one box goes offline along with the web.

Under that premise, the selection criterion is less about performance and more about “how far the collateral damage spreads if it breaks (blast radius).” I lined up three candidates.

Approach	Mechanism	Damage range if it breaks
Host firewall	Drop target IP ranges in the kernel	The whole host. A misconfiguration risks locking out your own SSH. Easy to get container forwarding wrong, too
nginx + external GeoIP database	A dedicated module determines the country	Inside the nginx container. But the official image lacks the module, so you end up owning a self-built image
nginx’s built-in geo module + CIDR list	A standard feature judges IP ranges	Inside the nginx container. Even if the data file breaks, pre-start validation catches it and nginx itself stays unharmed

Placing the three on “effect × blast radius” makes the right choice clear.

Method-selection chart: the x-axis is damage range (large to small), the y-axis is effect (low to high); nginx geo+CIDR sits in the high-effect, low-impact quadrant

I went with the third: nginx’s standard geo module + a CIDR list. The deciding points were:

The geo module is a standard feature always built into nginx. You can use the official image as-is and just add a config file.
The core of the block is just a text CIDR list. Even if it breaks, the validation below catches it before it goes live. At worst, nginx itself isn’t affected.
A block of the same kind (rejecting certain User-Agents) was already running, so I could add this in the same shape, with minimal work. Rolling back is just reverting the config.

The external GeoIP database approach is conceptually sound, but it means discarding the official image and promoting the frontmost nginx image to self-managed. In a setup where the front falling means one box goes down, that “expanding the failure surface you carry yourself” wasn’t worth it. I judged it enough to migrate once multi-country support or higher accuracy becomes necessary. The host firewall is the most efficient, but it has the highest chance of a lockout accident, so I kept it in reserve as a last resort for volumetric attacks.

The config is very plain. Load the CIDR list in the http context,

# A lookup table that just returns 0/1
geo $blocked {
    default 0;
    include /etc/nginx/blocked_cidr.conf;   # lines like "203.0.113.0/24 1;"
}

and in the server context, close the connection immediately on a match (444 is nginx’s internal status that drops the connection without returning a response).

if ($blocked) {
    return 444;
}

The list itself is just lines of the target country’s CIDRs with 1; appended.

203.0.113.0/24 1;
198.51.100.0/22 1;
...

Since geo is managed internally with a radix tree, the lookup cost is essentially negligible even with thousands of entries.

Running the list’s auto-update unattended and safely

CIDR allocations shift gradually, so I want to update the list periodically. At the same time, I don’t want to spend human effort on every update, and I absolutely want to avoid restarting nginx with a broken list and taking a box down. So I auto-update via a scheduled job on each server, with several safety nets.

Auto-update flow: generate → validate with the full config → on failure do nothing / on success overwrite keeping the same inode and restart → record metrics

There are three key points.

1. Validate with the “real config” before going live. Not the new list alone, but the full configuration combining the actual config files and certificates, run through nginx -t in a throwaway container. If it doesn’t pass, the production file is never touched. Even if the download or generation fails, the running nginx is unharmed.

2. Overwrite “by rewriting the same file” (preserving the inode). When you mount config into a container file-by-file, replacing the file (rename) can leave the container still looking at the old entity (inode) — a subtle but easy trap. So instead of swapping in a new file, I overwrite the contents of the existing file (keeping the same inode) and then restart.

3. Avoid simultaneous restarts. Without a load balancer, multiple boxes restarting at once momentarily narrows the intake. I added a random wait to each server’s scheduled job to stagger the restart timing.

Update success / failure is emitted as metrics, so failures are noticeable.

Rolling out to a downtime-free fleet, with a canary

Because there’s no load balancer, “try it on just one production box first” can’t be done with a normal deploy (the same config file goes to every box). So I got creative with how the config is shipped.

First, ship to all boxes with the blocking logic commented out. The lookup table, the list, and the auto-update mechanism are in place, but blocking isn’t active (a safe, inert state).
Enable blocking on just one box, by hand, and restart it (the canary).
On that one box’s real traffic, measure before and after.
If there are no problems, remove the comment in the config and roll out to all boxes.

The canary’s measurements were clear-cut.

Aspect	Before	After
Target-country IPs	Thousands passing through	Almost all blocked
Leakage	—	Very slight (only the HTTP→HTTPS redirect that runs first; no content is served)
False blocks of non-targets (legitimate)	—	Zero
Error rate	Normal	No change

Observed all day, there were no restart loops or the like; it stayed stable. Being able to confirm “it’s working” and “it’s not catching legitimate users” with numbers rather than guesses, before widening to all boxes, was reassuring.

Trade-offs and limits

To be honest, country-level blocking requires accepting some compromise.

Legitimate users in the target country are blocked too. A business judgment based on your user base is needed. Here, legitimate access from the target country was minor, so I accepted it.
Search-engine crawlers from the target country are rejected too. Balance that against the value of indexing; if needed, you can exempt specific User-Agents.
And fundamentally, it’s powerless if the proxy network switches to another country’s IPs. Country-level blocking is a symptomatic treatment that “works because they’re concentrated there now,” so you need to keep watching for signs of migration — whether inflow from another country is rising.

Even so, being able to clearly lower the immediate load — with minimal work, on the official image, in a form that doesn’t ripple to the core even if it breaks — was a cost-effective move.

Summary

Low-rate access spread across many IPs is hard to block by per-source behavior detection. Blocking by country / network unit is the natural move.
Handled as CIDR (IP ranges), country-level blocking can cover a vast address space ahead of time in just a few lines.
On a running, LB-less setup, choose the implementation by blast radius. Here, nginx’s official, standard geo module was the minimal setup that doesn’t ripple to the core even if it breaks.
Make auto-update safe even unattended with full validation before going live and an inode-preserving overwrite. Roll out inert distribution → one-box canary → all boxes, watching the numbers.

It’s not flashy, but I’d be glad if it helps as an example of building “in a way that doesn’t break,” one piece at a time, for anyone who meets the same situation somewhere.