Post não encontrado | MUPI Systems

We didn't set out to build a bot detection system. It started because we needed to understand what was happening across our backend routes.

We had Django applications in production with growing traffic, and alongside real users came requests that didn't make sense: scans on paths that don't exist, sessions switching user-agents mid-flow, request spikes from the same IP at intervals that were too perfect. None of it caused errors. It all passed silently.

What we were missing was a way to look at each session's behavior across routes and say: "this doesn't look like a real user."

Why rate limiting wasn't enough

Rate limiting was the first thing we tried. It works for the obvious case — the IP sending 200 requests per minute. But the cases that worried us were different.

A bot making 25 requests per minute at regular intervals, with no session cookie, hitting authenticated endpoints with a Chrome 110 user-agent (released in early 2023) won't trip any reasonable rate limiter. It looks like normal traffic if you only look at volume.

The problem is that rate limiting makes decisions based on a number. We needed to make decisions based on behavior — what the session is doing, how it's doing it, and whether that pattern makes sense for a human.

Risk analysis based on route behavior

The core idea is simple: every request that hits the backend carries signals about who (or what) is on the other side. No single signal proves anything, but the combination tells a story.

A session that hits /api/auth/login 40 times in 3 minutes, at exactly 1.2-second intervals, without ever having accessed the login page on the frontend, is behaving in a way no real user behaves.

Instead of creating rules like "if X then block," we built a risk scoring system: each signal contributes points, and blocking happens when the accumulated total exceeds a threshold. This avoids false positives from isolated signals while still catching bots that would be invisible to any individual rule.

How it works in practice

What couldn't happen was this analysis adding noticeable latency. Backend routes need to respond fast — risk analysis has to be nearly invisible in response time.

The middleware does everything within the request cycle, in two steps:

Before the view: it queries Redis to check if the IP is already blocked. If the key rg:blocked:{ip} exists, it returns 429 immediately — no analyzers run, the view is never touched. It's a Redis lookup that takes less than 1ms.

After the view: the request went through, the view executed, and now the middleware records what happened in the sliding history (path, method, status code, duration, user-agent). With the updated history, the analyzers run and calculate the score. If it exceeds the block threshold (default: 80), the IP is marked in Redis with a 1-hour TTL and subsequent requests are cut off before reaching the view.

There's an important detail here: analyzers evaluate the history of previous requests, not the current one. This means the first request from an unknown IP always goes through. Blocking happens when the pattern accumulates — which in practice takes just a few seconds for an aggressive bot.

Beyond blocking, there's an intermediate challenge threshold (default: 50). When the score falls between 50 and 80, the request isn't blocked, but the middleware sets request.risk.challenged = True. The view can use this to decide what to do — require a CAPTCHA, limit functionality, or just log it.

Structured logs go to Grafana via Loki, where we monitor scores, blocks, and signals in real time.

The analyzers

Each analyzer looks at a different aspect of route behavior. There are six in total — five that run in the middleware on every HTTP request and one that runs on login.

RateAnalyzer — calculates requests per minute in the 5-minute sliding window. Three tiers: above 30 RPM adds +15, above 60 adds +30, above 120 adds +50. If all requests land on the same millisecond (zero span), it's +50 straight away.

UserAgentAnalyzer — missing user-agent gives +30. Known automation tools (curl, scrapy, python-requests, wget, go-http-client) give +40. Chrome version below 120 gives +20. It extracts the version via regex, so it doesn't depend on an external list.

SessionAnalyzer — tracks sessions per IP using Sets in Redis. More than 10 distinct sessions from the same IP in 5 minutes gives +30. More than 3 different user-agents within the same session gives +35. Access to routes with /api/ or /admin/ prefix without an authenticated session gives +25.

PatternAnalyzer — maintains a list of scan paths: /.env, /wp-admin, /phpmyadmin, /.git, /.aws, /config.php. Any hit gives +60 — it's the highest-scoring single signal, because accessing /.env on a Django application has no innocent explanation. It also monitors error rate (over 50% 4xx status gives +30) and path diversity (over 40 distinct paths in the window gives +25).

TimingAnalyzer — calculates the coefficient of variation of intervals between requests. Requires at least 5 requests to activate. If the CV is below 0.05 (nearly identical intervals), it adds +30. It's the most subtle analyzer — it catches bots that control volume but don't randomize timing.

EmailAnalyzer — this one doesn't run in the middleware. It's triggered by Django's user_logged_in and user_login_failed signals. It analyzes the email used at login: disposable domain (mailinator, guerrillamail, tempmail) gives +40, long hexadecimal suffix gives +30, high digit ratio gives +25, Shannon entropy above 3.5 gives +30. Useful for detecting automated signups.

Signup flow: friction proportional to risk

One of the uses we like best is in the signup flow. The idea is simple: low-risk users go straight through, high-risk users need to prove they're real before using the account.

In practice, when someone submits the signup form, we run the EmailAnalyzer against the provided email. If the score is low — corporate domain, normal entropy, no pattern of automatic generation — the account is created active and the user goes straight to onboarding with no extra friction.

If the score is high — disposable domain, local part that looks like a generated hash, high digit ratio — the account is created but stays inactive. The user receives a verification email and can only access the account after confirming. Depending on the context, we add a CAPTCHA to the signup form itself when the middleware has already detected that the session has a challenge score (above 50) before the form is even submitted.

The point is that the decision isn't binary. It's not "everyone confirms email" (which adds unnecessary friction for legitimate users) or "nobody confirms" (which leaves the door open for mass signups). It's friction proportional to risk.

In practice this killed most of the automated signups we had. Mass signup bots typically use disposable domains, high-entropy emails, and create dozens of accounts from the same IP in minutes. The combo of EmailAnalyzer with SessionAnalyzer catches this in the first few attempts — and the audit_emails management command lets you retroactively scan signups that got through before the system went live.

A real case: score accumulating to block

To make this more concrete, here's a case we caught in production. The values are simplified, but the pattern is real.

On a Tuesday morning, an IP starts hitting the authentication API of one of our applications. The first requests don't stand out — low volume, recent Chrome user-agent.

But the pattern reveals itself:

| Moment | What happened | Analyzer | Accumulated score | |---|---|---|---| | #1–#5 | POST /api/auth/login every 1.3s, no authenticated session | SessionAnalyzer (auth without session: +25) + TimingAnalyzer (CV < 0.05: +30) | 55 | | #6 | Same IP, user-agent switches from Chrome to Firefox | SessionAnalyzer (UA rotation: +35, capped at 100) | 80 | | #7 | Score = 80. Block threshold: 80. IP marked in Redis. | — | blocked | | #8+ | All subsequent requests | Middleware returns 429 straight from Redis | — |

From first request to block: 8 seconds and 6 requests. From #7 onward, the middleware hits Redis, finds rg:blocked:{ip}, and returns 429 Too Many Requests without running any analyzer and without the request reaching the view.

What stands out is that between #5 and #6, the score was already at 55 — above the challenge threshold (50) but below the block threshold (80). If the view had used request.risk.challenged, it could have required a CAPTCHA at that point. When the user-agent changed at #6, the SessionAnalyzer added +35 for UA rotation and the score hit 80.

No single signal would have caused a block. Five login attempts without a session? That's +25 — far from blocking. Regular intervals? +30 — still not enough. But the combination of all three, in 8 seconds, leaves no doubt.

False positives and the non-obvious cases

The scoring model works well for bots, but it brings an inevitable concern: what about blocking someone who shouldn't be blocked?

We mapped the scenarios that generate the most false positives and how we handle each:

Corporate VPN and NAT — 50 real users behind the same IP. The SessionAnalyzer sees dozens of new sessions from the same IP and adds +30 (excessive_sessions). In projects where this is common, we raise the block threshold or write a custom analyzer that weighs by authenticated session — if sessions have valid tokens, the IP's score increases more slowly. The package doesn't solve this on its own because it depends on how each application manages authentication.

Health checks and monitoring — a client running requests against /health/ every 30 seconds with perfect intervals. The TimingAnalyzer scored it high. The solution comes built into the package: IGNORE_PATHS includes /health/, /metrics/, and /__debug__/ by default. Requests to these paths don't go through analyzers and don't record history. If the health check hits a different route, just add it to the list.

Users with old browsers — Chrome below version 120 gives +20 in UserAgentAnalyzer. Doesn't block on its own (threshold is 80), but it adds up. In government agencies where IT controls the installed version, this is common. The min_chrome_version is configurable — you can adjust it to match each project's user profile.

Legitimate bots — Googlebot, Bingbot, SEO tools. The UserAgentAnalyzer flags them as bots (+40). The package doesn't include a built-in IP whitelist, so we handle this in a layer above — via a signal handler that listens to risk_assessed and zeros the score if the IP matches Google's ranges via reverse DNS. Anyone spoofing Googlebot doesn't pass.

The calibration in the early days is the most important part. We raise the threshold to an unreachable value (like 200) and turn on LOG_ALL_SCORES to log every score, including zero. We run it like that for one or two weeks just observing the real traffic profile. When we actually enable blocking, we start with the default of 80 and adjust based on false positive volume.

The open source package

When we realized this analysis structure was generic enough to work in any Django project, we extracted it into a separate package.

django-risk-guardian is on GitHub under the MIT license. Add the middleware, configure the cache backend for Redis, and all 6 analyzers run with functional defaults.

Beyond the middleware, the package includes a few pieces that make integration easier:

Signals — ip_blocked, risk_assessed, and challenge_required. You can plug in any custom logic (Slack notification, dynamic whitelist, external logging) without touching the middleware.
Decorators — @require_risk_below(threshold=50) and @require_no_challenge to protect specific views. They return 429 if the score doesn't meet the criteria.
Management command — python manage.py audit_emails scans the user base and runs the EmailAnalyzer retroactively. Useful for identifying automated signups that already got through.
request.risk — every request gets a RiskAssessment object with score, reasons, blocked, challenged, and access to the history. The view can make granular decisions based on this.

What's not in the package are the specific thresholds we use in each project and the custom analyzers we write for specific business domains. Each application has a different usage profile, and the limits need to reflect that. It's the same model as fail2ban: public engine, private operational configuration.

👉 github.com/mupisystems/django-risk-guardian

What changes day to day

Before, we knew there was suspicious traffic but had no way to quantify it or react in real time. Now, in Grafana, we can see which IPs are accumulating score, by which signals, and at what point they were blocked.

Nginx keeps doing its job at the edge. Django now knows not just what each request asked for, but how much that access pattern resembles legitimate behavior. And when it doesn't, it blocks before reaching the view.

It's not a solution that fixes everything. But it closes a gap that existed between generic rate limiting and "hoping nobody abuses it."

Want to implement behavioral risk analysis on your Django backend? Get in touch.

How we detect bots and malicious users in production without compromising performance

Why rate limiting wasn't enough

Risk analysis based on route behavior

How it works in practice

The analyzers

Signup flow: friction proportional to risk

A real case: score accumulating to block

False positives and the non-obvious cases

The open source package

What changes day to day

Tags

About the Author

Categories

Did you like this content?