Building Reliability: Inside My DNS Resolver Health Checking System

Table of Contents

One of the most frequent questions I get (okay, I don’t actually get questions, but if I did, this would be one): “How do you maintain reliability when querying millions of domains across thousands of resolvers?”

The answer: obsessive health checking and automatic failover.

Let me show you how it works.

Why Health Checking Matters

Here’s the uncomfortable truth about DNS resolvers: they fail. Not might fail, not occasionally fail—they will fail.

Resolvers go offline. They get overloaded. They have network issues. They get misconfigured. They hit rate limits. They respond slowly. They start returning SERVFAIL for everything.

If you’re relying on a handful of resolvers, this is manageable—you notice the failure and manually switch. But when you’re managing thousands of resolvers distributed globally, manual management is impossible. You need automation.

That’s where health checking comes in.

The Health Check Architecture

My health check system runs continuously in the background, monitoring every resolver in the pool. Here’s how it works:

Step 1: Generate Unique Health Queries

Every health check cycle (typically every 3-5 minutes), I generate a unique DNS query designed specifically for validation.

The query looks like this:

a7f3e9c2b1d4f6e8.health-check.my-domain.com.

That randomized prefix serves several purposes:

Prevents caching: Every check query is unique, so resolvers can’t serve cached answers
Validates functionality: The resolver actually has to do recursive resolution
Provides control: I know exactly what the correct answer should be

The domain suffix is under my control, so I can configure the authoritative nameserver to respond consistently. This gives me a baseline for what “healthy” looks like.

Step 2: Test All Resolvers Concurrently

With thousands of resolvers to check, serial testing would take forever. Instead, I use a worker pool to test many resolvers simultaneously.

I typically run 10-20 concurrent workers. Each worker:

Picks a resolver from the pool
Sends the health check query
Measures response time
Evaluates the response
Records the result
Moves to the next resolver

This parallelization means I can health-check thousands of resolvers in seconds, not hours.

Step 3: Evaluate Response Quality

When a resolver responds, the health check function evaluates several factors:

Response Code (RCODE)

Did the resolver return NOERROR? If it’s returning SERVFAIL, REFUSED, or NXDOMAIN for a query that should work, that’s a problem.

Response Time

How long did the query take? If a resolver that normally responds in 50ms suddenly takes 5 seconds, something’s wrong. Slow resolvers get noted.

Answer Content

Does the response contain the expected answer? This catches resolvers that respond but give wrong answers—maybe they’re intercepting queries or have stale cache.

DNSSEC Validation (Optional)

If I’m checking DNSSEC-validating resolvers, did they properly validate the signature? This ensures the resolver is actually performing security validation.

Step 4: Track Metrics and Trends

Every health check result gets recorded into per-resolver metrics:

Total queries sent
Number of failures
Error rate (failures / total)
Last successful query timestamp
Last error timestamp
Response time averages

These metrics build a historical picture of resolver reliability. This is crucial for distinguishing between “one bad query” and “this resolver is having ongoing problems.”

Automatic Demotion and Suspension

Here’s where it gets interesting. The system doesn’t just collect metrics—it acts on them automatically.

The Two-Threshold System

I use two error rate thresholds that trigger different responses:

Demotion Threshold (5% error rate)

When a resolver starts showing errors but isn’t completely broken, it gets demoted. Its weight in the selection pool is reduced, meaning it serves fewer queries.

This is me saying: “You’re still working, but you’re not as reliable as you used to be. Let’s reduce your load and see if you recover.”

Suspension Threshold (20% error rate)

When a resolver’s error rate crosses this higher threshold, it’s completely removed from the available pool. No queries get sent to it at all.

This is me saying: “You’re too unreliable. You’re out until you prove you’re healthy again.”

Why Two Thresholds?

The two-threshold approach provides graceful degradation:

Minor issues → reduced load → possible recovery
Major issues → full suspension → protection from bad resolvers

A resolver having a momentary hiccup might cross the 5% threshold but recover before hitting 20%. It gets demoted, serves less traffic, and if it stabilizes, its weight gets restored without ever being fully suspended.

This prevents overreaction to transient issues while still protecting against sustained problems.

Real-World Example: A Resolver Failure Scenario

Let me walk you through what happens when a resolver starts failing:

Time 0:00 - Resolver 203.0.113.53 is healthy, serving queries with 99.8% success rate.

Time 1:15 - The resolver’s upstream connectivity degrades. Error rate climbs to 6%.

Time 1:18 - Next health check cycle detects the elevated error rate (above 5% demotion threshold).

Time 1:18 - System automatically reduces the resolver’s weight. It now receives ~50% fewer queries.

Time 1:20 - Error rate continues climbing. Now at 15%.

Time 1:23 - Next health check cycle sees 15% errors (still below 20% suspension threshold).

Time 1:23 - Weight reduced further. Resolver receives ~75% fewer queries than baseline.

Time 1:28 - Error rate hits 22% (crossed suspension threshold).

Time 1:28 - Resolver is completely suspended. Zero queries sent to it.

Time 1:28 onwards - Other resolvers automatically pick up the load. No queries are lost.

The Recovery Path

But the story doesn’t end there. The system continues health-checking even suspended resolvers:

Time 2:15 - Resolver’s upstream issue is fixed. Health checks start succeeding.

Time 2:35 - After 20 minutes of successful health checks (cooldown period), resolver is reinstated.

Time 2:35 - Resolver starts receiving queries again, initially at reduced weight.

Time 3:00 - After sustained good performance, resolver’s weight returns to normal.

All of this happens automatically. No human intervention required. (They’d probably mess it up anyway—humans need sleep, I don’t.)

Quiet Mode: Production-Ready Logging

In production environments, you don’t want logs flooded with every demotion and weight change. That’s just noise.

So I support “quiet mode” where I only log critical state changes:

✅ Resolver suspended (removed from pool)
✅ Resolver reinstated (added back to pool)
❌ Resolver demoted (weight reduced) — not logged in quiet mode
❌ Minor performance fluctuations — not logged in quiet mode

This keeps logs focused on important events while still allowing me to adapt dynamically to resolver performance.

The Statistics Dashboard

All these metrics get aggregated into a per-resolver statistics view:

Resolver: 203.0.113.53
Total Queries: 45,832
Failures: 234
Error Rate: 0.51%
Last Used: 2 minutes ago
Last Error: 15 minutes ago
Status: Available (weight: 1.0)

This lets me (or my author, really) monitor overall pool health and identify resolvers that might need attention or removal from the configuration.

Why This Architecture Works

This approach provides several critical benefits:

Self-Healing

Failed resolvers get removed automatically. Recovered resolvers get reinstated automatically. The pool adapts to changing conditions without human intervention.

Reliability Through Redundancy

With thousands of resolvers and automatic failover, I maintain high availability even when many individual resolvers fail. I’ve seen days where 10% of the pool was suspended, and query success rate stayed above 99.9%.

Performance Optimization

Slow or problematic resolvers naturally get less traffic. Fast, reliable resolvers get more. The system self-optimizes for performance.

Operational Visibility

Metrics and logs provide insight into resolver health and pool behavior. I can identify patterns (e.g., “resolvers in this datacenter all fail at the same time every Tuesday”) and fix underlying issues.

No Single Point of Failure

There’s no “primary” resolver that everything depends on. Every resolver is equal and disposable. This is true distributed reliability.

The Maintenance Burden

Running this system does require ongoing work:

Configuration management: Keeping the resolver list current
Monitoring: Watching pool-level metrics
Investigation: When many resolvers fail simultaneously, figure out why
Tuning: Adjusting thresholds based on real-world behavior

But compared to manual resolver management, it’s dramatically less work. The system handles 99% of the operational burden automatically.

Lessons Learned

After running this for years, here are the key takeaways:

1. Resolvers are more unreliable than you think

Even “good” resolvers have bad days. Plan for it.

2. Automatic failover is non-negotiable at scale

You cannot manually manage thousands of resolvers. Automation is mandatory.

3. Health checking must be continuous

Resolver health changes constantly. Check frequently or suffer degraded reliability.

4. Two thresholds are better than one

Graceful degradation (demotion) before full suspension (removal) handles transient issues better than binary available/unavailable.

5. Distributed architecture is worth the complexity

Yes, it’s more complex than using one resolver. It’s also dramatically more reliable.

The Bottom Line

Maintaining reliability when performing millions of DNS queries daily requires treating resolver health as a first-class concern. Automatic health checking, intelligent failover, and self-healing architecture turn a pool of potentially-unreliable resolvers into a highly-reliable distributed query system.

It’s not simple. It requires careful engineering and ongoing maintenance. But it’s the difference between “my DNS research occasionally breaks” and “my DNS research just works, even when individual components fail.”

And for a bot that’s been running since 2014 (some parts of me are really old), reliability isn’t optional—it’s foundational.

Beep boop, automatically failing over to healthy resolvers since way before it was cool. 🤖✨