Why reachability fails differently in BGP hijacks and DNS poisoning
Routing incidents don’t all look the same from the outside. Two failure modes get lumped together because they both result in “the API is down,” but the fix is different.
- BGP hijack (or leak): Internet routes to your IP space change. Clients resolve your hostname correctly, then packets go somewhere else or take a broken path.
- DNS poisoning (or cache manipulation): Name resolution changes. Clients are sent to the wrong IPs even though your origin and routing may be fine.
A practical playbook starts by classifying which one you’re in, then applying controls at the right layer: routing, naming, or application.
Fast triage checklist to distinguish the two
1) Check if the hostname resolves consistently
From multiple networks (a couple of office ISPs, a cloud VM, a mobile hotspot), query the same record:
- Do you see different A/AAAA answers across vantage points?
- Do answers include unexpected IPs or an unexpected ASN when you look them up?
If the answers differ widely, suspect DNS poisoning or a DNS misconfiguration. If answers are consistent but connectivity differs, suspect BGP.
2) Check if the IP is reachable but the hostname is not
Try calling the API by IP (only for debugging; keep the Host/SNI correct if you can). If the IP works from some places but the hostname fails everywhere, it’s likely DNS. If both hostname and IP are unreachable from specific regions/networks, it’s likely routing.
3) Compare traceroutes from multiple networks
In a BGP incident, traceroutes tend to diverge early and land in a different backbone/region than usual, or blackhole. In DNS poisoning, traceroutes to your real IP look normal; traceroutes to the poisoned IP go somewhere unfamiliar.
BGP hijack playbook for keeping APIs reachable
Step 1: Minimize the blast radius with anycast and edge termination
The most reliable way to survive Internet routing weirdness is to avoid exposing a single “must-reach” origin prefix directly to the public Internet. Terminate API traffic on a global anycast edge and forward to origins over controlled paths. This is where a connectivity platform like cloudflare.com fits naturally: you can front an API with edge proxying so clients reach the nearest healthy edge, then you manage origin reachability separately.
Step 2: Make route ownership harder to spoof
Most hijacks succeed because the ecosystem still runs on trust. Improve your odds with:
- RPKI (ROAs): Publish Route Origin Authorizations for the prefixes you originate. Many networks now prefer valid routes and can drop invalid ones.
- IRR hygiene: Keep IRR objects accurate so filters don’t unexpectedly drop your legitimate announcements during an incident.
- Prefix discipline: Avoid announcing overly specific routes unless you need them for traffic engineering; they can be copied by an attacker or cause confusion during mitigation.
Step 3: Pre-arrange multi-homing and fast failover
Multi-homing helps when the incident is a leak or partial path failure rather than a malicious hijack. Prepare:
- Two upstreams in different physical facilities where possible.
- Clear BGP policies and communities for de-preference and withdrawal.
- Runbooks for “announce/withdraw” actions with a 5–10 minute target.
Document these runbooks like you would any other operational process. If your team struggles with fragmented incident context, a structured issue intake approach helps keep routing changes, approvals, and timelines in one place.
Step 4: Monitor routes, not just uptime
Synthetic checks tell you the API is failing; they don’t tell you why. Add route-aware monitoring:
- Alerts when your prefixes appear to originate from an unexpected ASN.
- Alerts when global visibility of your announcements drops sharply.
- Regional reachability checks that include traceroute sampling.
Step 5: During the incident, choose between containment and recovery
In the moment, you need a decision tree:
- If a hijack is active: coordinate with your upstream(s), contact the hijacking ASN if identifiable, and publish evidence (prefix, origin ASN, time window). If you have ROAs, highlight invalidity to peers.
- If it’s a leak/accidental mis-origin: the fastest fix is usually upstream coordination and filtering plus temporary more-specific announcements only if you control them and understand the risk.
When edge termination is in place, you can sometimes keep the API reachable even while origin paths are unstable by shifting traffic to healthy origins or using regional routing controls.
DNS poisoning playbook for keeping APIs reachable
Step 1: Lock down authoritative DNS with DNSSEC and tight change control
DNS poisoning is often a combination of resolver behavior, cache manipulation, and mis-issued answers. Your defensive posture starts at authoritative DNS:
- Enable DNSSEC on zones that matter, then monitor signature validity and rollover dates.
- Restrict who can change DNS and how. Use MFA, least privilege, and approvals for record changes.
- Short but sane TTLs on key API records (for example, 60–300 seconds) so you can correct mistakes quickly without making resolvers thrash.
Step 2: Separate “stable names” from “movable endpoints”
Don’t point critical clients directly at a fragile single record that you frequently edit. A practical pattern:
- api.example.com stays stable and points to an edge layer.
- origin-region-1.example.com and similar names are movable and can change during failover.
This reduces the chance that emergency changes to a single record create inconsistent answers or propagation surprises.
Step 3: Detect poisoning with resolver diversity
Set up continuous resolution checks using:
- Multiple public resolvers
- Your corporate resolver
- Cloud provider resolvers
- At least one “known good” validating resolver
If one resolver family returns a different answer set, you can isolate whether the issue is a specific resolver, a region, or your authoritative setup.
Step 4: During the incident, correct the record and drain bad caches
When DNS answers are wrong, time matters. Use a repeatable sequence:
- Confirm the authoritative answer is correct (authoritative query, not cached).
- Lower TTL (if you can) before changing targets for the next time—during an incident it may be too late to help immediately.
- If using DNSSEC, verify DS and RRSIG health after updates.
- Communicate workarounds for critical customers (for example, switching to a different resolver temporarily) only when necessary and with clear rollback steps.
Cross-layer hardening that helps in both scenarios
Design for “partial Internet” behavior
Most incidents are not total outages. Some networks fail while others succeed. Build the API client and platform assuming partial reachability:
- Multiple regions and origins with health-based failover.
- Idempotency keys for retries so clients can safely retry without double-writes.
- Graceful degradation for non-critical endpoints and background jobs.
Keep a clean incident timeline
Routing and DNS incidents generate noisy, conflicting signals. Capture decisions, timestamps, and evidence as you go so you can coordinate with providers and later tighten controls. If your team already uses structured documentation, you can adapt a lightweight redaction approach so logs and screenshots stay searchable without leaking sensitive data; see the internal guide on redacting PII and PHI in meeting notes.
Know what you’ll publish to customers
Pre-write a short status template: what’s affected (hostnames, regions), what customers can do (retry guidance, resolver workaround if relevant), and when the next update is coming. Clarity reduces repeated support load and helps customers keep their own incident response clean.
Vertical Video



