mailing list archives
Re: QWest is having some pretty nice DNS issues right now
From: Steve Gibbard <scg () gibbard org>
Date: Fri, 6 Jan 2006 18:54:52 -0800 (PST)
On Fri, 6 Jan 2006, william(at)elan.net wrote:
On Fri, 6 Jan 2006, Wil Schultz wrote:
Apparently they have lost two authoritative servers. ETA is unknown.
You forgot to mention that they only have two authoritative servers for
most of their domains...
I didn't look at this while it was happening, and haven't talked to
anybody else about it, so I don't know if this was a systems or routing
issue. But, in the spirit of trying to learn lessons from incomplete
Qwest.net and Qwest.com have two authoritative name server addresses
listed, dca-ans-01.inet.qwest.net and svl-ans-01.inet.qwest.net. As the
names imply, traceroutes to these two servers appear to go to somewhere in
the DC area and somewhere in proximity to Sunnyvale, California. It
appears they're really just two servers or single location load-balanced
clusters, and not an anycast cloud with two addresses. It may be that two
simultaneous server failures would take out the whole thing, or they may
be in less visible load balancing configurations. Even if it's two
individual servers, that's the standard n+1 redundancy that's generally
considered sufficient for most things.
There is a fair amount of geographic diversity between the two sites,
which is a good thing.
The two servers have the IP addresses 220.127.116.11 and 18.104.22.168.
These both appear in global BGP tables as part of 22.214.171.124/14, so any
outage affecting that single route (flapping, getting withdrawn, getting
announced from somewhere without working connectivity to the two name
servers, etc.) would take out both of them.
So from my uninformed vantage point, it looks like they started doing this
more or less right -- two servers or clusters of servers in two different
facilities, a few thousand miles apart on different power grids and not
subject to the same natural disasters. In other words, they did the hard
part. What they didn't do is put them in different BGP routes, which for
a network with as much IP space as Qwest has would seem fairly easy.
While it's tempting to make fun of Qwest here, variations on this theme --
working hard on one area of design while ignoring another that's also
critical -- are really common. It's something we all need to be careful
Or, not having seen what happened here, the problem could have been
something completely different, perhaps even having nothing to do with
routing or network topology. In that case, my general point would remain
the same, but this would be a bad example to use.