nanog mailing list archives

Re: FYI Netflix is down


From: Hal Murray <hmurray () megapathdsl net>
Date: Mon, 02 Jul 2012 21:24:29 -0700


George Herbert <george.herbert () gmail com> said:

I worked for a Sun clone vendor (Axil) for a while and took some of our
systems and storage to Comdex one year in the 90s.  We had a RAID unit
(Mylex controller) we had just introduced.  Beforehand, I made REALLY REALLY
SURE that the pull-the-disk and pull-the-redundant-power tricks worked.  And
showed them to people with the "Please keep in mind that this voids the
warranty, but here we *rip* go...".  All of the other server vendors were
giving me dirty looks for that one. Apparently I sold a few systems that
way. 

:)  Nice.  Thanks.

Many years ago, I worked for one of DEC's research groups.  We built a 
network using FDDI 4B/5B link technology based on AMD TAXI chips.  (They were 
state of the art back then.)  The switches were 3U(?) boxes with 12 ports.  
It took a rack of 6 or 8 of them in the phone closet to cover a floor.  
Workstations had 2 cables plugged into different switches.  In theory, we 
covered any single point of failure.

My office was near the phone closet.  I got to watch my boss give demos to 
visiting VIPs.  He was pretty good at it.  In the middle of explaining 
things, he would grab a power cord and yank it.  Blinka-blinka=blinka and the 
remaining switches would reconfigure and go back to work.  (It took under a 
second.)

It was interesting to watch the VIPs.  Most of them got it: the network 
really could recover quickly. The interesting ones had a telco background.  
They were really surprised.  The concept of disrupting live traffic for 
something as insignificant as a demo was off scale in their culture.

It was just a research lab.  We were used to eating our own dog food.

----------

"Greg D. Moore" <mooregr () greenms com> said:

If folks have not read it, I would suggest reading Normal Accidents  by
Charles Perrow.

+1

The "it can't happen" is almost guaranteed to happen. ;-)  And when  it
does, it'll often interact in ways we can't predict or sometimes  even
understand. 

My memory of that sort of event is roughly...  (see above for context)

The hardware broke and turned a vanilla packet into a super-long packet.  My 
FPGA code was supposed to catch that case and do something sane.  It was 
never tested and didn't work.  It poured crap all over memory.  Needless to 
say, things went downhill from there.

Easy to spot in hindsight.  None of us thought that was an interesting case 
while we were testing.


-- 
These are my opinions.  I hate spam.





Current thread: