nanog mailing list archives

Re: Open letter to Level3 concerning the global routing issues on June 12th


From: jim deleskie <deleskie () gmail com>
Date: Fri, 12 Jun 2015 12:53:13 -0300

People from Big telcom should never reply to mailing lists from work
addresses unless specifically allowed, which I suspect TATA doesn't either,
based on some direct, buy old knowledge :)

Filtering has been a community issue since my days @ MCI being AS3561,
often discussed not often enough acted one, I suspect the topic has come up
at every "large" NSP I've worked at.  Frequently someone complains its
"hard" to fix, or router X makes it hard to fix, or customer Y won;t agree,
and not enough people stand up to force fix the issues.  I've did a preso
on it ( while working at TATA) with some other "smart folks" but for all
the usual reasons it died on the vine.  I don't blame (3) for this but our
community as a whole.  Many "people/networks" have to not do the "right
thing(tm)" for a failure like this to happen.


-jim

On Fri, Jun 12, 2015 at 12:43 PM, Utkarsh Gosain <
utkarsh.gosain () tatacommunications com> wrote:

Hi Martin
I am not a spokesperson on behalf of L3 but I have worked for big telcos
my whole career and my recommendation is to raise a trouble ticket if any
one on the forum is their customer and is affected.
I don’t think Engineers at NOC are authorized to reply to forums at any of
the major telcos especially regarding outages unless someone raise a
trouble ticket and seeks an RCA of the issue one on one with them.


Utkarsh Gosain
Global Acc Director
Tata Communications


-----Original Message-----
From: NANOG [mailto:nanog-bounces () nanog org] On Behalf Of Martin Millnert
Sent: Friday, June 12, 2015 11:33 AM
To: NANOG
Subject: Open letter to Level3 concerning the global routing issues on
June 12th

Dear Level3,

The Internet is a cooperative effort, and it works well only when its
participants take constructive actions to address errors and remedy
problems.
Your position as a major Internet Carrier bestows upon you a certain
degree of responsibility for the correct operation of the Internet all
across (and beyond) the planet. You have many customers. Customers will
always occasionally make mistakes. You as a major Internet Carrier have a
responsibility to limit, not amplify, your customers' mistakes.
Other major carriers implement technical measures that severely limits the
damages from customer mistakes from having global impact.
Other major carriers also implement operational procedures in addition to
technical measures.
In combination, these measures drastically reduce the outage-hours as a
result of customer configuration errors.

At 08:44 UTC on Friday 12th of June, one of your transit customers,
Telekom Malaysia (AS4788) began announcing the full Internet table back to
you, which you accepted and propagated to your peers and customers, causing
global outages for close to 3 hours.
[ https://twitter.com/DynResearch/status/609340592036970496 ] During this
3 hour window, it appears (from your own service outage
reports) that you did nothing to stop the global Internet outage, but that
Telekom Malaysia themselves eventually resolved it. This lack of action on
your end, and your disregard for the correct operation of the global
Internet is astonishing. These mistakes do not need to happen.
AS4788 under normal circumstances announces ~1900 IPv4 prefixes to the
Internet. You accepted multiple hundred thousand prefixes from them - a max
prefix setting would have severely limited the damage. We expect that these
are your practices as well, but they failed. When they do, it should not
take ~3 hours to shut down the session(s).

Many operators, in despair, turned down their peering sessions with you
once it was clear you were causing the outages and no immediate fix was in
sight. This improved the situation for some - but not all did. Had you
deployed proper IRR-filtering to filter the bad announcements the impact
would've been far less critical.

As a direct consequence of your ~3 hours of inaction, as a local example,
Swedish payment terminals were experiencing problems all over the country.
The Swedish economy was directly affected by your inaction.
There were queues when I was buying lunch! Imagine the food rage. The
situation was probably similar at other places around the globe where
people were awake.

Operators around the planet are curious:
  - Did Level3 not detect or understand that it was causing global
Internet outages for ~3 hours?
  - If Level3 did in fact detect or understand it was causing global
Internet outages, why did it not properly and immediately remedy the
situation?
  - What is Level3 going to do to address these questions and begin work
on restoring its credibility as a carrier?

We all understand that mistakes do happen (in applying customer interface
templates, etc.). However the Internet is all too pervasive in everyday
life today for anything but swift action by carriers to remedy breakage
after the fact. It is absolutely not sufficient to let a customer spend 3
hours to detect and fix a situation like this one. It is unacceptable that
no swift action was taken on your end to limit the global routing issues
you caused.

Sincerely,
Martin Millnert
Member of Internet Community - no carrier / ISP affiliation.



Current thread: