nanog mailing list archives

Re: Global Akamai Outage


From: Lukas Tribus <lukas () ltri eu>
Date: Mon, 26 Jul 2021 14:20:39 +0200

Hello,


On Mon, 26 Jul 2021 at 11:40, Mark Tinka <mark@tinka.africa> wrote:
I can count, on my hands, the number of RPKI-related outages that we
have experienced, and all of them have turned out to be a
misunderstanding of how ROA's work, either by customers or some other
network on the Internet. The good news is that all of those cases were
resolved within a few hours of notifying the affected party.

That's good, but the understanding of operational issues in the RPKI
systems in the wild is underwhelming, we are bound to make the same
mistakes of DNS all over again.

Yes, a complete failure of an RTR server theoretically does not have
big negative effects in networks. But failure of RPKI validation with
a separate RTR server can lead to outdated VRP's on the routers, just
as RTR server bugs will, which is why monitoring not only for
availability but also whether the data is actually not outdated is
*very* necessary.


Here some examples (both of operators POV as well as actual failure scenarios):


https://mailman.nanog.org/pipermail/nanog/2020-August/208982.html

we are at fault for not deploying the validation service in a redundant
setup and for failing at monitoring the service. But we did so because
we thought it not to be too important, because a failed validation
service should simply lead to no validation, not a crashed router.

In this case a RTR client bug crashed the router. But the point is
that it is not clear that setting up RPKI validators and RTR servers
is a serious endeavor and monitoring it is not optional.



https://github.com/cloudflare/gortr/issues/82

we noticed that one the ROAs was wrong. When I pulled output.json
from octorpki (/output.json), it had the correct value. However when
I ran rtrdump, it had different ASN value for the prefix. Restarting
gortr process did fix it. Sending SIGHUP did not.



https://github.com/RIPE-NCC/rpki-validator-3/issues/264

yesterday we saw a unexpected ROA propagation delay.

After updating a ROA in the RIPE lirportal, NTT, Telia and Cogent
saw the update within an hour, but a specific rpki validator
3.1-2020.08.06.14.39 in a third party network did not converge
for more than 4 hours.


I wrote a naive nagios script to check for stalled serials on a RTR server:
https://github.com/lukastribus/rtrcheck

and talked about it in this his blog post (shameless plug):
https://labs.ripe.net/author/lukas_tribus/rpki-rov-about-stale-rtr-servers-and-how-to-monitor-them/

This is on the validation/network side. On the CA side, similar issues apply.

I believe we still lack a few high level outages caused by
insufficient reliability in the RPKI stacks for people to start taking
it seriously.


Some specific failure scenarios are currently being addressed, but
this doesn't make monitoring optional:

rpki-client 7.1 emits a new per VRP attribute: expires, which makes it
possible for RTR servers to stop considering outdated VRP's:
https://github.com/rpki-client/rpki-client-openbsd/commit/9e48b3b6ad416f40ac3b5b265351ae0bb13ca925

stayrtr (a gortr fork), will consider this attribute in the future:
https://github.com/bgp/stayrtr/issues/3



cheers,
lukas


Current thread: