nanog mailing list archives
Re: Global Akamai Outage
From: Lukas Tribus <lukas () ltri eu>
Date: Mon, 26 Jul 2021 14:20:39 +0200
Hello, On Mon, 26 Jul 2021 at 11:40, Mark Tinka <mark@tinka.africa> wrote:
I can count, on my hands, the number of RPKI-related outages that we have experienced, and all of them have turned out to be a misunderstanding of how ROA's work, either by customers or some other network on the Internet. The good news is that all of those cases were resolved within a few hours of notifying the affected party.
That's good, but the understanding of operational issues in the RPKI systems in the wild is underwhelming, we are bound to make the same mistakes of DNS all over again. Yes, a complete failure of an RTR server theoretically does not have big negative effects in networks. But failure of RPKI validation with a separate RTR server can lead to outdated VRP's on the routers, just as RTR server bugs will, which is why monitoring not only for availability but also whether the data is actually not outdated is *very* necessary. Here some examples (both of operators POV as well as actual failure scenarios): https://mailman.nanog.org/pipermail/nanog/2020-August/208982.html
we are at fault for not deploying the validation service in a redundant setup and for failing at monitoring the service. But we did so because we thought it not to be too important, because a failed validation service should simply lead to no validation, not a crashed router.
In this case a RTR client bug crashed the router. But the point is that it is not clear that setting up RPKI validators and RTR servers is a serious endeavor and monitoring it is not optional. https://github.com/cloudflare/gortr/issues/82
we noticed that one the ROAs was wrong. When I pulled output.json from octorpki (/output.json), it had the correct value. However when I ran rtrdump, it had different ASN value for the prefix. Restarting gortr process did fix it. Sending SIGHUP did not.
https://github.com/RIPE-NCC/rpki-validator-3/issues/264
yesterday we saw a unexpected ROA propagation delay. After updating a ROA in the RIPE lirportal, NTT, Telia and Cogent saw the update within an hour, but a specific rpki validator 3.1-2020.08.06.14.39 in a third party network did not converge for more than 4 hours.
I wrote a naive nagios script to check for stalled serials on a RTR server: https://github.com/lukastribus/rtrcheck and talked about it in this his blog post (shameless plug): https://labs.ripe.net/author/lukas_tribus/rpki-rov-about-stale-rtr-servers-and-how-to-monitor-them/ This is on the validation/network side. On the CA side, similar issues apply. I believe we still lack a few high level outages caused by insufficient reliability in the RPKI stacks for people to start taking it seriously. Some specific failure scenarios are currently being addressed, but this doesn't make monitoring optional: rpki-client 7.1 emits a new per VRP attribute: expires, which makes it possible for RTR servers to stop considering outdated VRP's: https://github.com/rpki-client/rpki-client-openbsd/commit/9e48b3b6ad416f40ac3b5b265351ae0bb13ca925 stayrtr (a gortr fork), will consider this attribute in the future: https://github.com/bgp/stayrtr/issues/3 cheers, lukas
Current thread:
- Re: Global Akamai Outage, (continued)
- Re: Global Akamai Outage Hank Nussbacher (Jul 22)
- Re: Global Akamai Outage Hank Nussbacher (Jul 24)
- Re: Global Akamai Outage Saku Ytti (Jul 24)
- Re: Global Akamai Outage Hank Nussbacher (Jul 25)
- Re: Global Akamai Outage Mark Tinka (Jul 25)
- Re: Global Akamai Outage Jared Mauch (Jul 25)
- Re: Global Akamai Outage Saku Ytti (Jul 25)
- Re: Global Akamai Outage Mark Tinka (Jul 25)
- Re: Global Akamai Outage Saku Ytti (Jul 25)
- Re: Global Akamai Outage Mark Tinka (Jul 26)
- Re: Global Akamai Outage Lukas Tribus (Jul 26)
- Re: Global Akamai Outage Mark Tinka (Jul 26)
- Re: Global Akamai Outage heasley (Jul 26)
- Re: Global Akamai Outage Mark Tinka (Jul 26)
- Re: Global Akamai Outage Lukas Tribus (Jul 26)
- Re: Global Akamai Outage Mark Tinka (Jul 27)
- Re: Global Akamai Outage Lukas Tribus (Jul 27)
- Re: Global Akamai Outage heasley (Jul 27)
- Re: Global Akamai Outage Lukas Tribus (Jul 27)
- Re: Global Akamai Outage Hank Nussbacher (Jul 24)
- Re: Global Akamai Outage Hank Nussbacher (Jul 22)
- Re: Global Akamai Outage Randy Bush (Jul 25)
- Re: Global Akamai Outage Miles Fidelman (Jul 25)
