nanog mailing list archives

Re: Cisco ASR9902 SNMP polling ... is interesting


From: Saku Ytti via NANOG <nanog () lists nanog org>
Date: Fri, 8 Aug 2025 10:34:05 +0300

On Thu, 7 Aug 2025 at 15:08, Marc Binderberger via NANOG
<nanog () lists nanog org> wrote:

Then why making these assumptions? Especially with XR - not your mom & dad IT
box but for ISPs or IT departments - you could provide the mechanism and
either "do nothing as default" or "block everything as default". And then
provide documentation and service$$$ to the customers

Because while Cisco can't dimension the box well, operators do an even
worse job at it.

On cXR we had issues where occasionally LPTS would admit too much BGP,
after LPTS admits BGP traffic it is hashed to 1/8 XIPC worker
processes, before it is handed over to BGP. Because we had a busy
device, XIPC didn't get the CPU cycles it needed to service the LPTS
admitted packets, causing XIPC to drop packets. This meant a couple
times a month we lost on some router 1/8th of BGP speakers, and Cisco
explicitly refused to fix it. They literally said maybe it works
better in eXR (it does).
The funny thing is, this CPU demand was created by BGP, so because
XIPC didn't have priority for CPU over BGP, it caused BGP to demand
more CPU, due to flaps. If XIPC had had priority over BGP, the
symptoms would have been lessen. I pointed this out to Cisco, they
agreed, but said they've previously explored process priorities in
cXR, but ended up having just more unstable devices (unmanageable
complexity for people to understand what the priorities should be).
All this while pitching that RTOS is mandatory for carrier grade NOS,
while behind the scene nothing for said RTOS was used, it's just flat
priority all around.


Additionally LPTS is exclusively NPU level policer, if port1 congests
some policer, also port2 suffers, there isn't a more-specific
fall-back policer into IFD, IFL levels. So what can you do, if port1
has an L2 loop and is spewing ARP to you, killing port2? You can't MQC
to 10pps, you can't ACL it, as LPTS bypasses MQC and ACL, so your only
option is to shutdown port1, you cannot a-priori ensure one port won't
take out other ports.
There was an excessive flow tap, which could be used with success in
this scenario, but that feature was retired, because I guess someone
in cisco who knew why it was needed had left, and remaining people
didn't understand its use case and didn't want to carry the
complexity.


All of these are actually solvable, you can deliver NOS where port1 in
the same NPU won't take down port2, out-of-the-box, without
configuration. But it requires deep understanding on what the platform
can do, how it can do it, and how the actual customer network works.
This person doesn't exist.
Cisco or Nokia cannot be even configured like this by an operator,
Juniper can be, but it's way too complicated for operators to do.

So if you have a casual understanding how these devices work, you can
bring down any core devices no matter how it's protected from trivial
size single VPC DoS. Only reason the Internet works is because there
isn't motivation to break it, not because it is well protected. Which
is fine, because the same is true for personal safety, and focus
should be on the motivation mitigation, rather than absolute safety.

Of course this thread isn't about protecting devices in bad weather,
it is about trying to make devices work in fair weather, which is a
much more reasonable ask.

-- 
  ++ytti
_______________________________________________
NANOG mailing list 
https://lists.nanog.org/archives/list/nanog () lists nanog org/message/V56CX5TXE7MSA2NQR6WFFZQWSWEDQCB5/


Current thread: