nanog mailing list archives

RE: Cisco ASR9902 SNMP polling ... is interesting


From: Drew Weaver via NANOG <nanog () lists nanog org>
Date: Fri, 8 Aug 2025 13:58:26 +0000

I'm not sure I have the minerals tbh.

-Drew


-----Original Message-----
From: Saku Ytti <saku () ytti fi> 
Sent: Friday, August 8, 2025 9:55 AM
To: North American Network Operators Group <nanog () lists nanog org>
Cc: LJ Wobker (lwobker) <lwobker () cisco com>; Marc Binderberger <marc+lists () sniff es>; Drew Weaver <drew.weaver () 
thenap com>
Subject: Re: Cisco ASR9902 SNMP polling ... is interesting

I would chase this further with Cisco, if you have the cycles.

Often it pays dividends in the future to have a proper understanding of anatomy of the issue. So it's not purely for 
curiosity's sake.


On Fri, 8 Aug 2025 at 16:51, Drew Weaver via NANOG <nanog () lists nanog org> wrote:

One other note I'd like to make on this just for future reference:

The default for SNMP in LPTS on this platform is 300 (I'm assuming 
that is 300pps)

We aren't sending 300pps of SNMP traffic at this device so nothing should have been policed by it.

There might be an issue with how it's counting or it's duplicating packets.

Anyway setting it to 500 made everything work properly.

(We aren't sending 500pps of SNMP at the machine either).

Thanks,
-Drew


-----Original Message-----
From: Drew Weaver via NANOG <nanog () lists nanog org>
Sent: Friday, August 8, 2025 9:32 AM
To: 'North American Network Operators Group' <nanog () lists nanog org>
Cc: 'LJ Wobker (lwobker)' <lwobker () cisco com>; 'Marc Binderberger' 
<marc+lists () sniff es>; Drew Weaver <drew.weaver () thenap com>
Subject: RE: Cisco ASR9902 SNMP polling ... is interesting

I'm just replying here to let you know that this was "solved".

lpts pifib hardware police
 flow snmp rate 2000
!

I want to point out that if you set it to it's max configuration value (4294967295) it ignores it entirely even 
though IOS XR seems to know that it's maximum for this hardware is 50000.

It couldn't be bothered to simply set it to 50000 if you set it to the configured maximum of 4294967295 It couldn't 
be bothered to simply say: "Hey we know the max for this platform is 50000 so we set it to 50000 but you probably 
shouldn't be using 50000 for this value anyway"
It could be bothered to do absolutely nothing and silently reject the command which made me laugh for about 5 minutes 
this morning.

So thanks for that Cisco and more sincerely thank you to everyone that took any time to try and assist me with this.

I still would have preferred to just tell it what IP addresses to expect SNMP traffic to come from and use that 
instead of a PPS policer but hey it's 2025 and preferences are luxuries.

-Drew


-----Original Message-----
From: Saku Ytti via NANOG <nanog () lists nanog org>
Sent: Friday, August 8, 2025 3:34 AM
To: North American Network Operators Group <nanog () lists nanog org>
Cc: LJ Wobker (lwobker) <lwobker () cisco com>; Marc Binderberger 
<marc+lists () sniff es>; Saku Ytti <saku () ytti fi>
Subject: Re: Cisco ASR9902 SNMP polling ... is interesting

On Thu, 7 Aug 2025 at 15:08, Marc Binderberger via NANOG <nanog () lists nanog org> wrote:

Then why making these assumptions? Especially with XR - not your mom 
& dad IT box but for ISPs or IT departments - you could provide the 
mechanism and either "do nothing as default" or "block everything as 
default". And then provide documentation and service$$$ to the 
customers

Because while Cisco can't dimension the box well, operators do an even worse job at it.

On cXR we had issues where occasionally LPTS would admit too much BGP, after LPTS admits BGP traffic it is hashed to 
1/8 XIPC worker processes, before it is handed over to BGP. Because we had a busy device, XIPC didn't get the CPU 
cycles it needed to service the LPTS admitted packets, causing XIPC to drop packets. This meant a couple times a 
month we lost on some router 1/8th of BGP speakers, and Cisco explicitly refused to fix it. They literally said maybe 
it works better in eXR (it does).
The funny thing is, this CPU demand was created by BGP, so because XIPC didn't have priority for CPU over BGP, it 
caused BGP to demand more CPU, due to flaps. If XIPC had had priority over BGP, the symptoms would have been lessen. 
I pointed this out to Cisco, they agreed, but said they've previously explored process priorities in cXR, but ended 
up having just more unstable devices (unmanageable complexity for people to understand what the priorities should be).
All this while pitching that RTOS is mandatory for carrier grade NOS, while behind the scene nothing for said RTOS 
was used, it's just flat priority all around.


Additionally LPTS is exclusively NPU level policer, if port1 congests some policer, also port2 suffers, there isn't a 
more-specific fall-back policer into IFD, IFL levels. So what can you do, if port1 has an L2 loop and is spewing ARP 
to you, killing port2? You can't MQC to 10pps, you can't ACL it, as LPTS bypasses MQC and ACL, so your only option is 
to shutdown port1, you cannot a-priori ensure one port won't take out other ports.
There was an excessive flow tap, which could be used with success in this scenario, but that feature was retired, 
because I guess someone in cisco who knew why it was needed had left, and remaining people didn't understand its use 
case and didn't want to carry the complexity.


All of these are actually solvable, you can deliver NOS where port1 in the same NPU won't take down port2, 
out-of-the-box, without configuration. But it requires deep understanding on what the platform can do, how it can do 
it, and how the actual customer network works.
This person doesn't exist.
Cisco or Nokia cannot be even configured like this by an operator, Juniper can be, but it's way too complicated for 
operators to do.

So if you have a casual understanding how these devices work, you can bring down any core devices no matter how it's 
protected from trivial size single VPC DoS. Only reason the Internet works is because there isn't motivation to break 
it, not because it is well protected. Which is fine, because the same is true for personal safety, and focus should 
be on the motivation mitigation, rather than absolute safety.

Of course this thread isn't about protecting devices in bad weather, it is about trying to make devices work in fair 
weather, which is a much more reasonable ask.

--
  ++ytti
_______________________________________________
NANOG mailing list
https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.nanog.org_a
rchives_list_nanog-40lists.nanog.org_message_V56CX5TXE7MSA2NQR6WFFZQWS
WEDQCB5_&d=DwICAg&c=euGZstcaTDllvimEN8b7jXrwqOf-v5A_CdpgnVfiiMM&r=OPuf
M5oSy-PFpzfoijO_w76wskMALE1o4LtA3tMGmuw&m=JpBzXEAHGqhw7yYz2WYDniWSu1mY
KW1Hpnju_sjqO-Z5HFqV2hrVPk9ge-SMaqrk&s=78hSyv-0ZbBYSmiMoeY-ttfxJ9O_K8D
ab4hkaP-mlKk&e= _______________________________________________
NANOG mailing list
https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.nanog.org_a
rchives_list_nanog-40lists.nanog.org_message_5QFU3TMPNYTRDQWGD6ZNYQSCG
56J3YBH_&d=DwICAg&c=euGZstcaTDllvimEN8b7jXrwqOf-v5A_CdpgnVfiiMM&r=OPuf
M5oSy-PFpzfoijO_w76wskMALE1o4LtA3tMGmuw&m=CiPRK92BvloBNS51T81cJ1YPGgGm
fKkdKxEIYl46ZuxxUJtYYXIsrOu-aL7rBOoR&s=bcUoPtLvZA6z0yoTtxYOPYMn8MNceeJ
ugOEslPrbz6o&e= _______________________________________________
NANOG mailing list
https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.nanog.org_a
rchives_list_nanog-40lists.nanog.org_message_ORJMBJRVNLLDAYU3SMOFOW34O
ABC7UOD_&d=DwIFaQ&c=euGZstcaTDllvimEN8b7jXrwqOf-v5A_CdpgnVfiiMM&r=OPuf
M5oSy-PFpzfoijO_w76wskMALE1o4LtA3tMGmuw&m=g9V7cxKwbXhjWWffG8XudwAabSTr
kHWCrLcOhzztkzw5DkNw0QeIzeTn7DKk9e9p&s=pClfygoAgsC_PvS2a2Ni__FrYKh77ZK
SCIAmKiS2Jno&e=



--
  ++ytti
_______________________________________________
NANOG mailing list 
https://lists.nanog.org/archives/list/nanog () lists nanog org/message/IY25ESMOLAFDXMCU5AOW2KJA5Q22C4FH/


Current thread: