nanog mailing list archives

RE: Cisco ASR9902 SNMP polling ... is interesting


From: Marc Binderberger via NANOG <nanog () lists nanog org>
Date: Thu, 7 Aug 2025 14:06:31 +0200


If only the customers would recognize how awesome this is!


No, I'm not bashing Cisco (worked for them, including XR & NXOS development, 
good people, had a good time), but there is an attitude that Cisco helped 
develop though and meanwhile it is everywhere.

(on a lighter note, reminds me of this: 
https://e-fun.blogspot.com/2006/06/dilbert-at-cisco-ii.html)


Back to LJ's email: Just a few quotes.

But there are A LOT of assumptions that have to be made here [...]

[...] very likely not great assumptions for a system with a MUCH 
smaller/simpler config.

Then why making these assumptions? Especially with XR - not your mom & dad IT 
box but for ISPs or IT departments - you could provide the mechanism and 
either "do nothing as default" or "block everything as default". And then 
provide documentation and service$$$ to the customers

Sure, this was discussed back then. And your reply is showing a big problem 
(IMHO):

 * "we have a good idea for default rate limitations" You did not say this 
but I assume this sums it up, as someone decided for default values. Problem: 
no you don't have a good idea. Often enough not even the customer has a good 
understanding. What you probably know from customer interactions is how to 
measure-and-adjust and _reach_ a good configuration. With the customer.

 * the elephant in the room is that if you are a small(er) ISP ... lets be 
honest, if you are not one of the big guys, then the focus of vendors is 
small(er) too. I get it to some extend but I have seen this too often from 
engineers too. 


Than to say "machine a.b.c.d has totally unfettered access to my RP CPU and 
can melt me if he likes"

Ah, XR. A "real" OS (QNX back then, Linux now?). And "real" engineers do 
multi-threading. If they will ever learn how to do it right ... .

Frankly, there is a much bigger problem when you need rate limiting to avoid 
your RP CPU to melt, i.e. to be unable to keep the box & network stable. SNMP 
may be toast if you flood it, the CPU may run at very high utilization - the 
rest should keep going. The whole complexity of XR and the hardware can have 
only this justification (in reality: engineers get carried away with the new 
toy).

There is another twist to it: if your router behaves "bad" then it was the 
customer's fault to not limit aggressively enough? And if you do rate 
"tight", then establishing the base line is a full project - is this 
realistic?  And the customer is left alone.
(remember, "taking the blame" is part of the vendor's job ;-)


Don't get me wrong, I am sure you worked hard for the best outcome. I like 
you protect "your little router" :-) We should simply stop sugar coating the 
situation. Not sure any "pressure" of brutally-honest talk has an impact on 
Cisco's router business - it's commodity now, except the 400G interfaces and 
such - but at least it acknowledges the real problems colleagues like Drew 
have.


Taking a deep breath. My god, what have we done. With technology, with the 
Internet. 

The simple (and often older) stuff works, everything on top is too often 
complex, fragile and it is not always obvious what is achieves or if the 
effort is worth the outcome. Just look at the SPF/DKIM discussion we recently 
had on NANOG. Just look at XR and this LPTS discussion. And if you ever 
work(ed) on the code, it is quite shocking too.

We may have more/better tools, routers and software have improved but getting 
a basic engineering job done seems as hard, if not harder, compared to the 
old days. Back then we did not know (and had to learn), now we (well, the 
organizations, vendors) do not care anymore.

I better distract myself with a cat video on YT :-)

Marc








On Wed, 6 Aug 2025 19:51:21 +0000, LJ Wobker (lwobker) via NANOG wrote:
Some more background might be useful here... 

"Back In The Day" ... IOS XR was designed at least in some part to 
"automatically" protect the control plane from misconfiguration or 
malicious activity.  The LPTS architecture is built around a whole bunch of 
what I will call "automated policers" -- depending on the platform you 
might have dozens or hundreds of them.  The flow very roughly is:

- identify traffic types (BGP, BGP from a known peer, SNMP, ARP, and god 
only knows how many other things)
- check those incoming packets against a policer
- drop packets that exceed that policer

The whole idea here is that we don't want lots of packets from some "not 
totally trusted" thing to melt the box.  But there are A LOT of assumptions 
that have to be made here... and any assumption made to protect the box 
when it has 2,000 BGP peers and 10,000 interfaces (which the asr9k can 
actually do) are very likely not great assumptions for a system with a MUCH 
smaller/simpler config.

I was around back in the very early 2000's when we discussed, specifically, 
whether or not we should try to find a way to put the LPTS policer values 
into the configuration.   There's no perfect answer here.  One of the 
fundamental choices in XR (which is not *always* followed but pretty close) 
is to not put things in the config that are default values.  This prevents 
the config from being a bazillion lines long.  Another fundamental choice 
is that we only put things in the configuration that the user has actually 
configured... which sort of seems obvious but definitely isn't always.

This gets to your complaint, which is at the very least partially 
legitimate:  the system is doing things (policing) that on other platforms 
have to be explicitly configured.  But on XR systems, these LPTS (i.e. 
control plane policers) are IMPLICIT, and therefore they're a lot less 
visible than you might see on other platforms.  

Not that it matters, but 20+ years ago we spent quite a few heated meetings 
kicking around how to handle this, and balance the need for visibility, 
configurability, and simplicity. No answer is possible to optimize for all 
three, so what we have to day is more or less what we landed on.  My 
apologies that it's not super obvious, but we did our best to balance those 
conflicting goals.  

If you google for queries like "asr 9000 lpts policers" or "configure lpts 
policer rates" you should find at least a few config guides, maybe a decent 
doc or two on xrdocs.io, and god willing even a ciscolive presentation or 
two (hell, one of them might even be mine) that talks about this.  Again, 
I'll apologize as my experience with the box was from roughly 2007-2013 so 
I can't quote you chapter and verse here -- but I do KNOW some of that 
capability exists. 

I do know that there are ways to configure the policer rates for specific 
protocols... I can't swear on my life that SNMP is one of those that is 
configurable, hopefully the answer is "yes" -- at least in theory if we 
know how fast the polling station wants to ask, we can open the policer to 
that number.   This is often trial and error as the exact numbers and unit 
conversions aren't obvious unless you want to put a damn packet sniffer on 
the thing.   

For what it's worth...  even if it's possible (and I'm not sure it is?)  I 
would advise against pure-whitelisting any host or netblock in LPTS.  If 
you completely turned off the LPTS policers and you either accidentally (or 
someone else maliciously) got into that machine and did something 
{Accidental, Stupid, Devious} -- you risk melting the box.  It's MUCH safer 
to figure out a combination of:
 - slowing down the requests from the external thing
 - knowingly opening up the policers to a different / faster rate

Than to say "machine a.b.c.d has totally unfettered access to my RP CPU and 
can melt me if he likes"

I would push more towards "find me a way to open this policer up so I can 
choose how to balance my own risks like a grownup".

On the TAC side, if it's true that you've had a case open forever and 
haven't been able to get to SOMEONE who knows enough about IOS XR to get 
you at least remotely close to "you need to twiddle with the LPTS policers 
to get the behavior that you want" -- then something is pretty badly 
broken.  If you could unicast me a case number and whatever other specific 
info might help, I will do a little digging on the back end.  These routers 
are pretty fuckin' complex and troubleshooting them definitely isn't easy 
(again, sorry - we tried to err on the side of "be overly careful") ... but 
we also can't have the support org be some total black hole that can't get 
you reasonably quickly to someone who sorta knows how the thing works.


--lj

-----Original Message-----
From: Drew Weaver <drew.weaver () thenap com> 
Sent: Wednesday, August 6, 2025 1:28 PM
To: 'North American Network Operators Group' <nanog () lists nanog org>
Cc: LJ Wobker (lwobker) <lwobker () cisco com>
Subject: RE: Cisco ASR9902 SNMP polling ... is interesting

Hi there,

It has since been identified that the reason that the traffic is being 
dropped is the SNMP policer in LPTS seems to just be discarding the traffic.

I didn't configure it to do this.
This doesn't show up in the running configuration TAC still hasn't figured 
that out yet and they still haven't provided me with a way to simply 
whitelist traffic from a single /32 in LPTS 4 weeks later.

So yes, I will admit that I am somewhat ignorant on what you guys call CoPP 
in this platform but I don't think me being ignorant about it is as big as 
problem as TAC being fully unaware that it exists at all.

Still waiting for TAC to tell me how to whitelist a single /32 in the 
policer.

In 9 more weeks I'll let you know what the result ends up being.

Thanks though for stopping by.
-Drew



-----Original Message-----
From: LJ Wobker (lwobker) via NANOG <nanog () lists nanog org>
Sent: Tuesday, August 5, 2025 11:46 AM
To: North American Network Operators Group <nanog () lists nanog org>
Cc: LJ Wobker (lwobker) <lwobker () cisco com>
Subject: RE: Cisco ASR9902 SNMP polling ... is interesting

Wow, what a food fight this became.  At risk of wading into the middle 
school cafeteria and wearing ketchup, I'll attempt to possibly return to 
some semblance of a technical discussion.  For background, I was the first 
TME here at Cisco who worked on the ASR9k program back in the mid-2000s - 
so my memory might be a bit rusty but at least to some degree I can present 
myself as a knowledgeable source.  I also worked in TAC back in the day so 
I have some familiarity with their processes.  ;-)

In all IOS XR systems, there's an architecture designed to make sure that 
control plane traffic coming from the very high speed interfaces doesn't 
overwhelm the processing capacity of the system.  The whole thing is 
relatively complex and the exact implementation differs from system to 
system in the fine details, but the idea is that you want to funnel down 
traffic headed for the RP or linecard CPU so that by the time it gets there 
you're as confident as you can be that the traffic is legitimate and in the 
right place.

No one uses the same terms for anything, so some terminology...  We (cisco) 
broadly call the infrastructure "LPTS":  Local Packet Transport Service.  
The act of identifying that a packet needs to go up to the control plane we 
call "punting".  Every modern system from every vendor has SOME form or 
fashion for this, otherwise it's trivial to melt the system with traffic 
pointed at the control CPU.  But no one uses the same words.

Drew - I'm sorry you don't like the way my router works.  This hurts my 
feelings, because he's really a pretty good little router.  Let's see if we 
can figure out why.  In this case, there's lots of possible places things 
can behave in ways you don't like.

First question... when you say "we poll SNMP on any interface" -- do you 
mean you're changing the target IP address for where you point the SNMP 
manager, where sometimes it's the management ethernet address and sometimes 
a regular interface address?  This matters because IN GENERAL (yes, I 
know...) the system behaves differently here.  Packets pointed at the 
management ethernet are run through a different set of policers than if 
you're pointed at a data plane interface.  IN GENERAL the "best" way to do 
something like this is with a loopback interface, as the defaults are 
"better" tuned for that config compared to a direct zap at the actual 
interface IP.  This also has the benefit of virtualizing the loopback so 
you aren't tied to a single point of failure, but that's a separate thing.

I'm not remotely surprised that the behavior is different from the 9901 to 
the 9902.  At risk of being an apologist for my implementation, even within 
a product family there are always (sometimes stupid) differences in the 
implementations.  

I can ABSOLUTELY ASSURE you that there is nowhere in the code that says 
"make 62% of the SNMP polls fail because we hate Drew".  This is not how 
our system works... somewhere in the path there's a policer or a meter that 
is either dropping some of the inbound requests, or the SNMP process is 
choking on something and timing out, or something like that.  But there is 
no such thing on the router side as an SNMP polling timeout - that is a 
client side thing.  The SNMP process on the router gets a request, and it 
sends a response, that's all.  If something (either external or within the 
labyrinth of internal protections) drops the request on the way in, SNMP 
never sees it, so it can't respond.  Then the client has to figure out what 
to do, which often is throw a timeout and/or retry -- but this is dependent 
on the implementation of the SNMP client, and there's nothing that the 
router OS can do about it.

As someone mentioned along the way, the right way to troubleshoot this is 
to find the commands in XR that will show you the counters and potential 
drops between "the packet arrives at the box" and "SNMP did its thing with 
the packet".  I have to sadly admit that here I'm one of those old-ass Air 
Force Colonels who USED to be a hot-shit pilot, but now I fly a desk.  12 
years ago I could have told you chapter and verse what the commands are and 
where all the drop/meter counters live, but father time is undefeated and 
now I spend time apologizing on NANOG lists instead of having an actual lab 
to work on.  That said, your expectation that someone in TAC can figure out 
what's happening and explain it to you is totally reasonable, and if you're 
not getting those answers then escalating is correct.  We might not be able 
(or willing) to change the behavior to do things the way you like them, but 
we absolutely owe you an explanation of what's actually happening.  If you 
can't this from TAC, let me know and I will attempt to shake that tree.

At LEAST the following things would need to be chased down, some of which 
we'd have to get from the customer side...
* which interface(s) are being polled?  MgmtEth, loopback, physical?  
* at what rate does the SNMP station generate and send request packets?  
(Time windows matter here.  A short but very fast burst of requests might 
trip the meter, stuff like that)
* can this rate be changed?
* how much stuff (i.e. MIBs) are you polling? 

Anyway... hopefully that points you at least somewhat in the right 
direction.

--lj

-----Original Message-----
From: Mel Beckman via NANOG <nanog () lists nanog org>
Sent: Monday, August 4, 2025 10:42 AM
To: Tom Beecher <beecher () beecher cc>
Cc: nanog () lists nanog org; Mel Beckman <mel () beckman org>
Subject: Re: Cisco ASR9902 SNMP polling ... is interesting

Sorry, Tom. I’m not taking the bait.

-mel via cell

On Aug 4, 2025, at 7:02 AM, Tom Beecher <beecher () beecher cc> wrote:


Mel-

You have made multiple technical assertions in this thread that are 
demonstrably false. Quoting your earlier messages :

  1.  Also, non-management interfaces do packet processing in silicon at 
the ASIC level and don’t have the capacity to do anything more than 
statistical sampling of packets that require CPU-level processing to 
retrieve counters and generate SNMP responses. 62 % is as good a sampling 
rate as any other.
  2.  Cisco is likely to say that the control plane is only fully supported 
on the management port.
  3.  In-band SNMP to data forwarding interfaces violates that separation.

 You have attempted to frame these comments as :

honest and sincere attempts by other members to help identify the possible 
problem.

While your attempts to help may have been honest and sincere attempts to 
help the OP, they actually achieved the opposite effect. Your incorrect 
technical assertions , if anything, only hindered the OP's attempt to 
understand and identify their issue. Comment #1 is especially egregious ; 
you're telling Drew that his observations are *normal*.

Saku made 2 comments that addressed these falsehoods :

It might be easier to contribute, if there is familiarity to the subject 
matter.

some community member piled on with what can only be described as a bizarre 
drivel.

The first was a polite way of calling out the technical inaccuracies. The 
second was a more forceful way of stating "what you said was wrong". Most 
people, when they are corrected on a factual point, tend to reply with "Oh 
hey, I got that wrong, thanks for setting me straight" and move on. You 
seem to have just ignored it.

There is a massive difference between the following statements :

  1.  You are an idiot. [ Attacking the person ]
  2.  What you said was idiotic. [ Attacking the statements ]

It seems to be that you may be struggling in identifying that difference, 
and taking *any* criticism as a personal attack.

Nobody is bullying you, or anybody else, in this conversation.





On Mon, Aug 4, 2025 at 9:42 AM Mel Beckman via NANOG 
<nanog () lists nanog org<mailto:nanog () lists nanog org>> wrote:
Thanks. I knew we were not so out to lunch! If you don’t push back on 
bullies, they take over the community. It crops up on nanog periodically. :(

-mel via cell

On Aug 4, 2025, at 5:54 AM, Joe Loiacono via NANOG 
<nanog () lists nanog org<mailto:nanog () lists nanog org>> wrote:

Hi Mel, for what it's worth, I could not figure out what they were 
referring to by Saku's comments. I saw no justification for their 
complaint. A bit out of character for Saku, also,

Joe

On 8/2/2025 7:23 PM, Mel Beckman via NANOG wrote:
I’ll just let the incivility of you both stand.

-mel

On Aug 2, 2025, at 3:52 PM, Tom Beecher 
<beecher () beecher cc<mailto:beecher () beecher cc>> wrote:


Mel-

Saku did not call *you* any names. He called your *incorrect statements* 
in this thread 'bizzard drivel'. Which he is absolutely correct about. 
While your intentions may certainly have been to help, your statements 
here have been frankly dead wrong and did not accomplish that.

Probably just want to take the L here.


On Sat, Aug 2, 2025 at 5:34 PM Mel Beckman via NANOG 

<nanog () lists nanog org<mailto:nanog () lists nanog org><mailto:nanog () lists nanog org<mailto:nanog () lists nanog 
org>>> 
wrote:
Saku,

What is actually appalling is that a member of NANOG calls “bizarre 
drivel” the honest and sincere attempts by other members to help identify 
the possible problem. There’s no cause to be uncivil, people can disagree 
without stooping to name-calling.

 -mel

On Aug 2, 2025, at 11:46 AM, Saku Ytti via NANOG 

<nanog () lists nanog org<mailto:nanog () lists nanog org><mailto:nanog () lists nanog org<mailto:nanog () lists nanog 
org>>> 
wrote:

On Sat, 2 Aug 2025 at 21:02, Tom Beecher via NANOG 

<nanog () lists nanog org<mailto:nanog () lists nanog org><mailto:nanog () lists nanog org<mailto:nanog () lists nanog 
org>>> 
wrote:

I don't have in depth knowledge of Cisco's SNMP implementations, or 
even the ASR platform specifically, but if Cisco TAC is telling you 
this is 'normal', they are completely full of shit, and you should 
click any and every 'escalate' button you can find.

This almost sounds like a default control plane DDOS policer / LPTS 
, something like that.
There are various complicated reasons for this, LPTS policer is 
unlikely culprit, but possible. Bug search will show various DDTS 
with poor SNMP performance outcome, most of them are unrelated to LPTS.

But absolutely correct, the right solution is to escalate. In common 
case this would be SE from your account team, who would fight for 
you internally.


It is appalling that OP came to nanog after correctly suspecting TAC 
is gaslighting them, some community member piled on with what can 
only be described as a bizarre drivel.
--
 ++ytti
_______________________________________________
NANOG mailing list
https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.nanog.org
_archives_list_nanog-40lists.nanog.org_message_&d=DwIGaQ&c=euGZstcaT
DllvimEN8b7jXrwqOf-v5A_CdpgnVfiiMM&r=OPufM5oSy-PFpzfoijO_w76wskMALE1
o4LtA3tMGmuw&m=ZRQKyw0amYQuJDrOUoUtJCSVZKWvb764kPF4UjLJKuQ_I4NVhCMVb
VNzC8h9aWfc&s=HVfyN6javj5uX9ryxhOPxQSiMh2CkQJi_x885vQNB0M&e=
7KXUNRGFI5OEVSDEDU2OL5VMY5NBGQCV/
_______________________________________________
NANOG mailing list
https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.nanog.org_
archives_list_nanog-40lists.nanog.org_message_C&d=DwIGaQ&c=euGZstcaTD
llvimEN8b7jXrwqOf-v5A_CdpgnVfiiMM&r=OPufM5oSy-PFpzfoijO_w76wskMALE1o4
LtA3tMGmuw&m=ZRQKyw0amYQuJDrOUoUtJCSVZKWvb764kPF4UjLJKuQ_I4NVhCMVbVNz
C8h9aWfc&s=GtBCB1cT8FNf1-UD3vXAYH3UHRxLVcJgUO3WmSwt7a4&e=
F3QHVTISL6LDFTOWG4E3KK54QEDHUIY/
_______________________________________________
NANOG mailing list
https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.nanog.org_
archives_list_nanog-40lists.nanog.org_message_O&d=DwIGaQ&c=euGZstcaTD
llvimEN8b7jXrwqOf-v5A_CdpgnVfiiMM&r=OPufM5oSy-PFpzfoijO_w76wskMALE1o4
LtA3tMGmuw&m=ZRQKyw0amYQuJDrOUoUtJCSVZKWvb764kPF4UjLJKuQ_I4NVhCMVbVNz
C8h9aWfc&s=5ODSQkzz8W9Kr3E9IWdoE9mLIm_bTb8Z0H9sSnuNKSs&e=
J7ICXLSPFND32X2XS2U7XIWA6DALSIF/
_______________________________________________
NANOG mailing list
https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.nanog.org_a
rchives_list_nanog-40lists.nanog.org_message_E4&d=DwIGaQ&c=euGZstcaTDl
lvimEN8b7jXrwqOf-v5A_CdpgnVfiiMM&r=OPufM5oSy-PFpzfoijO_w76wskMALE1o4Lt
A3tMGmuw&m=ZRQKyw0amYQuJDrOUoUtJCSVZKWvb764kPF4UjLJKuQ_I4NVhCMVbVNzC8h
9aWfc&s=8E7xCMB2-Jb4W7oWeB3GOFc7RFZYZYj3W5GlLeJX9BA&e=
CF2TFV35VSJVFEZZANEWOAJFUUNDL4/
_______________________________________________
NANOG mailing list

https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.nanog.org_archives_list_nanog-40lists.nanog.org_message_RU6WF77QOECXABP6IDCMVNLAH67X4WNW_&d=DwIGaQ&c=euGZstcaTDllvimEN8b7jXrwqOf-v5A_CdpgnVfiiMM&r=OPufM5oSy-PFpzfoijO_w76wskMALE1o4LtA3tMGmuw&m=ZRQKyw0amYQuJDrOUoUtJCSVZKWvb764kPF4UjLJKuQ_I4NVhCMVbVNzC8h9aWfc&s=7t0FidWZ-eOmjk9WDRw3h78TBRDLNkqVXdQ7GSVnrOc&e=
_______________________________________________
NANOG mailing list

https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.nanog.org_archives_list_nanog-40lists.nanog.org_message_3NCOGL6SHARKHBT2TJRK4W7ZOP2BO2BW_&d=DwIGaQ&c=euGZstcaTDllvimEN8b7jXrwqOf-v5A_CdpgnVfiiMM&r=OPufM5oSy-PFpzfoijO_w76wskMALE1o4LtA3tMGmuw&m=ZRQKyw0amYQuJDrOUoUtJCSVZKWvb764kPF4UjLJKuQ_I4NVhCMVbVNzC8h9aWfc&s=M4KruocLeATFcohjqA5bbEtk4u9xNX0ZFyQt_OhItjM&e=
_______________________________________________
NANOG mailing list

https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.nanog.org_archives_list_nanog-40lists.nanog.org_message_LE6LLRVDEOQK3R5JO3G3QSIRYYICRQIZ_&d=DwIGaQ&c=euGZstcaTDllvimEN8b7jXrwqOf-v5A_CdpgnVfiiMM&r=OPufM5oSy-PFpzfoijO_w76wskMALE1o4LtA3tMGmuw&m=ZRQKyw0amYQuJDrOUoUtJCSVZKWvb764kPF4UjLJKuQ_I4NVhCMVbVNzC8h9aWfc&s=HK3eMuL_F8B7YRLvgGYzli-lx8Y-h6JZXJr7pNeDoCg&e=
_______________________________________________
NANOG mailing list 

https://lists.nanog.org/archives/list/nanog () lists nanog org/message/KLOXVK362L76ABLCZNCTALDKCZRBCWYN/
_______________________________________________
NANOG mailing list 
https://lists.nanog.org/archives/list/nanog () lists nanog org/message/4KKN2AAZ5EMI73LVJ36C4YDF3MMWOY7X/

Current thread: