nanog mailing list archives

RE: Cisco ASR9902 SNMP polling ... is interesting


From: LJ Wobker via NANOG <nanog () lists nanog org>
Date: Fri, 8 Aug 2025 12:16:36 -0400

By design, the LPTS default values are set to be on the "slow but safe"
side.  As I've already mentioned, picking default values is incredibly hard
for stuff like this because you've got a dramatic range of system sizes,
shapes, use cases, blah blah.  The general consensus is we'd rather force
people to open up policers explicitly than have them be too open by default.
Feel free to dismiss me as a crybaby apologist, but that's how we got here.

Said another way:  feel free to say that we made terrible choices and you
hate our defaults.  But you can't justly accuse of us of not thinking about
it and/or just making shit up.  

Another challenge here (yeah, I know... there goes LJ apologizing again...)
is that a "500 pps" policer is not actually 500 packets per second.  It's a
token bucket meter where the actual parameters are the token fill rate and a
burst size.  Choosing THESE values is yet another messy problem... if we
assumed that the bucket gets filled up once per second, then what you'd end
up with is a meter that would allow 500 packets through as fast as they
could be dequeued, but then let nothing else through for the rest of that
second time window.  Then we add 500 "tokens" at T = 1 sec, and lather rinse
repeat.

In the real world every hardware based meter is slightly different as far as
what burst sizes are available, how fast the token interval fills, and more
stuff.  But if we circle back to this particular case, we might well not
have a 500 PPS policer, but rather we might have a policer that is "50
packets every tenth of a second" or "5 packets every hundredth of a second".


This is where you have to know something about other side (here - the SNMP
client)... does it send a burst of packets all at once?  How large is that
burst?  Does that burst overrun the policer on the router?  If it times out
and re-tries, does that make it worse or better?

Your complaint about the thing silently accepting a value that can't be
supported in the hardware is 100% valid.  We should not let you say "police
this to a rate of 4 billion" and reply with "OK no problem" when in reality
we're not doing that.  Please ask your TAC engineer to file a bug for this
... we might or might not ever get around to fixing it, but it at least
needs to be documented somewhere.  (I would do it myself but they deemed me
too dangerous to allow continued access to the DDTS database many years
ago...)

As to why it takes so much longer to do the same thing on a non-management
interface, I'm truly curious as to this one.  5 seconds is a bonkers amount
of time on a system like this... my best guesses at this point are things
like: 
- because the rate limiters are different for mgmt vs non, somehow we're
getting a partial completion each "cycle" and we've got tons of retries in
there.

Drew, did you ever get the output of something like "debug snmp packet" or
whatever it was the TAC guys asked for?  I'd be specifically interested in
comparing those traces for the two {mgmt, not} cases... once the SNMP
process generates it's replies the data plane on the way OUT is pretty much
non blocking, so I'd want to see if somehow we're pacing the arrival of the
requests into the SNMP process, and/or if it thinks it's generating the
responses in the same amount of time....

--lj


-----Original Message-----
From: Nick Hilliard via NANOG <nanog () lists nanog org> 
Sent: Friday, August 8, 2025 11:39 AM
To: North American Network Operators Group <nanog () lists nanog org>
Cc: Nick Hilliard <nick () foobar org>
Subject: Re: Cisco ASR9902 SNMP polling ... is interesting

Drew Weaver via NANOG wrote on 08/08/2025 14:31:
It couldn't be bothered to simply set it to 50000 if you set it to the 
configured maximum of 4294967295 It couldn't be bothered to simply say:
"Hey we know the max for this platform is50000 so we set it to 50000  but
you probably shouldn't be using 50000 for this value anyway"
It could be bothered to do absolutely nothing and silently reject the
command which made me laugh for about 5 minutes this morning.

Some years ago I was fighting with a low level pps rate limiter for a
telemetry service on a long obsolete platform. The default limit caused
packets to be dropped, and we finally settled on an updated figure based on
the usual compromise of performance vs consequence. But: if we increased the
limiter above what we had measured to be reasonable, this fairly quickly
caused a performance cliff which affected other services, e.g. snmp / lacp
timeouts, etc, so production impact. Although this was in the days of
in-house NOS schedulers, I'd be fairly cautious in this area - particular on
RTOS platforms like XR.

If Cisco have implemented a pps limiter of 50k/s, that's a lot of snmp pps.
Is this a realistic amount of requests to be properly serviced per second?
SNMP packet encapsulation / general handling is one thing, but stats
collection / intermediation can be more heavyweight. Bear in mind that the
failure modes in this sort of situation are often non-linear.

For sure it's a bit annoying that they don't warn that this is the maximum
(possibly a platform / LC limit? i.e. possible that this is not a generic
limit across all SPs on all types of unit), but at least the box won't fall
over in production just because someone tweaked a parameter beyond what the
hardware was likely capable of handling.

Nick
_______________________________________________
NANOG mailing list
https://lists.nanog.org/archives/list/nanog () lists nanog org/message/WUYR7KRW
QCA5EA2IF6RVNE4BKUUD5TZL/

_______________________________________________
NANOG mailing list 
https://lists.nanog.org/archives/list/nanog () lists nanog org/message/I4HVKRYUCVQ3QOVSQQXS7D5KWUQWNWDE/


Current thread: