nanog mailing list archives

Re: Cisco ASR9902 SNMP polling ... is interesting


From: Tom Beecher via NANOG <nanog () lists nanog org>
Date: Sat, 2 Aug 2025 13:45:03 -0400


As I said elsewhere, the control plane was invented to separate management
functions from the data forwarding process. In-band SNMP to data forwarding
interfaces violates that separation.


Uh, no it doesn't.

A control plane is just a separate compute/processing space that isn't used
for traffic forwarding. In many cases , the 'management interface' is, by
itself, not even part of the control plane.It's just a separate forwarding
plane that isn't supposed to be able to send or receive anything from the
'main' forwarding plane. ( But as many of us have seen over time, that
isn't always true. )



On Fri, Aug 1, 2025 at 3:07 PM Mel Beckman via NANOG <nanog () lists nanog org>
wrote:

Drew,

As I said elsewhere, the control plane was invented to separate management
functions from the data forwarding process. In-band SNMP to data forwarding
interfaces violates that separation. I’d say all bets are off. As they say
in mathematics, this behavior is undefined. :)


-mel via cell

On Aug 1, 2025, at 11:42 AM, Drew Weaver via NANOG <
nanog () lists nanog org> wrote:

Hi,

Just to correct:

I was saying that 62% of the polls timeout and that only 38% actually
result in responses and those 38% responses take multiples of time longer
to actually complete if polling on an in-line interface.

This is just with a simple bash script running "time check_interfaces
<args>" from the Nagios-Tools package and doing hundreds of poll runs in a
row with various pauses between pollings.

It would be a little less of a concern if any other product did this but
the idea that they just sort of left it 62% broken and shipped it that way
is really making me wonder what else only functions at 38%.

We don't have a huge budget and the ASR9902 costs almost twice as much
as the Arista devices we would've preferred to buy [the Arista device in
question has 30x100GE ports and the ASR9902 is basically an 8x100GE router
with a very poorly configured midplane/gearbox that ties into some sort of
switch [that nobody seems to know how any of that works at Cisco, either].

If we had an unlimited budget we'd just mulligan this thing and buy the
DCS devices that we want but we're stuck with it and if we're stuck with it
I don't think it's insane to expect it to operate at least as well as an
ASR9001.

Thanks,
-Drew




-----Original Message-----
From: Saku Ytti via NANOG <nanog () lists nanog org>
Sent: Friday, August 1, 2025 2:28 PM
To: North American Network Operators Group <nanog () lists nanog org>
Cc: Saku Ytti <saku () ytti fi>
Subject: Re: Cisco ASR9902 SNMP polling ... is interesting

On Fri, 1 Aug 2025 at 16:44, Mel Beckman via NANOG <
nanog () lists nanog org> wrote:

Also, non-management interfaces do packet processing in silicon at the
ASIC level and don’t have the capacity to do anything more than statistical
sampling of packets that require CPU-level processing to retrieve counters
and generate SNMP responses. 62 % is as good a sampling rate as any other.

Absolutely not. We expect to process 100% of legitimate control-plane
traffic, e.g. BGP, ISIS, LDP, ARP, SNMP etc.

62% would be devastating.

In fair weather this is easy, in bad weather you need hardware based
discrimination on what is expected good traffic and what is unexpected bad
traffic.

Drew is in the right to expect functioning SNMP and is experiencing
significant regression in behaviour compared to previous devices from the
same vendor.


It would take a very long time to explain how to troubleshoot this, as
it is an extremely complicated topic with a lot of nuance that even the
best experts of Cisco are unaware of.  I've regularly had TAC handwave
problems away 'sometimes it be like that' because they didn't want to do
the work. Once our NOC spent months on a case where TAC was blaming our QoS
configuration for BGP flaps, by the time I got on it, I escalated it to
Xander, and initially even Xander agreed with TAC that we need to look into
QoS configuration, until I reminded him that LPTS is not subject to QoS or
ACL (which is terrible design choice, for reasons I'm happy to elaborate),
which immediately reminded him how LPTS works and the TAC case finally got
some traction.
This is a completely untenable situation, IOS-XR regularly has
complicated problems that TAC is not equipped to solve and the expectation
is that the user has deep enough knowledge to rebuff them.


--
 ++ytti
_______________________________________________
NANOG mailing list

https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.nanog.org_archives_list_nanog-40lists.nanog.org_message_KK73RTHMIZXLUMICYPEECO2AQXILKHIQ_&d=DwIGaQ&c=euGZstcaTDllvimEN8b7jXrwqOf-v5A_CdpgnVfiiMM&r=OPufM5oSy-PFpzfoijO_w76wskMALE1o4LtA3tMGmuw&m=d_XQ0w1ltWzu7JBKSWfGAfci8ywpv0Vz_Lg6Q-eS5pZAWpgoZ9PBnm_qnf2BAqbd&s=CmbeUcr_Ltz9nrzW2h4l3azL_KBEqloxrF9Rl9GuEpQ&e=
_______________________________________________
NANOG mailing list

https://lists.nanog.org/archives/list/nanog () lists nanog org/message/F2466J65DSWXATIP7DWSXU6FDHFW7L6H/
_______________________________________________
NANOG mailing list

https://lists.nanog.org/archives/list/nanog () lists nanog org/message/COQS4UD65IGURHPQBXYD6YVNKPUIYHTZ/
_______________________________________________
NANOG mailing list 
https://lists.nanog.org/archives/list/nanog () lists nanog org/message/AF2LJTK6DMQJGZ4PTKKUMEKHTNIGRNBD/

Current thread: