nanog mailing list archives
Re: BFD vs network brownouts
From: David Zimmerman via NANOG <nanog () nanog org>
Date: Thu, 9 Jan 2025 22:31:49 +0000
Thanks for the feedback, Jason, Saku, Tore, Tom, and Alex. Agreed that trying to effectively brute force (mis)use of BFD as I described is misdirected. To some degree I'm trying to reinforce a "why this doesn't work" argument internally as part of a larger narrative. Thanks specifically to Jason for reminding me about 802.1ag and Y.1731 — OAM is where I'll spend some time digging if there's any benefit. I'm pretty ignorant of that (pretty big) space. Towards Saku's, Tore's, and Tom's comments about watching error counters, I'll keep that in mind, though I expect I'll want to cover situations where frames are simply lost rather than errored. For example (tapping into Alex's point) on an L2VPN circuit with carrier underlay congestion where the last-mile circuits are otherwise clean. -dp From: David Zimmerman <dzimmerman () linkedin com> Date: Wednesday, January 8, 2025 at 2:20 PM To: North American Network Operators' Group <nanog () nanog org> Subject: BFD vs network brownouts Hi, all. BFD is well known for what it brings to the table for improving link failure detection; however, even at a reasonably athletic 300ms Control rate, you're not going to catch a significant percentage of brownout situations where you have packet loss but not a full outage. I'm trying to: 1. find any formal or semi-formal writing about quantification of BFD's effectiveness. For example, my mental picture is a 3D graph where, for a given Control rate and corresponding Detection Time, the X axis is percentage of packet loss, the Y axis is the Control/Detection timer tuple, and the Z axis is the likelihood that BFD will fully engage (i.e., missing all three Control packets). Beyond what I believe is a visualization complexity needing some single malt scotch nearby, letting even a single Control packet through resets your Detection timer. 2. ask if folks in the Real World use BFD towards this end, or have other mechanisms as a data plane loss instrumentation vehicle. For example, in my wanderings, I've found an environment that offloads the diagnostic load to adjacent compute nodes, but they reach out to orchestration to trigger further router actions in a full-circle cycle measured in minutes. Short of that, really aggressive timers (solving through brute force) on BFD quickly hit platform limits for scale unless perhaps you can offboard the BFD to something inline (e.g. the Ciena 5170 can be dialed down to a 3.3ms Control timer). Any thoughts appreciated. I'm also pursuing ways of having my internal "customer" signal me upon their own packet loss observation (e.g. 1% loss for most folks is a TCP retransmission, but 1% loss for them are crying eyeballs and an escalation). -dp
Current thread:
- BFD vs network brownouts David Zimmerman via NANOG (Jan 08)
- Re: BFD vs network brownouts Jason Iannone (Jan 08)
- Re: BFD vs network brownouts Saku Ytti (Jan 08)
- Re: BFD vs network brownouts Tom Beecher (Jan 09)
- Re: BFD vs network brownouts Alex Buie (Jan 09)
- Re: BFD vs network brownouts Tom Beecher (Jan 09)
- Re: BFD vs network brownouts Tore Anderson (Jan 09)
- Re: BFD vs network brownouts David Zimmerman via NANOG (Jan 09)
- Re: BFD vs network brownouts Saku Ytti (Jan 09)
- Re: BFD vs network brownouts Jason Iannone (Jan 10)
- Re: BFD vs network brownouts Saku Ytti (Jan 10)
- Re: BFD vs network brownouts Tore Anderson (Jan 12)
- Re: BFD vs network brownouts Saku Ytti (Jan 09)
