nanog mailing list archives

Re: BFD vs network brownouts


From: Jason Iannone <jason.iannone () gmail com>
Date: Wed, 8 Jan 2025 20:36:22 -0500

BFD is binary. Service OAM 802.3ag / ITU-T Y.1731 generates time series
data that talks to service reliability and SLA. OAM offers interface shut
and fault propagation as well, which means it's both an observability tool
and an operational one. BFD is just not the thing for measuring the
reliability of network services.


On Wed, Jan 8, 2025, 5:20 PM David Zimmerman via NANOG <nanog () nanog org>
wrote:

Hi, all.  BFD is well known for what it brings to the table for improving
link failure detection; however, even at a reasonably athletic 300ms
Control rate, you're not going to catch a significant percentage of
brownout situations where you have packet loss but not a full outage.  I'm
trying to:



   1. find any formal or semi-formal writing about quantification of
   BFD's effectiveness.  For example, my mental picture is a 3D graph where,
   for a given Control rate and corresponding Detection Time, the X axis is
   percentage of packet loss, the Y axis is the Control/Detection timer tuple,
   and the Z axis is the likelihood that BFD will fully engage (i.e., missing
   all three Control packets).  Beyond what I believe is a visualization
   complexity needing some single malt scotch nearby, letting even a single
   Control packet through resets your Detection timer.
   2. ask if folks in the Real World use BFD towards this end, or have
   other mechanisms as a data plane loss instrumentation vehicle.  For
   example, in my wanderings, I've found an environment that offloads the
   diagnostic load to adjacent compute nodes, but they reach out to
   orchestration to trigger further router actions in a full-circle cycle
   measured in *minutes*.  Short of that, really aggressive timers
   (solving through brute force) on BFD quickly hit platform limits for scale
   unless perhaps you can offboard the BFD to something inline (e.g. the Ciena
   5170 can be dialed down to a 3.3ms Control timer).



Any thoughts appreciated.  I'm also pursuing ways of having my internal
"customer" signal me upon their own packet loss observation (e.g. 1% loss
for most folks is a TCP retransmission, but 1% loss for them are crying
eyeballs and an escalation).



-dp




Current thread: