nanog mailing list archives

Re: BFD vs network brownouts

From: Tom Beecher <beecher () beecher cc>
Date: Thu, 9 Jan 2025 14:57:18 -0500


i, all.  BFD is well known for what it brings to the table for improving
link failure detection; however, even at a reasonably athletic 300ms
Control rate, you're not going to catch a significant percentage of
brownout situations where you have packet loss but not a full outage.  I'm
trying to:



BFD doesn't improve link failure detection. It's the exact opposite ; it's
there to detect *reachability* failure faster than protocols themselves
would do so , in those cases where link failure does NOT occur, which would
otherwise do the same thing.

Beyond that, I agree with Jason and Saku that BFD is not the correct tool
for what you're trying to achieve anyways. Aside from monitoring interface
counters, there are software options out there to detect loss % on explicit
paths that would suit your need much better



On Thu, Jan 9, 2025 at 2:58 AM Saku Ytti <saku () ytti fi> wrote:

On Thu, 9 Jan 2025 at 00:23, David Zimmerman via NANOG <nanog () nanog org>
wrote:

find any formal or semi-formal writing about quantification of BFD's

effectiveness.  For example, my mental picture is a 3D graph where, for a
given Control rate and corresponding Detection Time, the X axis is
percentage of packet loss, the Y axis is the Control/Detection timer tuple,
and the Z axis is the likelihood that BFD will fully engage (i.e., missing
all three Control packets).  Beyond what I believe is a visualization
complexity needing some single malt scotch nearby, letting even a single
Control packet through resets your Detection timer.

ask if folks in the Real World use BFD towards this end, or have other

mechanisms as a data plane loss instrumentation vehicle.  For example, in
my wanderings, I've found an environment that offloads the diagnostic load
to adjacent compute nodes, but they reach out to orchestration to trigger
further router actions in a full-circle cycle measured in minutes.  Short
of that, really aggressive timers (solving through brute force) on BFD
quickly hit platform limits for scale unless perhaps you can offboard the
BFD to something inline (e.g. the Ciena 5170 can be dialed down to a 3.3ms
Control timer).




Any thoughts appreciated.  I'm also pursuing ways of having my internal

"customer" signal me upon their own packet loss observation (e.g. 1% loss
for most folks is a TCP retransmission, but 1% loss for them are crying
eyeballs and an escalation).

I agree with what Jason wrote, that this is not what BFD was designed for.

In SONET/SDH even WAN-PHY you could declare interface down if BER
threshold went beyond what you consider acceptable. For more modern
interfaces your best bet is RS-FEC and preFEC error rate as predictor,
possibly multimetric decision including also DDM data and projections.
To my knowledge vendors currently don't have software support to
assert RFI on preFEC counters, infact last time I looked you couldn't
even SNMP GET FEC counters, for which I opened Enhancement Requests to
vendors. So today you'd need to do this with screenscraping and manual
interface down, which is a much bigger hammer than RFI assertion.

--
  ++ytti

Current thread:

BFD vs network brownouts David Zimmerman via NANOG (Jan 08)
- Re: BFD vs network brownouts Jason Iannone (Jan 08)
- Re: BFD vs network brownouts Saku Ytti (Jan 08)
  - Re: BFD vs network brownouts Tom Beecher (Jan 09)
    - Re: BFD vs network brownouts Alex Buie (Jan 09)
- Re: BFD vs network brownouts Tore Anderson (Jan 09)
- Re: BFD vs network brownouts David Zimmerman via NANOG (Jan 09)
  - Re: BFD vs network brownouts Saku Ytti (Jan 09)
    - Re: BFD vs network brownouts Jason Iannone (Jan 10)
    - Re: BFD vs network brownouts Saku Ytti (Jan 10)
    - Re: BFD vs network brownouts Tore Anderson (Jan 12)