nanog mailing list archives

Re: BFD vs network brownouts


From: Saku Ytti <saku () ytti fi>
Date: Thu, 9 Jan 2025 09:56:05 +0200

On Thu, 9 Jan 2025 at 00:23, David Zimmerman via NANOG <nanog () nanog org> wrote:

find any formal or semi-formal writing about quantification of BFD's effectiveness.  For example, my mental picture 
is a 3D graph where, for a given Control rate and corresponding Detection Time, the X axis is percentage of packet 
loss, the Y axis is the Control/Detection timer tuple, and the Z axis is the likelihood that BFD will fully engage 
(i.e., missing all three Control packets).  Beyond what I believe is a visualization complexity needing some single 
malt scotch nearby, letting even a single Control packet through resets your Detection timer.
ask if folks in the Real World use BFD towards this end, or have other mechanisms as a data plane loss 
instrumentation vehicle.  For example, in my wanderings, I've found an environment that offloads the diagnostic load 
to adjacent compute nodes, but they reach out to orchestration to trigger further router actions in a full-circle 
cycle measured in minutes.  Short of that, really aggressive timers (solving through brute force) on BFD quickly hit 
platform limits for scale unless perhaps you can offboard the BFD to something inline (e.g. the Ciena 5170 can be 
dialed down to a 3.3ms Control timer).



Any thoughts appreciated.  I'm also pursuing ways of having my internal "customer" signal me upon their own packet 
loss observation (e.g. 1% loss for most folks is a TCP retransmission, but 1% loss for them are crying eyeballs and 
an escalation).

I agree with what Jason wrote, that this is not what BFD was designed for.

In SONET/SDH even WAN-PHY you could declare interface down if BER
threshold went beyond what you consider acceptable. For more modern
interfaces your best bet is RS-FEC and preFEC error rate as predictor,
possibly multimetric decision including also DDM data and projections.
To my knowledge vendors currently don't have software support to
assert RFI on preFEC counters, infact last time I looked you couldn't
even SNMP GET FEC counters, for which I opened Enhancement Requests to
vendors. So today you'd need to do this with screenscraping and manual
interface down, which is a much bigger hammer than RFI assertion.

-- 
  ++ytti


Current thread: