nanog mailing list archives

Re: Distributed Router Fabrics


From: Christopher Morrow <morrowc.lists () gmail com>
Date: Wed, 1 Jan 2025 15:30:25 -0500

On Tue, Dec 24, 2024 at 8:26 AM Mike Hammett <nanog () ics-il net> wrote:

In the articles I've read and videos I've watched, they have mentioned
varying amounts of reduced power. I didn't commit them to memory because
that wasn't the part I was interested in at the moment.


I'd think that, especially as data rates climb, the power consumption is
going to really get important fast.
When a single device requires ~50kw to run ... I think you'll want to make
sure you have space/power to deal with that :(

I'm not sure that distributed fabric plans make that problem better? (maybe
it's all the same problem in the end because the fabric interconnect is
going to be distance limited/etc too)


Management of the things is a big thing I've been concerned about going
into more modern systems. So often there's hand waiving regarding the
orchestration piece of non-traditional systems. From what I've seen (and I
would love to be wrong), you either build it in-house (not a small lift) or
you buy something that ends up taking away all of the cost advantages that
path had.


You almost certainly get into (pretty quickly) something that smells a
bunch like:
  "here's my pile of ansible recipes for...."
  (choice of ansible here for example only, s/ansible/<whatever>/ of course
to whatever you feel like)

That's maybe fine if that's your jam? I think it's hard to
reason/plan/build without some automation plan 'now',
and it looks like a ton of folk start without that then try to retrofit
once: "omg this is very large now... ugh" happens.
  (1-10 devices? sure fine do it by hand, 5-><bunches more> you really
ought to have had an automation plan at ~5 ... my opinion clearly)


Failure domain stuff is part of what I'm trying to learn more about, which
goes back to more about the fundamentals of how the fabric works.


yea... This part(reasoning about failure domains) I assume is also a tad
hard.
A scenario is:
  "I built this 200tb fabric, I interconnect to the outside with ~100T max
and internally with ~100T"
now that ~100T breaks and (ideally!) everything on the outside re-routes
around to a different front-door... oops are you prepared for an extra
~100T arriving?
How do you deal with parts (fabric parts) failing in part? "oops only 50T
of my 100T can get through here and ... I also am still telling my external
neighbors all's good"

Really that failure-domain problem is tightly linked to the 'manage a ton
of things' problem too.. at least for containing damage in a quick manner.

Current thread: