InfiniBand vs Ethernet: Choosing the Right Interconnect for Your GPU Cluster
The networking interconnect inside a GPU cluster is one of the most consequential infrastructure decisions an AI team will make, and it is also one of the most misunderstood. Both InfiniBand NDR and 400GbE Ethernet can push 400 Gb/s per port, so the raw bandwidth numbers look identical on a spec sheet. The real difference shows up in latency, which is the metric that determines whether all-reduce operations across your gpu cloud computing environment complete in 12 microseconds or 85. At small scale that gap barely registers. At 256 GPUs it is the difference between a training run that finishes in three days and one that takes five.
InfiniBand NDR delivers end-to-end latency of roughly 1.0 to 1.3 microseconds for small messages, while RoCE v2 over 400GbE Ethernet lands between 3.5 and 5.0 microseconds for the same payload. These numbers come from RDMA write benchmarks at 64-byte message sizes. They look close on paper, but all-reduce is not a single operation. It is a tree of dependent communications across every GPU in the job. At 64 nodes, each running 8 GPUs, a ring all-reduce touches 512 endpoints. The per-hop latency compounds through every stage of the reduction, and InfiniBand's lower, more consistent tail latency keeps the entire collective tight. Measurements on a 512 H100 GPU cluster show all-reduce completion times of 47 microseconds over InfiniBand NDR versus 210 microseconds over RoCE. That is a 4.5x gap on a single collective operation, and a large language model training step can execute hundreds of collectives per iteration. Teams weighing the total cost of ownership often overlook how much interconnect latency inflates their GPU-hour spend at scale.
At 16 GPUs or fewer, Ethernet is almost always the right call. The communication overhead at this scale is a small fraction of total step time, and the latency difference between InfiniBand and RoCE amounts to single-digit percentage points on end-to-end throughput. A 16-GPU training job on a well-configured RoCE fabric will achieve 92 to 95 percent of the throughput you would see on InfiniBand. The cost premium for InfiniBand at this scale, which adds roughly 15 to 20 percent to the total cluster cost, does not pay for itself. For teams at this stage of capacity planning, spending the savings on additional compute is a better investment.
The calculus changes at 64 GPUs. At this scale, communication overhead grows to 15 to 25 percent of total step time for transformer models with large hidden dimensions. The lower latency of InfiniBand begins to matter in a measurable way. Teams that move from Ethernet to InfiniBand at this node count routinely recover 18 to 22 percent of their training throughput. That translates directly to fewer GPU-hours per training run and a lower total cost, even after accounting for the interconnect premium. If you are training models above 30 billion parameters on 64 or more GPUs, InfiniBand is not a luxury. It is a cost optimization. Many organizations running a bare metal gpu deployment at this tier find the interconnect upgrade pays for itself within the first major training run, especially when factoring in the cost of downtime and wasted compute from stragglers on slower fabrics.
At 256 GPUs and above, InfiniBand is effectively mandatory for distributed training. The communication-to-computation ratio at this scale makes every microsecond of collective latency visible in the wall-clock time. Adaptive routing and congestion control in InfiniBand fabrics handle the traffic patterns of large-scale all-reduce far more gracefully than ECN-based congestion management in Ethernet. RoCE fabrics at 256 GPUs suffer from tail latency spikes that cause stragglers, where a single slow node holds up the entire synchronous training step. InfiniBand's credit-based flow control virtually eliminates this problem. The throughput gap at 256 GPUs widens to 30 to 40 percent in favor of InfiniBand, making Ethernet a false economy at this tier. For teams assembling a GPU cluster of this size, choosing the right hardware generation matters just as much as the interconnect, and the two decisions should be made together.
Inference is a different story entirely. Inference workloads are embarrassingly parallel across requests and do not require tight inter-node communication. A well-tuned Ethernet fabric is perfectly adequate for serving, even at hundreds of GPUs, because each request is typically handled by a single node or a small tensor-parallel group. Spending on InfiniBand for an inference-only gpu cloud computing deployment is almost never justified unless that same cluster also handles training jobs. Teams running multi-provider strategies often split their inference traffic across Ethernet-connected bare metal gpu nodes at multiple partners while concentrating training workloads on a single InfiniBand-connected GPU cluster.
The lesson from provisioning hundreds of clusters across our partner network is that the interconnect decision is a function of scale and workload, not a blanket preference. Below 64 GPUs, start with Ethernet and invest the savings in more compute. At 64 GPUs and above for training, InfiniBand pays for itself in reduced GPU-hours. At 256 GPUs and above, do not even consider Ethernet for training workloads. The numbers make the decision straightforward once you stop treating interconnect as a line item and start treating it as a throughput multiplier for your GPU cluster.