Why is there no "round robin" option for lacp (at least on my switches)?

Question

I am trying to increase bandwidth for particular hosts by using lacp on a Cisco Catalyst 4500 switch. However, it seems that I will always end up with the same flow (src/dst mac, ip, and port) being completely switched over one interface, no matter what port-channel load-balance option I choose.

Why is there no such thing as a round robin option for this? (It seems that some older models had this option)

Round-robin is actually a very bad way to do this. It causes out of order packet delivery, and that can kill some protocols, especially real-time protocols. — Ron Maupin, Sep 16 '16 at 00:34
Did any answer help you? if so, you should accept the answer so that the question doesn't keep popping up forever, looking for an answer. Alternatively, you could post and accept your own answer. — Ron Maupin, Jan 03 '21 at 19:33

score 7 · Answer 1 · answered Sep 15 '16 at 12:43

From my understanding, LACP always requires that packets in a particular network "conversation" (a TCP session or a UDP stream between the same source and destination ports) traverse a single LACP path, in order to guarantee delivery order. LACP chooses a path based on a hashing algorithm. In Linux, the hashing algorithms are those described in the bonding driver's xmit_hash_policy setting:

layer2 - meaning the hash consists of the source and destination layer 2 (MAC) addresses;
layer2+3 - meaning the hash consists of the layer2 information, plus the source and destination layer 3 (IP) addresses;
layer3+4 - meaning the hash consists of the layer 3 (IP) addresses plus whatever layer 4 addressing information is available (e.g. TCP or UDP source and destination port numbers, if the packets are not fragmented).

Cisco has a number of additional options.

The server and switch do not need to use the same hashing algorithm; all that is required is that the packets for a particular conversation choose the same path as long as that path is available.

For "guaranteed delivery order", we are talking about the wrong OSi layer here. To quote Wikipedia (emph mine): "When balancing traffic, network administrators often wish to avoid reordering Ethernet frames." This suggests that it should be configurable, but not the only option — Hagen von Eitzen, Feb 21 '18 at 09:11

score 2 · Answer 2 · answered Sep 15 '16 at 14:54

LACP controls the aggregation of links, it doesn't control switching. The selection of a particular member of a bundle is via an algortithm decided upon and implemented by each switch vendor. It has to be something that can be programmed into hardware in order to operate at wire speeds relatively cheaply, and a hash of various fields of an Ethernet frame is suitable for this. To perform round-robin the ingress hardware would need to know all about the bundle members and keep state about previous frames switched. That problem is an order of magnitude more complex. I don't know of any switches that perform round-robin across a LAG.

Linux and ESX etc can perform round-robin in software, that's fairly straightforward in comparison since the entire interface selection process is already in software. Also notice that typically servers have a higher transmit requirement than receive and so for many applications the increased bandwidth that round-robin achieves is only required server-to-switch.

Re "only required server-to-switch" - I disagree. With flow-based decision, a server can receive up to 8Gbps, say, but at most 1Gbps per flow. With round-robin, the server then might transmit up to 8Gbps and even in just one flow. But if these 8Gbps cannot get out of the switch (cause it allows only 1Gbps per flow), there will be congestion and ultimately only 1Gbps per flow as well. — Hagen von Eitzen, Sep 15 '16 at 16:54
Also, I fail to understand how incrementing a three-bit counter per port can be an order of magnitude more complex than performing a hash computation that turns 24 bits into 3 bits in a fashion complex enough to be called a hash in the first place ... — Hagen von Eitzen, Sep 15 '16 at 16:58

score 2 · Answer 3 · answered Aug 01 '21 at 12:02

I tried using round-robin to connect two Linux machines together, as they both support this. While I did get more available bandwidth than a single link, the result was not spectacular.

I bonded together two 10G links, in theory allowing 20 Gbps between the hosts. Sometimes I would indeed get close to this, but a lot of the time I would only get 11-15 Gbps.

The problem is that for whatever reason, the packets don't always arrive in order. For TCP this can cause it to think the packet has been lost and request a retransmit (lowering overall bandwidth) and increasing latency, but for bulk transfers like file servers you still end up with an overall higher transfer rate so it's not the end of the world.

However for protocols that don't guarantee delivery like UDP and realtime video streaming it's a disaster, as they tend to process the latest packet available and discard anything earlier that hasn't arrived yet, in order to get the lowest latency possible - they don't want to wait for all the packets to arrive first as that can introduce a lot of delay. While a lost packet here and there is no problem, with round-robin you can get extremely high numbers of packets arriving out of order which can completely corrupt the data stream and for video, make it unwatchable!

Imagine the source machine is sending a single stream out of four bonded NICs in the order NIC1, NIC2, NIC3, NIC4, NIC1, etc. Now imagine the receiving machine is processing the packets in the opposite order, first checking NIC4, then NIC3, NIC2, NIC1 then NIC4 again. In this example the packet it picks up first will be from NIC4 which will be the latest, and the next three packets off NIC3, 2 and 1 will be older. That's 75% of the packets out of order, which could cause a realtime video stream to lose 75% of its packets easily making it unwatchable.

The only way this can work is if the receiving hardware is designed to store the packets until a complete set has arrived and then forward them on in order, which will increase latency. It would also mean a switch would have to decide how long to wait before deciding a packet has been lost and give up waiting for it, so before you know it you're implementing a cut-down TCP-style guaranteed delivery protocol on top of Ethernet. Then you have to consider what happens when someone tries to bond a 10G and four 1G links, hoping to get 14G. Your 10G link will send so many more packets it will look like all the ones coming in over the 1G link have gotten lost, even though they are just taking 10 times longer to transmit. So how do you tell the difference between a lost packet or one that is just taking ages to arrive across a slower link?

Long story short, it's complicated to make it work.

The 40G QSFP standard uses four 10G links in parallel to achieve 40 Gbps of bandwidth but this only works because it's not done at the packet level. Each switch is splitting the packet and sending it out across the four links in parallel, and the receiving switch is combining it at the far end, thus keeping every packet in order and the latency low. This is really the only way to do it properly, but it means coming up with a new standard like they did with QSFP because the data going over the wire is no longer standard Ethernet and not backwards compatible.

Whatever you do, keep your packets in order! - SCNR – Zac67 Aug 01 '21 at 12:56 — Zac67, Aug 01 '21 at 12:56

Why is there no "round robin" option for lacp (at least on my switches)?

3 Answers3