IPv6 ECMP Instability in the Linux Kernel
2026-03-29
I recently discovered an unfortunate behavior in the Linux kernels IPv6 stack:
The Setup
In our infrastructure we have a management Kubernetes cluster which
has three control-plane nodes. As we do not want to expose this cluster
to the internet we rely on our router as an SSH jump host. Operators
connect to the router using ssh -D 1080 $router which opens
up a local SOCKS5 proxy to connect to endpoints within the
infrastructure.
To ensure the cluster stays reachable when we are upgrading individual nodes, all three nodes announce the same IPv6 address in addition to their own node IP address. This allows us to balance traffic across the nodes by relying on ECMP.
The Problem
As users started using the proxy to connect to the cluster (via the
proxy-url property in their kubeconfig) I started receiving
reports of EOF errors. The errors occurred more frequently with longer
requests.
Capturing traffic on the router I could see that the TCP connections, which originate on the router as they are terminated by the proxy in between, would occasionally switch their outgoing interface mid-session. After some testing it became clear that the flipping of the path was what broke the TCP session as the flow ended up on a different control-plane node which was unaware of the TCP session and replied with a TCP RST. This in turn caused the proxy to close the connection towards the client as the upstream connection was gone.
After some fruitless attempts at understanding what was going wrong I’ve started consulting Claude Code. After explaining the setup of the router, which includes FRR, it pointed me towards the possibility of a route change invalidating the cached route lookup result for the active connection.
A quick search revealed that this seems to be an issue as discussed on the Linux Kernel mailing list. Apparently the issue has been mitigated for IPv4 but the same mitigation was never applied to the IPv6 path.
To get confirmation I’ve built this bpftrace script which monitors the addresses and ECMP hash of active connections:
#!/usr/bin/env bpftrace
kprobe:fib6_select_path {
$fl6 = (struct flowi6 *)arg2;
@fl6[tid] = arg2;
}
kretprobe:fib6_select_path {
$fl6 = (struct flowi6 *)@fl6[tid];
printf("%-8s\tpid=%-6d\tsrc=%-16s dst=%-16s\thash=%u\n",
comm, pid,
ntop(10, $fl6->saddr.in6_u.u6_addr8),
ntop(10, $fl6->daddr.in6_u.u6_addr8),
$fl6->mp_hash);
delete(@fl6[tid]);
}
END {
clear(@fl6);
}
Reproducing the setup with network namespaces, veth pairs, and netcat revealed this output:
nc pid=1185 src=:: dst=fd00:ecec:f::1 hash=1922781570
nc pid=1185 src=fd00:ecec:1::1 dst=ff02::1:ff00:2 hash=394913401
nc pid=1185 src=fd00:ecec:1::2 dst=fd00:ecec:1::1 hash=838337919
nc pid=1185 src=fd00:ecec:1::1 dst=fd00:ecec:f::1 hash=0
nc pid=1185 src=fd00:ecec:f::1 dst=fd00:ecec:1::1 hash=0
nc pid=1185 src=fd00:ecec:f::1 dst=fd00:ecec:1::1 hash=0
nc pid=1185 src=fd00:ecec:1::1 dst=fd00:ecec:f::1 hash=0
nc pid=1185 src=fd00:ecec:f::1 dst=fd00:ecec:1::1 hash=0
nc pid=1185 src=fd00:ecec:1::1 dst=fd00:ecec:f::1 hash=1761710900
nc pid=1185 src=fd00:ecec:f::1 dst=fd00:ecec:1::1 hash=0
You can see how the first lookup is done with the source address
being ::. After that the source address is set but as no
further lookups are performed traffic flows the path initially selected.
After replacing the routes of the ECMP group and sending another message
the hash is calculated again, this time with the selected source address
resulting in a different hash.
Depending on the setup this could be completely harmless, or, as in our case, break all active TCP connections originating from that router.
The Fix
We are starting out in net/ipv6/tcp_ipv6.c in the
function tcp_v6_connect which calls
ip6_dst_lookup_flow in net/ipv6/ip6_output.c,
itself calling ip6_dst_lookup_tail where the magic
happens.
If the source address is not set an initial route lookup is performed
which, by calling ip6_route_output which eventually calls
fib6_select_path, populates the ECMP hash based on the
empty source address.
static int ip6_dst_lookup_tail(struct net *net, const struct sock *sk,
struct dst_entry **dst, struct flowi6 *fl6)
{
// ...
if (ipv6_addr_any(&fl6->saddr)) {
// ...
*dst = ip6_route_output(net, sk, fl6);
// ...
err = ip6_route_get_saddr(net, from, &fl6->daddr,
sk ? READ_ONCE(inet6_sk(sk)->srcprefs) : 0,
fl6->flowi6_l3mdev,
&fl6->saddr);
//...
}
if (!*dst)
*dst = ip6_route_output_flags(net, sk, fl6, flags);
// ...
}
If the lookup was successful the result from the initial lookup is used without doing an additional lookup. We can rely on that to make sure that the lookup is performed again when an ECMP hash was involved in the selection of the source address by invalidating the destination:
if (fl6->mp_hash) {
fl6->mp_hash = 0;
dst_release(*dst);
*dst = NULL;
}
Now the route lookup will be repeated with the already selected source address. The assumption is that, when ECMP groups are involved, the source address will be the same for every interface in the group.
After applying this change I’ve rebuilt the kernel and repeated the experiment, lo and behold, it now does calculate the hash twice and reaches a stable hash before sending the first packet:
nc pid=1333 src=:: dst=fd00:ecec:f::1 hash=916816949
nc pid=1333 src=fd00:ecec::1 dst=fd00:ecec:f::1 hash=656520448
nc pid=1333 src=fd00:ecec::1 dst=ff02::1:ff00:2 hash=765176184
nc pid=1333 src=fd00:ecec::2 dst=fd00:ecec::1 hash=749775625
nc pid=1333 src=fd00:ecec::1 dst=fd00:ecec:f::1 hash=0
nc pid=1333 src=fd00:ecec:f::1 dst=fd00:ecec::1 hash=0
nc pid=1333 src=fd00:ecec:f::1 dst=fd00:ecec::1 hash=0
nc pid=1333 src=fd00:ecec::1 dst=fd00:ecec:f::1 hash=0
nc pid=1333 src=fd00:ecec:f::1 dst=fd00:ecec::1 hash=0
nc pid=1333 src=fd00:ecec::1 dst=fd00:ecec:f::1 hash=656520448
nc pid=1333 src=fd00:ecec:f::1 dst=fd00:ecec::1 hash=0
After replacing the routes, the hash gets re-calculated but since it already stabilized, the new hash value is identical!
Update: submitted the patch.