IPv6 ECMP Instability in the Linux Kernel

2026-03-29

I recently discovered an unfortunate behavior in the Linux kernels IPv6 stack:

The Setup

In our infrastructure we have a management Kubernetes cluster which has three control-plane nodes. As we do not want to expose this cluster to the internet we rely on our router as an SSH jump host. Operators connect to the router using ssh -D 1080 $router which opens up a local SOCKS5 proxy to connect to endpoints within the infrastructure.

To ensure the cluster stays reachable when we are upgrading individual nodes, all three nodes announce the same IPv6 address in addition to their own node IP address. This allows us to balance traffic across the nodes by relying on ECMP.

The Problem

As users started using the proxy to connect to the cluster (via the proxy-url property in their kubeconfig) I started receiving reports of EOF errors. The errors occurred more frequently with longer requests.

Capturing traffic on the router I could see that the TCP connections, which originate on the router as they are terminated by the proxy in between, would occasionally switch their outgoing interface mid-session. After some testing it became clear that the flipping of the path was what broke the TCP session as the flow ended up on a different control-plane node which was unaware of the TCP session and replied with a TCP RST. This in turn caused the proxy to close the connection towards the client as the upstream connection was gone.

After some fruitless attempts at understanding what was going wrong I’ve started consulting Claude Code. After explaining the setup of the router, which includes FRR, it pointed me towards the possibility of a route change invalidating the cached route lookup result for the active connection.

A quick search revealed that this seems to be an issue as discussed on the Linux Kernel mailing list. Apparently the issue has been mitigated for IPv4 but the same mitigation was never applied to the IPv6 path.

To get confirmation I’ve built this bpftrace script which monitors the addresses and ECMP hash of active connections:

#!/usr/bin/env bpftrace

kprobe:fib6_select_path {
    $fl6 = (struct flowi6 *)arg2;
    @fl6[tid] = arg2;
}

kretprobe:fib6_select_path {
    $fl6 = (struct flowi6 *)@fl6[tid];

    printf("%-8s\tpid=%-6d\tsrc=%-16s dst=%-16s\thash=%u\n",
        comm, pid,
        ntop(10, $fl6->saddr.in6_u.u6_addr8),
        ntop(10, $fl6->daddr.in6_u.u6_addr8),
        $fl6->mp_hash);

    delete(@fl6[tid]);
}

END {
    clear(@fl6);
}

Reproducing the setup with network namespaces, veth pairs, and netcat revealed this output:

nc          pid=1185    src=::               dst=fd00:ecec:f::1     hash=1922781570
nc          pid=1185    src=fd00:ecec:1::1   dst=ff02::1:ff00:2     hash=394913401
nc          pid=1185    src=fd00:ecec:1::2   dst=fd00:ecec:1::1     hash=838337919
nc          pid=1185    src=fd00:ecec:1::1   dst=fd00:ecec:f::1     hash=0
nc          pid=1185    src=fd00:ecec:f::1   dst=fd00:ecec:1::1     hash=0
nc          pid=1185    src=fd00:ecec:f::1   dst=fd00:ecec:1::1     hash=0
nc          pid=1185    src=fd00:ecec:1::1   dst=fd00:ecec:f::1     hash=0
nc          pid=1185    src=fd00:ecec:f::1   dst=fd00:ecec:1::1     hash=0
nc          pid=1185    src=fd00:ecec:1::1   dst=fd00:ecec:f::1     hash=1761710900
nc          pid=1185    src=fd00:ecec:f::1   dst=fd00:ecec:1::1     hash=0

You can see how the first lookup is done with the source address being ::. After that the source address is set but as no further lookups are performed traffic flows the path initially selected. After replacing the routes of the ECMP group and sending another message the hash is calculated again, this time with the selected source address resulting in a different hash.

Depending on the setup this could be completely harmless, or, as in our case, break all active TCP connections originating from that router.

The Fix

We are starting out in net/ipv6/tcp_ipv6.c in the function tcp_v6_connect which calls ip6_dst_lookup_flow in net/ipv6/ip6_output.c, itself calling ip6_dst_lookup_tail where the magic happens.

If the source address is not set an initial route lookup is performed which, by calling ip6_route_output which eventually calls fib6_select_path, populates the ECMP hash based on the empty source address.

static int ip6_dst_lookup_tail(struct net *net, const struct sock *sk,
                   struct dst_entry **dst, struct flowi6 *fl6)
{
    // ...

    if (ipv6_addr_any(&fl6->saddr)) {
        // ...
        *dst = ip6_route_output(net, sk, fl6);
        // ...
        err = ip6_route_get_saddr(net, from, &fl6->daddr,
                      sk ? READ_ONCE(inet6_sk(sk)->srcprefs) : 0,
                      fl6->flowi6_l3mdev,
                      &fl6->saddr);
        //...
    }

    if (!*dst)
        *dst = ip6_route_output_flags(net, sk, fl6, flags);
    
    // ...
}

If the lookup was successful the result from the initial lookup is used without doing an additional lookup. We can rely on that to make sure that the lookup is performed again when an ECMP hash was involved in the selection of the source address by invalidating the destination:

        if (fl6->mp_hash) {
            fl6->mp_hash = 0;
            dst_release(*dst);
            *dst = NULL;
        }

Now the route lookup will be repeated with the already selected source address. The assumption is that, when ECMP groups are involved, the source address will be the same for every interface in the group.

After applying this change I’ve rebuilt the kernel and repeated the experiment, lo and behold, it now does calculate the hash twice and reaches a stable hash before sending the first packet:

nc          pid=1333    src=::               dst=fd00:ecec:f::1     hash=916816949
nc          pid=1333    src=fd00:ecec::1     dst=fd00:ecec:f::1     hash=656520448
nc          pid=1333    src=fd00:ecec::1     dst=ff02::1:ff00:2     hash=765176184
nc          pid=1333    src=fd00:ecec::2     dst=fd00:ecec::1       hash=749775625
nc          pid=1333    src=fd00:ecec::1     dst=fd00:ecec:f::1     hash=0
nc          pid=1333    src=fd00:ecec:f::1   dst=fd00:ecec::1       hash=0
nc          pid=1333    src=fd00:ecec:f::1   dst=fd00:ecec::1       hash=0
nc          pid=1333    src=fd00:ecec::1     dst=fd00:ecec:f::1     hash=0
nc          pid=1333    src=fd00:ecec:f::1   dst=fd00:ecec::1       hash=0
nc          pid=1333    src=fd00:ecec::1     dst=fd00:ecec:f::1     hash=656520448
nc          pid=1333    src=fd00:ecec:f::1   dst=fd00:ecec::1       hash=0

After replacing the routes, the hash gets re-calculated but since it already stabilized, the new hash value is identical!

Update: submitted the patch.