[BUG] svclb-traefik* won't start after host crash and restart. #1021

bayeslearnerold · 2022-03-23T02:06:13Z

What did you do

How was the cluster created?
- only 1 node, with a volume mapping for /var/rancher.../storage.
What did you do afterwards?
My host crashed and after restarting it and restarting k3d, I am no longer able to connect to any app service through ingress.

What did you expect to happen

Ingress should work

Screenshots or terminal output

[rockylinux@rockylinux8 infra_k3d]$ kubectl -n kube-system logs svclb-traefik-dkgkq lb-port-80
+ trap exit TERM INT
+ echo 10.43.70.41
+ grep -Eq :
+ cat /proc/sys/net/ipv4/ip_forward
+ '[' 1 '!=' 1 ]
+ iptables -t nat -I PREROUTING '!' -s 10.43.70.41/32 -p TCP --dport 80 -j DNAT --to 10.43.70.41:80
modprobe: can't change directory to '/lib/modules': No such file or directory
modprobe: can't change directory to '/lib/modules': No such file or directory
modprobe: can't change directory to '/lib/modules': No such file or directory
modprobe: can't change directory to '/lib/modules': No such file or directory
modprobe: can't change directory to '/lib/modules': No such file or directory
iptables v1.8.4 (legacy): can't initialize iptables table `nat': Table does not exist (do you need to insmod?)
Perhaps iptables or your kernel needs to be upgraded.

Which OS & Architecture

Linux, Windows, MacOS / amd64, x86, ...?
Linux rockylinux8.linuxvmimages.local 4.18.0-348.20.1.el8_5.x86_64 #1 SMP Thu Mar 10 20:59:28 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Which version of `k3d`

output of k3d version

k3d version v5.3.0
k3s version v1.22.6-k3s1 (default)

Which version of docker

output of docker version and docker info
[rockylinux@rockylinux8 infra_k3d]$ docker info

Client:
 Context:    default
 Debug Mode: false
 Plugins:
  app: Docker App (Docker Inc., v0.9.1-beta3)
  buildx: Docker Buildx (Docker Inc., v0.8.0-docker)
  scan: Docker Scan (Docker Inc., v0.17.0)

Server:
 Containers: 3
  Running: 2
  Paused: 0
  Stopped: 1
 Images: 5
 Server Version: 20.10.13
 Storage Driver: overlay2
  Backing Filesystem: xfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 2a1d4dbdb2a1030dc5b01e96fb110a9d9f150ecc
 runc version: v1.0.3-0-gf46b6ba
 init version: de40ad0
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 4.18.0-348.20.1.el8_5.x86_64
 Operating System: Rocky Linux 8.5 (Green Obsidian)
 OSType: linux
 Architecture: x86_64
 CPUs: 8
 Total Memory: 31.19GiB
 Name: rockylinux8.linuxvmimages.local
 ID: RI32:V7KA:PDQG:Q2Z2:DNET:CMMP:3MMG:23OF:RMTN:W6J2:WOQO:N4YA
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

The text was updated successfully, but these errors were encountered:

bayeslearnerold · 2022-03-23T02:28:24Z

Events:
  Type     Reason          Age                   From     Message
  ----     ------          ----                  ----     -------
  Warning  BackOff         58m (x391 over 138m)  kubelet  Back-off restarting failed container
  Normal   SandboxChanged  47m                   kubelet  Pod sandbox changed, it will be killed and re-created.
  Normal   Started         46m (x2 over 47m)     kubelet  Started container lb-port-80
  Normal   Pulled          46m (x2 over 47m)     kubelet  Container image "rancher/klipper-lb:v0.3.4" already present on machine
  Normal   Created         46m (x2 over 47m)     kubelet  Created container lb-port-443
  Normal   Started         46m (x2 over 47m)     kubelet  Started container lb-port-443
  Warning  BackOff         46m (x5 over 47m)     kubelet  Back-off restarting failed container
  Normal   Pulled          46m (x3 over 47m)     kubelet  Container image "rancher/klipper-lb:v0.3.4" already present on machine
  Normal   Created         46m (x3 over 47m)     kubelet  Created container lb-port-80
  Warning  BackOff         22m (x125 over 47m)   kubelet  Back-off restarting failed container
  Normal   SandboxChanged  17m                   kubelet  Pod sandbox changed, it will be killed and re-created.
  Normal   Started         16m (x2 over 17m)     kubelet  Started container lb-port-80
  Normal   Pulled          16m (x2 over 17m)     kubelet  Container image "rancher/klipper-lb:v0.3.4" already present on machine
  Normal   Created         16m (x2 over 17m)     kubelet  Created container lb-port-443
  Normal   Started         16m (x2 over 17m)     kubelet  Started container lb-port-443
  Warning  BackOff         16m (x5 over 17m)     kubelet  Back-off restarting failed container
  Normal   Pulled          16m (x3 over 17m)     kubelet  Container image "rancher/klipper-lb:v0.3.4" already present on machine
  Normal   Created         16m (x3 over 17m)     kubelet  Created container lb-port-80
  Warning  BackOff         119s (x78 over 17m)   kubelet  Back-off restarting failed container

bayeslearnerold · 2022-03-23T04:01:03Z

There is definitely something wrong, however the problem went away.

I deleted the cluster and recreated one. The error was still there. Then I restarted the VM and the error is again there for about 10 minutes. Then after that everything is fine again. What the heck?! :-)

iwilltry42 · 2022-03-24T09:34:04Z

Hi @bayeslearner , thanks for opening this issue and providing all the information.
This does look like some weird incompatibility between klipper and your kernel 🤔
I can't really figure out what's the issue there, but in case it comes back, I'd recommend you to try and open an issue for klipper: https://github.com/k3s-io/klipper-lb

Sorry I cannot help you there.. maybe jump on the rancher-users Slack and drop a question in e.g. the #k3s channel.

If you think that there's anything we can do on k3d's side, feel free to reopen this issue 👍

bayeslearnerold · 2022-04-04T03:24:59Z

It did come back. This seems to be caused by an application trying to start another svclb-yyy in the same docker container.

But the issue is that I have no clue how to recover from this error after removing the other application.
I tried delete the pod from the daemonset etc. New pod has the same problem.

iwilltry42 · 2022-06-04T19:19:23Z

@bayeslearner so you have multiple svclb pods trying to map the same port and then when you delete one of them, the other still doesn't work?
Can you provide some logs or kubectl output to show the situation?

guhuajun · 2022-12-24T09:14:46Z

I had same issue. I am using Rocky Linux.

uname -r
5.14.0-162.6.1.el9_1.0.1.x86_64

cat /etc/os-release
NAME="Rocky Linux"
VERSION="9.1 (Blue Onyx)"
ID="rocky"
ID_LIKE="rhel centos fedora"
VERSION_ID="9.1"
PLATFORM_ID="platform:el9"
PRETTY_NAME="Rocky Linux 9.1 (Blue Onyx)"    
ANSI_COLOR="0;32"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:rocky:rocky:9::baseos"      
HOME_URL="https://rockylinux.org/"
BUG_REPORT_URL="https://bugs.rockylinux.org/"
ROCKY_SUPPORT_PRODUCT="Rocky-Linux-9"        
ROCKY_SUPPORT_PRODUCT_VERSION="9.1"
REDHAT_SUPPORT_PRODUCT="Rocky Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="9.1"

And following patch works for me.
k3s-io/klipper-lb#34 (comment)

r-ushil · 2023-05-23T09:09:31Z

Getting the same issue, running:

k3d version v5.5.1
k3s version v1.26.4-k3s1 (default)

kubectl get all -A shows me this:

kube-system   pod/svclb-traefik-e6de6385-m4hzd              0/2     CrashLoopBackOff   10 (80s ago)   4m8s
kube-system   pod/svclb-traefik-e6de6385-79shf              0/2     CrashLoopBackOff   10 (78s ago)   4m8s
kube-system   pod/svclb-traefik-e6de6385-xrtz7              0/2     CrashLoopBackOff   10 (69s ago)   4m8s

A deeper dive into the logs for one of the pods shows:

Defaulted container "lb-tcp-80" out of: lb-tcp-80, lb-tcp-443
+ trap exit TERM INT
+ BIN_DIR=/sbin
+ check_iptables_mode
+ set +e
+ lsmod
+ grep nf_tables
[INFO]  legacy mode detected
+ '[' 1 '=' 0 ]
+ mode=legacy
+ set -e
+ info 'legacy mode detected'
+ echo '[INFO] ' 'legacy mode detected'
+ set_legacy
+ ln -sf /sbin/xtables-legacy-multi /sbin/iptables
+ ln -sf /sbin/xtables-legacy-multi /sbin/iptables-save
+ ln -sf /sbin/xtables-legacy-multi /sbin/iptables-restore
+ ln -sf /sbin/xtables-legacy-multi /sbin/ip6tables
+ start_proxy
+ echo 0.0.0.0/0
+ grep -Eq :
+ iptables -t filter -I FORWARD -s 0.0.0.0/0 -p TCP --dport 80 -j ACCEPT
modprobe: can't change directory to '/lib/modules': No such file or directory
modprobe: can't change directory to '/lib/modules': No such file or directory
iptables v1.8.8 (legacy): can't initialize iptables table `filter': Table does not exist (do you need to insmod?)
Perhaps iptables or your kernel needs to be upgraded.

My host's iptable version is iptables v1.8.4 (nf_tables) compared to the container versions iptables v1.8.8 (nf_tables) so it shouldn't be a nftable compatibility error.

Any idea as to what's going on? The hack above doesn't seem to work when using rancher/klipper:v0.4.3. Completely stumped.

r-ushil · 2023-05-23T12:22:55Z

After a bit of digging, I've found the problem lies in the entry script for the rancher/klipper:v0.4.3 container.

The problem lies in the check_iptables_mode() function, which tries to use lsmod and modprobe when it doesn't need to, and as a result thinks the iptables mode is always legacy - meaning it doesn't work with nft.

Here's a fix for iptables_nft host operating systems:

FILE: entry

#!/bin/sh
set -ex

trap exit TERM INT

BIN_DIR="/sbin"

info()
{
    echo '[INFO] ' "$@"
}
fatal()
{
    echo '[ERROR] ' "$@" >&2
    exit 1
}

check_iptables_mode() {
    mode=nft
}

set_nft() {
    for i in iptables iptables-save iptables-restore ip6tables; do
        ln -sf /sbin/xtables-nft-multi "$BIN_DIR/$i";
    done
}

set_legacy() {
    for i in iptables iptables-save iptables-restore ip6tables; do
        ln -sf /sbin/xtables-legacy-multi "$BIN_DIR/$i";
    done
}

start_proxy() {
    for src_range in ${SRC_RANGES}; do
    if echo ${src_range} | grep -Eq ":"; then
        ip6tables -t filter -I FORWARD -s ${src_range} -p ${DEST_PROTO} --dport ${SRC_PORT} -j ACCEPT
    else
        iptables -t filter -I FORWARD -s ${src_range} -p ${DEST_PROTO} --dport ${SRC_PORT} -j ACCEPT
    fi
    done

    for dest_ip in ${DEST_IPS}; do
        if echo ${dest_ip} | grep -Eq ":"; then
            [ $(cat /proc/sys/net/ipv6/conf/all/forwarding) == 1 ] || exit 1
            ip6tables -t filter -A FORWARD -d ${dest_ip}/128 -p ${DEST_PROTO} --dport ${DEST_PORT} -j DROP
            ip6tables -t nat -I PREROUTING ! -s ${dest_ip}/128 -p ${DEST_PROTO} --dport ${SRC_PORT} -j DNAT --to [${dest_ip}]:${DEST_PORT}
            ip6tables -t nat -I POSTROUTING -d ${dest_ip}/128 -p ${DEST_PROTO} -j MASQUERADE
        else
            [ $(cat /proc/sys/net/ipv4/ip_forward) == 1 ] || exit 1
            iptables -t filter -A FORWARD -d ${dest_ip}/32 -p ${DEST_PROTO} --dport ${DEST_PORT} -j DROP
            iptables -t nat -I PREROUTING ! -s ${dest_ip}/32 -p ${DEST_PROTO} --dport ${SRC_PORT} -j DNAT --to ${dest_ip}:${DEST_PORT}
            iptables -t nat -I POSTROUTING -d ${dest_ip}/32 -p ${DEST_PROTO} -j MASQUERADE
        fi
    done
}

check_iptables_mode
case $mode in
nft)
    info "nft mode detected"
    set_nft
    ;;
legacy)
    info "legacy mode detected"
    set_legacy
    ;;
*)
    fatal "invalid iptables mode"
    ;;
esac
start_proxy

if [ ! -e /pause ]; then
    mkfifo /pause
fi
</pause

FILE: Dockerfile

FROM rancher/klipper-lb:v0.4.3
COPY entry /usr/bin/entry
CMD ["entry"]

Run docker build -t rancher/klipper-lb:v0.4.3 . and k3d image import -c MY_CLUSTER_NAME rancher/klipper-lb:v0.4.3 to inject the modified image into your running k3d cluster to fix.

bayeslearnerold added the bug Something isn't working label Mar 23, 2022

bayeslearnerold mentioned this issue Mar 23, 2022

[BUG] trafik svclb does not start, status: CrashLoopBackOff #686

Closed

iwilltry42 self-assigned this Mar 24, 2022

iwilltry42 added this to the Backlog milestone Mar 24, 2022

iwilltry42 closed this as completed Mar 24, 2022

r-ushil mentioned this issue May 23, 2023

[BUG] svclb-traefik* won't start after host crash and restart. k3s-io/klipper-lb#34

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] svclb-traefik* won't start after host crash and restart. #1021

[BUG] svclb-traefik* won't start after host crash and restart. #1021

bayeslearnerold commented Mar 23, 2022 •

edited by iwilltry42

Loading

bayeslearnerold commented Mar 23, 2022 •

edited by iwilltry42

Loading

bayeslearnerold commented Mar 23, 2022

iwilltry42 commented Mar 24, 2022

bayeslearnerold commented Apr 4, 2022 •

edited

Loading

iwilltry42 commented Jun 4, 2022

guhuajun commented Dec 24, 2022 •

edited

Loading

r-ushil commented May 23, 2023

r-ushil commented May 23, 2023

[BUG] svclb-traefik* won't start after host crash and restart. #1021

[BUG] svclb-traefik* won't start after host crash and restart. #1021

Comments

bayeslearnerold commented Mar 23, 2022 • edited by iwilltry42 Loading

What did you do

What did you expect to happen

Screenshots or terminal output

Which OS & Architecture

Which version of k3d

Which version of docker

bayeslearnerold commented Mar 23, 2022 • edited by iwilltry42 Loading

bayeslearnerold commented Mar 23, 2022

iwilltry42 commented Mar 24, 2022

bayeslearnerold commented Apr 4, 2022 • edited Loading

iwilltry42 commented Jun 4, 2022

guhuajun commented Dec 24, 2022 • edited Loading

r-ushil commented May 23, 2023

r-ushil commented May 23, 2023

bayeslearnerold commented Mar 23, 2022 •

edited by iwilltry42

Loading

Which version of `k3d`

bayeslearnerold commented Mar 23, 2022 •

edited by iwilltry42

Loading

bayeslearnerold commented Apr 4, 2022 •

edited

Loading

guhuajun commented Dec 24, 2022 •

edited

Loading