Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] svclb-traefik* won't start after host crash and restart. #1021

Closed
bayeslearnerold opened this issue Mar 23, 2022 · 8 comments
Closed
Assignees
Labels
bug Something isn't working
Milestone

Comments

@bayeslearnerold
Copy link

bayeslearnerold commented Mar 23, 2022

What did you do

  • How was the cluster created?

    • only 1 node, with a volume mapping for /var/rancher.../storage.
  • What did you do afterwards?
    My host crashed and after restarting it and restarting k3d, I am no longer able to connect to any app service through ingress.

What did you expect to happen

Ingress should work

Screenshots or terminal output

[rockylinux@rockylinux8 infra_k3d]$ kubectl -n kube-system logs svclb-traefik-dkgkq lb-port-80
+ trap exit TERM INT
+ echo 10.43.70.41
+ grep -Eq :
+ cat /proc/sys/net/ipv4/ip_forward
+ '[' 1 '!=' 1 ]
+ iptables -t nat -I PREROUTING '!' -s 10.43.70.41/32 -p TCP --dport 80 -j DNAT --to 10.43.70.41:80
modprobe: can't change directory to '/lib/modules': No such file or directory
modprobe: can't change directory to '/lib/modules': No such file or directory
modprobe: can't change directory to '/lib/modules': No such file or directory
modprobe: can't change directory to '/lib/modules': No such file or directory
modprobe: can't change directory to '/lib/modules': No such file or directory
iptables v1.8.4 (legacy): can't initialize iptables table `nat': Table does not exist (do you need to insmod?)
Perhaps iptables or your kernel needs to be upgraded. 

Which OS & Architecture

  • Linux, Windows, MacOS / amd64, x86, ...?
    Linux rockylinux8.linuxvmimages.local 4.18.0-348.20.1.el8_5.x86_64 #1 SMP Thu Mar 10 20:59:28 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Which version of k3d

  • output of k3d version
k3d version v5.3.0
k3s version v1.22.6-k3s1 (default)

Which version of docker

  • output of docker version and docker info
    [rockylinux@rockylinux8 infra_k3d]$ docker info
Client:
 Context:    default
 Debug Mode: false
 Plugins:
  app: Docker App (Docker Inc., v0.9.1-beta3)
  buildx: Docker Buildx (Docker Inc., v0.8.0-docker)
  scan: Docker Scan (Docker Inc., v0.17.0)

Server:
 Containers: 3
  Running: 2
  Paused: 0
  Stopped: 1
 Images: 5
 Server Version: 20.10.13
 Storage Driver: overlay2
  Backing Filesystem: xfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 2a1d4dbdb2a1030dc5b01e96fb110a9d9f150ecc
 runc version: v1.0.3-0-gf46b6ba
 init version: de40ad0
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 4.18.0-348.20.1.el8_5.x86_64
 Operating System: Rocky Linux 8.5 (Green Obsidian)
 OSType: linux
 Architecture: x86_64
 CPUs: 8
 Total Memory: 31.19GiB
 Name: rockylinux8.linuxvmimages.local
 ID: RI32:V7KA:PDQG:Q2Z2:DNET:CMMP:3MMG:23OF:RMTN:W6J2:WOQO:N4YA
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false
@bayeslearnerold bayeslearnerold added the bug Something isn't working label Mar 23, 2022
@bayeslearnerold
Copy link
Author

bayeslearnerold commented Mar 23, 2022

Events:
  Type     Reason          Age                   From     Message
  ----     ------          ----                  ----     -------
  Warning  BackOff         58m (x391 over 138m)  kubelet  Back-off restarting failed container
  Normal   SandboxChanged  47m                   kubelet  Pod sandbox changed, it will be killed and re-created.
  Normal   Started         46m (x2 over 47m)     kubelet  Started container lb-port-80
  Normal   Pulled          46m (x2 over 47m)     kubelet  Container image "rancher/klipper-lb:v0.3.4" already present on machine
  Normal   Created         46m (x2 over 47m)     kubelet  Created container lb-port-443
  Normal   Started         46m (x2 over 47m)     kubelet  Started container lb-port-443
  Warning  BackOff         46m (x5 over 47m)     kubelet  Back-off restarting failed container
  Normal   Pulled          46m (x3 over 47m)     kubelet  Container image "rancher/klipper-lb:v0.3.4" already present on machine
  Normal   Created         46m (x3 over 47m)     kubelet  Created container lb-port-80
  Warning  BackOff         22m (x125 over 47m)   kubelet  Back-off restarting failed container
  Normal   SandboxChanged  17m                   kubelet  Pod sandbox changed, it will be killed and re-created.
  Normal   Started         16m (x2 over 17m)     kubelet  Started container lb-port-80
  Normal   Pulled          16m (x2 over 17m)     kubelet  Container image "rancher/klipper-lb:v0.3.4" already present on machine
  Normal   Created         16m (x2 over 17m)     kubelet  Created container lb-port-443
  Normal   Started         16m (x2 over 17m)     kubelet  Started container lb-port-443
  Warning  BackOff         16m (x5 over 17m)     kubelet  Back-off restarting failed container
  Normal   Pulled          16m (x3 over 17m)     kubelet  Container image "rancher/klipper-lb:v0.3.4" already present on machine
  Normal   Created         16m (x3 over 17m)     kubelet  Created container lb-port-80
  Warning  BackOff         119s (x78 over 17m)   kubelet  Back-off restarting failed container

@bayeslearnerold
Copy link
Author

There is definitely something wrong, however the problem went away.

I deleted the cluster and recreated one. The error was still there. Then I restarted the VM and the error is again there for about 10 minutes. Then after that everything is fine again. What the heck?! :-)

@iwilltry42 iwilltry42 self-assigned this Mar 24, 2022
@iwilltry42 iwilltry42 added this to the Backlog milestone Mar 24, 2022
@iwilltry42
Copy link
Member

Hi @bayeslearner , thanks for opening this issue and providing all the information.
This does look like some weird incompatibility between klipper and your kernel 🤔
I can't really figure out what's the issue there, but in case it comes back, I'd recommend you to try and open an issue for klipper: https://github.com/k3s-io/klipper-lb

Sorry I cannot help you there.. maybe jump on the rancher-users Slack and drop a question in e.g. the #k3s channel.

If you think that there's anything we can do on k3d's side, feel free to reopen this issue 👍

@bayeslearnerold
Copy link
Author

bayeslearnerold commented Apr 4, 2022

It did come back. This seems to be caused by an application trying to start another svclb-yyy in the same docker container.

But the issue is that I have no clue how to recover from this error after removing the other application.
I tried delete the pod from the daemonset etc. New pod has the same problem.

@iwilltry42
Copy link
Member

@bayeslearner so you have multiple svclb pods trying to map the same port and then when you delete one of them, the other still doesn't work?
Can you provide some logs or kubectl output to show the situation?

@guhuajun
Copy link

guhuajun commented Dec 24, 2022

I had same issue. I am using Rocky Linux.

uname -r
5.14.0-162.6.1.el9_1.0.1.x86_64

cat /etc/os-release
NAME="Rocky Linux"
VERSION="9.1 (Blue Onyx)"
ID="rocky"
ID_LIKE="rhel centos fedora"
VERSION_ID="9.1"
PLATFORM_ID="platform:el9"
PRETTY_NAME="Rocky Linux 9.1 (Blue Onyx)"    
ANSI_COLOR="0;32"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:rocky:rocky:9::baseos"      
HOME_URL="https://rockylinux.org/"
BUG_REPORT_URL="https://bugs.rockylinux.org/"
ROCKY_SUPPORT_PRODUCT="Rocky-Linux-9"        
ROCKY_SUPPORT_PRODUCT_VERSION="9.1"
REDHAT_SUPPORT_PRODUCT="Rocky Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="9.1"

And following patch works for me.
k3s-io/klipper-lb#34 (comment)

@r-ushil
Copy link

r-ushil commented May 23, 2023

Getting the same issue, running:

k3d version v5.5.1
k3s version v1.26.4-k3s1 (default)

kubectl get all -A shows me this:

kube-system   pod/svclb-traefik-e6de6385-m4hzd              0/2     CrashLoopBackOff   10 (80s ago)   4m8s
kube-system   pod/svclb-traefik-e6de6385-79shf              0/2     CrashLoopBackOff   10 (78s ago)   4m8s
kube-system   pod/svclb-traefik-e6de6385-xrtz7              0/2     CrashLoopBackOff   10 (69s ago)   4m8s

A deeper dive into the logs for one of the pods shows:

Defaulted container "lb-tcp-80" out of: lb-tcp-80, lb-tcp-443
+ trap exit TERM INT
+ BIN_DIR=/sbin
+ check_iptables_mode
+ set +e
+ lsmod
+ grep nf_tables
[INFO]  legacy mode detected
+ '[' 1 '=' 0 ]
+ mode=legacy
+ set -e
+ info 'legacy mode detected'
+ echo '[INFO] ' 'legacy mode detected'
+ set_legacy
+ ln -sf /sbin/xtables-legacy-multi /sbin/iptables
+ ln -sf /sbin/xtables-legacy-multi /sbin/iptables-save
+ ln -sf /sbin/xtables-legacy-multi /sbin/iptables-restore
+ ln -sf /sbin/xtables-legacy-multi /sbin/ip6tables
+ start_proxy
+ echo 0.0.0.0/0
+ grep -Eq :
+ iptables -t filter -I FORWARD -s 0.0.0.0/0 -p TCP --dport 80 -j ACCEPT
modprobe: can't change directory to '/lib/modules': No such file or directory
modprobe: can't change directory to '/lib/modules': No such file or directory
iptables v1.8.8 (legacy): can't initialize iptables table `filter': Table does not exist (do you need to insmod?)
Perhaps iptables or your kernel needs to be upgraded.

My host's iptable version is iptables v1.8.4 (nf_tables) compared to the container versions iptables v1.8.8 (nf_tables) so it shouldn't be a nftable compatibility error.

Any idea as to what's going on? The hack above doesn't seem to work when using rancher/klipper:v0.4.3. Completely stumped.

@r-ushil
Copy link

r-ushil commented May 23, 2023

After a bit of digging, I've found the problem lies in the entry script for the rancher/klipper:v0.4.3 container.

The problem lies in the check_iptables_mode() function, which tries to use lsmod and modprobe when it doesn't need to, and as a result thinks the iptables mode is always legacy - meaning it doesn't work with nft.

Here's a fix for iptables_nft host operating systems:

FILE: entry

#!/bin/sh
set -ex

trap exit TERM INT

BIN_DIR="/sbin"

info()
{
    echo '[INFO] ' "$@"
}
fatal()
{
    echo '[ERROR] ' "$@" >&2
    exit 1
}

check_iptables_mode() {
    mode=nft
}

set_nft() {
    for i in iptables iptables-save iptables-restore ip6tables; do
        ln -sf /sbin/xtables-nft-multi "$BIN_DIR/$i";
    done
}

set_legacy() {
    for i in iptables iptables-save iptables-restore ip6tables; do
        ln -sf /sbin/xtables-legacy-multi "$BIN_DIR/$i";
    done
}

start_proxy() {
    for src_range in ${SRC_RANGES}; do
    if echo ${src_range} | grep -Eq ":"; then
        ip6tables -t filter -I FORWARD -s ${src_range} -p ${DEST_PROTO} --dport ${SRC_PORT} -j ACCEPT
    else
        iptables -t filter -I FORWARD -s ${src_range} -p ${DEST_PROTO} --dport ${SRC_PORT} -j ACCEPT
    fi
    done

    for dest_ip in ${DEST_IPS}; do
        if echo ${dest_ip} | grep -Eq ":"; then
            [ $(cat /proc/sys/net/ipv6/conf/all/forwarding) == 1 ] || exit 1
            ip6tables -t filter -A FORWARD -d ${dest_ip}/128 -p ${DEST_PROTO} --dport ${DEST_PORT} -j DROP
            ip6tables -t nat -I PREROUTING ! -s ${dest_ip}/128 -p ${DEST_PROTO} --dport ${SRC_PORT} -j DNAT --to [${dest_ip}]:${DEST_PORT}
            ip6tables -t nat -I POSTROUTING -d ${dest_ip}/128 -p ${DEST_PROTO} -j MASQUERADE
        else
            [ $(cat /proc/sys/net/ipv4/ip_forward) == 1 ] || exit 1
            iptables -t filter -A FORWARD -d ${dest_ip}/32 -p ${DEST_PROTO} --dport ${DEST_PORT} -j DROP
            iptables -t nat -I PREROUTING ! -s ${dest_ip}/32 -p ${DEST_PROTO} --dport ${SRC_PORT} -j DNAT --to ${dest_ip}:${DEST_PORT}
            iptables -t nat -I POSTROUTING -d ${dest_ip}/32 -p ${DEST_PROTO} -j MASQUERADE
        fi
    done
}

check_iptables_mode
case $mode in
nft)
    info "nft mode detected"
    set_nft
    ;;
legacy)
    info "legacy mode detected"
    set_legacy
    ;;
*)
    fatal "invalid iptables mode"
    ;;
esac
start_proxy

if [ ! -e /pause ]; then
    mkfifo /pause
fi
</pause

FILE: Dockerfile

FROM rancher/klipper-lb:v0.4.3
COPY entry /usr/bin/entry
CMD ["entry"]

Run docker build -t rancher/klipper-lb:v0.4.3 . and k3d image import -c MY_CLUSTER_NAME rancher/klipper-lb:v0.4.3 to inject the modified image into your running k3d cluster to fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants