Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

容器启动monitor_collector_main-coredump #171

Open
234278700 opened this issue Mar 14, 2025 · 8 comments
Open

容器启动monitor_collector_main-coredump #171

234278700 opened this issue Mar 14, 2025 · 8 comments

Comments

@234278700
Copy link

234278700 commented Mar 14, 2025

环境:
宿主机系统: Anolis OS release 8.8 amd64. (GNU libc) 2.28 kernel: 5.10
容器:ubuntu:22.04

容器启动:
docker run -it --network=host --name 3fs3 --device=/dev/infiniband:/dev/infiniband -v /etc/libibverbs.d:/etc/libibverbs.d --cap-add=NET_RAW --cap-add=IPC_LOCK --cap-add=CAP_NET_ADMIN --privileged 3fs:v1 /bin/bash

容器:
show_gids
DEV PORT INDEX GID IPv4 VER DEV


mlx5_bond_0 1 0 fe80:0000:0000:0000:0ac0:ebff:fe5a:2008 v1 bond0
mlx5_bond_0 1 1 fe80:0000:0000:0000:0ac0:ebff:fe5a:2008 v2 bond0
mlx5_bond_0 1 2 0000:0000:0000:0000:0000:ffff:0ac7:2516 192.168.1.2 v1 bond0.600
mlx5_bond_0 1 3 0000:0000:0000:0000:0000:ffff:0ac7:2516 192.168.1.2 v2 bond0.600

启动:
/opt/3fs/bin/monitor_collector_main --cfg /opt/3fs/etc/monitor_collector_main.toml

[2025-03-14T01:04:43.799468495+00:00 monitor_collect:81240 IBDevice.cc:169 INFO] ibdev2netdev: mlx5_bond_0 port 1 ==> bond0 (Up)
[2025-03-14T01:04:43.799533255+00:00 monitor_collect:81240 IBDevice.cc:186 INFO] ibdev2netdev parsed: mlx5_bond_0 => bond0
[2025-03-14T01:04:43.799691852+00:00 monitor_collect:81240 IfAddrs.h:102 INFO] Get ifaddr of bond0.600, addr 192.168.1.2, subnet 192.168.1.0/24, up true
[2025-03-14T01:04:43.802160844+00:00 monitor_collect:81240 IBDevice.cc:386 WARNING] IfAddr of mlx5_bond_0:1 -> bond0 not found, maybe running in container!
[2025-03-14T01:04:43.802173396+00:00 monitor_collect:81240 IBDevice.cc:441 CRITICAL] IBDevice mlx5_bond_0:1 can't set zone by IP, fallback to UNKNOWN
[2025-03-14T01:04:43.802249521+00:00 monitor_collect:81240 IBDevice.cc:367 INFO] IBDevice mlx5_bond_0 add active port 1, linklayer ETHERNET, addrs , zones UNKNOWN, RoCE v2 GID 0:0:0:0:0:0:0:0:0:0:ff:ff:a:c7:25:16
[2025-03-14T01:04:43.802260518+00:00 monitor_collect:81240 IBDevice.cc:256 INFO] IBDevice add mlx5_bond_0, id 0, 1 available ports
[2025-03-14T01:04:43.803790460+00:00 IBManager:81267 EventLoop.cc:116 INFO] EventLoop::loop() started.
[2025-03-14T01:04:43.803936435+00:00 monitor_collect:81240 LogConfig.cc:96 INFO] Folly log json configure: {
[2025-03-14T01:04:43.803936435+00:00 monitor_collect:81240 LogConfig.cc:96 INFO] "categories": {
[2025-03-14T01:04:43.803936435+00:00 monitor_collect:81240 LogConfig.cc:96 INFO] ".": {
[2025-03-14T01:04:43.803936435+00:00 monitor_collect:81240 LogConfig.cc:96 INFO] "level": "INFO",
[2025-03-14T01:04:43.803936435+00:00 monitor_collect:81240 LogConfig.cc:96 INFO] "inherit": true,
[2025-03-14T01:04:43.803936435+00:00 monitor_collect:81240 LogConfig.cc:96 INFO] "propagate": "NONE",
[2025-03-14T01:04:43.803936435+00:00 monitor_collect:81240 LogConfig.cc:96 INFO] "handlers": [
[2025-03-14T01:04:43.803936435+00:00 monitor_collect:81240 LogConfig.cc:96 INFO] "normal",
[2025-03-14T01:04:43.803936435+00:00 monitor_collect:81240 LogConfig.cc:96 INFO] "err",
[2025-03-14T01:04:43.803936435+00:00 monitor_collect:81240 LogConfig.cc:96 INFO] "fatal"
[2025-03-14T01:04:43.803936435+00:00 monitor_collect:81240 LogConfig.cc:96 INFO] ]
[2025-03-14T01:04:43.803936435+00:00 monitor_collect:81240 LogConfig.cc:96 INFO] }
[2025-03-14T01:04:43.803936435+00:00 monitor_collect:81240 LogConfig.cc:96 INFO] },
[2025-03-14T01:04:43.803936435+00:00 monitor_collect:81240 LogConfig.cc:96 INFO] "handlers": {
[2025-03-14T01:04:43.803936435+00:00 monitor_collect:81240 LogConfig.cc:96 INFO] "normal": {
[2025-03-14T01:04:43.803936435+00:00 monitor_collect:81240 LogConfig.cc:96 INFO] "type": "file",
[2025-03-14T01:04:43.803936435+00:00 monitor_collect:81240 LogConfig.cc:96 INFO] "options": {
[2025-03-14T01:04:43.803936435+00:00 monitor_collect:81240 LogConfig.cc:96 INFO] "path": "/var/log/3fs/monitor_collector_main.log",
[2025-03-14T01:04:43.803936435+00:00 monitor_collect:81240 LogConfig.cc:96 INFO] "async": "true",
[2025-03-14T01:04:43.803936435+00:00 monitor_collect:81240 LogConfig.cc:96 INFO] "rotate": "true",
[2025-03-14T01:04:43.803936435+00:00 monitor_collect:81240 LogConfig.cc:96 INFO] "max_files": "10",
[2025-03-14T01:04:43.803936435+00:00 monitor_collect:81240 LogConfig.cc:96 INFO] "max_file_size": "104857600",
[2025-03-14T01:04:43.803936435+00:00 monitor_collect:81240 LogConfig.cc:96 INFO] "rotate_on_open": "false"
[2025-03-14T01:04:43.803936435+00:00 monitor_collect:81240 LogConfig.cc:96 INFO] }
[2025-03-14T01:04:43.803936435+00:00 monitor_collect:81240 LogConfig.cc:96 INFO] },
[2025-03-14T01:04:43.803936435+00:00 monitor_collect:81240 LogConfig.cc:96 INFO] "err": {
[2025-03-14T01:04:43.803936435+00:00 monitor_collect:81240 LogConfig.cc:96 INFO] "type": "file",
[2025-03-14T01:04:43.803936435+00:00 monitor_collect:81240 LogConfig.cc:96 INFO] "options": {
[2025-03-14T01:04:43.803936435+00:00 monitor_collect:81240 LogConfig.cc:96 INFO] "level": "ERR",
[2025-03-14T01:04:43.803936435+00:00 monitor_collect:81240 LogConfig.cc:96 INFO] "path": "/var/log/3fs/monitor_collector_main-err.log",
[2025-03-14T01:04:43.803936435+00:00 monitor_collect:81240 LogConfig.cc:96 INFO] "async": "false",
[2025-03-14T01:04:43.803936435+00:00 monitor_collect:81240 LogConfig.cc:96 INFO] "rotate": "true",
[2025-03-14T01:04:43.803936435+00:00 monitor_collect:81240 LogConfig.cc:96 INFO] "max_files": "10",
[2025-03-14T01:04:43.803936435+00:00 monitor_collect:81240 LogConfig.cc:96 INFO] "max_file_size": "104857600",
[2025-03-14T01:04:43.803936435+00:00 monitor_collect:81240 LogConfig.cc:96 INFO] "rotate_on_open": "false"
[2025-03-14T01:04:43.803936435+00:00 monitor_collect:81240 LogConfig.cc:96 INFO] }
[2025-03-14T01:04:43.803936435+00:00 monitor_collect:81240 LogConfig.cc:96 INFO] },
[2025-03-14T01:04:43.803936435+00:00 monitor_collect:81240 LogConfig.cc:96 INFO] "fatal": {
[2025-03-14T01:04:43.803936435+00:00 monitor_collect:81240 LogConfig.cc:96 INFO] "type": "stream",
[2025-03-14T01:04:43.803936435+00:00 monitor_collect:81240 LogConfig.cc:96 INFO] "options": {
[2025-03-14T01:04:43.803936435+00:00 monitor_collect:81240 LogConfig.cc:96 INFO] "level": "FATAL",
[2025-03-14T01:04:43.803936435+00:00 monitor_collect:81240 LogConfig.cc:96 INFO] "stream": "stderr"
[2025-03-14T01:04:43.803936435+00:00 monitor_collect:81240 LogConfig.cc:96 INFO] }
[2025-03-14T01:04:43.803936435+00:00 monitor_collect:81240 LogConfig.cc:96 INFO] }
[2025-03-14T01:04:43.803936435+00:00 monitor_collect:81240 LogConfig.cc:96 INFO] }
[2025-03-14T01:04:43.803936435+00:00 monitor_collect:81240 LogConfig.cc:96 INFO] }
[2025-03-14T01:04:43.804024793+00:00 monitor_collect:81240 OnePhaseApplication.h:87 INFO] LogConfig: {"categories":{".":{"level":"INFO","inherit":true,"propagate":"NONE","handlers":["normal","err","fatal"]}},"handlers":{"normal":{"type":"file","options":{"path":"/var/log/3fs/monitor_collector_main.log","async":"true","rotate":"true","max_files":"10","max_file_size":"104857600","rotate_on_open":"false"}},"err":{"type":"file","options":{"level":"ERR","path":"/var/log/3fs/monitor_collector_main-err.log","async":"false","rotate":"true","max_files":"10","max_file_size":"104857600","rotate_on_open":"false"}},"fatal":{"type":"stream","options":{"level":"FATAL","stream":"stderr"}}}}
Segmentation fault (core dumped)

gdb:
#0 0x00001494f414391f in make_request (pid=81294, fd=30) at ../sysdeps/unix/sysv/linux/check_pf.c:147
147 ../sysdeps/unix/sysv/linux/check_pf.c: No such file or directory.
[Current thread is 1 (Thread 0x1494b96f8640 (LWP 81343))]
(gdb) bt full
#0 0x00001494f414391f in make_request (pid=81294, fd=30) at ../sysdeps/unix/sysv/linux/check_pf.c:147
__result =
result_len = 0
nladdr = {nl_family = 16, nl_pad = 0, nl_pid = 0, nl_groups = 0}
buf = '\000' <repeats 2468 times>...
seen_ipv6 =
result_cap = 32
req = {nlh = {nlmsg_len = 20, nlmsg_type = 22, nlmsg_flags = 769, nlmsg_seq = 1741914359, nlmsg_pid = 0}, g = {rtgen_family = 0 '\000'}, pad = "\000\000"}
done =
seen_ipv4 =
result = 0x0
buf_size = 4096
iov = {iov_base = 0x1494b92ddd80, iov_len = 4096}
result =
result_len =
result_cap =
req =
nladdr =
PRETTY_FUNCTION =
buf_size =
buf =
iov =
out_fail =
done =
seen_ipv4 =
seen_ipv6 =
out =
__result =
msg =
read_len =
nlmh =
__result =
ifam =
rta =
len =
local =
address =
info =
__a =
#1 __check_pf (seen_ipv4=seen_ipv4@entry=0x1494b92defd6, seen_ipv6=seen_ipv6@entry=0x1494b92defd7, in6ai=in6ai@entry=0x1494b92defe8,
in6ailen=in6ailen@entry=0x1494b92deff0) at ../sysdeps/unix/sysv/linux/check_pf.c:329

@echaozh

@haohaiwei
Copy link
Contributor

haohaiwei commented Mar 14, 2025

这个出core的原因,像是容器内的网卡都是ib开头的,当前3fs 建立rdma 连接,都需要一张非ib开头的网卡,否则服务会无法启动
源代码这里src/common/net/Listener.cc 可以看到 tcp监听的时候,会检查网卡时候以en eth或者bond 开头

@234278700
Copy link
Author

234278700 commented Mar 14, 2025

这个出core的原因,像是容器内的网卡都是ib开头的,当前3fs 建立rdma 连接,都需要一张非ib开头的网卡,否则服务会无法启动 源代码这里src/common/net/Listener.cc 可以看到 tcp监听的时候,会检查网卡时候以en eth或者bond 开头

容器启动是用的宿主机namespace,tcp的网卡有的

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
2: eth2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
link/ether ec:38:8f:69:a7:09 brd ff:ff:ff:ff:ff:ff
3: eth3: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
link/ether ec:38:8f:69:a7:0a brd ff:ff:ff:ff:ff:ff
4: eth0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP group default qlen 1000
link/ether 08:c0:eb:5a:20:08 brd ff:ff:ff:ff:ff:ff
5: eth1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP group default qlen 1000
link/ether 08:c0:eb:5a:20:08 brd ff:ff:ff:ff:ff:ff permaddr 08:c0:eb:5a:20:09
6: eth4: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
link/ether 90:e2:ba:3f:22:44 brd ff:ff:ff:ff:ff:ff
7: eth5: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
link/ether 90:e2:ba:3f:22:45 brd ff:ff:ff:ff:ff:ff
8: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 08:c0:eb:5a:20:08 brd ff:ff:ff:ff:ff:ff
9: bond0.600@bond0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 08:c0:eb:5a:20:08 brd ff:ff:ff:ff:ff:ff
inet 192.168.1.2/24 brd 10.199.37.255 scope global noprefixroute bond0.500
valid_lft forever preferred_lft forever

同时我也试过把ip地址配置在bond0

@234278700
Copy link
Author

monitorCollectorOperator_ = std::make_unique(config_.monitor_collector()); (出现coredump的位置在这里)

Result MonitorCollectorServer::beforeStart() {
monitorCollectorOperator_ = std::make_unique(config_.monitor_collector()); (出现coredump的位置在这里)
RETURN_ON_ERROR(addSerdeService(std::make_unique(*monitorCollectorOperator_), true));
return Void{};
}

@Icedroid
Copy link

同遇到coredump ,大佬知道怎么解决吗?

root@localhost:/opt/3fs# /opt/3fs/bin/monitor_collector_main --cfg /opt/3fs/etc/monitor_collector_main.toml
[2025-03-14T13:08:16.512822042+00:00 monitor_collect:  156 IBDevice.cc:169 INFO] ibdev2netdev: mlx5_0 port 1 ==> Bond0 (Up)
[2025-03-14T13:08:16.512822042+00:00 monitor_collect:  156 IBDevice.cc:169 INFO] mlx5_1 port 1 ==> Bond0 (Up)
[2025-03-14T13:08:16.512822042+00:00 monitor_collect:  156 IBDevice.cc:169 INFO] mlx5_2 port 1 ==> Bond0 (Up)
[2025-03-14T13:08:16.512822042+00:00 monitor_collect:  156 IBDevice.cc:169 INFO] mlx5_3 port 1 ==> Bond0 (Up)
[2025-03-14T13:08:16.512876530+00:00 monitor_collect:  156 IBDevice.cc:186 INFO] ibdev2netdev parsed: mlx5_0 => Bond0
[2025-03-14T13:08:16.512880240+00:00 monitor_collect:  156 IBDevice.cc:186 INFO] ibdev2netdev parsed: mlx5_1 => Bond0
[2025-03-14T13:08:16.512882982+00:00 monitor_collect:  156 IBDevice.cc:186 INFO] ibdev2netdev parsed: mlx5_2 => Bond0
[2025-03-14T13:08:16.512887535+00:00 monitor_collect:  156 IBDevice.cc:186 INFO] ibdev2netdev parsed: mlx5_3 => Bond0
[2025-03-14T13:08:16.513531239+00:00 monitor_collect:  156 IfAddrs.h:102 INFO] Get ifaddr of Bond0.2503, addr 192.168.3.25/24, subnet 192.168.3.0/24, up true
[2025-03-14T13:08:16.513539473+00:00 monitor_collect:  156 IfAddrs.h:102 INFO] Get ifaddr of Bond0.211, addr 192.168.211.25/24, subnet 192.168.211.0/24, up true
[2025-03-14T13:08:16.513545590+00:00 monitor_collect:  156 IfAddrs.h:102 INFO] Get ifaddr of Bond0.202, addr 192.168.202.25/24, subnet 192.168.202.0/24, up true
[2025-03-14T13:08:16.513550796+00:00 monitor_collect:  156 IfAddrs.h:102 INFO] Get ifaddr of Bond1.193, addr 10.129.185.25/24, subnet 10.129.185.0/24, up true
[2025-03-14T13:08:16.517824232+00:00 monitor_collect:  156 IBDevice.cc:386 WARNING] IfAddr of mlx5_0:1 -> Bond0 not found, maybe running in container!
[2025-03-14T13:08:16.517831953+00:00 monitor_collect:  156 IBDevice.cc:441 CRITICAL] IBDevice mlx5_0:1 can't set zone by IP, fallback to UNKNOWN
[2025-03-14T13:08:16.517864933+00:00 monitor_collect:  156 IBDevice.cc:367 INFO] IBDevice mlx5_0 add active port 1, linklayer ETHERNET, addrs , zones UNKNOWN, RoCE v2 GID 0:0:0:0:0:0:0:0:
0:0:ff:ff:c0:a8:3:19
[2025-03-14T13:08:16.517871053+00:00 monitor_collect:  156 IBDevice.cc:256 INFO] IBDevice add mlx5_0, id 0, 1 available ports
[2025-03-14T13:08:16.520906330+00:00 monitor_collect:  156 IBDevice.cc:386 WARNING] IfAddr of mlx5_1:1 -> Bond0 not found, maybe running in container!
[2025-03-14T13:08:16.520910725+00:00 monitor_collect:  156 IBDevice.cc:441 CRITICAL] IBDevice mlx5_1:1 can't set zone by IP, fallback to UNKNOWN
[2025-03-14T13:08:16.520927516+00:00 monitor_collect:  156 IBDevice.cc:367 INFO] IBDevice mlx5_1 add active port 1, linklayer ETHERNET, addrs , zones UNKNOWN, RoCE v2 GID 0:0:0:0:0:0:0:0:
0:0:ff:ff:c0:a8:3:19
[2025-03-14T13:08:16.520932036+00:00 monitor_collect:  156 IBDevice.cc:256 INFO] IBDevice add mlx5_1, id 1, 1 available ports
[2025-03-14T13:08:16.524056057+00:00 monitor_collect:  156 IBDevice.cc:386 WARNING] IfAddr of mlx5_2:1 -> Bond0 not found, maybe running in container!
[2025-03-14T13:08:16.524060248+00:00 monitor_collect:  156 IBDevice.cc:441 CRITICAL] IBDevice mlx5_2:1 can't set zone by IP, fallback to UNKNOWN
[2025-03-14T13:08:16.524076743+00:00 monitor_collect:  156 IBDevice.cc:367 INFO] IBDevice mlx5_2 add active port 1, linklayer ETHERNET, addrs , zones UNKNOWN, RoCE v2 GID 0:0:0:0:0:0:0:0:
0:0:ff:ff:c0:a8:3:19
[2025-03-14T13:08:16.524080967+00:00 monitor_collect:  156 IBDevice.cc:256 INFO] IBDevice add mlx5_2, id 2, 1 available ports
[2025-03-14T13:08:16.527724009+00:00 monitor_collect:  156 IBDevice.cc:386 WARNING] IfAddr of mlx5_3:1 -> Bond0 not found, maybe running in container!
[2025-03-14T13:08:16.527728256+00:00 monitor_collect:  156 IBDevice.cc:441 CRITICAL] IBDevice mlx5_3:1 can't set zone by IP, fallback to UNKNOWN
[2025-03-14T13:08:16.527744727+00:00 monitor_collect:  156 IBDevice.cc:367 INFO] IBDevice mlx5_3 add active port 1, linklayer ETHERNET, addrs , zones UNKNOWN, RoCE v2 GID 0:0:0:0:0:0:0:0:
0:0:ff:ff:c0:a8:3:19
[2025-03-14T13:08:16.527748674+00:00 monitor_collect:  156 IBDevice.cc:256 INFO] IBDevice add mlx5_3, id 3, 1 available ports
[2025-03-14T13:08:16.529142628+00:00 IBManager:  210 EventLoop.cc:116 INFO] EventLoop::loop() started.
[2025-03-14T13:08:16.529273685+00:00 monitor_collect:  156 OnePhaseApplication.h:87 INFO] LogConfig: {"categories":{".":{"level":"INFO","inherit":true,"propagate":"NONE","handlers":["norm
al","err","fatal"]}},"handlers":{"normal":{"type":"file","options":{"path":"/var/log/3fs/monitor_collector_main.log","async":"true","rotate":"true","max_files":"10","max_file_size":"10485
7600","rotate_on_open":"false"}},"err":{"type":"file","options":{"level":"ERR","path":"/var/log/3fs/monitor_collector_main-err.log","async":"false","rotate":"true","max_files":"10","max_f
ile_size":"104857600","rotate_on_open":"false"}},"fatal":{"type":"stream","options":{"level":"FATAL","stream":"stderr"}}}}
[2025-03-14T13:08:16.559326006+00:00 monitor_collect:  156 OnePhaseApplication.h:101 FATAL] Setup server failed: RPC::ListenFailed(2011)
*** Aborted at 1741957696 (Unix time, try 'date -d @1741957696') ***
*** Signal 6 (SIGABRT) (0x9c) received by PID 156 (pthread TID 0x7f5f45088600) (linux TID 156) (maybe from PID 156, UID 0) (code: -6), stack trace: ***
    @ 00000000004e9e9f (unknown)
    @ 000000000004251f (unknown)
    @ 00000000000969fc pthread_kill
    @ 0000000000042475 raise
    @ 00000000000287f2 abort
    @ 0000000000551b60 (unknown)
    @ 0000000000551016 (unknown)
    @ 0000000000551269 (unknown)
    @ 0000000000240491 (unknown)
    @ 0000000000262d46 (unknown)
    @ 0000000000029d8f (unknown)
    @ 0000000000029e3f __libc_start_main
    @ 000000000018dfe4 (unknown)
Aborted (core dumped)

@vsxen
Copy link

vsxen commented Mar 15, 2025

看下这个 #178 (comment)

@Icedroid
Copy link

@vsxen 谢谢,这个可以了,但mgmtd也启动不成功,初始化是成功了的

mgmtd_main:   29 Utils.cc:161 INFO] LogConfig: {"categories":{".":{"level":"INFO","inherit":true,"propagate":"NONE","handlers":["normal","err","fatal"]}},"handlers":{"normal":{"type":"file","options":{"path":"Mgmtd.log","async":"true","rotate":"true","max_files":"100","max_file_size":"10485760","rotate_on_open":"false"}},"err":{"type":"file","options":{"level":"ERR","path":"Mgmtd.err.log","async":"false","rotate":"true","max_files":"100","max_file_size":"10485760","rotate_on_open":"false"}},"fatal":{"type":"stream","options":{"level":"FATAL","stream":"stderr"}}}}
Mar 14 16:47:57 localhost.localdomain mgmtd_main[29]: [2025-03-14T16:47:57.680441025+00:00 mgmtd_main:   29 TwoPhaseApplication.h:59 FATAL] Init server failed: RPC::ListenFailed(2011)
Mar 14 16:47:57 localhost.localdomain mgmtd_main[29]: *** Aborted at 1741970877 (Unix time, try 'date -d @1741970877') ***
Mar 14 16:47:57 localhost.localdomain mgmtd_main[29]: *** Signal 6 (SIGABRT) (0x1d) received by PID 29 (pthread TID 0x7fc198aa9600) (linux TID 29) (maybe from PID 29, UID 0) (code: -6), stack trace: ***
Mar 14 16:47:57 localhost.localdomain mgmtd_main[29]:     @ 00000000008302ff (unknown)
Mar 14 16:47:57 localhost.localdomain mgmtd_main[29]:     @ 000000000004251f (unknown)
Mar 14 16:47:57 localhost.localdomain mgmtd_main[29]:     @ 00000000000969fc pthread_kill
Mar 14 16:47:57 localhost.localdomain mgmtd_main[29]:     @ 0000000000042475 raise
Mar 14 16:47:57 localhost.localdomain mgmtd_main[29]:     @ 00000000000287f2 abort
Mar 14 16:47:57 localhost.localdomain mgmtd_main[29]:     @ 0000000000897fc0 (unknown)
Mar 14 16:47:57 localhost.localdomain mgmtd_main[29]:     @ 0000000000897476 (unknown)
Mar 14 16:47:57 localhost.localdomain mgmtd_main[29]:     @ 00000000008976c9 (unknown)
Mar 14 16:47:57 localhost.localdomain mgmtd_main[29]:     @ 0000000000231c78 (unknown)
Mar 14 16:47:57 localhost.localdomain mgmtd_main[29]:     @ 00000000005ac046 (unknown)
Mar 14 16:47:57 localhost.localdomain mgmtd_main[29]:     @ 000000000022fe97 (unknown)
Mar 14 16:47:57 localhost.localdomain mgmtd_main[29]:     @ 0000000000029d8f (unknown)
Mar 14 16:47:57 localhost.localdomain mgmtd_main[29]:     @ 0000000000029e3f __libc_start_main
Mar 14 16:47:57 localhost.localdomain mgmtd_main[29]:     @ 000000000022fda4 (unknown)

@Icedroid
Copy link

@vsxen 谢谢,这个可以了,但mgmtd也启动不成功,初始化是成功了的

mgmtd_main:   29 Utils.cc:161 INFO] LogConfig: {"categories":{".":{"level":"INFO","inherit":true,"propagate":"NONE","handlers":["normal","err","fatal"]}},"handlers":{"normal":{"type":"file","options":{"path":"Mgmtd.log","async":"true","rotate":"true","max_files":"100","max_file_size":"10485760","rotate_on_open":"false"}},"err":{"type":"file","options":{"level":"ERR","path":"Mgmtd.err.log","async":"false","rotate":"true","max_files":"100","max_file_size":"10485760","rotate_on_open":"false"}},"fatal":{"type":"stream","options":{"level":"FATAL","stream":"stderr"}}}}
Mar 14 16:47:57 localhost.localdomain mgmtd_main[29]: [2025-03-14T16:47:57.680441025+00:00 mgmtd_main:   29 TwoPhaseApplication.h:59 FATAL] Init server failed: RPC::ListenFailed(2011)
Mar 14 16:47:57 localhost.localdomain mgmtd_main[29]: *** Aborted at 1741970877 (Unix time, try 'date -d @1741970877') ***
Mar 14 16:47:57 localhost.localdomain mgmtd_main[29]: *** Signal 6 (SIGABRT) (0x1d) received by PID 29 (pthread TID 0x7fc198aa9600) (linux TID 29) (maybe from PID 29, UID 0) (code: -6), stack trace: ***
Mar 14 16:47:57 localhost.localdomain mgmtd_main[29]:     @ 00000000008302ff (unknown)
Mar 14 16:47:57 localhost.localdomain mgmtd_main[29]:     @ 000000000004251f (unknown)
Mar 14 16:47:57 localhost.localdomain mgmtd_main[29]:     @ 00000000000969fc pthread_kill
Mar 14 16:47:57 localhost.localdomain mgmtd_main[29]:     @ 0000000000042475 raise
Mar 14 16:47:57 localhost.localdomain mgmtd_main[29]:     @ 00000000000287f2 abort
Mar 14 16:47:57 localhost.localdomain mgmtd_main[29]:     @ 0000000000897fc0 (unknown)
Mar 14 16:47:57 localhost.localdomain mgmtd_main[29]:     @ 0000000000897476 (unknown)
Mar 14 16:47:57 localhost.localdomain mgmtd_main[29]:     @ 00000000008976c9 (unknown)
Mar 14 16:47:57 localhost.localdomain mgmtd_main[29]:     @ 0000000000231c78 (unknown)
Mar 14 16:47:57 localhost.localdomain mgmtd_main[29]:     @ 00000000005ac046 (unknown)
Mar 14 16:47:57 localhost.localdomain mgmtd_main[29]:     @ 000000000022fe97 (unknown)
Mar 14 16:47:57 localhost.localdomain mgmtd_main[29]:     @ 0000000000029d8f (unknown)
Mar 14 16:47:57 localhost.localdomain mgmtd_main[29]:     @ 0000000000029e3f __libc_start_main
Mar 14 16:47:57 localhost.localdomain mgmtd_main[29]:     @ 000000000022fda4 (unknown)

这个要看 tail -f /var/log/3fs/mgmtd_main.log,发现mgmtd_main还会起一个9000端口,跟我本地ssh端口冲突了,通过修改mgmtd_main.toml的端口为9001解决。

@234278700
Copy link
Author

@vsxen 看起来不是同一个问题,拆掉bond网卡一样coredump

[2025-03-18T07:15:04.921363415+00:00 monitor_collect: 2655 IBDevice.cc:169 INFO] ibdev2netdev: mlx5_0 port 1 ==> eth0 (Up)
[2025-03-18T07:15:04.921363415+00:00 monitor_collect: 2655 IBDevice.cc:169 INFO] mlx5_1 port 1 ==> eth1 (Up)
[2025-03-18T07:15:04.921415536+00:00 monitor_collect: 2655 IBDevice.cc:186 INFO] ibdev2netdev parsed: mlx5_0 => eth0
[2025-03-18T07:15:04.921425657+00:00 monitor_collect: 2655 IBDevice.cc:186 INFO] ibdev2netdev parsed: mlx5_1 => eth1
[2025-03-18T07:15:04.921694684+00:00 monitor_collect: 2655 IfAddrs.h:102 INFO] Get ifaddr of eth0.600, addr 192.168.1.2/24, subnet 192.168.1.0/24, up true
[2025-03-18T07:15:04.924192950+00:00 monitor_collect: 2655 IBDevice.cc:386 WARNING] IfAddr of mlx5_0:1 -> eth0 not found, maybe running in container!
[2025-03-18T07:15:04.924214790+00:00 monitor_collect: 2655 IBDevice.cc:441 CRITICAL] IBDevice mlx5_0:1 can't set zone by IP, fallback to UNKNOWN
[2025-03-18T07:15:04.924295808+00:00 monitor_collect: 2655 IBDevice.cc:367 INFO] IBDevice mlx5_0 add active port 1, linklayer ETHERNET, addrs , zones UNKNOWN, RoCE v2 GID 0:0:0:0:0:0:0:0:0:0:ff:ff:a:c7:25:16
[2025-03-18T07:15:04.924307839+00:00 monitor_collect: 2655 IBDevice.cc:256 INFO] IBDevice add mlx5_0, id 0, 1 available ports
[2025-03-18T07:15:04.926079354+00:00 monitor_collect: 2655 IBDevice.cc:305 INFO] Skip device mlx5_1, port 1 because it's not in device filter.
[2025-03-18T07:15:04.926089484+00:00 monitor_collect: 2655 IBDevice.cc:253 INFO] IBDevice skip mlx5_1 because it doesn't have available ports.
[2025-03-18T07:15:04.927926197+00:00 IBManager: 2692 EventLoop.cc:116 INFO] EventLoop::loop() started.
[2025-03-18T07:15:04.928124749+00:00 monitor_collect: 2655 LogConfig.cc:96 INFO] Folly log json configure: {
[2025-03-18T07:15:04.928124749+00:00 monitor_collect: 2655 LogConfig.cc:96 INFO] "categories": {
[2025-03-18T07:15:04.928124749+00:00 monitor_collect: 2655 LogConfig.cc:96 INFO] ".": {
[2025-03-18T07:15:04.928124749+00:00 monitor_collect: 2655 LogConfig.cc:96 INFO] "level": "INFO",
[2025-03-18T07:15:04.928124749+00:00 monitor_collect: 2655 LogConfig.cc:96 INFO] "inherit": true,
[2025-03-18T07:15:04.928124749+00:00 monitor_collect: 2655 LogConfig.cc:96 INFO] "propagate": "NONE",
[2025-03-18T07:15:04.928124749+00:00 monitor_collect: 2655 LogConfig.cc:96 INFO] "handlers": [
[2025-03-18T07:15:04.928124749+00:00 monitor_collect: 2655 LogConfig.cc:96 INFO] "normal",
[2025-03-18T07:15:04.928124749+00:00 monitor_collect: 2655 LogConfig.cc:96 INFO] "err",
[2025-03-18T07:15:04.928124749+00:00 monitor_collect: 2655 LogConfig.cc:96 INFO] "fatal"
[2025-03-18T07:15:04.928124749+00:00 monitor_collect: 2655 LogConfig.cc:96 INFO] ]
[2025-03-18T07:15:04.928124749+00:00 monitor_collect: 2655 LogConfig.cc:96 INFO] }
[2025-03-18T07:15:04.928124749+00:00 monitor_collect: 2655 LogConfig.cc:96 INFO] },
[2025-03-18T07:15:04.928124749+00:00 monitor_collect: 2655 LogConfig.cc:96 INFO] "handlers": {
[2025-03-18T07:15:04.928124749+00:00 monitor_collect: 2655 LogConfig.cc:96 INFO] "normal": {
[2025-03-18T07:15:04.928124749+00:00 monitor_collect: 2655 LogConfig.cc:96 INFO] "type": "file",
[2025-03-18T07:15:04.928124749+00:00 monitor_collect: 2655 LogConfig.cc:96 INFO] "options": {
[2025-03-18T07:15:04.928124749+00:00 monitor_collect: 2655 LogConfig.cc:96 INFO] "path": "/var/log/3fs/monitor_collector_main.log",
[2025-03-18T07:15:04.928124749+00:00 monitor_collect: 2655 LogConfig.cc:96 INFO] "async": "true",
[2025-03-18T07:15:04.928124749+00:00 monitor_collect: 2655 LogConfig.cc:96 INFO] "rotate": "true",
[2025-03-18T07:15:04.928124749+00:00 monitor_collect: 2655 LogConfig.cc:96 INFO] "max_files": "10",
[2025-03-18T07:15:04.928124749+00:00 monitor_collect: 2655 LogConfig.cc:96 INFO] "max_file_size": "104857600",
[2025-03-18T07:15:04.928124749+00:00 monitor_collect: 2655 LogConfig.cc:96 INFO] "rotate_on_open": "false"
[2025-03-18T07:15:04.928124749+00:00 monitor_collect: 2655 LogConfig.cc:96 INFO] }
[2025-03-18T07:15:04.928124749+00:00 monitor_collect: 2655 LogConfig.cc:96 INFO] },
[2025-03-18T07:15:04.928124749+00:00 monitor_collect: 2655 LogConfig.cc:96 INFO] "err": {
[2025-03-18T07:15:04.928124749+00:00 monitor_collect: 2655 LogConfig.cc:96 INFO] "type": "file",
[2025-03-18T07:15:04.928124749+00:00 monitor_collect: 2655 LogConfig.cc:96 INFO] "options": {
[2025-03-18T07:15:04.928124749+00:00 monitor_collect: 2655 LogConfig.cc:96 INFO] "level": "ERR",
[2025-03-18T07:15:04.928124749+00:00 monitor_collect: 2655 LogConfig.cc:96 INFO] "path": "/var/log/3fs/monitor_collector_main-err.log",
[2025-03-18T07:15:04.928124749+00:00 monitor_collect: 2655 LogConfig.cc:96 INFO] "async": "false",
[2025-03-18T07:15:04.928124749+00:00 monitor_collect: 2655 LogConfig.cc:96 INFO] "rotate": "true",
[2025-03-18T07:15:04.928124749+00:00 monitor_collect: 2655 LogConfig.cc:96 INFO] "max_files": "10",
[2025-03-18T07:15:04.928124749+00:00 monitor_collect: 2655 LogConfig.cc:96 INFO] "max_file_size": "104857600",
[2025-03-18T07:15:04.928124749+00:00 monitor_collect: 2655 LogConfig.cc:96 INFO] "rotate_on_open": "false"
[2025-03-18T07:15:04.928124749+00:00 monitor_collect: 2655 LogConfig.cc:96 INFO] }
[2025-03-18T07:15:04.928124749+00:00 monitor_collect: 2655 LogConfig.cc:96 INFO] },
[2025-03-18T07:15:04.928124749+00:00 monitor_collect: 2655 LogConfig.cc:96 INFO] "fatal": {
[2025-03-18T07:15:04.928124749+00:00 monitor_collect: 2655 LogConfig.cc:96 INFO] "type": "stream",
[2025-03-18T07:15:04.928124749+00:00 monitor_collect: 2655 LogConfig.cc:96 INFO] "options": {
[2025-03-18T07:15:04.928124749+00:00 monitor_collect: 2655 LogConfig.cc:96 INFO] "level": "FATAL",
[2025-03-18T07:15:04.928124749+00:00 monitor_collect: 2655 LogConfig.cc:96 INFO] "stream": "stderr"
[2025-03-18T07:15:04.928124749+00:00 monitor_collect: 2655 LogConfig.cc:96 INFO] }
[2025-03-18T07:15:04.928124749+00:00 monitor_collect: 2655 LogConfig.cc:96 INFO] }
[2025-03-18T07:15:04.928124749+00:00 monitor_collect: 2655 LogConfig.cc:96 INFO] }
[2025-03-18T07:15:04.928124749+00:00 monitor_collect: 2655 LogConfig.cc:96 INFO] }
[2025-03-18T07:15:04.928180632+00:00 monitor_collect: 2655 OnePhaseApplication.h:87 INFO] LogConfig: {"categories":{".":{"level":"INFO","inherit":true,"propagate":"NONE","handlers":["normal","err","fatal"]}},"handlers":{"normal":{"type":"file","options":{"path":"/var/log/3fs/monitor_collector_main.log","async":"true","rotate":"true","max_files":"10","max_file_size":"104857600","rotate_on_open":"false"}},"err":{"type":"file","options":{"level":"ERR","path":"/var/log/3fs/monitor_collector_main-err.log","async":"false","rotate":"true","max_files":"10","max_file_size":"104857600","rotate_on_open":"false"}},"fatal":{"type":"stream","options":{"level":"FATAL","stream":"stderr"}}}}
Segmentation fault

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants