feat: Implement systemd-notify directly #26456

srett · 2025-02-21T10:26:28Z

Remove the dependency on sd-notify, which in turn wraps the native libsystemd.so to use the systemd notification mechanism. This also means we have one less build dependency (libsystemd-dev on debian an derivatives).
The protocol is now implemented in TypeScript on top of unix-dgram.

While at it, I also made it so that z2m would exit immediately if loading the sd-notify module fails for whatever reason (unix-dgram unavailable) and z2m is run with a WatchdogTimeout configured, since there is no way to avoid the service getting killed over and over again. The accompanying log message hopefully makes it easier to quickly find the root cause, see #21463 / #26390

Fun fact: While the systemd developers have taken measures to reduce dependencies of libsystemd, this was famously how openssh was backdoored last year, since sshd also used libsystemd for sd_notify, and also loaded liblzma for unrelated reasons, which contained the actual backdoor targeting sshd.

lib/controller.ts

lib/types/unix-dgram.d.ts

lib/util/sd-notify.ts

test/sd-notify.test.ts

lib/util/sd-notify.ts

srett · 2025-02-22T15:01:23Z

Here's attempt v2. Most of the issues should've been resolved, and I even tried to write a proper test.

Since signalling readiness now also automatically starts the watchdog interval if applicable, I decided to rename the functions to what they mean from the controller's pov rather than what they technically do, so ready() + startWatchdog() is now just started(), and stopWatchdog() is now stopped(). (stopping() stays as it is. :-))

Also, all checks regarding availability of unix sockets, and whether we actually run under a service manager using the sd_notify interface now happen in the sd-notify module, so as much logic as possible is moved out of controller.ts. The newly introduced init() will do all the checks and initialization, and throw an exception if the notification protocol should be used, but there was an error when creating the socket.
All functions now become no-ops if not applicable, to avoid any checks and conditionals in controller.ts.

lib/util/sd-notify.ts

Nerivec · 2025-02-22T17:13:40Z

import logger from './logger';

/**
 * Handle sd_notify protocol, @see https://www.freedesktop.org/software/systemd/man/latest/sd_notify.html
 * No-op if running on unsupported platforms or without Type=notify
 */
export async function initSdNotify(): Promise<{stopping: () => void; stop: () => void} | undefined> {
    if (!process.env.NOTIFY_SOCKET) {
        return;
    }

    try {
        const {createSocket} = await import('unix-dgram');
        const socket = createSocket('unix_dgram');
        const sendToSystemd = (msg: string): void => {
            const buffer = Buffer.from(msg);

            socket.send(buffer, 0, buffer.byteLength, process.env.NOTIFY_SOCKET!, (err) => {
                if (err) {
                    logger.warning(`Failed to send "${msg}" to systemd: ${err.message}`);
                }
            });
        };
        const stopping = (): void => sendToSystemd('STOPPING=1');

        sendToSystemd('READY=1');

        const wdUSec = process.env.WATCHDOG_USEC !== undefined ? Math.max(0, parseInt(process.env.WATCHDOG_USEC, 10)) : -1;

        if (wdUSec > 0) {
            // Convert us to ms, send twice as frequently as the timeout
            const watchdogInterval = setInterval(() => sendToSystemd('WATCHDOG=1'), wdUSec / 1000 / 2);

            return {
                stopping,
                stop: (): void => clearInterval(watchdogInterval),
            };
        }

        if (wdUSec !== -1) {
            logger.warning(`WATCHDOG_USEC invalid: "${process.env.WATCHDOG_USEC}", parsed to "${wdUSec}"`);
        }

        return {
            stopping,
            stop: (): void => {},
        };
    } catch (error) {
        // Ignore error on Windows if not running on WSL, as UNIX sockets don't exist on Windows.
        // Unlikely that NOTIFY_SOCKET is set anyways but better be safe.
        if (process.platform !== 'win32' || process.env.WSL_DISTRO_NAME) {
            throw error;
        }
    }
}

I'd go with something like this (quick draft/untested). The return is stored in Controller, and functions are called as needed. That should keep only the minimum amount of refs, and completely no-op when unsupported.

Note: can use Awaited<ReturnType<typeof initSdNotify>> for the ref type in Controller.

@Koenkk what do you think?

Koenkk · 2025-02-22T19:23:06Z

@Nerivec sounds good!

However I'm wondering if all of this code should be directly in Z2M, I think this deserves its own NPM package?

Nerivec · 2025-02-22T19:51:12Z

It's tailored pretty specifically for Z2M though (the whole ready/stop thingy). You'd have to make it more generic for an NPM package, and lose the optimizations.
It's just one function and a few tests. I think it's okay in Z2M. It allows a tiny footprint in Controller (3 lines, one to assign+init, one for stopping, one for stop).

Nerivec · 2025-02-22T21:14:57Z

@srett I added the changes, and full coverage, but I can't actually test it on a real setup at the moment. Can you check?

srett · 2025-02-23T16:25:52Z

@Nerivec can give it the smoke test in my local setup tomorrow.

Regarding the naming of the methods, I opted for past tense since we're calling those two functions when the action in question finished happening, i.e. stopped() after all the subsystems have been stopped, before exit. And started() is called after all the initialization of the zigbee adapter, mqtt connection etc. is done, so when z2m is fully started up. But I'm not married to that scheme though, just thought it's more expressive. We could also call them shutdown() and startupComplete() or whatnot...

There is one potential problem with your merge of init() and started() into initSdNotify() - if there's a problem with initialization of the sd-notify module (for example if loading the shared object for unix-dgram fails), we'll only find out after all the general setup has already taken place, and we announced all the entities to home assistant and so on. That's why I left it at the very top, where the old sd-notify import also happened, so we can bail out before anything else is done, since a broken sd-notify interface to systemd will guarantee that the process will get killed once the startup timeout or watchdog timeout is reached.
Also I'm personally a fan of adding error messages that tell the user what actions they can take to fix the problem. :)

Generally the idea of this whole sd_notify is:

systemd starts the service, passing along NOTIFY_SOCKET and optionally WATCHDOG_USEC
systemd considers the service to be in "Starting" state, for which there is a default timeout of 90 seconds (on most distros)
As soon as the service has finished initialization/startup, it will send READY=1 to signal systemd that it is up and running

(If the service doesn't send READY=1 within that startup timeout, the service is considered hung and will be killed)

If WATCHDOG_USEC is set, from this moment on the service has to send WATCHDOG=1 periodically to tell systemd it is not frozen/deadlocked. If this takes longer than WATCHDOG_USEC, the process will be killed. The recommendation is to ping twice as often as the timeout interval.

(If WATCHDOG_USEC is not set, we don't need to do anything else after 3)

Nerivec · 2025-02-23T17:21:55Z

If loading a module fails we have way bigger problems though, I don't think it's much of an issue (even though we technically have to cover the code path for it in this scenario). The previous code had special handling because the imported module hooked into a non-node system that could fail. The new code should only fail if on unsupported system (i.e. createSocket throws), which means no-op.

Dependency on sd-notify removed. sd-notify is wrapping libsystemd, which requires installing libsystemd-dev for building, and loads libsystemd.so at runtime. Since we only need the sd_notify part of the lib, and the protocol is trivial to implement, do so. We also now bail out early in case a watchdog timeout is set and the notification module could not be loaded, as in this case systemd would constantly kill and restart zigbee2mqtt, resulting in unstable operation that might not be obvious at first sight.

srett · 2025-02-23T19:40:48Z

It would also fail if you don't have a c++ compiler installed, which coincidentally is also the case in the CI, hence the first version failed during the tests with "unix_dgram.node not found".
This would require to make unix-dgram a non-optional dependency if you want to avoid running into this error state. I made it optional as sd-notify was optional too. I assumed this is to make it possible to build z2m in an environment without a c++ compiler if you don't care about sd_notify.

Nerivec · 2025-02-23T20:51:22Z

@Koenkk is --ignore-scripts in CI>tests still relevant? Can we remove it so the installs are in line with regular ones?

Koenkk · 2025-02-23T20:54:04Z

@Nerivec we can try removing it for now, and if it creates problems, only skip it for Windows

srett · 2025-02-25T07:15:37Z

@Nerivec Have been running this for ~20 hours now, all seems good. :)

Have you considered making unix-dgram non-optional? It should basically make the case where we run into an exception when trying to load the library go away. But in either case this is already better than before...

srett force-pushed the sd-notify branch from 5efc595 to f6192c8 Compare February 21, 2025 10:50