Split up IPC messages #3868

clumens · 2025-04-17T18:08:48Z

There are currently cts-lab failures with the Reattach and Remote* tests caused by this PR that I am still in the middle of tracking down.

clumens · 2025-04-17T18:15:15Z

lib/common/ipc_server.c

        }

+        free(part_text);
+
        if (flags & crm_ipc_server_free) {
            pcmk_free_ipc_event(iov);
        }


I've found this patch useful for debugging purposes, but we don't necessarily have to keep it. One annoyance is that our logging cuts off after a certain number of characters, so very long XML messages won't get logged in their entirety anyway. Still, it's usually enough to see just the beginning to know what's going on.

clumens · 2025-04-17T21:07:32Z

The Retry sending in crm_ipcs_flush_events patch fixes the problem in the Remote* tests, but there's still something going on causing other failures:

Apr 17 16:40:58 Running test SimulStop              (rhel9-ctslab-3) [  6]
Apr 17 16:46:44 Warn: All nodes stopped but CTS didn't detect: ['rhel9-ctslab-2\\W.*Unloading all Corosync service engines']
Apr 17 16:46:44 Test SimulStopLite                  FAILED: Missing log message: ['rhel9-ctslab-2\\W.*Unloading all Corosync service engines']
Apr 17 16:46:44 Test SimulStop                      FAILED: Stopall failed

Apr 17 16:59:48 Running test Reattach               (rhel9-ctslab-1) [ 14]
Apr 17 17:02:10 ResourceActivity: 2025-04-17T17:02:04-0400 rhel9-ctslab-1 pacemaker-controld[16925]:  notice: Result of start operation for stateful-1 on rhel9-ctslab-1: OK
Apr 17 17:03:15 Test Reattach                       FAILED: Resources stopped or started after resource management was re-enabled

Apr 17 17:06:29 Running test NearQuorumPoint        (rhel9-ctslab-3) [ 16]
Apr 17 17:07:10 BadNews: 2025-04-17T17:06:37-0400 rhel9-ctslab-1 pacemaker-controld[27993]:  error: Could not contact the scheduler: Resource temporarily unavailable
Apr 17 17:07:10 BadNews: 2025-04-17T17:06:37-0400 rhel9-ctslab-1 pacemaker-controld[27993]:  error: Input I_ERROR received in state S_POLICY_ENGINE from do_pe_invoke_callback
Apr 17 17:07:10 BadNews: 2025-04-17T17:06:37-0400 rhel9-ctslab-1 pacemaker-controld[27993]:  error: Input I_TERMINATE received in state S_RECOVERY from do_recover
Apr 17 17:07:10 BadNews: 2025-04-17T17:06:37-0400 rhel9-ctslab-1 pacemaker-controld[27993]:  error: 7 resources were active at shutdown
Apr 17 17:07:10 BadNews: 2025-04-17T17:06:37-0400 rhel9-ctslab-1 pacemaker-controld[27993]:  error: Could not recover from internal error
Apr 17 17:07:10 BadNews: 2025-04-17T17:06:37-0400 rhel9-ctslab-1 pacemakerd[27986]:  error: pacemaker-controld[27993] exited with status 1 (Error occurred)

These test failures appear to be somewhat transient. I can't reliably reproduce them on every test every time, but they will happen fairly frequently. It's difficult to tell from looking at the cts-lab logs which failures represent real errors and which are merely due to the lab killing things and restarting VMs.

clumens · 2025-04-18T15:01:45Z

These test failures appear to be somewhat transient. I can't reliably reproduce them on every test every time, but they will happen fairly frequently. It's difficult to tell from looking at the cts-lab logs which failures represent real errors and which are merely due to the lab killing things and restarting VMs.

This was fixed by the following:

-        if (pcmk_all_flags_set(flags, crm_ipc_multipart)) {
-            bool is_end = pcmk_all_flags_set(flags, crm_ipc_multipart_end);
+        if (pcmk__ipc_msg_is_multipart(header)) {
+            bool is_end = pcmk__ipc_msg_is_multipart_end(header);

clumens · 2025-04-21T13:48:52Z

Rebased onto main to reflect that I merged the compression PR, no other code changes.

nrwahl2

VERY partial. Just getting these early comments out while I continue reviewing.

nrwahl2 · 2025-04-21T05:20:44Z

lib/common/ipc_client.c

+                g_byte_array_free(client->buffer, TRUE);
+            }
+
+            client->buffer = g_byte_array_sized_new(client->buf_size);


Second, because there's no way to clear out a GByteArray and inserting
new bytes directly at the beginning, we just need to allocate and free
it every time we do a read.

Are we sure about that? Looks like setting the size to 0 has the effect of truncating the array. (I checked the implementation to confirm.)
https://docs.gtk.org/glib/type_func.ByteArray.set_size.html

nrwahl2 · 2025-04-21T05:46:25Z

lib/common/ipc_client.c


-    client->buffer[0] = 0;
-    client->msg_size = qb_ipcc_event_recv(client->ipc, client->buffer,
+    buffer = pcmk__assert_alloc(crm_ipc_default_buffer_size(), sizeof(char));


We could make buffer a guint8 * and allocate it with g_malloc0() (the GLib equivalent of calloc() with assertion on memory error). Libqb expects a void *, so it doesn't care about the type of buffer.

Then instead of calling g_byte_array_append(), we call g_byte_array_new_take(buffer, client->msg_size), and skip freeing buffer on success. This avoids the copy by transferring ownership of the existing buffer to the byte array.

Now, this only works when creating the byte array. As of this commit, we're only appending to it once unless we truncate it first. If we append more data later without truncating (as I expect we will do), then there's no way for the byte array to take ownership of the data in the buffer in those later appends. Though we could either reuse the fixed-size buffer without freeing it, or do some realloc trickery.

https://docs.gtk.org/glib/type_func.ByteArray.new_take.html

Another thought: I'm not especially concerned about dynamic memory allocation, but we could avoid it here by using an internal macro constant for buffer size instead of calling crm_ipc_default_buffer_size() in these places. Then we could just have a static array for the buffer.

Up to you. I see the appeal of wanting to use that function consistently even internally, but the buffer size is always the same now.

clumens · 2025-04-22T14:22:13Z

lib/common/ipc_server.c

-        qb_rc = qb_ipcs_event_sendv(c->ipcs, event, 2);
+        do {
+            qb_rc = qb_ipcs_event_sendv(c->ipcs, event, 2);
+        } while (qb_rc == -EAGAIN);


https://projects.clusterlabs.org/T322 is relevant here.

nrwahl2

Another partial review

lib/common/ipc_client.c

daemons/based/based_notify.c

lib/common/ipc_server.c

include/crm/common/ipc.h

lib/common/ipc_common.c

lib/common/crmcommon_private.h

lib/common/ipc_common.c

lib/common/ipc_server.c

This has the built-in advantage of various library functions that can be used to manipulate it instead of using pointer arithmetic. For the moment, this isn't important but it will become important later when we're doing split up IPC messages. On the other hand, there's a couple other changes here that are not advantages. First, instead of reading directly into the GByteArray, we now have to allocate a buffer to read into, and then append that to the GByteArray. This also means we need to be careful about freeing that buffer. Second, because there's no way to clear out a GByteArray and inserting new bytes directly at the beginning, we just need to allocate and free it every time we do a read. This also means we need to be more careful about checking if it's NULL. This will be less of a problem later when we are reading into the same GByteArray multiple times for a split up message.

...instead of an xmlNode. The idea here is that to deal with split up messages, this function can be called repeatedly from a loop. Each call will prepare the next chunk of the message to be sent. The first step in doing that is simply changing the type of the parameter. This does result in a little code duplication in the callers, but there are only a couple.

Functions should return this error code if a single IPC message is too large to fit into a single buffer, and so it has to be split up into multiple. It can also be used when one chunk of a split up message has been read, indicating that the caller needs to continue reading.

The function now only prepares as much for one I/O vector as will fit in a single buffer. The idea is that you can call the function repeatedly in a loop, preparing and transmitting individual chunks of a single XML message. As long as pcmk__ipc_prepare_iov returns pcmk_rc_ipc_more, you know there's more that needs to be prepared. Keep a running count of the bytes it prepared and pass that in for offset. Note that clients are not doing this yet, so very large IPC messages are still going to fail.

This does technically add flags to a public enum, but on the other hand they are added after the "these are for pacemaker's internal use only" comment, and we are hoping to eventually make this entire enum private.

We could do this in the callers, but this function has everything it needs to decide which flags should be set.

…ges. If the IPC message is too large to fit in a single buffer, this function now knows how to loop over the message and send it one chunk at a time. This handles the sending side for servers that use the newer style IPC code. This also takes care of sending IPC events from a server.

The libqb read buffer is now a fixed size, so we can just use crm_ipc_default_buffer_size() anywhere we need that. And then anywhere we need to know the size of buffer (nowhere at the moment, but possibly in the future) we can just use GByteArray->len to figure that out. So there's no need to keep an extra struct member around anymore.

This function can be called in a loop on the receive side of new-style IPC code to build up a complete IPC message from individual parts.

We create a new fixed-size temporary buffer inside crm_ipc_send and receive from libqb into that in order to keep changes to a minimum. Then, we add the contents of that temporary buffer to the client's IPC buffer which is allowed to grow as more of the message is received. Also note that crm_ipc_send has an extra receive block at the top for reading and discarding replies that previously timed out. This block also has to be modified to handle multipart messages, but since we are just throwing those away, there's not much to do.

This is similar to internal_ipc_get_reply, except it deals with IPC events and doesn't use timeouts.

We need this for the same reason that there's a similar member in crm_ipc_s - we need some place to store a potentially split up IPC message while we are reading it piece-by-piece.

...at least, when it's sending a server event. In this case, the iov will be added to the send queue and EAGAIN will be returned. We will attempt to send the event again the next time pcmk__ipc_send_iov is called, or when the timer pops to flush the event queue. Thus, we are already handling the EAGAIN return code properly elsewhere and just need to proceed as if everything will be fine.

header->size doesn't mean the same thing anymore now that we have split up IPC messages. If pcmk__client_data2xml is given an IPC message that did not have to be split up, then the size member will be the same as the size of the entire message. However, if it was given an IPC message that had to be split up for transmit and then reassembled, the size member only refers to the size of the last chunk received. Thus, checking for a NULL character in the position given by size will have us checking somewhere in the middle of the reassembled string which is obviously wrong.

This is basically how every daemon is going to work. The daemon needs to delay calling pcmk__client_data2xml until it has the complete message. The hard work of reassembling the message and determining whether it's a multipart message in the first place is already done elsewhere.

Fixes T903

clumens · 2025-04-29T16:42:07Z

I've updated this in various ways that make it worth looking at from the beginning again:

Rebased on main, unfortunately.
Added a new patch that adds a pcmk_rc_ipc_more return code, stopped using EAGAIN except to interface with libqb throughout.
Replaced the patch that looped on EAGAIN with one that handles that return code differently for server events.

I have one concern here that comes up occassionally in testing. I've been setting the buffer size to something small, like 16k, to ensure that messages get split up so I can test all the splitting/combining code. This results in a lot more IPC traffic, which results in a lot more EAGAIN return codes, and also sometimes means things don't work - for instance, remote nodes sometimes get hung when you try to run cibadmin -Q on them.

On the one hand, this is kind of an invented error. The buffer size is never going to be that small. I've created this problem by making it easier to test. On the other hand, I'm concerned that with a very large cluster, we're basically going to see the same problem where there's a lot of IPC traffic, but just with large messages instead of small ones.

clumens added the status: in progress PRs that aren't ready yet label Apr 17, 2025

clumens commented Apr 17, 2025

View reviewed changes

clumens force-pushed the ipc-split branch from e5f4c4d to e0fac84 Compare April 17, 2025 21:04

clumens force-pushed the ipc-split branch from e0fac84 to 61636c7 Compare April 18, 2025 15:00

clumens removed the status: in progress PRs that aren't ready yet label Apr 18, 2025

clumens marked this pull request as ready for review April 18, 2025 15:00

clumens force-pushed the ipc-split branch 2 times, most recently from ce2a405 to 15e03aa Compare April 21, 2025 13:48

nrwahl2 reviewed Apr 21, 2025

View reviewed changes

clumens commented Apr 22, 2025

View reviewed changes

nrwahl2 reviewed Apr 25, 2025

View reviewed changes

clumens added the review: in progress PRs that are currently being reviewed label Apr 28, 2025

clumens added 15 commits April 28, 2025 14:37

Feature: libcrmcommon: Add flags for multipart messages.

b3879d5

This does technically add flags to a public enum, but on the other hand they are added after the "these are for pacemaker's internal use only" comment, and we are hoping to eventually make this entire enum private.

Refactor: libcrmcommon: Add functions for inspecting multipart messages.

45b1d3b

Refactor: libcrmcommon: Set multipart flags in pcmk__ipc_prepare_iov.

8230d9f

We could do this in the callers, but this function has everything it needs to decide which flags should be set.

Feature: libcrmcommon: crm_ipc_send can send large IPC messages.

f18e977

Feature: libcrmcommon: Add pcmk__ipc_msg_append.

69f1463

This function can be called in a loop on the receive side of new-style IPC code to build up a complete IPC message from individual parts.

Feature: libcrmcommon: crm_ipc_read can receive large IPC messages.

59a57ca

This is similar to internal_ipc_get_reply, except it deals with IPC events and doesn't use timeouts.

Refactor: libcrmcommon: Standardize IPC logging messages.

fa2f9c1

Refactor: libcrmcommon: Add a GByteArray buffer to pcmk__client_s.

ca9c217

We need this for the same reason that there's a similar member in crm_ipc_s - we need some place to store a potentially split up IPC message while we are reading it piece-by-piece.

clumens added 9 commits April 29, 2025 11:26

Feature: daemons: Convert controld to support multipart IPC messages.

008cec4

Feature: daemons: Convert fenced to support multipart IPC messages.

2d7770f

Feature: daemons: Convert execd to support multipart IPC messages.

1c1f4d2

Feature: daemons: Convert pacemakerd to support multipart IPC messages.

36423b3

Feature: daemons: Convert attrd to support multipart IPC messages.

93f7f3d

Feature: daemons: Convert schedulerd to support multipart IPC messages.

aa4eba1

Fixes T903

clumens force-pushed the ipc-split branch from 15e03aa to aa4eba1 Compare April 29, 2025 16:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split up IPC messages #3868

Split up IPC messages #3868

clumens commented Apr 17, 2025

clumens Apr 17, 2025

clumens commented Apr 17, 2025

clumens commented Apr 18, 2025

clumens commented Apr 21, 2025

nrwahl2 left a comment

nrwahl2 Apr 21, 2025

nrwahl2 Apr 21, 2025 •

edited

Loading

nrwahl2 Apr 21, 2025

clumens Apr 22, 2025

nrwahl2 left a comment

clumens commented Apr 29, 2025

Split up IPC messages #3868

Are you sure you want to change the base?

Split up IPC messages #3868

Conversation

clumens commented Apr 17, 2025

clumens Apr 17, 2025

Choose a reason for hiding this comment

clumens commented Apr 17, 2025

clumens commented Apr 18, 2025

clumens commented Apr 21, 2025

nrwahl2 left a comment

Choose a reason for hiding this comment

nrwahl2 Apr 21, 2025

Choose a reason for hiding this comment

nrwahl2 Apr 21, 2025 • edited Loading

Choose a reason for hiding this comment

nrwahl2 Apr 21, 2025

Choose a reason for hiding this comment

clumens Apr 22, 2025

Choose a reason for hiding this comment

nrwahl2 left a comment

Choose a reason for hiding this comment

clumens commented Apr 29, 2025

nrwahl2 Apr 21, 2025 •

edited

Loading