Skip to content

Split up IPC messages #3868

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 24 commits into
base: main
Choose a base branch
from
Open

Split up IPC messages #3868

wants to merge 24 commits into from

Conversation

clumens
Copy link
Contributor

@clumens clumens commented Apr 17, 2025

There are currently cts-lab failures with the Reattach and Remote* tests caused by this PR that I am still in the middle of tracking down.

@clumens clumens added the status: in progress PRs that aren't ready yet label Apr 17, 2025
}

free(part_text);

if (flags & crm_ipc_server_free) {
pcmk_free_ipc_event(iov);
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've found this patch useful for debugging purposes, but we don't necessarily have to keep it. One annoyance is that our logging cuts off after a certain number of characters, so very long XML messages won't get logged in their entirety anyway. Still, it's usually enough to see just the beginning to know what's going on.

@clumens
Copy link
Contributor Author

clumens commented Apr 17, 2025

The Retry sending in crm_ipcs_flush_events patch fixes the problem in the Remote* tests, but there's still something going on causing other failures:

Apr 17 16:40:58 Running test SimulStop              (rhel9-ctslab-3) [  6]
Apr 17 16:46:44 Warn: All nodes stopped but CTS didn't detect: ['rhel9-ctslab-2\\W.*Unloading all Corosync service engines']
Apr 17 16:46:44 Test SimulStopLite                  FAILED: Missing log message: ['rhel9-ctslab-2\\W.*Unloading all Corosync service engines']
Apr 17 16:46:44 Test SimulStop                      FAILED: Stopall failed
Apr 17 16:59:48 Running test Reattach               (rhel9-ctslab-1) [ 14]
Apr 17 17:02:10 ResourceActivity: 2025-04-17T17:02:04-0400 rhel9-ctslab-1 pacemaker-controld[16925]:  notice: Result of start operation for stateful-1 on rhel9-ctslab-1: OK
Apr 17 17:03:15 Test Reattach                       FAILED: Resources stopped or started after resource management was re-enabled
Apr 17 17:06:29 Running test NearQuorumPoint        (rhel9-ctslab-3) [ 16]
Apr 17 17:07:10 BadNews: 2025-04-17T17:06:37-0400 rhel9-ctslab-1 pacemaker-controld[27993]:  error: Could not contact the scheduler: Resource temporarily unavailable
Apr 17 17:07:10 BadNews: 2025-04-17T17:06:37-0400 rhel9-ctslab-1 pacemaker-controld[27993]:  error: Input I_ERROR received in state S_POLICY_ENGINE from do_pe_invoke_callback
Apr 17 17:07:10 BadNews: 2025-04-17T17:06:37-0400 rhel9-ctslab-1 pacemaker-controld[27993]:  error: Input I_TERMINATE received in state S_RECOVERY from do_recover
Apr 17 17:07:10 BadNews: 2025-04-17T17:06:37-0400 rhel9-ctslab-1 pacemaker-controld[27993]:  error: 7 resources were active at shutdown
Apr 17 17:07:10 BadNews: 2025-04-17T17:06:37-0400 rhel9-ctslab-1 pacemaker-controld[27993]:  error: Could not recover from internal error
Apr 17 17:07:10 BadNews: 2025-04-17T17:06:37-0400 rhel9-ctslab-1 pacemakerd[27986]:  error: pacemaker-controld[27993] exited with status 1 (Error occurred)

These test failures appear to be somewhat transient. I can't reliably reproduce them on every test every time, but they will happen fairly frequently. It's difficult to tell from looking at the cts-lab logs which failures represent real errors and which are merely due to the lab killing things and restarting VMs.

@clumens clumens removed the status: in progress PRs that aren't ready yet label Apr 18, 2025
@clumens clumens marked this pull request as ready for review April 18, 2025 15:00
@clumens
Copy link
Contributor Author

clumens commented Apr 18, 2025

These test failures appear to be somewhat transient. I can't reliably reproduce them on every test every time, but they will happen fairly frequently. It's difficult to tell from looking at the cts-lab logs which failures represent real errors and which are merely due to the lab killing things and restarting VMs.

This was fixed by the following:

-        if (pcmk_all_flags_set(flags, crm_ipc_multipart)) {
-            bool is_end = pcmk_all_flags_set(flags, crm_ipc_multipart_end);
+        if (pcmk__ipc_msg_is_multipart(header)) {
+            bool is_end = pcmk__ipc_msg_is_multipart_end(header);

@clumens clumens force-pushed the ipc-split branch 2 times, most recently from ce2a405 to 15e03aa Compare April 21, 2025 13:48
@clumens
Copy link
Contributor Author

clumens commented Apr 21, 2025

Rebased onto main to reflect that I merged the compression PR, no other code changes.

Copy link
Contributor

@nrwahl2 nrwahl2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VERY partial. Just getting these early comments out while I continue reviewing.

g_byte_array_free(client->buffer, TRUE);
}

client->buffer = g_byte_array_sized_new(client->buf_size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second, because there's no way to clear out a GByteArray and inserting
new bytes directly at the beginning, we just need to allocate and free
it every time we do a read.

Are we sure about that? Looks like setting the size to 0 has the effect of truncating the array. (I checked the implementation to confirm.)
https://docs.gtk.org/glib/type_func.ByteArray.set_size.html


client->buffer[0] = 0;
client->msg_size = qb_ipcc_event_recv(client->ipc, client->buffer,
buffer = pcmk__assert_alloc(crm_ipc_default_buffer_size(), sizeof(char));
Copy link
Contributor

@nrwahl2 nrwahl2 Apr 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could make buffer a guint8 * and allocate it with g_malloc0() (the GLib equivalent of calloc() with assertion on memory error). Libqb expects a void *, so it doesn't care about the type of buffer.

Then instead of calling g_byte_array_append(), we call g_byte_array_new_take(buffer, client->msg_size), and skip freeing buffer on success. This avoids the copy by transferring ownership of the existing buffer to the byte array.

Now, this only works when creating the byte array. As of this commit, we're only appending to it once unless we truncate it first. If we append more data later without truncating (as I expect we will do), then there's no way for the byte array to take ownership of the data in the buffer in those later appends. Though we could either reuse the fixed-size buffer without freeing it, or do some realloc trickery.

https://docs.gtk.org/glib/type_func.ByteArray.new_take.html

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thought: I'm not especially concerned about dynamic memory allocation, but we could avoid it here by using an internal macro constant for buffer size instead of calling crm_ipc_default_buffer_size() in these places. Then we could just have a static array for the buffer.

Up to you. I see the appeal of wanting to use that function consistently even internally, but the buffer size is always the same now.

qb_rc = qb_ipcs_event_sendv(c->ipcs, event, 2);
do {
qb_rc = qb_ipcs_event_sendv(c->ipcs, event, 2);
} while (qb_rc == -EAGAIN);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@nrwahl2 nrwahl2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another partial review

@clumens clumens added the review: in progress PRs that are currently being reviewed label Apr 28, 2025
clumens added 15 commits April 28, 2025 14:37
This has the built-in advantage of various library functions that can be
used to manipulate it instead of using pointer arithmetic.  For the
moment, this isn't important but it will become important later when
we're doing split up IPC messages.

On the other hand, there's a couple other changes here that are not
advantages.  First, instead of reading directly into the GByteArray, we
now have to allocate a buffer to read into, and then append that to the
GByteArray.  This also means we need to be careful about freeing that
buffer.

Second, because there's no way to clear out a GByteArray and inserting
new bytes directly at the beginning, we just need to allocate and free
it every time we do a read.  This also means we need to be more careful
about checking if it's NULL.  This will be less of a problem later when
we are reading into the same GByteArray multiple times for a split up
message.
...instead of an xmlNode.  The idea here is that to deal with split up
messages, this function can be called repeatedly from a loop.  Each call
will prepare the next chunk of the message to be sent.  The first step
in doing that is simply changing the type of the parameter.  This does
result in a little code duplication in the callers, but there are only a
couple.
Functions should return this error code if a single IPC message is too
large to fit into a single buffer, and so it has to be split up into
multiple.  It can also be used when one chunk of a split up message has
been read, indicating that the caller needs to continue reading.
The function now only prepares as much for one I/O vector as will fit in
a single buffer.

The idea is that you can call the function repeatedly in a loop,
preparing and transmitting individual chunks of a single XML message.
As long as pcmk__ipc_prepare_iov returns pcmk_rc_ipc_more, you know
there's more that needs to be prepared.  Keep a running count of the
bytes it prepared and pass that in for offset.

Note that clients are not doing this yet, so very large IPC messages are
still going to fail.
This does technically add flags to a public enum, but on the other hand
they are added after the "these are for pacemaker's internal use only"
comment, and we are hoping to eventually make this entire enum private.
We could do this in the callers, but this function has everything it
needs to decide which flags should be set.
…ges.

If the IPC message is too large to fit in a single buffer, this function
now knows how to loop over the message and send it one chunk at a time.
This handles the sending side for servers that use the newer style IPC
code.  This also takes care of sending IPC events from a server.
The libqb read buffer is now a fixed size, so we can just use
crm_ipc_default_buffer_size() anywhere we need that.  And then anywhere
we need to know the size of buffer (nowhere at the moment, but possibly
in the future) we can just use GByteArray->len to figure that out.  So
there's no need to keep an extra struct member around anymore.
This function can be called in a loop on the receive side of new-style
IPC code to build up a complete IPC message from individual parts.
We create a new fixed-size temporary buffer inside crm_ipc_send and
receive from libqb into that in order to keep changes to a minimum.
Then, we add the contents of that temporary buffer to the client's IPC
buffer which is allowed to grow as more of the message is received.

Also note that crm_ipc_send has an extra receive block at the top for
reading and discarding replies that previously timed out.  This block
also has to be modified to handle multipart messages, but since we are
just throwing those away, there's not much to do.
This is similar to internal_ipc_get_reply, except it deals with IPC
events and doesn't use timeouts.
We need this for the same reason that there's a similar member in
crm_ipc_s - we need some place to store a potentially split up IPC
message while we are reading it piece-by-piece.
clumens added 9 commits April 29, 2025 11:26
...at least, when it's sending a server event.  In this case, the iov
will be added to the send queue and EAGAIN will be returned.  We will
attempt to send the event again the next time pcmk__ipc_send_iov is
called, or when the timer pops to flush the event queue.

Thus, we are already handling the EAGAIN return code properly elsewhere
and just need to proceed as if everything will be fine.
header->size doesn't mean the same thing anymore now that we have split
up IPC messages.  If pcmk__client_data2xml is given an IPC message that
did not have to be split up, then the size member will be the same as
the size of the entire message.  However, if it was given an IPC message
that had to be split up for transmit and then reassembled, the size
member only refers to the size of the last chunk received.

Thus, checking for a NULL character in the position given by size will
have us checking somewhere in the middle of the reassembled string which
is obviously wrong.
This is basically how every daemon is going to work.  The daemon needs
to delay calling pcmk__client_data2xml until it has the complete
message.  The hard work of reassembling the message and determining
whether it's a multipart message in the first place is already done
elsewhere.
@clumens
Copy link
Contributor Author

clumens commented Apr 29, 2025

I've updated this in various ways that make it worth looking at from the beginning again:

  • Rebased on main, unfortunately.
  • Added a new patch that adds a pcmk_rc_ipc_more return code, stopped using EAGAIN except to interface with libqb throughout.
  • Replaced the patch that looped on EAGAIN with one that handles that return code differently for server events.

I have one concern here that comes up occassionally in testing. I've been setting the buffer size to something small, like 16k, to ensure that messages get split up so I can test all the splitting/combining code. This results in a lot more IPC traffic, which results in a lot more EAGAIN return codes, and also sometimes means things don't work - for instance, remote nodes sometimes get hung when you try to run cibadmin -Q on them.

On the one hand, this is kind of an invented error. The buffer size is never going to be that small. I've created this problem by making it easier to test. On the other hand, I'm concerned that with a very large cluster, we're basically going to see the same problem where there's a lot of IPC traffic, but just with large messages instead of small ones.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
review: in progress PRs that are currently being reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants