Clarify WAN queue capacity behaviour [v/5.4] (#1652)

github-actions[bot] · JamesHazelcast · web-flow · commit d9a7e0e2226c · 2025-04-08T14:29:26.000+01:00
Backport of #1637 We only expose the primary events queue when we talk about "outbound queue size", which makes sense from a practical standpoint, but it can lead to confusion when you encounter scenarios such as partition promotions during migration. This can lead to metrics showing the WAN queue size exceeding the configured capacity, but it's only a shift of existing data, not a population of new data. Without proper explanation, customers can be confused and concerned that the memory footprint of their WAN queues has suddenly increased, even though this is not the case. This PR adds more information when discussing queue capacity in WAN, and also adds a bit more detail to the outbound queue metric description. Fixes https://hazelcast.atlassian.net/browse/HZG-308 Co-authored-by: James Holgate <130981049+JamesHazelcast@users.noreply.github.com>
diff --git a/docs/modules/ROOT/pages/list-of-metrics.adoc b/docs/modules/ROOT/pages/list-of-metrics.adoc
@@ -2269,7 +2269,7 @@ Based on your latency tolerance in your business use case, you can define a thre
 
 |`wan.outboundQueueSize`
 |count
-|Outbound WAN queue size on this member
+|Total number of WAN events currently placed in the WAN queues of primary partitions on this member
 
 |`wan.removeCount`
 |count
diff --git a/docs/modules/wan/pages/tuning.adoc b/docs/modules/wan/pages/tuning.adoc
@@ -181,7 +181,18 @@ For clusters with high data mutation rates or with long expected periods of disr
 you might need to increase the replication queue size. The default queue size for replication queues is `10000`.
 This means, if you have heavy put/update/remove rates or if the target/passive cluster is unavailable for too long,
 you might exceed the queue size so that the oldest, not yet replicated, updates might get lost.
-Note that a separate queue is used for each WAN Replication configured for IMap and ICache.
+Separate queues are used for each WAN Replication configured for IMap and ICache.
+
+Two queues are used for each WAN Replication: Primary and Backup. The primary queue is offered events from owned
+partitions, while the backup queue is offered events from partitions owned by other members.
+The configured queue capacity applies separately to each of the primary and backup event queues.
+Metrics for WAN outbound queue sizes report the primary event queue size.
+
+NOTE: During partition migrations, particularly when a cluster size shrinks, some elements in the backup events queue
+can be promoted to the primary events queue, resulting in an increased outbound queue size which can exceed the configured
+queue capacity. The total memory footprint of both queues combined will not increase in this case as the backup events
+queue was already in memory, so this temporary increase in the primary queue size is expected and necessary to prevent
+WAN event data loss.
 
 Queue capacity can be set for each target cluster by modifying the related `WanBatchPublisherConfig`.
 
@@ -892,4 +903,4 @@ multiplying this duration by the connection's total failed connection attempts,
 used in the back-off strategy for the connection health checks. Default is `12`.
 
 NOTE: You can enable the legacy behavior of static endpoints, where the connection health is not checked
-by setting  the `hazelcast.wan.static.discovery.legacy` property to `true`.
+by setting  the `hazelcast.wan.static.discovery.legacy` property to `true`.