Skip to content

Commit d9a7e0e

Browse files
Clarify WAN queue capacity behaviour [v/5.4] (#1652)
Backport of #1637 We only expose the primary events queue when we talk about "outbound queue size", which makes sense from a practical standpoint, but it can lead to confusion when you encounter scenarios such as partition promotions during migration. This can lead to metrics showing the WAN queue size exceeding the configured capacity, but it's only a shift of existing data, not a population of new data. Without proper explanation, customers can be confused and concerned that the memory footprint of their WAN queues has suddenly increased, even though this is not the case. This PR adds more information when discussing queue capacity in WAN, and also adds a bit more detail to the outbound queue metric description. Fixes https://hazelcast.atlassian.net/browse/HZG-308 Co-authored-by: James Holgate <130981049+JamesHazelcast@users.noreply.github.com>
1 parent 9e48cb7 commit d9a7e0e

File tree

2 files changed

+14
-3
lines changed

2 files changed

+14
-3
lines changed

docs/modules/ROOT/pages/list-of-metrics.adoc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2269,7 +2269,7 @@ Based on your latency tolerance in your business use case, you can define a thre
22692269

22702270
|`wan.outboundQueueSize`
22712271
|count
2272-
|Outbound WAN queue size on this member
2272+
|Total number of WAN events currently placed in the WAN queues of primary partitions on this member
22732273

22742274
|`wan.removeCount`
22752275
|count

docs/modules/wan/pages/tuning.adoc

Lines changed: 13 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -181,7 +181,18 @@ For clusters with high data mutation rates or with long expected periods of disr
181181
you might need to increase the replication queue size. The default queue size for replication queues is `10000`.
182182
This means, if you have heavy put/update/remove rates or if the target/passive cluster is unavailable for too long,
183183
you might exceed the queue size so that the oldest, not yet replicated, updates might get lost.
184-
Note that a separate queue is used for each WAN Replication configured for IMap and ICache.
184+
Separate queues are used for each WAN Replication configured for IMap and ICache.
185+
186+
Two queues are used for each WAN Replication: Primary and Backup. The primary queue is offered events from owned
187+
partitions, while the backup queue is offered events from partitions owned by other members.
188+
The configured queue capacity applies separately to each of the primary and backup event queues.
189+
Metrics for WAN outbound queue sizes report the primary event queue size.
190+
191+
NOTE: During partition migrations, particularly when a cluster size shrinks, some elements in the backup events queue
192+
can be promoted to the primary events queue, resulting in an increased outbound queue size which can exceed the configured
193+
queue capacity. The total memory footprint of both queues combined will not increase in this case as the backup events
194+
queue was already in memory, so this temporary increase in the primary queue size is expected and necessary to prevent
195+
WAN event data loss.
185196

186197
Queue capacity can be set for each target cluster by modifying the related `WanBatchPublisherConfig`.
187198

@@ -892,4 +903,4 @@ multiplying this duration by the connection's total failed connection attempts,
892903
used in the back-off strategy for the connection health checks. Default is `12`.
893904

894905
NOTE: You can enable the legacy behavior of static endpoints, where the connection health is not checked
895-
by setting the `hazelcast.wan.static.discovery.legacy` property to `true`.
906+
by setting the `hazelcast.wan.static.discovery.legacy` property to `true`.

0 commit comments

Comments
 (0)