[9.0] New threadpool-based merge scheduler which is disk space aware #129134

albertzaharovits · 2025-06-09T06:45:10Z

This is the backport of a few PRs related to the implementation of the new threadpool-based merge scheduler.
The new merge scheduler uses a node-level threadpool (sized the number of CPU cores) to execute all the merges across all the shards on the node, limiting the amount of concurrently executing merges, irrespective of the number of shards that the node hosts. Smaller merges continue to have priority over larger ones.
In addition, the new merge scheduler implementation also monitors the available disk space on the node, so that it won't start executing any new merges when the available disk space becomes scarce (the used disk space gets above the indices.merge.disk.watermark.high (95%) limit (same as the the allocation flood stage (the limit that flips shards on the node to read-only))).
The new merge scheduler is now enabled by default (indices.merge.scheduler.use_thread_pool is true).

Here is the complete list of backported PRs:

Threadpool merge scheduler by albertzaharovits · Pull Request #120869 · elastic/elasticsearch - this is the initial implementation of the new merge scheduler
Start indexing throttling only after disk IO unthrottling does not keep up with the merge load by albertzaharovits · Pull Request #125654 · elastic/elasticsearch - this fixes a problem (where the merge scheduler throttles indexing too early) detected by nightly benchmarks in https://github.com/elastic/elasticsearch-benchmarks/issues/2437
Slack merge throttling params for fewer merge tasks by albertzaharovits · Pull Request #126016 · elastic/elasticsearch - this slackens the disk IO throttling algorithm to generally do less disk IO throttling of merges
Fix failure in ScalingThreadPoolTests after addition of merge scheduler by mark-vieira · Pull Request #125245 · elastic/elasticsearch, Fix ThreadPoolMergeSchedulerStressTestIT testMergingFallsBehindAndThenCatchesUp by albertzaharovits · Pull Request #125956 · elastic/elasticsearch, Fix ThreadPoolMergeSchedulerTests testMergeSourceWithFollowUpMergesRunSequentially by albertzaharovits · Pull Request #126050 · elastic/elasticsearch, Fix ThreadPoolMergeSchedulerTests testSchedulerCloseWaitsForRunningMerge by albertzaharovits · Pull Request #126110 · elastic/elasticsearch, and Fix IndexStatsIT by albertzaharovits · Pull Request #126113 · elastic/elasticsearch - these are fixes for test failures
Threadpool merge executor is aware of available disk space by albertzaharovits · Pull Request #127613 · elastic/elasticsearch - this adds the feature that monitors for the available disk space, and blocks before starting any merges when the available disk space gets low
Move MergeMemoryEstimator by pxsalehi · Pull Request #125686 · elastic/elasticsearch, and Expose merge events and their memory usage estimate by pxsalehi · Pull Request #126667 · elastic/elasticsearch - optional change that exposes heap memory estimations for merge tasks
ES-10264: Bring over merge scheduler features from stateless by BrianRothermich · Pull Request #128155 · elastic/elasticsearch optional refactoring that provides a few extension hook methods (that are mostly used for metrics)

See also: #129152
Relates: ES-11701 ES-10046

This adds a new merge scheduler implementation that uses a (new) dedicated thread pool to run the merges. This way the number of concurrent merges is limited to the number of threads in the pool (i.e. the number of allocated processors to the ES JVM). It implements dynamic IO throttling (the same target IO rate for all merges, roughly, with caveats) that's adjusted based on the number of currently active (queued + running) merges. Smaller merges are always preferred to larger ones, irrespective of the index shard that they're coming from. The implementation also supports the per-shard "max thread count" and "max merge count" settings, the later being used today for indexing throttling. Note that IO throttling, max merge count, and max thread count work similarly, but not identical, to their siblings in the ConcurrentMergeScheduler. The per-shard merge statistics are not affected, and the thread-pool statistics should reflect the merge ones (i.e. the completed thread pool stats reflects the total number of merges, across shards, per node).

…ep up with the merge load (elastic#125654) Fixes an issue where indexing throttling kicks in while disk IO is throttling. Instead disk IO should first unthrottle, and only then, if we still can't keep up with the merging load, start throttling indexing. Fixes elastic/elasticsearch-benchmarks#2437 Relates elastic#120869

The intent here is to aim for fewer to-do merges enqueued for execution, and to unthrottle disk IO at a faster rate when the queue grows longer. Overall this results in less merge disk throttling. Relates elastic/elasticsearch-benchmarks#2437 elastic#120869

…er (elastic#125245)

…nCatchesUp (elastic#125956) We don't know how many semaphore merge permits we need to release, or how many are already released. Fixes elastic#125744

Fixes elastic#125639 Relates elastic#120869

…rge (elastic#126110) Fixes elastic#125236

Ensures proper cleanup in the testThrottleStats test. Fixes elastic#125910 elastic#125907 elastic#125912

…27613) This PR introduces 3 new settings: indices.merge.disk.check_interval, indices.merge.disk.watermark.high, and indices.merge.disk.watermark.high.max_headroom that control if the threadpool merge executor starts executing new merges when the disk space is getting low. The intent of this change is to avoid the situation where in-progress merges exhaust the available disk space on the node's local filesystem. To this end, the thread pool merge executor periodically monitors the available disk space, as well as the current disk space estimates required by all in-progress (currently running) merges on the node, and will NOT schedule any new merges if the disk space is getting low (by default below the 5% limit of the total disk space, or 100 GB, whichever is smaller (same as the disk allocation flood stage level)).

Relates ES-10961

Relates to an effort to consolidate the stateless merge scheduler with the current (stateful) merge scheduler from main ES. This PR brings over features required to maintain parity with the stateless scheduler. Specifically, a few methods are added for the stateless scheduler to override: Adds an overridable method shouldSkipMerge to test for skipping merges Adds 2 additional lifecycle callbacks to the scheduler for when a merge is enqueued and when a merge is executed or aborted. This is used by stateless to track active + queued merges per-shard Adds overridable methods for enabling/disabling IO/thread/merge count throttling Other functionality required by the stateless merge scheduler can use the existing callbacks from the stateful scheduler: beforeMerge can be overridden to prewarm afterMerge can be overridden to refresh after big merges Relates ES-10264 --------- Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co>

Relates ES-10961

…lastic#129134) This is the backport of a few PRs related to the implementation of the new threadpool-based merge scheduler. The new merge scheduler uses a node-level threadpool (sized the number of CPU cores) to execute all the merges across all the shards on the node, limiting the amount of concurrently executing merges, irrespective of the number of shards that the node hosts. Smaller merges continue to have priority over larger ones. In addition, the new merge scheduler implementation also monitors the available disk space on the node, so that it won't start executing any new merges when the available disk space becomes scarce (the used disk space gets above the indices.merge.disk.watermark.high (95%) limit (same as the the allocation flood stage (the limit that flips shards on the node to read-only))). The new merge scheduler is now enabled by default (indices.merge.scheduler.use_thread_pool is true).

albertzaharovits and others added 9 commits June 9, 2025 08:13

Fix failure in ScalingThreadPoolTests after addition of merge schedul…

5727a1f

…er (elastic#125245)

Fix ThreadPoolMergeSchedulerStressTestIT testMergingFallsBehindAndThe…

a255537

…nCatchesUp (elastic#125956) We don't know how many semaphore merge permits we need to release, or how many are already released. Fixes elastic#125744

Fix testMergeSourceWithFollowUpMergesRunSequentially (elastic#126050)

c43805e

Fixes elastic#125639 Relates elastic#120869

Fix ThreadPoolMergeSchedulerTests testSchedulerCloseWaitsForRunningMe…

db2ed34

…rge (elastic#126110) Fixes elastic#125236

Fix IndexStatsIT (elastic#126113)

646069d

Ensures proper cleanup in the testThrottleStats test. Fixes elastic#125910 elastic#125907 elastic#125912

albertzaharovits self-assigned this Jun 9, 2025

albertzaharovits added >feature :Distributed Indexing/Distributed A catch all label for anything in the Distributed Indexing Area. Please avoid if you can. backport v9.0.3 labels Jun 9, 2025

pxsalehi and others added 5 commits June 9, 2025 10:10

Expose merge events and their memory usage estimate (elastic#126667)

771c294

Relates ES-10961

Move MergeMemoryEstimator (elastic#125686)

6aaaf52

Relates ES-10961

Remove unused project resolver reference

96b1677

Compilation fix

fb018d3

albertzaharovits changed the title ~~[9.0] Threadpool merge scheduler that is disk space aware~~ [9.0] New threadpool-based merge scheduler that is disk space aware Jun 9, 2025

albertzaharovits merged commit 707d994 into elastic:9.0 Jun 9, 2025
16 checks passed

albertzaharovits changed the title ~~[9.0] New threadpool-based merge scheduler that is disk space aware~~ [9.0] New threadpool-based merge scheduler which is disk space aware Jun 9, 2025

albertzaharovits mentioned this pull request Jun 9, 2025

[8.19] New threadpool-based merge scheduler which is disk space aware #129152

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[9.0] New threadpool-based merge scheduler which is disk space aware #129134

[9.0] New threadpool-based merge scheduler which is disk space aware #129134

Uh oh!

albertzaharovits commented Jun 9, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

[9.0] New threadpool-based merge scheduler which is disk space aware #129134

[9.0] New threadpool-based merge scheduler which is disk space aware #129134

Uh oh!

Conversation

albertzaharovits commented Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

albertzaharovits commented Jun 9, 2025 •

edited

Loading