I/O imbalance across bookies #24010

Jayer23 · 2025-02-20T08:27:06Z

Discussed in #24009

^{Originally posted by Jayer23 February 20, 2025}
Description

We observed an abnormal write pattern in our Pulsar cluster:

The nodes exposed in bookie metrics have little difference in read and write io. Producer has compressed write enabled.

In fact, monitoring at the node level found that io was greatly magnified, and the io differences between nodes were large. Some nodes had io less than 50MB/s, while some nodes had io greater than 450MB/s.

Journal disks on some nodes show very high write throughput (e.g., 400+ MB/s).

Environment

Pulsar version: 3.0.7
Storage config:
- Journal: 3×SSD
- Ledger：9×HDD
Replication: 3 replicas, 2 quorum writes

Expected Behavior

Write operations should be balanced between bookies

The text was updated successfully, but these errors were encountered:

thetumbled · 2025-02-24T03:36:59Z

Apache Bookkeeper do not support load balance between bookies natively.
To mitigate the defect of unbalance, you can try this pr: apache/bookkeeper#4246, or implement your own load balance feature that is suitable to your specific enviroment.

Shawyeok · 2025-02-26T05:36:51Z

You probably have some naughty partitions with heavy write throughput, you can check by following Prometheus query:

pulsar_throughput_in{cluster="YOUR_CLUSTER_NAME"} > 5 * 1024^2

If you discover partitions with excessive write loads, consider increasing the number of partitions for the affected topics. This will help distribute the write throughput more evenly across your bookie instances.

Notes about Pulsar message publishing:

Each partition maintains exactly one writable ledger at a time
With configuration ensembleSize=2, writeQuorum=2:
- A partition with 100MB/s write throughput will generate approximately 100MB/s write load on each bookie in the ensemble
- Bookies not in the current ensemble will not receive any writes from this ledger

Jayer23 · 2025-02-26T07:53:43Z

You probably have some naughty partitions with heavy write throughput, you can check by following Prometheus query:
pulsar_throughput_in{cluster="YOUR_CLUSTER_NAME"} > 5 * 1024^2
If you discover partitions with excessive write loads, consider increasing the number of partitions for the affected topics. This will help distribute the write throughput more evenly across your bookie instances.

Notes about Pulsar message publishing:

Each partition maintains exactly one writable ledger at a time

With configuration ensembleSize=2, writeQuorum=2:

A partition with 100MB/s write throughput will generate approximately 100MB/s write load on each bookie in the ensemble

Bookies not in the current ensemble will not receive any writes from this ledger

There are only 3 partitions in the cluster with write rates exceeding 2MB/s
The producer has Lz4 compression . Does the Prometheus query threshold need to be adjusted?

Shawyeok · 2025-02-26T08:48:10Z

There are only 3 partitions in the cluster with write rates exceeding 2MB/s
The producer has Lz4 compression . Does the Prometheus query threshold need to be adjusted?

You need to figure out what ledgers contribute most write load on the bookies with high write throughput.

How much difference between following querys?

sum(pulsar_throughput_in{cluster="YOUR_CLUSTER_NAME"}) / 1024^2

sum(rate(bookie_WRITE_BYTES{group="YOUR_CLUSTER_NAME"}[1m])) / 1024^2

Btw, what's query behind this graph?

Jayer23 · 2025-02-26T09:12:05Z

There are only 3 partitions in the cluster with write rates exceeding 2MB/s
The producer has Lz4 compression . Does the Prometheus query threshold need to be adjusted?

You need to figure out what ledgers contribute most write load on the bookies with high write throughput.

How much difference between following querys?
sum(pulsar_throughput_in{cluster="YOUR_CLUSTER_NAME"}) / 1024^2

sum(rate(bookie_WRITE_BYTES{group="YOUR_CLUSTER_NAME"}[1m])) / 1024^2
Btw, what's query behind this graph?

sum(pulsar_throughput_in{cluster="YOUR_CLUSTER_NAME"}) / 1024^2

2. sum(rate(bookie_WRITE_BYTES{group="YOUR_CLUSTER_NAME"}[1m])) / 1024^2

3. We collected host-level disk io metrics，include journal disks and ledger disks，found that io-writes of some hosts is much larger than that of others
The query is: sum(diskio.bytes_written{instance=~"sd.*"})by(host）

Shawyeok · 2025-02-26T09:39:14Z

Is there any other process that might be generating I/O writes in your deployment? You can check this using pidstat -d 1.

Jayer23 · 2025-02-26T10:00:34Z

Is there any other process that might be generating I/O writes in your deployment? You can check this using pidstat -d 1.

We are sure there is no other process. So we are also confused why bookie_WRITE_BYTES metrics are quite different from diskio.bytes_written

bookie_WRITE_BYTES

diskio.bytes_written

Shawyeok · 2025-02-26T10:36:23Z

Does bookie_WRITE_BYTES consistent with io write metric for journal disks? You may check compaction activities on ledger disks for further investigation.

Can you provide a demo to reproduce this problem?

Jayer23 · 2025-02-27T03:39:36Z

Does bookie_WRITE_BYTES consistent with io write metric for journal disks? You may check compaction activities on ledger disks for further investigation.

Can you provide a demo to reproduce this problem?

Does bookie_WRITE_BYTES consistent with io write metric for journal disks?
Inconsistency.

You may check compaction activities on ledger disks for further investigation.
Ok, thanks.

Can you provide a demo to reproduce this problem?
We also don't konw how to reproduce it, but we found the same problem in several clusters.

Jayer23 · 2025-03-06T08:26:37Z

We found that the reason for this problem is that in the default configurationjournalSyncData=true,which will cause a large amount of 8B data to be written to the journal disk. Because the 4K Alignment feature of SSD causes write amplification, the problem is solved after configuring journalSyncData=false

Jayer23 closed this as completed Mar 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I/O imbalance across bookies #24010

I/O imbalance across bookies #24010

Jayer23 commented Feb 20, 2025

thetumbled commented Feb 24, 2025

Shawyeok commented Feb 26, 2025

Jayer23 commented Feb 26, 2025

Shawyeok commented Feb 26, 2025 •

edited

Loading

Jayer23 commented Feb 26, 2025 •

edited

Loading

Shawyeok commented Feb 26, 2025

Jayer23 commented Feb 26, 2025

Shawyeok commented Feb 26, 2025

Jayer23 commented Feb 27, 2025

Jayer23 commented Mar 6, 2025

I/O imbalance across bookies #24010

I/O imbalance across bookies #24010

Comments

Jayer23 commented Feb 20, 2025

Discussed in #24009

thetumbled commented Feb 24, 2025

Shawyeok commented Feb 26, 2025

Jayer23 commented Feb 26, 2025

Shawyeok commented Feb 26, 2025 • edited Loading

Jayer23 commented Feb 26, 2025 • edited Loading

Shawyeok commented Feb 26, 2025

Jayer23 commented Feb 26, 2025

Shawyeok commented Feb 26, 2025

Jayer23 commented Feb 27, 2025

Jayer23 commented Mar 6, 2025

Shawyeok commented Feb 26, 2025 •

edited

Loading

Jayer23 commented Feb 26, 2025 •

edited

Loading