Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I/O imbalance across bookies #24010

Closed
Jayer23 opened this issue Feb 20, 2025 · 10 comments
Closed

I/O imbalance across bookies #24010

Jayer23 opened this issue Feb 20, 2025 · 10 comments

Comments

@Jayer23
Copy link

Jayer23 commented Feb 20, 2025

Discussed in #24009

Originally posted by Jayer23 February 20, 2025
Description

We observed an abnormal write pattern in our Pulsar cluster:

The nodes exposed in bookie metrics have little difference in read and write io. Producer has compressed write enabled.

266
In fact, monitoring at the node level found that io was greatly magnified, and the io differences between nodes were large. Some nodes had io less than 50MB/s, while some nodes had io greater than 450MB/s.

267
Journal disks on some nodes show very high write throughput (e.g., 400+ MB/s).

Environment

  • Pulsar version: 3.0.7
  • Storage config:
    • Journal: 3×SSD
    • Ledger:9×HDD
  • Replication: 3 replicas2 quorum writes

Expected Behavior

Write operations should be balanced between bookies

@thetumbled
Copy link
Member

Apache Bookkeeper do not support load balance between bookies natively.
To mitigate the defect of unbalance, you can try this pr: apache/bookkeeper#4246, or implement your own load balance feature that is suitable to your specific enviroment.

@Shawyeok
Copy link
Contributor

You probably have some naughty partitions with heavy write throughput, you can check by following Prometheus query:

pulsar_throughput_in{cluster="YOUR_CLUSTER_NAME"} > 5 * 1024^2

If you discover partitions with excessive write loads, consider increasing the number of partitions for the affected topics. This will help distribute the write throughput more evenly across your bookie instances.

Notes about Pulsar message publishing:

  • Each partition maintains exactly one writable ledger at a time
  • With configuration ensembleSize=2, writeQuorum=2:
    • A partition with 100MB/s write throughput will generate approximately 100MB/s write load on each bookie in the ensemble
    • Bookies not in the current ensemble will not receive any writes from this ledger

@Jayer23
Copy link
Author

Jayer23 commented Feb 26, 2025

You probably have some naughty partitions with heavy write throughput, you can check by following Prometheus query:

pulsar_throughput_in{cluster="YOUR_CLUSTER_NAME"} > 5 * 1024^2

If you discover partitions with excessive write loads, consider increasing the number of partitions for the affected topics. This will help distribute the write throughput more evenly across your bookie instances.

Notes about Pulsar message publishing:

  • Each partition maintains exactly one writable ledger at a time

  • With configuration ensembleSize=2, writeQuorum=2:

    • A partition with 100MB/s write throughput will generate approximately 100MB/s write load on each bookie in the ensemble
    • Bookies not in the current ensemble will not receive any writes from this ledger
  1. There are only 3 partitions in the cluster with write rates exceeding 2MB/s
  2. The producer has Lz4 compression . Does the Prometheus query threshold need to be adjusted?

@Shawyeok
Copy link
Contributor

Shawyeok commented Feb 26, 2025

There are only 3 partitions in the cluster with write rates exceeding 2MB/s
The producer has Lz4 compression . Does the Prometheus query threshold need to be adjusted?

You need to figure out what ledgers contribute most write load on the bookies with high write throughput.

How much difference between following querys?

sum(pulsar_throughput_in{cluster="YOUR_CLUSTER_NAME"}) / 1024^2

sum(rate(bookie_WRITE_BYTES{group="YOUR_CLUSTER_NAME"}[1m])) / 1024^2

Btw, what's query behind this graph?

Image

@Jayer23
Copy link
Author

Jayer23 commented Feb 26, 2025

There are only 3 partitions in the cluster with write rates exceeding 2MB/s
The producer has Lz4 compression . Does the Prometheus query threshold need to be adjusted?

You need to figure out what ledgers contribute most write load on the bookies with high write throughput.

How much difference between following querys?

sum(pulsar_throughput_in{cluster="YOUR_CLUSTER_NAME"}) / 1024^2

sum(rate(bookie_WRITE_BYTES{group="YOUR_CLUSTER_NAME"}[1m])) / 1024^2

Btw, what's query behind this graph?

Image
  1. sum(pulsar_throughput_in{cluster="YOUR_CLUSTER_NAME"}) / 1024^2

Image
2. sum(rate(bookie_WRITE_BYTES{group="YOUR_CLUSTER_NAME"}[1m])) / 1024^2

Image
3. We collected host-level disk io metrics,include journal disks and ledger disks,found that io-writes of some hosts is much larger than that of others
The query is: sum(diskio.bytes_written{instance=~"sd.*"})by(host)
Image

@Shawyeok
Copy link
Contributor

Is there any other process that might be generating I/O writes in your deployment? You can check this using pidstat -d 1.

@Jayer23
Copy link
Author

Jayer23 commented Feb 26, 2025

Is there any other process that might be generating I/O writes in your deployment? You can check this using pidstat -d 1.

We are sure there is no other process. So we are also confused why bookie_WRITE_BYTES metrics are quite different from diskio.bytes_written

bookie_WRITE_BYTES

Image
diskio.bytes_written

Image

@Shawyeok
Copy link
Contributor

Does bookie_WRITE_BYTES consistent with io write metric for journal disks? You may check compaction activities on ledger disks for further investigation.

Can you provide a demo to reproduce this problem?

@Jayer23
Copy link
Author

Jayer23 commented Feb 27, 2025

Does bookie_WRITE_BYTES consistent with io write metric for journal disks? You may check compaction activities on ledger disks for further investigation.

Can you provide a demo to reproduce this problem?

Does bookie_WRITE_BYTES consistent with io write metric for journal disks?
Inconsistency.

You may check compaction activities on ledger disks for further investigation.
Ok, thanks.

Can you provide a demo to reproduce this problem?
We also don't konw how to reproduce it, but we found the same problem in several clusters.

@Jayer23
Copy link
Author

Jayer23 commented Mar 6, 2025

We found that the reason for this problem is that in the default configurationjournalSyncData=true,which will cause a large amount of 8B data to be written to the journal disk. Because the 4K Alignment feature of SSD causes write amplification, the problem is solved after configuring journalSyncData=false

@Jayer23 Jayer23 closed this as completed Mar 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants