diff --git a/pip/pip-314.md b/pip/pip-314.md new file mode 100644 index 0000000000000..b23189d88f854 --- /dev/null +++ b/pip/pip-314.md @@ -0,0 +1,59 @@ +# PIP-314: Add metrics pulsar_subscription_redelivery_messages + +# Background knowledge + +## Delivery of messages in normal + +First, the `motivation & background` of this proposal is only relevant to `Key_Shared` subscriptions. + +To simplify the description of the mechanism, let's take the policy [Auto Split Hash Range](https://pulsar.apache.org/docs/3.0.x/concepts-messaging/#auto-split-hash-range) as an example: + +| `0 ~ 16,384` | `16,385 ~ 32,768` | `32,769 ~ 65,536` | +|-------------------|-------------------|--------------------------------| +| ------- C1 ------ | ------- C2 ------ | ------------- C3 ------------- | + +- If the entry key is between `-1(non-include) ~ 16,384(include)`, it is delivered to C1 +- If the entry key is between `16,384(non-include) ~ 32,768(include)`, it is delivered to C2 +- If the entry key is between `32,768(non-include) ~ 65,536(include)`, it is delivered to C3 + +# Motivation + +For the example above, if `C1` is stuck or consumed slowly, the Broker will push the entries that should be delivered to `C1` into a memory collection `redelivery_messages` and read next entries continue, then the collection `redelivery_messages` becomes larger and larger and take up a lot of memory. When sending messages, it will also determine the key of the entries in the collection `redelivery_messages`, affecting performance. + +# Goals +- Add metrics + - Broker level: + - Add a metric `pulsar_broker_max_subscription_redelivery_messages_total` to indicate the max one `{redelivery_messages}` of subscriptions in the broker, used by pushing an alert if it is too large. Nit: The Broker will print a log, which contains the name of the subscription(which has the maximum count of redelivery messages). This will help find the issue subscription. + - Add a metric `pulsar_broker_memory_usage_of_redelivery_messages_bytes` to indicate the memory usage of all `redelivery_messages` in the broker. This is helpful for memory health checks. +- Improve `Topic stats`. + - Add an attribute `redeliveryMessageCount` under `SubscriptionStats` + +Differ between `redelivery_messages` and `pulsar_subscription_unacked_messages & pulsar_subscription_back_log` + +- `pulsar_subscription_unacked_messages`: the messages have been delivered to the client but have not been acknowledged yet. +- `pulsar_subscription_back_log`: how many messages should be acknowledged, contains delivered messages, and the messages which should be delivered. + +### Public API + +SubscriptionStats.java +```java +long getRedeliveryMessageCount(); +``` + +### Metrics + +**pulsar_broker_max_subscription_redelivery_messages_total** +- Description: the max one `{redelivery_messages}` of subscriptions in the broker. +- Attributes: `[cluster]` +- Unit: `Gauge` + +**pulsar_broker_memory_usage_of_redelivery_messages_bytes** +- Description: the memory usage of all `redelivery_messages` in the broker. +- Attributes: `[cluster]` +- Unit: `Gauge` + + +# Monitoring + +- Push an alert if `pulsar_broker_max_subscription_redelivery_messages_total` is too large. This indicates that some `Key_Shared` consumers may be stuck, and it will affect the consumption speed and increase CPU usage. +- Push an alert if `pulsar_broker_memory_usage_of_redelivery_messages_bytes` is too large. This means Redelivery messages used too much memory, which may cause OOM.