apache · poorbarcode · Oct 31, 2023 · Nov 1, 2023 · Nov 1, 2023 · Nov 3, 2023
diff --git a/pip/pip-314.md b/pip/pip-314.md
@@ -0,0 +1,57 @@
+# PIP-314: Add metrics pulsar_subscription_redelivery_messages
+
+# Background knowledge
+
+## Delivery of messages in normal
+
+To simplify the description of the mechanism, let's take the policy [Auto Split Hash Range](https://pulsar.apache.org/docs/3.0.x/concepts-messaging/#auto-split-hash-range)  as an example:
+
+| `0 ~ 16,384`      | `16,385 ~ 32,768` | `32,769 ~ 65,536`              |
+|-------------------|-------------------|--------------------------------|
+| ------- C1 ------ | ------- C2 ------ | ------------- C3 ------------- |
+
+- If the entry key is between `-1(non-include) ~ 16,384(include)`, it is delivered to C1
+- If the entry key is between `16,384(non-include) ~ 32,768(include)`, it is delivered to C2
+- If the entry key is between `32,768(non-include) ~ 65,536(include)`, it is delivered to C3
+
+# Motivation
+
+For the example above, if `C1` is stuck or consumed slowly, the Broker will push the entries that should be delivered to `C1` into a memory collection `redelivery_messages` and read next entries continue, then the collection `redelivery_messages` becomes larger and larger and take up a lot of memory. When sending messages, it will also determine the key of the entries in the collection `redelivery_messages`, affecting performance.
+
+# Goals
+- Add metrics
+  - Broker level:
+    - Add a metric `pulsar_broker_max_subscription_redelivery_messages_total` to indicate the max one `{redelivery_messages}` of subscriptions in the broker, used by pushing an alert if it is too large. Nit: The Broker will print a log, which contains the name of the subscription(which has the maximum count of redelivery messages). This will help find the issue subscription.
+    - Add a metric `pulsar_broker_memory_usage_of_redelivery_messages_bytes` to indicate the memory usage of all `redelivery_messages` in the broker. This is helpful for memory health checks.
+- Improve `Topic stats`.
+  - Add an attribute `redeliveryMessageCount` under `SubscriptionStats`
+
+Differ between `redelivery_messages` and `pulsar_subscription_unacked_messages & pulsar_subscription_back_log`
+
+- `pulsar_subscription_unacked_messages`: the messages have been delivered to the client but have not been acknowledged yet.
+- `pulsar_subscription_back_log`: how many messages should be acknowledged, contains delivered messages, and the messages which should be delivered.
+
+### Public API
+
+<strong>SubscriptionStats.java</strong>
+```java
+long getRedeliveryMessageCount();
+```
+
+### Metrics
+
+**pulsar_broker_max_subscription_redelivery_messages_total**
+- Description: the max one of `{pulsar_broker_memory_usage_of_redelivery_messages_bytes}` maintains by subscriptions in the broker.
+- Attributes: `[cluster]`
+- Unit: `Counter`
+
+**pulsar_broker_memory_usage_of_redelivery_messages_bytes**
+- Description: the memory usage of all `redelivery_messages` in the broker.
+- Attributes: `[cluster]`
+- Unit: `Counter`
+
+
+# Monitoring
+
+- Push an alert if `pulsar_broker_max_subscription_redelivery_messages_total` is too large. This indicates that some `Key_Shared` consumers may be stuck, and it will affect the consumption speed and increase CPU usage.
+- Push an alert if `pulsar_broker_memory_usage_of_redelivery_messages_bytes` is too large. This means Redelivery messages used too much memory, which may cause OOM.