Skip to content

notifications: Exit protocols on handle drop to save up CPU of minimal-relay-chains #376

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Apr 23, 2025

Conversation

lexnv
Copy link
Collaborator

@lexnv lexnv commented Apr 17, 2025

This PR ensures that notification tasks are fully terminated when the associated handle is dropped, addressing an issue specific to minimal-relay-chains. We have operated from the assumption that the handle is never supposed to get dropped, since protocols cannot be dynamically injected after the backend is started.

This PR effectively reduces the CPU usage of minimal-relay-chains from 7% to 0.1%.

We have noticed that the CPU consumption of minimal-relay-chains has increased compared to libp2p backends. However, our metrics for CPU consumption of full relay chains have drastically improved.

The issue is the following:

  • substrate needs a block-announces notification config to initiate the network components.
  • the minimal-relay-chains do not need this notification around, and drop the associated handle.
  • this behaviour causes a CPU loop that continuously spams the following log line:
litep2p::notification: [Parachain] user protocol has exited, exiting protocol=/b0a8d493285c2df73290dfb7e61f870f17b41801197a149ca93654499ea3dafe/block-announces/1

CPU before and after:
Screenshot 2025-04-17 at 17 44 23

Fixes: paritytech/polkadot-sdk#7998

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
@lexnv lexnv added the bug Something isn't working label Apr 17, 2025
@lexnv lexnv self-assigned this Apr 17, 2025
@bkchr
Copy link
Member

bkchr commented Apr 17, 2025

Generally could use some test. Make a test that directly drops this stream and then poll the run function.

lexnv and others added 3 commits April 22, 2025 12:51
@lexnv lexnv merged commit 403ec00 into master Apr 23, 2025
8 checks passed
@lexnv lexnv deleted the lexnv/fix-how-cpu-parachains branch April 23, 2025 10:13
lexnv added a commit that referenced this pull request May 1, 2025
## [0.9.4] - 2025-04-29

This release brings several improvements and fixes to litep2p, advancing
its stability and readiness for production use.

### Performance Improvements

This release addresses an issue where notification protocols failed to
exit on handle drop, lowering CPU usage in scenarios like
minimal-relay-chains from 7% to 0.1%.

### Robustness Improvements

- Kademlia:
- Optimized address store by sorting addresses based on dialing score,
bounding memory consumption and improving efficiency.
- Limited `FIND_NODE` responses to the replication factor, reducing data
stored in the routing table.
- Address store improvements enhance robustness against routing table
alterations.

- Identify Codec:
- Enhanced message decoding to manage malformed or unexpected messages
gracefully.

- Bitswap:
- Introduced a write timeout for sending frames, preventing protocol
hangs or delays.

### Testing and Reliability

- Fuzzing Harness: Added a fuzzing harness by SRLabs to uncover and
resolve potential issues, improving code robustness. Thanks to @R9295
for the contribution!

- Testing Enhancements: Improved notification state machine testing.
Thanks to Dominique (@Imod7) for the contribution!

### Dependency Management

- Updated all dependencies for stable feature flags (default and
"websocket") to their latest versions.

- Reorganized dependencies under specific feature flags, shrinking the
default feature set and avoiding exposure of outdated dependencies from
experimental features.

### Fixed

- notifications: Exit protocols on handle drop to save up CPU of
`minimal-relay-chains`
([#376](#376))
- identify: Improve identify message decoding
([#379](#379))
- crypto/noise: Set timeout limits for the noise handshake
([#373](#373))
- kad: Improve robustness of addresses from the routing table
([#369](#369))
- kad: Bound kademlia messages to the replication factor
([#371](#371))
- codec: Decode smaller payloads for identity to None
([#362](#362))

### Added

- bitswap: Add write timeout for sending frames
([#361](#361))
- notif/tests: check test state
([#360](#360))
- SRLabs: Introduce simple fuzzing harness
([#367](#367))
- SRLabs: Introduce Fuzzing Harness
([#365](#365))

### Changed

- features: Move quic related dependencies under feature flag
([#359](#359))
- tests/substrate: Remove outdated substrate specific conformace testing
([#370](#370))
- ci: Update stable dependencies
([#375](#375))
- build(deps): bump hex-literal from 0.4.1 to 1.0.0
([#381](#381))
- build(deps): bump tokio from 1.44.1 to 1.44.2 in /fuzz/structure-aware
([#378](#378))
- build(deps): bump Swatinem/rust-cache from 2.7.7 to 2.7.8
([#363](#363))
- build(deps): bump tokio from 1.43.0 to 1.43.1
([#368](#368))
- build(deps): bump openssl from 0.10.70 to 0.10.72
([#366](#366))

---------

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
github-merge-queue bot pushed a commit to paritytech/polkadot-sdk that referenced this pull request May 1, 2025
# Litep2p Release

This release brings several improvements and fixes to litep2p, advancing
its stability and readiness for production use.

### Performance Improvements

This release addresses an issue where notification protocols failed to
exit on handle drop, lowering CPU usage in scenarios like
minimal-relay-chains from 7% to 0.1%.

### Robustness Improvements

- Kademlia:
- Optimized address store by sorting addresses based on dialing score,
bounding memory consumption and improving efficiency.
- Limited `FIND_NODE` responses to the replication factor, reducing data
stored in the routing table.
- Address store improvements enhance robustness against routing table
alterations.

- Identify Codec:
- Enhanced message decoding to manage malformed or unexpected messages
gracefully.

- Bitswap:
- Introduced a write timeout for sending frames, preventing protocol
hangs or delays.

### Testing and Reliability

- Fuzzing Harness: Added a fuzzing harness by SRLabs to uncover and
resolve potential issues, improving code robustness. Thanks to @R9295
for the contribution!

- Testing Enhancements: Improved notification state machine testing.
Thanks to Dominique (@Imod7) for the contribution!

### Dependency Management

- Updated all dependencies for stable feature flags (default and
"websocket") to their latest versions.

- Reorganized dependencies under specific feature flags, shrinking the
default feature set and avoiding exposure of outdated dependencies from
experimental features.

### Fixed

- notifications: Exit protocols on handle drop to save up CPU of
`minimal-relay-chains`
([#376](paritytech/litep2p#376))
- identify: Improve identify message decoding
([#379](paritytech/litep2p#379))
- crypto/noise: Set timeout limits for the noise handshake
([#373](paritytech/litep2p#373))
- kad: Improve robustness of addresses from the routing table
([#369](paritytech/litep2p#369))
- kad: Bound kademlia messages to the replication factor
([#371](paritytech/litep2p#371))
- codec: Decode smaller payloads for identity to None
([#362](paritytech/litep2p#362))

### Added

- bitswap: Add write timeout for sending frames
([#361](paritytech/litep2p#361))
- notif/tests: check test state
([#360](paritytech/litep2p#360))
- SRLabs: Introduce simple fuzzing harness
([#367](paritytech/litep2p#367))
- SRLabs: Introduce Fuzzing Harness
([#365](paritytech/litep2p#365))

### Changed

- features: Move quic related dependencies under feature flag
([#359](paritytech/litep2p#359))
- tests/substrate: Remove outdated substrate specific conformace testing
([#370](paritytech/litep2p#370))
- ci: Update stable dependencies
([#375](paritytech/litep2p#375))
- build(deps): bump hex-literal from 0.4.1 to 1.0.0
([#381](paritytech/litep2p#381))
- build(deps): bump tokio from 1.44.1 to 1.44.2 in /fuzz/structure-aware
([#378](paritytech/litep2p#378))
- build(deps): bump Swatinem/rust-cache from 2.7.7 to 2.7.8
([#363](paritytech/litep2p#363))
- build(deps): bump tokio from 1.43.0 to 1.43.1
([#368](paritytech/litep2p#368))
- build(deps): bump openssl from 0.10.70 to 0.10.72
([#366](paritytech/litep2p#366))

---------

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Co-authored-by: Dmitry Markin <dmitry@markin.tech>
castillax pushed a commit to paritytech/polkadot-sdk that referenced this pull request May 12, 2025
# Litep2p Release

This release brings several improvements and fixes to litep2p, advancing
its stability and readiness for production use.

### Performance Improvements

This release addresses an issue where notification protocols failed to
exit on handle drop, lowering CPU usage in scenarios like
minimal-relay-chains from 7% to 0.1%.

### Robustness Improvements

- Kademlia:
- Optimized address store by sorting addresses based on dialing score,
bounding memory consumption and improving efficiency.
- Limited `FIND_NODE` responses to the replication factor, reducing data
stored in the routing table.
- Address store improvements enhance robustness against routing table
alterations.

- Identify Codec:
- Enhanced message decoding to manage malformed or unexpected messages
gracefully.

- Bitswap:
- Introduced a write timeout for sending frames, preventing protocol
hangs or delays.

### Testing and Reliability

- Fuzzing Harness: Added a fuzzing harness by SRLabs to uncover and
resolve potential issues, improving code robustness. Thanks to @R9295
for the contribution!

- Testing Enhancements: Improved notification state machine testing.
Thanks to Dominique (@Imod7) for the contribution!

### Dependency Management

- Updated all dependencies for stable feature flags (default and
"websocket") to their latest versions.

- Reorganized dependencies under specific feature flags, shrinking the
default feature set and avoiding exposure of outdated dependencies from
experimental features.

### Fixed

- notifications: Exit protocols on handle drop to save up CPU of
`minimal-relay-chains`
([#376](paritytech/litep2p#376))
- identify: Improve identify message decoding
([#379](paritytech/litep2p#379))
- crypto/noise: Set timeout limits for the noise handshake
([#373](paritytech/litep2p#373))
- kad: Improve robustness of addresses from the routing table
([#369](paritytech/litep2p#369))
- kad: Bound kademlia messages to the replication factor
([#371](paritytech/litep2p#371))
- codec: Decode smaller payloads for identity to None
([#362](paritytech/litep2p#362))

### Added

- bitswap: Add write timeout for sending frames
([#361](paritytech/litep2p#361))
- notif/tests: check test state
([#360](paritytech/litep2p#360))
- SRLabs: Introduce simple fuzzing harness
([#367](paritytech/litep2p#367))
- SRLabs: Introduce Fuzzing Harness
([#365](paritytech/litep2p#365))

### Changed

- features: Move quic related dependencies under feature flag
([#359](paritytech/litep2p#359))
- tests/substrate: Remove outdated substrate specific conformace testing
([#370](paritytech/litep2p#370))
- ci: Update stable dependencies
([#375](paritytech/litep2p#375))
- build(deps): bump hex-literal from 0.4.1 to 1.0.0
([#381](paritytech/litep2p#381))
- build(deps): bump tokio from 1.44.1 to 1.44.2 in /fuzz/structure-aware
([#378](paritytech/litep2p#378))
- build(deps): bump Swatinem/rust-cache from 2.7.7 to 2.7.8
([#363](paritytech/litep2p#363))
- build(deps): bump tokio from 1.43.0 to 1.43.1
([#368](paritytech/litep2p#368))
- build(deps): bump openssl from 0.10.70 to 0.10.72
([#366](paritytech/litep2p#366))

---------

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Co-authored-by: Dmitry Markin <dmitry@markin.tech>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[stable2412-3] high CPU load in Kusama System Chains
2 participants