Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Raft Node Support for Health Checks and Metrics Collection #643

Merged
merged 7 commits into from
Dec 22, 2024

Conversation

sinadarbouy
Copy link
Collaborator

@sinadarbouy sinadarbouy commented Dec 22, 2024

Ticket(s)

Description

This pull request addresses the Raft Health Check Integration as part of the enhancements needed for the Raft cluster management. The following changes have been made:

  • Added Raft-specific health checks to the existing health check endpoint.
  • Included leader election status in health checks.
  • Added cluster state validation in health checks.
  • Exposed metrics about Raft cluster health, including health status, leader status, and last contact latency.
  • Updated the liveness function to incorporate Raft node health checks.
  • Enhanced test coverage for health checks with Raft nodes.
  • Updated docker-compose-raft.yaml to include health checks for services.

Development Checklist

  • I have added a descriptive title to this PR.
  • I have squashed related commits together.
  • I have rebased my branch on top of the latest main branch.
  • I have performed a self-review of my own code.
  • I have commented on my code, particularly in hard-to-understand areas.
  • I have added docstring(s) to my code.
  • I have made corresponding changes to the documentation (docs).
  • I have updated docs using make gen-docs command.
  • I have added tests for my changes.
  • I have signed all the commits.

Legal Checklist

Add functionality to monitor Raft node health status and expose key metrics:
- Add new Prometheus metrics for Raft health status, leader status, and last contact latency
- Implement GetHealthStatus() method to track node health state
- Monitor leadership status and communication with leader
- Track last contact time with leader
- Add helper functions for metric value conversion and time parsing

The metrics will help monitor cluster health and leadership changes in production.
- Added Raft node dependency to `Options` and `HealthChecker` structs.
- Updated `liveness` function to include Raft node health status.
- Modified health check logic in `healthcheck.go` and `http_server.go` to consider Raft node status.
- Enhanced `healthcheck_test.go` to include tests for Raft node integration.
- Ensured Raft node is properly initialized and cleaned up in tests.
- Introduced a new test `TestGetHealthStatus` to check the health status of Raft nodes.
- The test covers three scenarios:
  1. When the node is the leader, it should be healthy and recognize itself as the leader.
  2. When the node is a follower, it should be healthy and recognize the leader.
  3. When no leader is available, the node should not be healthy and should report an error.
- Utilized temporary directories and a test logger for setup.
- Ensured proper shutdown of nodes after tests to prevent resource leaks.
- Corrected variable naming from `leaderId` to `leaderID` for consistency.
- Added missing periods to comments for proper punctuation.
- Used `%w` in `fmt.Errorf` for error wrapping.
- Updated the `runCmd` in `cmd/run.go` to include `RaftNode` in the API configuration.
- This change ensures that the RaftNode is properly initialized and passed to the API object, which may be necessary for distributed consensus or state management.
- Added port mappings for 9090 to each gatewayd service.
- Introduced health checks for gatewayd-2 and gatewayd-3 services to ensure they are running correctly.
- Added service dependencies for write-postgres, read-postgres, redis, and install_plugins to ensure proper startup order.
- Linked gatewayd-2 and gatewayd-3 services to write-postgres, read-postgres, and redis for improved connectivity.
- Added a comment to document the GetHealthStatus method, explaining its purpose.
- Removed an unnecessary comment about caching commonly used values, as it was redundant with the code's functionality.
Copy link

Overview

Image reference ghcr.io/gatewayd-io/gatewayd:ed25ae1 gatewaydio/gatewayd:latest
- digest 0b0ffb03aba6 383013efa302
- tag ed25ae1 latest
- provenance b6df86a
- vulnerabilities critical: 0 high: 0 medium: 1 low: 0 critical: 0 high: 1 medium: 1 low: 0
- platform linux/amd64 linux/amd64
- size 20 MB 18 MB (-2.3 MB)
- packages 144 140 (-4)
Base Image alpine:3
also known as:
3.20
3.20.3
latest
alpine:3.20
also known as:
3
3.20.3
latest
- vulnerabilities critical: 0 high: 0 medium: 1 low: 0 critical: 0 high: 0 medium: 1 low: 0
Packages and Vulnerabilities (6 package changes and 1 vulnerability changes)
  • ➖ 3 packages removed
  • ♾️ 3 packages changed
  • 133 packages unchanged
  • ❗ 1 vulnerabilities added
Changes for packages of type apk (3 changes)
Package Version
ghcr.io/gatewayd-io/gatewayd:ed25ae1
Version
gatewaydio/gatewayd:latest
ca-certificates 20240705-r0
openssl 3.3.2-r0
pax-utils 1.3.7-r2
Changes for packages of type golang (3 changes)
Package Version
ghcr.io/gatewayd-io/gatewayd:ed25ae1
Version
gatewaydio/gatewayd:latest
♾️ github.com/gatewayd-io/gatewayd (devel) 0.0.0-20241214123014-b6df86a6fe94
♾️ golang.org/x/net 0.33.0 0.32.0
critical: 0 high: 1 medium: 0 low: 0
Added vulnerabilities (1):
  • high : CVE--2024--45338
♾️ stdlib go1.23.4 1.23.4

@sinadarbouy sinadarbouy marked this pull request as ready for review December 22, 2024 20:21
@sinadarbouy sinadarbouy requested a review from mostafa December 22, 2024 20:21
Copy link
Member

@mostafa mostafa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@mostafa mostafa merged commit eec4eba into main Dec 22, 2024
5 checks passed
@mostafa mostafa deleted the feature/add-raft-health-check branch December 22, 2024 23:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants