Prometheus collector and exporter of Slurm cluster metrics. A Slinky project.
Slurm metrics are collected from Slurm through the Slurm REST API and exported as Prometheus types. The exporter is granted authorization via an authentication JWT token, hence it runs with privileges of a Slurm cluster user via token.
The recommended deployment mechanism is through Helm chart.
- Allocated: nodes which have been allocated to one or more jobs.
- Completing: all jobs associated with this node are in the process of COMPLETING.
- Down: nodes which are unavailable for use.
- Drain: nodes which are marked as drain, unavailable for future work but can complete their current work.
- Idle: nodes which are not allocated to any jobs and is available for use.
- Maintenance: nodes which are currently in a maintenance reservation.
- Mixed: nodes which have some but not all of their CPUs ALLOCATED, or suspended jobs have TRES (e.g. Memory) still allocated.
- Reserved: nodes which are in an advanced reservation and not generally available.
- Nodes: number of nodes associated with the partition.
- CPUs: number of CPUs associated with the partition.
- Jobs: number of incomplete (e.g. pending, running) jobs in the partition.
- Allocated CPUs: number of ALLOCATED CPUs among all nodes in the partition.
- Idle CPUs: number of IDLE CPUs among all nodes in the partition.
- Pending Jobs: number of pending jobs in the partition.
- Pending Jobs, Max Nodes: max number of nodes requested among all pending jobs in the partition.
- Running Jobs: number of running jobs in the partition.
- Held Jobs: number of held jobs in the partition.
- Job Count: number of incomplete (e.g. pending, running) jobs for the user.
- Pending Jobs: number of pending jobs for the user.
- Running Jobs: number of running jobs for the user.
- Held Jobs: number of held jobs for the user.
Currently only a minimal set of metrics are collected. More metrics may be added in the future.
- Slurm Version: >= 24.05
Install kube-prometheus-stack
for metrics collection and observation.
Prometheus is also used as an extension API server so custom Slurm metrics may
be used with autoscaling.
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace=prometheus --create-namespace
Install the slurm-exporter:
helm install slurm-exporter oci://ghcr.io/slinkyproject/charts/slurm-exporter \
--namespace=slurm-exporter --create-namespace
Copyright (C) SchedMD LLC.
Licensed under the Apache License, Version 2.0 you may not use project except in compliance with the license.
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.