Skip to content

Prometheus collector and exporter for Slurm cluster metrics. A Slinky project.

Notifications You must be signed in to change notification settings

SlinkyProject/slurm-exporter

Repository files navigation

Slurm Cluster Metrics Exporter

License Tag Go-Version Last-Commit

Prometheus collector and exporter of Slurm cluster metrics. A Slinky project.

Table of Contents

Overview

Slurm metrics are collected from Slurm through the Slurm REST API and exported as Prometheus types. The exporter is granted authorization via an authentication JWT token, hence it runs with privileges of a Slurm cluster user via token.

The recommended deployment mechanism is through Helm chart.

Features

Nodes

  • Allocated: nodes which have been allocated to one or more jobs.
  • Completing: all jobs associated with this node are in the process of COMPLETING.
  • Down: nodes which are unavailable for use.
  • Drain: nodes which are marked as drain, unavailable for future work but can complete their current work.
    • Drained: nodes which are unavailable for use (per system administrator request) and have no more work to complete.
    • Draining: The node is currently allocated to job(s), but will not be allocated additional jobs.
  • Idle: nodes which are not allocated to any jobs and is available for use.
  • Maintenance: nodes which are currently in a maintenance reservation.
  • Mixed: nodes which have some but not all of their CPUs ALLOCATED, or suspended jobs have TRES (e.g. Memory) still allocated.
  • Reserved: nodes which are in an advanced reservation and not generally available.

Partitions

  • Nodes: number of nodes associated with the partition.
  • CPUs: number of CPUs associated with the partition.
  • Jobs: number of incomplete (e.g. pending, running) jobs in the partition.
  • Allocated CPUs: number of ALLOCATED CPUs among all nodes in the partition.
  • Idle CPUs: number of IDLE CPUs among all nodes in the partition.
  • Pending Jobs: number of pending jobs in the partition.
  • Pending Jobs, Max Nodes: max number of nodes requested among all pending jobs in the partition.
  • Running Jobs: number of running jobs in the partition.
  • Held Jobs: number of held jobs in the partition.

User Statistics

  • Job Count: number of incomplete (e.g. pending, running) jobs for the user.
  • Pending Jobs: number of pending jobs for the user.
  • Running Jobs: number of running jobs for the user.
  • Held Jobs: number of held jobs for the user.

Limitations

Currently only a minimal set of metrics are collected. More metrics may be added in the future.

  • Slurm Version: >= 24.05

Installation

Install kube-prometheus-stack for metrics collection and observation. Prometheus is also used as an extension API server so custom Slurm metrics may be used with autoscaling.

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace=prometheus --create-namespace

Install the slurm-exporter:

helm install slurm-exporter oci://ghcr.io/slinkyproject/charts/slurm-exporter \
  --namespace=slurm-exporter --create-namespace

License

Copyright (C) SchedMD LLC.

Licensed under the Apache License, Version 2.0 you may not use project except in compliance with the license.

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

About

Prometheus collector and exporter for Slurm cluster metrics. A Slinky project.

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors 4

  •  
  •  
  •  
  •