Slurm Cluster Metrics Exporter

Prometheus collector and exporter of Slurm cluster metrics. A Slinky project.

Overview

Slurm metrics are collected from Slurm through the Slurm REST API and exported as Prometheus types. The exporter is granted authorization via an authentication JWT token, hence it runs with privileges of a Slurm cluster user via token.

The recommended deployment mechanism is through Helm chart.

Features

Nodes

Allocated: nodes which have been allocated to one or more jobs.
Completing: all jobs associated with this node are in the process of COMPLETING.
Down: nodes which are unavailable for use.
Drain: nodes which are marked as drain, unavailable for future work but can complete their current work.
- Drained: nodes which are unavailable for use (per system administrator request) and have no more work to complete.
- Draining: The node is currently allocated to job(s), but will not be allocated additional jobs.
Idle: nodes which are not allocated to any jobs and is available for use.
Maintenance: nodes which are currently in a maintenance reservation.
Mixed: nodes which have some but not all of their CPUs ALLOCATED, or suspended jobs have TRES (e.g. Memory) still allocated.
Reserved: nodes which are in an advanced reservation and not generally available.

Partitions

Nodes: number of nodes associated with the partition.
CPUs: number of CPUs associated with the partition.
Jobs: number of incomplete (e.g. pending, running) jobs in the partition.
Allocated CPUs: number of ALLOCATED CPUs among all nodes in the partition.
Idle CPUs: number of IDLE CPUs among all nodes in the partition.
Pending Jobs: number of pending jobs in the partition.
Pending Jobs, Max Nodes: max number of nodes requested among all pending jobs in the partition.
Running Jobs: number of running jobs in the partition.
Held Jobs: number of held jobs in the partition.

User Statistics

Job Count: number of incomplete (e.g. pending, running) jobs for the user.
Pending Jobs: number of pending jobs for the user.
Running Jobs: number of running jobs for the user.
Held Jobs: number of held jobs for the user.

Limitations

Currently only a minimal set of metrics are collected. More metrics may be added in the future.

Slurm Version: >= 24.05

Installation

Install kube-prometheus-stack for metrics collection and observation. Prometheus is also used as an extension API server so custom Slurm metrics may be used with autoscaling.

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace=prometheus --create-namespace

Install the slurm-exporter:

helm install slurm-exporter oci://ghcr.io/slinkyproject/charts/slurm-exporter \
  --namespace=slurm-exporter --create-namespace

License

Copyright (C) SchedMD LLC.

Licensed under the Apache License, Version 2.0 you may not use project except in compliance with the license.

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
.gitlab		.gitlab
.vscode		.vscode
LICENSES		LICENSES
cmd		cmd
docs		docs
helm/slurm-exporter		helm/slurm-exporter
internal		internal
.codespellrc		.codespellrc
.dockerignore		.dockerignore
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
Dockerfile		Dockerfile
Dockerfile.dev		Dockerfile.dev
Dockerfile.dev.dockerignore		Dockerfile.dev.dockerignore
Makefile		Makefile
README.md		README.md
commitlint.config.ts		commitlint.config.ts
docker-bake.hcl		docker-bake.hcl
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Slurm Cluster Metrics Exporter

Table of Contents

Overview

Features

Nodes

Partitions

User Statistics

Limitations

Installation

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors 4

Uh oh!

Languages

SlinkyProject/slurm-exporter

Folders and files

Latest commit

History

Repository files navigation

Slurm Cluster Metrics Exporter

Table of Contents

Overview

Features

Nodes

Partitions

User Statistics

Limitations

Installation

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors 4

Uh oh!

Languages

Packages