Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Turn panel descriptions into reference docs #99

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
66 changes: 66 additions & 0 deletions docs/reference/dashboards/cluster.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# Cluster Information

The cluster dashboard contains several panels that show relevant cluster-wide information.

```{warning}
This section is a Work in Progress!
```

## Cluster Stats

### Running Users

Number of currently running users per hub. Common shapes this visualization may take:

1. A large number of users starting servers at exactly the same time will be visible here as a single spike, and may cause stability issues. Since they share the same cluster, such spikes happening on a *different* hub may still affect your hub.

### Memory commitment %

Percentage of memory in cluster guaranteed to user workloads. Common shapes:

1. If this is consistently low (<50%), you are paying for cloud compute that you do not need. Consider reducing the size of your nodes, or increasing the amount of memory guaranteed to your users. Some variability based on time of day is to be expected.

### CPU commitment %

Percentage of total CPU in the cluster currently guaranteed to user workloads.

Most commonly, JupyterHub workloads are *memory bound*, not CPU bound. So this is not a particularly helpful graph.

Common shapes:
1. If this is *consistently high* but shaped differently than your memory commitment graph, consider changing your CPU requirements.

### Node count

Number of nodes in each nodepool in this cluster.

### Pods not in Running state

Pods in states other than 'Running'.
In a functional clusters, pods should not be in non-Running states for long.

## Node stats

### Node CPU Commit %

Percentage of each node guaranteed to pods on it.

### Node Memory Commit %

Percentage of each node guaranteed to pods on it. When this hits 100%, the autoscaler will spawn a new node and the scheduler will stop putting pods on the old node.

### Node Memory Utilization %

Percentage of available Memory currently in use.

### Node CPU Utilization %

Percentage of available CPUs currently in use.

### Out of Memory kill count

Number of Out of Memory (OOM) kills in a given node.

When users use up more memory than they are allowed, the notebook kernel they
were running usually gets killed and restarted. This graph shows the number of times
that happens on any given node, and helps validate that a notebook kernel restart was
in fact caused by an OOM.
9 changes: 9 additions & 0 deletions docs/reference/dashboards/global.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Global Usage

Contains "global" dashboards with useful stats computed across all datasources.

```{warning}
This section is a Work in Progress!
```

## 'Active users (over 7 days)
73 changes: 73 additions & 0 deletions docs/reference/dashboards/jupyterhub.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
# JupyterHub Dashboard

The JupyterHub dashboard contains several panels with useful stats about usage & diagnostics.

```{warning}
This section is a Work in Progress!
```

## Currently Active Users

## Daily Active Users

Number of unique users who were active within the preceeding 24h period.

Requires JupyterHub 3.1.

## Weekly Active Users

Number of unique users who were active within the preceeding 7d period.

Requires JupyterHub 3.1.

## Monthly Active Users

Number of unique users who were active within the preceeding 7d period.

Requires JupyterHub 3.1.

## Hub DB Disk Space Availability %

% of disk space left in the disk storing the JupyterHub sqlite database. If goes to 0, the hub will fail.

## Server Start Times

## Server Start Failures

Attempts by users to start servers that failed.

## Users per node

## Non Running Pods

Pods in a non-running state in the hub's namespace.

Pods stuck in non-running states often indicate an error condition.

## Free space (%) in shared volume (Home directories, etc.)

% of disk space left in a shared storage volume, typically used for users' home directories.

Requires an additional node_exporter deployment to work. If this graph is empty, look at the README for jupyterhub/grafana-dashboards to see what extra deployment is needed.

## Very old user pods

User pods that have been running for a long time (>8h).

This often indicates problems with the idle culler

## User Pods with high CPU usage (>0.5)

User pods using a lot of CPU

This could indicate a runaway process consuming resources unnecessarily.

## User pods with high memory usage (>80% of limit)

User pods getting close to their memory limit

Once they hit their memory limit, user kernels will start dying.

## Images used by user pods

Number of user servers using a container image.
28 changes: 28 additions & 0 deletions docs/reference/dashboards/support.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# NFS and Support Information

The NFS and Support Information dashboard contains several panels with useful information about support resources.

```{warning}
This section is a Work in Progress!
```

## User Nodes NFS Ops

## NFS Operation Types on user nodes

## NFS Server CPU

## NFS Server Disk ops

## NFS Server disk write latency

## NFS Server disk write latency

## Prometheus Memory (Working Set)

## Prometheus CPU

## Prometheus Free Disk space

## Prometheus Network Usage

13 changes: 13 additions & 0 deletions docs/reference/dashboards/usage-report.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Usage Report

```{warning}
This section is a Work in Progress!
```

## User pod memory usage

## Dask-gateway worker pod memory usage

## Dask-gateway scheduler pod memory usage

## GPU pod memory usage
35 changes: 35 additions & 0 deletions docs/reference/dashboards/user.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# User Diagnostics

```{warning}
This section is a Work in Progress!
```

## Memory Usage

Per-user per-server memory usage

## CPU Usage

Per-user per-server CPU usage

## Home Directory Usage (on shared home directories)

Per user home directory size, when using a shared home directory.

Requires https://github.com/yuvipanda/prometheus-dirsize-exporter to
be set up.

Similar to server pod names, user names will be *encoded* here
using the escapism python library (https://github.com/minrk/escapism).
You can unencode them with the following python snippet:

from escapism import unescape
unescape('<escaped-username>', '-')

## Memory Requests

Per-user per-server memory Requests

## CPU Requests

Per-user per-server CPU Requests
6 changes: 6 additions & 0 deletions docs/reference/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,4 +17,10 @@ Please see our [contributing guide](contributing) if you'd like to add to it.
% that they appear in the table of contents
```{toctree}
:maxdepth: 2
dashboards/cluster.md
dashboards/jupyterhub.md
dashboards/support.md
dashboards/usage-report.md
dashboards/user.md
dashboards/global.md
```