jupyterhub · GeorgianaElena · Feb 2, 2024 · Feb 8, 2024 · Feb 8, 2024
diff --git a/docs/reference/dashboards/cluster.md b/docs/reference/dashboards/cluster.md
@@ -0,0 +1,66 @@
+# Cluster Information
+
+The cluster dashboard contains several panels that show relevant cluster-wide information.
+
+```{warning}
+This section is a Work in Progress!
+```
+
+## Cluster Stats
+
+### Running Users
+
+Number of currently running users per hub. Common shapes this visualization may take:
+
+1. A large number of users starting servers at exactly the same time will be visible here as a single spike, and may cause stability issues. Since they share the same cluster, such spikes happening on a *different* hub may still affect your hub.
+
+### Memory commitment %
+
+Percentage of memory in cluster guaranteed to user workloads. Common shapes:
+
+1. If this is consistently low (<50%), you are paying for cloud compute that you do not need. Consider reducing the size of your nodes, or increasing the amount of memory guaranteed to your users. Some variability based on time of day is to be expected.
+
+### CPU commitment %
+
+Percentage of total CPU in the cluster currently guaranteed to user workloads.
+
+Most commonly, JupyterHub workloads are *memory bound*, not CPU bound. So this is not a particularly helpful graph.
+
+Common shapes:
+1. If this is *consistently high* but shaped differently than your memory commitment graph, consider changing your CPU requirements.
+
+### Node count
+
+Number of nodes in each nodepool in this cluster.
+
+### Pods not in Running state
+
+Pods in states other than 'Running'.
+In a functional clusters, pods should not be in non-Running states for long.
+
+## Node stats
+
+### Node CPU Commit %
+
+Percentage of each node guaranteed to pods on it.
+
+### Node Memory Commit %
+
+Percentage of each node guaranteed to pods on it. When this hits 100%, the autoscaler will spawn a new node and the scheduler will stop putting pods on the old node.
+
+### Node Memory Utilization %
+
+Percentage of available Memory currently in use.
+
+### Node CPU Utilization %
+
+Percentage of available CPUs currently in use.
+
+### Out of Memory kill count
+
+Number of Out of Memory (OOM) kills in a given node.
+
+When users use up more memory than they are allowed, the notebook kernel they
+were running usually gets killed and restarted. This graph shows the number of times
+that happens on any given node, and helps validate that a notebook kernel restart was
+in fact caused by an OOM.
diff --git a/docs/reference/dashboards/global.md b/docs/reference/dashboards/global.md
@@ -0,0 +1,9 @@
+# Global Usage
+
+Contains "global" dashboards with useful stats computed across all datasources.
+
+```{warning}
+This section is a Work in Progress!
+```
+
+## 'Active users (over 7 days)
diff --git a/docs/reference/dashboards/jupyterhub.md b/docs/reference/dashboards/jupyterhub.md
@@ -0,0 +1,73 @@
+# JupyterHub Dashboard
+
+The JupyterHub dashboard contains several panels with useful stats about usage & diagnostics.
+
+```{warning}
+This section is a Work in Progress!
+```
+
+## Currently Active Users
+
+## Daily Active Users
+
+Number of unique users who were active within the preceeding 24h period.
+
+Requires JupyterHub 3.1.
+
+## Weekly Active Users
+
+Number of unique users who were active within the preceeding 7d period.
+
+Requires JupyterHub 3.1.
+
+## Monthly Active Users
+
+Number of unique users who were active within the preceeding 7d period.
+
+Requires JupyterHub 3.1.
+
+## Hub DB Disk Space Availability %
+
+% of disk space left in the disk storing the JupyterHub sqlite database. If goes to 0, the hub will fail.
+
+## Server Start Times
+
+## Server Start Failures
+
+Attempts by users to start servers that failed.
+
+## Users per node
+
+## Non Running Pods
+
+Pods in a non-running state in the hub's namespace.
+
+Pods stuck in non-running states often indicate an error condition.
+
+## Free space (%) in shared volume (Home directories, etc.)
+
+% of disk space left in a shared storage volume, typically used for users' home directories.
+
+Requires an additional node_exporter deployment to work. If this graph is empty, look at the README for jupyterhub/grafana-dashboards to see what extra deployment is needed.
+
+## Very old user pods
+
+User pods that have been running for a long time (>8h).
+
+This often indicates problems with the idle culler
+
+## User Pods with high CPU usage (>0.5)
+
+User pods using a lot of CPU
+
+This could indicate a runaway process consuming resources unnecessarily.
+
+## User pods with high memory usage (>80% of limit)
+
+User pods getting close to their memory limit
+
+Once they hit their memory limit, user kernels will start dying.
+
+## Images used by user pods
+
+Number of user servers using a container image.
diff --git a/docs/reference/dashboards/support.md b/docs/reference/dashboards/support.md
@@ -0,0 +1,28 @@
+# NFS and Support Information
+
+The NFS and Support Information dashboard contains several panels with useful information about support resources.
+
+```{warning}
+This section is a Work in Progress!
+```
+
+## User Nodes NFS Ops
+
+## NFS Operation Types on user nodes
+
+## NFS Server CPU
+
+## NFS Server Disk ops
+
+## NFS Server disk write latency
+
+## NFS Server disk write latency
+
+## Prometheus Memory (Working Set)
+
+## Prometheus CPU
+
+## Prometheus Free Disk space
+
+## Prometheus Network Usage
+
diff --git a/docs/reference/dashboards/usage-report.md b/docs/reference/dashboards/usage-report.md
@@ -0,0 +1,13 @@
+# Usage Report
+
+```{warning}
+This section is a Work in Progress!
+```
+
+## User pod memory usage
+
+## Dask-gateway worker pod memory usage
+
+## Dask-gateway scheduler pod memory usage
+
+## GPU pod memory usage
diff --git a/docs/reference/dashboards/user.md b/docs/reference/dashboards/user.md
@@ -0,0 +1,35 @@
+# User Diagnostics
+
+```{warning}
+This section is a Work in Progress!
+```
+
+## Memory Usage
+
+Per-user per-server memory usage
+
+## CPU Usage
+
+Per-user per-server CPU usage
+
+## Home Directory Usage (on shared home directories)
+
+Per user home directory size, when using a shared home directory.
+
+Requires https://github.com/yuvipanda/prometheus-dirsize-exporter to
+    be set up.
+
+Similar to server pod names, user names will be *encoded* here
+using the escapism python library (https://github.com/minrk/escapism).
+You can unencode them with the following python snippet:
+
+from escapism import unescape
+unescape('<escaped-username>', '-')
+
+## Memory Requests
+
+Per-user per-server memory Requests
+
+## CPU Requests
+
+Per-user per-server CPU Requests
diff --git a/docs/reference/index.md b/docs/reference/index.md
@@ -17,4 +17,10 @@ Please see our [contributing guide](contributing) if you'd like to add to it.
 % that they appear in the table of contents
 ```{toctree}
 :maxdepth: 2
+dashboards/cluster.md
+dashboards/jupyterhub.md
+dashboards/support.md
+dashboards/usage-report.md
+dashboards/user.md
+dashboards/global.md
 ```