gpu: Update installation instructions (#20112)

gjulianm · rtrieu · web-flow · commit 680afbf9cd24 · 2025-04-22T10:03:16.000Z
* Update README

* Add requirements

* Update gpu/README.md

Co-authored-by: Rosa Trieu &lt;107086888+rtrieu@users.noreply.github.com&gt;

---------

Co-authored-by: Rosa Trieu &lt;107086888+rtrieu@users.noreply.github.com&gt;
diff --git a/gpu/README.md b/gpu/README.md
@@ -9,6 +9,12 @@ Supported vendors: NVIDIA.
 - Track utilization of GPU devices and retrieve performance and health metrics.
 - Monitor processes that are using GPU devices and their performance.
 
+## Requirements
+
+- NVIDIA driver version: 450.51 and above
+- Supported OS: Linux only
+- Linux kernel version: 5.8 and above
+
 ## Setup
 
 ### Installation
@@ -26,14 +32,21 @@ The check also uses eBPF probes to assign GPU usage and performance metrics to p
 
 #### Host
 
-Enabling the `gpu` integration requires `system-probe` to have the configuration option enabled.  Inside the `system-probe.yaml` configuration file, the following parameters must be set:
+The agent needs to be configured to enable GPU-related features. Add the following parameters to the `/etc/datadog-agent/datadog.yaml` configuration file and then restart the Agent:
+
+```yaml
+collect_gpu_tags: true
+enable_nvml_detection: true
+```
+
+Enabling the `gpu` integration requires `system-probe` to have the configuration option enabled for collecting per-process metrics. Inside the `/etc/datadog-agent/system-probe.yaml` configuration file, the following parameters must be set:
 
 ```yaml
 gpu_monitoring:
   enabled: true
 ```
 
-The check in the Agent configuration file is enabled by default whenever NVIDIA GPUs and their drivers are detected in the system. However, it can also be configured manually following these steps:
+The check in the Agent configuration file is enabled by default whenever NVIDIA GPUs and their drivers are detected in the system, as long as the `enable_nvml_detection` parameter is set to `true`. However, it can also be configured manually following these steps:
 
 1. Edit the `gpu.d/conf.yaml` file, in the `conf.d/` folder at the root of your
    Agent's configuration directory, to start collecting your GPU performance data.
@@ -46,6 +59,43 @@ This check is automatically enabled when the Agent is running on a host with NVI
 <!-- xxz tab xxx -->
 <!-- xxx tab "Containerized" xxx -->
 
+#### Docker
+
+The GPU monitoring feature requires the `system-probe` component to be enabled, so in addition to the configuration above for the `datadog.yaml` and `system-probe.yaml` files, the following needs to be added to the `docker run` command:
+
+```bash
+docker run --cgroupns host \
+  --pid host \
+  -e DD_API_KEY="<DATADOG_API_KEY>" \
+  -e DD_GPU_MONITORING_ENABLED=true \
+  -v /var/run/docker.sock:/var/run/docker.sock:ro \
+  -v /proc/:/host/proc/:ro \
+  -v /sys/fs/cgroup/:/host/sys/fs/cgroup:ro \
+  -v /sys/kernel/debug:/sys/kernel/debug \
+  -v /lib/modules:/lib/modules:ro \
+  -v /usr/src:/usr/src:ro \
+  -v /var/tmp/datadog-agent/system-probe/build:/var/tmp/datadog-agent/system-probe/build \
+  -v /var/tmp/datadog-agent/system-probe/kernel-headers:/var/tmp/datadog-agent/system-probe/kernel-headers \
+  -v /etc/apt:/host/etc/apt:ro \
+  -v /etc/yum.repos.d:/host/etc/yum.repos.d:ro \
+  -v /etc/zypp:/host/etc/zypp:ro \
+  -v /etc/pki:/host/etc/pki:ro \
+  -v /etc/yum/vars:/host/etc/yum/vars:ro \
+  -v /etc/dnf/vars:/host/etc/dnf/vars:ro \
+  -v /etc/rhsm:/host/etc/rhsm:ro \
+  -e HOST_ROOT=/host/root \
+  --security-opt apparmor:unconfined \
+  --cap-add=SYS_ADMIN \
+  --cap-add=SYS_RESOURCE \
+  --cap-add=SYS_PTRACE \
+  --cap-add=NET_ADMIN \
+  --cap-add=NET_BROADCAST \
+  --cap-add=NET_RAW \
+  --cap-add=IPC_LOCK \
+  --cap-add=CHOWN \
+  gcr.io/datadoghq/agent:latest
+```
+
 #### Important: Running on Helm/Kubernetes in mixed environments
 
 One important thing to note in the deployment for Kubernetes clusters is that, in order to access the GPUs, the Datadog Agent pods needs access to both the GPUs and NVIDIA's NVML library (`libnvidia-ml.so`). Due to the design of NVIDIA's Kubernetes Device Plugin, in order to have access to those features the Agent pods will need to run with the `nvidia` runtime class. This means that the Agent pods will not be able to run in the default runtime class.