Skip to content

Commit 680afbf

Browse files
gjulianmrtrieu
andauthored
gpu: Update installation instructions (#20112)
* Update README * Add requirements * Update gpu/README.md Co-authored-by: Rosa Trieu <107086888+rtrieu@users.noreply.github.com> --------- Co-authored-by: Rosa Trieu <107086888+rtrieu@users.noreply.github.com>
1 parent 8ffbb76 commit 680afbf

File tree

1 file changed

+52
-2
lines changed

1 file changed

+52
-2
lines changed

gpu/README.md

Lines changed: 52 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,12 @@ Supported vendors: NVIDIA.
99
- Track utilization of GPU devices and retrieve performance and health metrics.
1010
- Monitor processes that are using GPU devices and their performance.
1111

12+
## Requirements
13+
14+
- NVIDIA driver version: 450.51 and above
15+
- Supported OS: Linux only
16+
- Linux kernel version: 5.8 and above
17+
1218
## Setup
1319

1420
### Installation
@@ -26,14 +32,21 @@ The check also uses eBPF probes to assign GPU usage and performance metrics to p
2632

2733
#### Host
2834

29-
Enabling the `gpu` integration requires `system-probe` to have the configuration option enabled. Inside the `system-probe.yaml` configuration file, the following parameters must be set:
35+
The agent needs to be configured to enable GPU-related features. Add the following parameters to the `/etc/datadog-agent/datadog.yaml` configuration file and then restart the Agent:
36+
37+
```yaml
38+
collect_gpu_tags: true
39+
enable_nvml_detection: true
40+
```
41+
42+
Enabling the `gpu` integration requires `system-probe` to have the configuration option enabled for collecting per-process metrics. Inside the `/etc/datadog-agent/system-probe.yaml` configuration file, the following parameters must be set:
3043

3144
```yaml
3245
gpu_monitoring:
3346
enabled: true
3447
```
3548

36-
The check in the Agent configuration file is enabled by default whenever NVIDIA GPUs and their drivers are detected in the system. However, it can also be configured manually following these steps:
49+
The check in the Agent configuration file is enabled by default whenever NVIDIA GPUs and their drivers are detected in the system, as long as the `enable_nvml_detection` parameter is set to `true`. However, it can also be configured manually following these steps:
3750

3851
1. Edit the `gpu.d/conf.yaml` file, in the `conf.d/` folder at the root of your
3952
Agent's configuration directory, to start collecting your GPU performance data.
@@ -46,6 +59,43 @@ This check is automatically enabled when the Agent is running on a host with NVI
4659
<!-- xxz tab xxx -->
4760
<!-- xxx tab "Containerized" xxx -->
4861

62+
#### Docker
63+
64+
The GPU monitoring feature requires the `system-probe` component to be enabled, so in addition to the configuration above for the `datadog.yaml` and `system-probe.yaml` files, the following needs to be added to the `docker run` command:
65+
66+
```bash
67+
docker run --cgroupns host \
68+
--pid host \
69+
-e DD_API_KEY="<DATADOG_API_KEY>" \
70+
-e DD_GPU_MONITORING_ENABLED=true \
71+
-v /var/run/docker.sock:/var/run/docker.sock:ro \
72+
-v /proc/:/host/proc/:ro \
73+
-v /sys/fs/cgroup/:/host/sys/fs/cgroup:ro \
74+
-v /sys/kernel/debug:/sys/kernel/debug \
75+
-v /lib/modules:/lib/modules:ro \
76+
-v /usr/src:/usr/src:ro \
77+
-v /var/tmp/datadog-agent/system-probe/build:/var/tmp/datadog-agent/system-probe/build \
78+
-v /var/tmp/datadog-agent/system-probe/kernel-headers:/var/tmp/datadog-agent/system-probe/kernel-headers \
79+
-v /etc/apt:/host/etc/apt:ro \
80+
-v /etc/yum.repos.d:/host/etc/yum.repos.d:ro \
81+
-v /etc/zypp:/host/etc/zypp:ro \
82+
-v /etc/pki:/host/etc/pki:ro \
83+
-v /etc/yum/vars:/host/etc/yum/vars:ro \
84+
-v /etc/dnf/vars:/host/etc/dnf/vars:ro \
85+
-v /etc/rhsm:/host/etc/rhsm:ro \
86+
-e HOST_ROOT=/host/root \
87+
--security-opt apparmor:unconfined \
88+
--cap-add=SYS_ADMIN \
89+
--cap-add=SYS_RESOURCE \
90+
--cap-add=SYS_PTRACE \
91+
--cap-add=NET_ADMIN \
92+
--cap-add=NET_BROADCAST \
93+
--cap-add=NET_RAW \
94+
--cap-add=IPC_LOCK \
95+
--cap-add=CHOWN \
96+
gcr.io/datadoghq/agent:latest
97+
```
98+
4999
#### Important: Running on Helm/Kubernetes in mixed environments
50100

51101
One important thing to note in the deployment for Kubernetes clusters is that, in order to access the GPUs, the Datadog Agent pods needs access to both the GPUs and NVIDIA's NVML library (`libnvidia-ml.so`). Due to the design of NVIDIA's Kubernetes Device Plugin, in order to have access to those features the Agent pods will need to run with the `nvidia` runtime class. This means that the Agent pods will not be able to run in the default runtime class.

0 commit comments

Comments
 (0)