You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: gpu/README.md
+52-2Lines changed: 52 additions & 2 deletions
Original file line number
Diff line number
Diff line change
@@ -9,6 +9,12 @@ Supported vendors: NVIDIA.
9
9
- Track utilization of GPU devices and retrieve performance and health metrics.
10
10
- Monitor processes that are using GPU devices and their performance.
11
11
12
+
## Requirements
13
+
14
+
- NVIDIA driver version: 450.51 and above
15
+
- Supported OS: Linux only
16
+
- Linux kernel version: 5.8 and above
17
+
12
18
## Setup
13
19
14
20
### Installation
@@ -26,14 +32,21 @@ The check also uses eBPF probes to assign GPU usage and performance metrics to p
26
32
27
33
#### Host
28
34
29
-
Enabling the `gpu` integration requires `system-probe` to have the configuration option enabled. Inside the `system-probe.yaml` configuration file, the following parameters must be set:
35
+
The agent needs to be configured to enable GPU-related features. Add the following parameters to the `/etc/datadog-agent/datadog.yaml` configuration file and then restart the Agent:
36
+
37
+
```yaml
38
+
collect_gpu_tags: true
39
+
enable_nvml_detection: true
40
+
```
41
+
42
+
Enabling the `gpu` integration requires `system-probe` to have the configuration option enabled for collecting per-process metrics. Inside the `/etc/datadog-agent/system-probe.yaml` configuration file, the following parameters must be set:
30
43
31
44
```yaml
32
45
gpu_monitoring:
33
46
enabled: true
34
47
```
35
48
36
-
The check in the Agent configuration file is enabled by default whenever NVIDIA GPUs and their drivers are detected in the system. However, it can also be configured manually following these steps:
49
+
The check in the Agent configuration file is enabled by default whenever NVIDIA GPUs and their drivers are detected in the system, as long as the `enable_nvml_detection` parameter is set to `true`. However, it can also be configured manually following these steps:
37
50
38
51
1. Edit the `gpu.d/conf.yaml` file, in the `conf.d/` folder at the root of your
39
52
Agent's configuration directory, to start collecting your GPU performance data.
@@ -46,6 +59,43 @@ This check is automatically enabled when the Agent is running on a host with NVI
46
59
<!-- xxz tab xxx -->
47
60
<!-- xxx tab "Containerized" xxx -->
48
61
62
+
#### Docker
63
+
64
+
The GPU monitoring feature requires the `system-probe` component to be enabled, so in addition to the configuration above for the `datadog.yaml` and `system-probe.yaml` files, the following needs to be added to the `docker run` command:
#### Important: Running on Helm/Kubernetes in mixed environments
50
100
51
101
One important thing to note in the deployment for Kubernetes clusters is that, in order to access the GPUs, the Datadog Agent pods needs access to both the GPUs and NVIDIA's NVML library (`libnvidia-ml.so`). Due to the design of NVIDIA's Kubernetes Device Plugin, in order to have access to those features the Agent pods will need to run with the `nvidia` runtime class. This means that the Agent pods will not be able to run in the default runtime class.
0 commit comments