Skip to content

Commit 97ebe8e

Browse files
val06mbertroneurseberry
authored
[GPUM:] updated onboarding documentation for deployment with dd-operator (#20407)
* updated gpum onboarding documentation when using dd-operator for k8s deployments * Update gpu/README.md Co-authored-by: Matteo Bertrone <m.bertrone@gmail.com> * fixed indentation * CR fixes Co-authored-by: Ursula Chen <58821586+urseberry@users.noreply.github.com> * added extra config flags to helm deployment section in the guide --------- Co-authored-by: Matteo Bertrone <m.bertrone@gmail.com> Co-authored-by: Ursula Chen <58821586+urseberry@users.noreply.github.com>
1 parent f32fa4b commit 97ebe8e

File tree

1 file changed

+26
-0
lines changed

1 file changed

+26
-0
lines changed

gpu/README.md

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -108,6 +108,8 @@ For Helm configurations where all the nodes have GPUs, you can set up the Datado
108108

109109
```yaml
110110
datadog:
111+
enable_nvml_detection: true
112+
collect_gpu_tags: true
111113
gpuMonitoring:
112114
enabled: true
113115
```
@@ -142,6 +144,8 @@ Additionally, if you need to select nodes based on the presence of a label key,
142144
# GPU-specific values-gpu.yaml (for GPU nodes)
143145
datadog:
144146
kubeStateMetricsEnabled: false # Disabled as we're joining an existing Cluster Agent
147+
enable_nvml_detection: true
148+
collect_gpu_tags: true
145149
gpuMonitoring:
146150
enabled: true
147151
@@ -174,6 +178,8 @@ helm install -f values.yaml -f values-gpu.yaml datadog-gpu datadog
174178

175179
#### Datadog Operator
176180

181+
_**Minimum required operator version: 1.14**_
182+
177183
To enable the GPU feature in clusters where all the nodes have GPUs, set the `features.gpu.enabled` parameter in the DatadogAgent manifest:
178184

179185
```yaml
@@ -185,6 +191,18 @@ spec:
185191
features:
186192
gpu:
187193
enabled: true
194+
# for operator versions 1.14.x and 1.15.x add this section
195+
override:
196+
nodeAgent:
197+
containers:
198+
agent:
199+
env:
200+
# add this env var, if using operator version 1.14.x
201+
- name: DD_ENABLE_NVML_DETECTION
202+
value: "true"
203+
# add this env var, if using operator versions 1.14.x or 1.15.x
204+
- name: DD_COLLECT_GPU_TAGS
205+
value: "true"
188206
```
189207

190208
For **mixed environments**, use the [DatadogAgentProfiles feature](https://github.com/DataDog/datadog-operator/blob/main/docs/datadog_agent_profiles.md) of the operator, which allows different configurations to be deployed for different nodes. In this case, it is not necessary to modify the DatadogAgent manifest. Instead, create a profile that enables the configuration on GPU nodes only:
@@ -210,6 +228,14 @@ spec:
210228
env:
211229
- name: DD_GPU_MONITORING_ENABLED
212230
value: "true"
231+
# add this env var, if using operator version 1.14.x
232+
agent:
233+
env:
234+
- name: DD_ENABLE_NVML_DETECTION
235+
value: "true"
236+
# add this env var, if using operator versions 1.14.x or 1.15.x
237+
- name: DD_COLLECT_GPU_TAGS
238+
value: "true"
213239
```
214240

215241
<!-- xxz tab xxx -->

0 commit comments

Comments
 (0)