|
1 |
| -# AMD GPU device plugin for Kubernetes |
2 |
| -[](https://goreportcard.com/report/github.com/RadeonOpenCompute/k8s-device-plugin) |
| 1 | +# DCU vGPU device plugin for HAMi |
3 | 2 |
|
4 | 3 | ## Introduction
|
5 |
| -This is a [Kubernetes][k8s] [device plugin][dp] implementation that enables the registration of AMD GPU in a container cluster for compute workload. With the approrpriate hardware and this plugin deployed in your Kubernetes cluster, you will be able to run jobs that require AMD GPU. |
6 |
| - |
7 |
| -More information about [RadeonOpenCompute (ROCm)][rocm] |
| 4 | +This is a [Kubernetes][k8s] [device plugin][dp] implementation that enables the registration of hygon DCU in a container cluster for compute workload. With the approrpriate hardware and this plugin deployed in your Kubernetes cluster, you will be able to run jobs that require AMD DCU. It supports DCU-virtualzation by using hy-virtual provided by dtk |
8 | 5 |
|
9 | 6 |
|
10 | 7 | ## Prerequisites
|
11 |
| -* [ROCm capable machines][sysreq] |
12 |
| -* [kubeadm capable machines][kubeadm] (if you are using kubeadm to deploy your k8s cluster) |
13 |
| -* [ROCm kernel][rock] ([Installation guide][rocminstall]) or latest AMD GPU Linux driver ([Installation guide][amdgpuinstall]) |
14 |
| -* A [Kubernetes deployment][k8sinstall] |
15 |
| -* `--allow-privileged=true` for both kube-apiserver and kubelet (only needed if the device plugin is deployed via DaemonSet since the device plugin container requires privileged security context to access `/dev/kfd` for device health check) |
| 8 | +* dtk >= 24.04 |
| 9 | +* hy=smi == v1.6.0 |
16 | 10 |
|
17 | 11 |
|
18 | 12 | ## Limitations
|
19 | 13 | * This plugin targets Kubernetes v1.18+.
|
20 | 14 |
|
21 | 15 | ## Deployment
|
22 |
| -The device plugin needs to be run on all the nodes that are equipped with AMD GPU. The simplist way of doing so is to create a Kubernetes [DaemonSet][ds], which run a copy of a pod on all (or some) Nodes in the cluster. We have a pre-built Docker image on [DockerHub][dhk8samdgpudp] that you can use for with your DaemonSet. This repository also have a pre-defined yaml file named `k8s-ds-amdgpu-dp.yaml`. You can create a DaemonSet in your Kubernetes cluster by running this command: |
23 |
| -``` |
24 |
| -$ kubectl create -f k8s-ds-amdgpu-dp.yaml |
25 | 16 | ```
|
26 |
| -or directly pull from the web using |
| 17 | +$ kubectl apply -f k8s-dcu-rbac.yaml |
| 18 | +$ kubectl apply -f k8s-dcu-plugin.yaml |
27 | 19 | ```
|
28 |
| -kubectl create -f https://raw.githubusercontent.com/RadeonOpenCompute/k8s-device-plugin/master/k8s-ds-amdgpu-dp.yaml |
29 |
| -``` |
30 |
| - |
31 |
| -If you want to enable the experimental device health check, please use `k8s-ds-amdgpu-dp-health.yaml` **after** `--allow-privileged=true` is set for kube-apiserver and kublet. |
32 | 20 |
|
33 |
| -## Example workload |
34 |
| -You can restrict work to a node with GPU by adding `resources.limits` to the pod definition. An example pod definition is provided in `example/pod/alexnet-gpu.yaml`. This pod runs the timing benchmark for AlexNet on AMD GPU and then go to sleep. You can create the pod by running: |
| 21 | +## Build |
35 | 22 | ```
|
36 |
| -$ kubectl create -f alexnet-gpu.yaml |
37 |
| -``` |
38 |
| - |
39 |
| -or |
40 |
| - |
41 |
| -``` |
42 |
| -$ kubectl create -f https://raw.githubusercontent.com/RadeonOpenCompute/k8s-device-plugin/master/example/pod/alexnet-gpu.yaml |
43 |
| -``` |
44 |
| - |
45 |
| -and then check the pod status by running |
46 |
| -``` |
47 |
| -$ kubectl describe pods |
48 |
| -``` |
49 |
| - |
50 |
| -After the pod is created and running, you can see the benchmark result by running: |
| 23 | +docker build . |
51 | 24 | ```
|
52 |
| -$ kubectl logs alexnet-tf-gpu-pod alexnet-tf-gpu-container |
53 |
| -``` |
54 |
| - |
55 |
| -For comparison, an example pod definition of running the same benchmark with CPU is provided in `example/pod/alexnet-cpu.yaml`. |
56 |
| - |
57 |
| -## Labelling node with additional GPU properties |
58 |
| - |
59 |
| -Please see [AMD GPU Kubernetes Node Labeller](cmd/k8s-node-labeller/README.md) for details. An example configuration is in [k8s-ds-amdgpu-labeller.yaml](k8s-ds-amdgpu-labeller.yaml): |
60 |
| -``` |
61 |
| -$ kubectl create -f k8s-ds-amdgpu-labeller.yaml |
62 |
| -``` |
63 |
| - |
64 |
| -or |
65 |
| - |
66 |
| -``` |
67 |
| -$ kubectl create -f https://raw.githubusercontent.com/RadeonOpenCompute/k8s-device-plugin/master/k8s-ds-amdgpu-labeller.yaml |
68 |
| -``` |
69 |
| - |
70 |
| - |
71 |
| -## Notes |
72 |
| -* This plugin uses [`go modules`][gm] for dependencies management |
73 |
| -* Please consult the `Dockerfile` on how to build and use this plugin independent of a docker image |
74 | 25 |
|
75 |
| -## TODOs |
76 |
| -* Add proper GPU health check (health check without `/dev/kfd` access.) |
| 26 | +## Maintainer |
77 | 27 |
|
78 |
| -[ds]: https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/ |
79 |
| -[dp]: https://kubernetes.io/docs/concepts/cluster-administration/device-plugins/ |
80 |
| -[rocm]: https://rocm.github.io/ |
81 |
| -[rock]: https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver |
82 |
| -[rocminstall]: http://rocm-documentation.readthedocs.io/en/latest/Installation_Guide/ROCk-kernel.html#rock-kernel |
83 |
| -[amdgpuinstall]: https://support.amd.com/en-us/kb-articles/Pages/AMDGPU-PRO-Install.aspx |
84 |
| -[sysreq]: http://rocm-documentation.readthedocs.io/en/latest/Installation_Guide/Installation-Guide.html#system-requirement |
85 |
| -[gm]: https://blog.golang.org/using-go-modules |
86 |
| -[kubeadm]: https://kubernetes.io/docs/setup/independent/install-kubeadm/#before-you-begin |
87 |
| -[k8sinstall]: https://kubernetes.io/docs/setup/independent/install-kubeadm |
88 |
| -[k8s]: https://kubernetes.io |
89 |
| -[dhk8samdgpudp]: https://hub.docker.com/r/rocm/k8s-device-plugin/ |
| 28 | +limengxuan@4paradigm.com |
0 commit comments