关注小众语言、AI技术,记录、分享技术点滴!

0%

kubernetes安装nvidia-gpu驱动

k8s-device-plugin用于是Kubernetes调用GPU的插件。

nvidia的k8s插件安装之后作为daemonset在每个节点上运行,插件运行之后会向master汇报每个节点的GPU数量,运行需要GPU的POD。

安装k8s-device-plugin的前提条件如下:

  • Nvidia的驱动版本 ~= 384.81
  • nvidia-docker >= 2.0 || nvidia-container-toolkit >= 1.7.0
  • 配置nvidia-container-runtime 为默认的运行时
  • Kubernetes的版本≥1.10

安装驱动

在每个节点上nvidia驱动。

安装nvidia-container-toolkit

1
2
3
4
5
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | apt-key add -
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | tee /etc/apt/sources.list.d/libnvidia-container.list

apt-get update && apt-get install -y nvidia-container-toolkit

kubernetes基于docker运行时配置
将docker的默认运行时改成nvidia-container-runtime,新建/etc/docker/daemon.json文件,添加以下内容:

1
2
3
4
5
6
7
8
9
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}

之后重启docker:

1
2
systemctl daemon-reload
systemctl restart docker

kubernetes基于containerd运行时配置
将containerd的默认运行时改成nvidia-container-runtime,修改/etc/containerd/config.toml文件,添加以下内容:

1
2
3
4
5
6
7
8
9
10
11
12
13
[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "nvidia"
...

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes]

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
privileged_without_host_devices = false
runtime_engine = ""
runtime_root = ""
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
BinaryName = "/usr/bin/nvidia-container-runtime"

配置参考,如下图所示:

之后重启containerd:

1
systemctl restart containerd

安装k8s-device-plugin

安装k8s-device-plugin有两种方式:

使用yaml文件安装
执行以下命令,以daemonset的方式运行nvidia-device-plugin

1
$ kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.12.2/nvidia-device-plugin.yml

这种方式的nvidia-device-plugin只是简单的静态daemonset,只能使用nvidia-device-plugin最基础的功能。

使用helm安装

nvidia-device-plugin的一些参数可供配置:

Flag Envvar Default Value 作用
–mig-strategy $MIG_STRATEGY “none”
–fail-on-init-error $FAIL_ON_INIT_ERROR true
–nvidia-driver-root $NVIDIA_DRIVER_ROOT “/“
–pass-device-specs $PASS_DEVICE_SPECS false
–device-list-strategy $DEVICE_LIST_STRATEGY “envvar”
–device-id-strategy $DEVICE_ID_STRATEGY “uuid”
–config-file $CONFIG_FILE “”

在使用helm来部署nvidia-device-plugin时可以设置这些参数,按下面的配置文件所示:

1
2
3
4
5
6
7
8
9
version: v1
flags:
migStrategy: "none"
failOnInitError: true
nvidiaDriverRoot: "/"
plugin:
passDeviceSpecs: false
deviceListStrategy: "envvar"
deviceIDStrategy: "uuid"

基于时间片共享GPU

nvidia-device-plugin允许GPU超额订阅,使多个进程共享同一个GPU。但是GPU并没有对进程进行计算资源和显存方面的隔离,不同的进程运行在相同的故障域。这意味着如果一个workload崩溃,所有的workload都会崩溃。

配置文件模块如下:

1
2
3
4
5
6
7
8
9
version: v1
sharing:
timeSlicing:
renameByDefault: <bool>
failRequestsGreaterThanOne: <bool>
resources:
- name: <resource-name>
replicas: <num-replicas>
...

当resources中配置了replicas,就可以为一个GPU指定多个副本,这样多个进程就可以共享同一个GPU。

如果renameByDefault设置为true,每个资源的名字都会变成<resource-name>.shared,而不是简单的<resource-name>

如果failRequestsGreaterThanOne设置为true,插件将无法为请求大于1个GPU的容器分配资源。容器的 pod 将失败并出现 UnexpectedAdmissionError,需要手动删除、更新和重新部署。

配置示例如下:

1
2
3
4
5
6
version: v1
sharing:
timeSlicing:
resources:
- name: nvidia.com/gpu
replicas: 15

当该配置被应用后,一个拥有1个GPU的节点会出现15个nvidia.com/gpu资源。插件简单地为每个GPU创建了15个引用,然后不加选择地将它们分配给容器。

部署插件

首先安装helm

1
2
3
4
5
curl https://baltocdn.com/helm/signing.asc | gpg --dearmor | tee /usr/share/keyrings/helm.gpg > /dev/null
apt-get install apt-transport-https --yes
echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/helm.gpg] https://baltocdn.com/helm/stable/debian/ all main" | tee /etc/apt/sources.list.d/helm-stable-debian.list
apt-get update
apt-get install helm

配置插件的helm仓库

1
2
$ helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
$ helm repo update

检验插件是否可用

1
2
3
$ helm search repo nvdp --devel
NAME CHART VERSION APP VERSION DESCRIPTION
nvdp/nvidia-device-plugin 0.15.0 0.15.0 A Helm chart for ...

简单部署插件

1
2
3
4
helm upgrade -i nvdp nvdp/nvidia-device-plugin \
--namespace nvidia-device-plugin \
--create-namespace \
--version 0.15.0

配置time-slicing

一、使用配置文件部署

新建yaml配置文件:

1
2
3
4
5
6
7
8
cat << EOF > dp-config.yaml
version: v1
sharing:
timeSlicing:
resources:
- name: nvidia.com/gpu
replicas: 15
EOF

使用yaml配置文件部署插件:

1
2
3
4
5
helm upgrade -i nvdp nvdp/nvidia-device-plugin \
--namespace nvidia-device-plugin \
--create-namespace \
--version 0.15.0 \
--set-file config.map.config=dp-config.yaml

二、使用ConfigMap配置部署

如果您不希望 Helm 为您创建 ConfigMap,您也可以将其指向预先创建的 ConfigMap,如下所示:

创建命名空间:

1
$ kubectl create ns nvidia-device-plugin

创建ConfigMap:

1
2
$ kubectl create cm -n nvidia-device-plugin nvidia-plugin-configs \
--from-file=config=/tmp/dp-example-config0.yaml

使用ConfigMap配置部署插件:

1
2
3
4
5
$ helm upgrade -i nvdp nvdp/nvidia-device-plugin \
--version=0.15.0 \
--namespace nvidia-device-plugin \
--create-namespace \
--set config.name=nvidia-plugin-configs

安装之后,我们发现一个拥有1个GPU的节点可以看到有15个nvidia.com/gpu资源

运行GPU任务
当nvidia-device-plugin的daemonset部署之后,kubernetes就可以下发使用gpu的pod:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
$ cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
restartPolicy: Never
containers:
- name: cuda-container
image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
resources:
limits:
nvidia.com/gpu: 1 # requesting 1 GPU
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
EOF

查看pod的日志:

1
2
3
4
5
6
7
$ kubectl logs gpu-pod
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

至此gpu已经可以使用。