본문 바로가기

DKE/Kubernetes

[Kubernetes] NVIDIA device plugin 설치하기 / 2023.06.15

Kubernetes에서 GPU를 사용하기 위해 NVIDIA device plugin을 설치했다

아래 github 주소를 참고!!!

https://github.com/NVIDIA/k8s-device-plugin#quick-start

 

GitHub - NVIDIA/k8s-device-plugin: NVIDIA device plugin for Kubernetes

NVIDIA device plugin for Kubernetes. Contribute to NVIDIA/k8s-device-plugin development by creating an account on GitHub.

github.com

Step 1. nvidia-container-toolkit 설치하기

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | sudo tee /etc/apt/sources.list.d/libnvidia-container.list

sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit

Step 2. Docker 구성하기 

sudo vim /etc/docker/daemon.json

# 아래 코드 복사
{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

 

Docker 재시작

sudo systemctl restart docker

Step 3. nvidia-container-runtime을 low-level로 설정하기

sudo vim /etc/containerd/config.toml

#아래 코드 복사
version = 2
[plugins]
  [plugins."io.containerd.grpc.v1.cri"]
    [plugins."io.containerd.grpc.v1.cri".containerd]
      default_runtime_name = "nvidia"

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName = "/usr/bin/nvidia-container-runtime"

containerd 재시작

sudo systemctl restart containerd

Step 4. nvidia-device-plugin Daemonset 배포하기

$ kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml

Step 5. GPU를 사용하는 pod 생성하기

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  restartPolicy: Never
  containers:
    - name: cuda-container
      image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
      resources:
        limits:
          nvidia.com/gpu: 1 # requesting 1 GPU
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
EOF

로그 확인 시, 아래와 같이 나오면 성공!!

kubectl logs gpu-pod