在 k8s 集群上部署 ollama + deepseek-r1 时遇到的问题,k8s 节点用的是 containerd 容器运行时,部署时 pod 无法启动,出现下面的错误提示
FailedScheduling 1 Insufficient nvidia.com/gpu
和使用 docker 时一样安装,只要在 nvidia-ctk runtime configure
时设置 runtime=containerd
以下是完整安装命令:
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
| sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
| sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
| sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=containerd
sudo systemctl restart containerd
NVIDIA 官网帮助文档 Installing the NVIDIA Container Toolkit 中也写了,一开始没找到
建议用下面的命令设置为默认就使用 nvidia runtime 运行容器
nvidia-ctk runtime configure --runtime=containerd --set-as-default=true
运行这个命令会修改 /etc/containerd/config.toml
中 default_runtime_name
配置,将 runc
改为 nvidia
default_runtime_name = "nvidia"
已经在 k8s 集群上成功部署了 ollama + deepseek-r1:r7,详见博文 https://www.cnblogs.com/dudu/p/18713973
A Practical Guide to Running NVIDIA GPUs on Kubernetes
– dudu 1周前https://github.com/NVIDIA/k8s-device-plugin/issues/348
– dudu 1周前