准备在 ubuntu 22.04 上以容器方式部署通义千问,参考 Installing the NVIDIA Container Toolkit 安装了 nvidia-container-toolkit
安装之后通过下面的命令进行了配置
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
然后运行下面的命令进行测试
sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
却出现下面的错误
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown.
请问如何解决这个问题?
将阿里云云服务器实例规格更换为 轻量级 GPU 实例 vgn6i-vws / ecs.vgn6i-m4-vws.xlarge(4vCPU 23GiB)
后问题依旧
用 ldconfig -p | grep -E 'nvidia|cuda'
命令发现宿主机中也没有安装 libnvidia-ml.so.1
,于是通过下面的命令安装(
apt install libnvidia-ml-dev
安装后 "load library failed: libnvidia-ml.so.1" 错误消失,出现新的错误:
nvidia-container-cli: initialization error: nvml error: driver not loaded: unknown.
终于解决了!是 NVIDIA 驱动没有成功安装引起的,对于阿里云轻量级 GPU 实例 vgn6i-vws
,需要安装 GRID 驱动
正确的安装命令是阿里云提供的下面的命令(来自阿里云官网帮助文档):
if acs-plugin-manager --list --local | grep grid_driver_install > /dev/null 2>&1
then
acs-plugin-manager --remove --plugin grid_driver_install
fi
acs-plugin-manager --exec --plugin grid_driver_install
记录安装过程的博文:阿里云轻量级 GPU 实例安装 NVIDIA 驱动
服务器用的是阿里云云服务器:通用算力型 u1(ecs.g7.2xlarge)8核32G
– dudu 10个月前libnvidia-ml.so.1 是 nvidia 用户级驱动程序库,详见 https://www.cnblogs.com/wuchangsoft/p/9767170.html
– dudu 10个月前阿里云GPU计算型ECS实例安装NVIDIA驱动和CUDA
– dudu 10个月前