阿里云Ubuntu 22.04 (A10) 安装PyTorch NGC Container环境
硬件配置
基于 NVIDIA A10 的 ecs.gn7i-c16g1.4xlarge
系统选用了 Ubuntu 22.04 64位
前置需求
- NVIDIA GPU 驱动
- Docker Engine
- NVIDIA Container Toolkit
GPU驱动在阿里云购买时可以选择自动安装,我选择了安装CUDA 12.0.1 版,SSH登陆机器后就看到自动安装命令:
CHECKING AUTO INSTALL, DRIVER_VERSION=525.105.17 CUDA_VERSION=12.0.1 CUDNN_VERSION=8.9.1.23 INSTALL AIACC-Training=FALSE INSTALL AIACC-Inference=FALSE , INSTALL RDMA=FALSE, INSTALL eRDMA=FALSE PLEASE WAIT ......
The script automatically downloads and installs a NVIDIA GPU driver and CUDA, CUDNN library. if you choose install RDMA or ERDMA, RDMA or ERDMA software will install.
if you choose install perseus, perseus environment will install as well.
1. The installation takes 15 to 20 minutes, depending on the intranet bandwidth and the quantity of vCPU cores of the instance. Please do not operate the GPU or install any GPU-related software until the GPU driver is installed successfully.
2. After the GPU is installed successfully, the instance will restarts automatically.
CUDA-12.0.1 installing, it tasks 2 to 5 minutes. Remaining installation time 9 to 12 minutes!
| #################################################################################################### | 100%
cuDNN-8.9.1.23 installing, it takes about 10 seconds. Remaining installation time 6 to 9 minutes!
| #################################################################################################### | 100%
结束后,机器自动重启,整个过程十来分钟,这里看到驱动版本为 525.105.17,在后面启动PyTorch镜像会用到
安装Docker Engine
使用官方的一键安装脚本
curl https://get.docker.com | sh \
&& sudo systemctl --now enable docker
验证安装是否成功
sudo docker run hello-world
这里会下载hello-world镜像并运行,看到 Hello from Docker! 就说明成功了。
安装 NVIDIA Container Toolkit
添加Toolkit源
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
&& curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
安装
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
配置docker来识别NVIDIA Container Runtime
sudo nvidia-ctk runtime configure --runtime=docker
重启docker
sudo systemctl restart docker
验证我们的配置
sudo docker run --rm --runtime=nvidia --gpus all nvidia/cuda:12.0.1-base-ubuntu22.04 nvidia-smi
这里用的镜像与我们的系统一致,最后得到下面的结果
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A10 On | 00000000:00:07.0 Off | 0 |
| 0% 30C P8 12W / 150W | 0MiB / 23028MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
这与直接运行 nvidia-smi 的结果一致,说明配置成功
使用Docker来运行PyTorch
docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:23.02-py3
这里参考 官方的版本说明 , 支持驱动5.2.5的最新版本为 23.02,下载镜像需要一段时间,成功的话会进入pytorch容器
=============
== PyTorch ==
=============
NVIDIA Release 23.02 (build 53420872)
PyTorch Version 1.14.0a0+44dac51
...