硬件配置

基于 NVIDIA A10 的 ecs.gn7i-c16g1.4xlarge

系统选用了 Ubuntu 22.04 64位

前置需求

  • NVIDIA GPU 驱动
  • Docker Engine
  • NVIDIA Container Toolkit

GPU驱动在阿里云购买时可以选择自动安装,我选择了安装CUDA 12.0.1 版,SSH登陆机器后就看到自动安装命令:

CHECKING AUTO INSTALL, DRIVER_VERSION=525.105.17 CUDA_VERSION=12.0.1 CUDNN_VERSION=8.9.1.23 INSTALL AIACC-Training=FALSE INSTALL AIACC-Inference=FALSE , INSTALL RDMA=FALSE, INSTALL eRDMA=FALSE PLEASE WAIT ......
The script automatically downloads and installs a NVIDIA GPU driver and CUDA, CUDNN library. if you choose install RDMA or ERDMA, RDMA or ERDMA software will install.
if you choose install perseus, perseus environment will install as well.
1. The installation takes 15 to 20 minutes, depending on the intranet bandwidth and the quantity of vCPU cores of the instance. Please do not operate the GPU or install any GPU-related software until the GPU driver is installed successfully.
2. After the GPU is installed successfully, the instance will restarts automatically.

CUDA-12.0.1 installing, it tasks 2 to 5 minutes. Remaining installation time 9 to 12 minutes!
| #################################################################################################### | 100% 

cuDNN-8.9.1.23 installing, it takes about 10 seconds. Remaining installation time 6 to 9 minutes!
| #################################################################################################### | 100%

结束后,机器自动重启,整个过程十来分钟,这里看到驱动版本为 525.105.17,在后面启动PyTorch镜像会用到

安装Docker Engine

使用官方的一键安装脚本

curl https://get.docker.com | sh \
  && sudo systemctl --now enable docker

验证安装是否成功

sudo docker run hello-world

这里会下载hello-world镜像并运行,看到 Hello from Docker! 就说明成功了。

安装 NVIDIA Container Toolkit

添加Toolkit源

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
      && curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
      && curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
            sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
            sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

安装

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit

配置docker来识别NVIDIA Container Runtime

sudo nvidia-ctk runtime configure --runtime=docker

重启docker

sudo systemctl restart docker

验证我们的配置

sudo docker run --rm --runtime=nvidia --gpus all nvidia/cuda:12.0.1-base-ubuntu22.04 nvidia-smi

这里用的镜像与我们的系统一致,最后得到下面的结果

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A10          On   | 00000000:00:07.0 Off |                    0 |
|  0%   30C    P8    12W / 150W |      0MiB / 23028MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

这与直接运行 nvidia-smi 的结果一致,说明配置成功

使用Docker来运行PyTorch

docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:23.02-py3

这里参考 官方的版本说明 , 支持驱动5.2.5的最新版本为 23.02,下载镜像需要一段时间,成功的话会进入pytorch容器

=============
== PyTorch ==
=============

NVIDIA Release 23.02 (build 53420872)
PyTorch Version 1.14.0a0+44dac51

...

标签: none

添加新评论