leafee98-blog/content/essays/use-cuda-in-container.md
leafee98 a5ac8c83e5
All checks were successful
ci/woodpecker/push/deploy Pipeline was successful
new essays: use-cuda-in-container
2023-06-16 11:03:40 +08:00

88 lines
2.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "在容器中使用 cuda"
date: 2023-06-16T10:18:21+08:00
tags: []
categories: []
weight: 50
show_comments: true
draft: false
---
> 由于 nvidia 包更新,导致容器所使用的 cdi 中 nvidia.com 设备相关的文件过时,突然意识到记录过程的重要性,遂此篇。
<!--more-->
## 一般安装步骤
> 本节据回忆和文档撰写,撰文时未实操,与实际可能存在出入。
在容器中使用 cuda除需要较新的驱动、可用的容器运行时podman、docker 等),还需要 [nvidia-container-toolkit](https://aur.archlinux.org/packages/nvidia-container-toolkit)。
```
pacman -S podman
paru -S nvidia-container-toolkit
```
生成 CDI 描述文件:
```
nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
```
随后即可拉取一个 [cuda 镜像](https://hub.docker.com/r/nvidia/cuda) 然后试验 cuda 是否可用:
```
podman run --rm --device nvidia.com/gpu=all docker.io/nvidia/cuda nvidia-smi -L
```
上述简述的安装逻辑在 [Installation Guide — NVIDIA Cloud Native Technologies documentation][1] 均有提到,具体请参考之。
## 当前遇到的问题
简而言之是 nvidia 驱动更新,于是就出现了下面的错误。
```
Error: unable to start container "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx": crun: error stat'ing file `/usr/lib/libEGL_nvidia.so.530.41.03`: No such file or directory: OCI runtime attempted to invoke a command that was not found
```
错误的根源在于容器启动时使用了 `--device nvidia.com/all` (即 CDI这种方式会受到 `/etc/cdi/` 下的文件的影响,然而 `/etc/cdi/nvidia.yaml` 文件(部分内容如下)中所涉及的文件部分已经由于 nvidia 从 530.41.03-17 更新到了 535.54.03-2 而不存在,最终导致了上面的错误。
```
- containerPath: /usr/lib/libEGL_nvidia.so.530.41.03
hostPath: /usr/lib/libEGL_nvidia.so.530.41.03
options:
- ro
- nosuid
- nodev
- bind
```
既然如此,重新生成该文件即可解决问题(以下命令在 [Installation Guide — NVIDIA Cloud Native Technologies documentation][1] 中亦有提到)。
```
nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
```
## 自动化
为了避免之后再出现问题,于是创建了以下 hooks 以在 nvidia 更新时自动更新 `/etc/cdi/nvidia.yaml`,但由于刚刚更新一次 nvidia该 hook 是否正常工作尚不得而知。
```
# This file located at /etc/pacman.d/hooks/nvidia-generate-cdi.hook
[Trigger]
Operation=Install
Operation=Upgrade
Operation=Remove
Type=Package
Target=nvidia
[Action]
Description=Update cdi for container
Depends=nvidia-container-toolkit
When=PostTransaction
NeedsTargets
Exec=/usr/bin/nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
```
[1]: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html "Installation Guide — NVIDIA Cloud Native Technologies documentation"