new essays: use-cuda-in-container
All checks were successful
ci/woodpecker/push/deploy Pipeline was successful
All checks were successful
ci/woodpecker/push/deploy Pipeline was successful
This commit is contained in:
parent
d77074ef75
commit
a5ac8c83e5
87
content/essays/use-cuda-in-container.md
Normal file
87
content/essays/use-cuda-in-container.md
Normal file
|
@ -0,0 +1,87 @@
|
||||||
|
---
|
||||||
|
title: "在容器中使用 cuda"
|
||||||
|
date: 2023-06-16T10:18:21+08:00
|
||||||
|
tags: []
|
||||||
|
categories: []
|
||||||
|
weight: 50
|
||||||
|
show_comments: true
|
||||||
|
draft: false
|
||||||
|
---
|
||||||
|
|
||||||
|
> 由于 nvidia 包更新,导致容器所使用的 cdi 中 nvidia.com 设备相关的文件过时,突然意识到记录过程的重要性,遂此篇。
|
||||||
|
|
||||||
|
<!--more-->
|
||||||
|
|
||||||
|
## 一般安装步骤
|
||||||
|
|
||||||
|
> 本节据回忆和文档撰写,撰文时未实操,与实际可能存在出入。
|
||||||
|
|
||||||
|
在容器中使用 cuda,除需要较新的驱动、可用的容器运行时(podman、docker 等),还需要 [nvidia-container-toolkit](https://aur.archlinux.org/packages/nvidia-container-toolkit)。
|
||||||
|
|
||||||
|
```
|
||||||
|
pacman -S podman
|
||||||
|
paru -S nvidia-container-toolkit
|
||||||
|
```
|
||||||
|
|
||||||
|
生成 CDI 描述文件:
|
||||||
|
|
||||||
|
```
|
||||||
|
nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
|
||||||
|
```
|
||||||
|
|
||||||
|
随后即可拉取一个 [cuda 镜像](https://hub.docker.com/r/nvidia/cuda) 然后试验 cuda 是否可用:
|
||||||
|
|
||||||
|
```
|
||||||
|
podman run --rm --device nvidia.com/gpu=all docker.io/nvidia/cuda nvidia-smi -L
|
||||||
|
```
|
||||||
|
|
||||||
|
上述简述的安装逻辑在 [Installation Guide — NVIDIA Cloud Native Technologies documentation][1] 均有提到,具体请参考之。
|
||||||
|
|
||||||
|
## 当前遇到的问题
|
||||||
|
|
||||||
|
简而言之是 nvidia 驱动更新,于是就出现了下面的错误。
|
||||||
|
|
||||||
|
```
|
||||||
|
Error: unable to start container "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx": crun: error stat'ing file `/usr/lib/libEGL_nvidia.so.530.41.03`: No such file or directory: OCI runtime attempted to invoke a command that was not found
|
||||||
|
```
|
||||||
|
|
||||||
|
错误的根源在于容器启动时使用了 `--device nvidia.com/all` (即 CDI),这种方式会受到 `/etc/cdi/` 下的文件的影响,然而 `/etc/cdi/nvidia.yaml` 文件(部分内容如下)中所涉及的文件部分已经由于 nvidia 从 530.41.03-17 更新到了 535.54.03-2 而不存在,最终导致了上面的错误。
|
||||||
|
|
||||||
|
```
|
||||||
|
- containerPath: /usr/lib/libEGL_nvidia.so.530.41.03
|
||||||
|
hostPath: /usr/lib/libEGL_nvidia.so.530.41.03
|
||||||
|
options:
|
||||||
|
- ro
|
||||||
|
- nosuid
|
||||||
|
- nodev
|
||||||
|
- bind
|
||||||
|
```
|
||||||
|
|
||||||
|
既然如此,重新生成该文件即可解决问题(以下命令在 [Installation Guide — NVIDIA Cloud Native Technologies documentation][1] 中亦有提到)。
|
||||||
|
|
||||||
|
```
|
||||||
|
nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
|
||||||
|
```
|
||||||
|
|
||||||
|
## 自动化
|
||||||
|
|
||||||
|
为了避免之后再出现问题,于是创建了以下 hooks 以在 nvidia 更新时自动更新 `/etc/cdi/nvidia.yaml`,但由于刚刚更新一次 nvidia,该 hook 是否正常工作尚不得而知。
|
||||||
|
|
||||||
|
```
|
||||||
|
# This file located at /etc/pacman.d/hooks/nvidia-generate-cdi.hook
|
||||||
|
[Trigger]
|
||||||
|
Operation=Install
|
||||||
|
Operation=Upgrade
|
||||||
|
Operation=Remove
|
||||||
|
Type=Package
|
||||||
|
Target=nvidia
|
||||||
|
|
||||||
|
[Action]
|
||||||
|
Description=Update cdi for container
|
||||||
|
Depends=nvidia-container-toolkit
|
||||||
|
When=PostTransaction
|
||||||
|
NeedsTargets
|
||||||
|
Exec=/usr/bin/nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
|
||||||
|
```
|
||||||
|
|
||||||
|
[1]: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html "Installation Guide — NVIDIA Cloud Native Technologies documentation"
|
Loading…
Reference in a new issue