Loading...

报错详情信息

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Exception has occurred: RuntimeError
NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:121, unhandled cuda error, NCCL version 2.14.3
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 'out of memory'
File "/cyb/LAVIS/lavis/common/dist_utils.py", line 114, in init_distributed_mode
torch.distributed.barrier()
File "/cyb/LAVIS/train.py", line 96, in main
init_distributed_mode(cfg.run_cfg)#初始化分布式训练模式
File "/cyb/LAVIS/train.py", line 119, in <module>
main()
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:121, unhandled cuda error, NCCL version 2.14.3
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 'out of memory'

这个错误信息表明在执行分布式训练时遇到了CUDA内存不足的问题,因为GPU资源被大量占用,而当前任务又需要更多的内存资源(感觉默认先用GPU0,但GPU0被占了,所以才无法使用的)


这是目前GPU使用情况:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
(ab) (base) root@b11a13895df1:/# nvidia-smi
Thu Mar 14 13:07:58 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:1B:00.0 Off | N/A |
| 73% 69C P2 343W / 350W | 23875MiB / 24576MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce ... Off | 00000000:1C:00.0 Off | N/A |
| 30% 26C P8 21W / 350W | 0MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA GeForce ... Off | 00000000:1D:00.0 Off | N/A |
| 30% 30C P8 10W / 350W | 20465MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA GeForce ... Off | 00000000:1E:00.0 Off | N/A |
| 70% 67C P2 342W / 350W | 23875MiB / 24576MiB | 98% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 4 NVIDIA GeForce ... Off | 00000000:3D:00.0 Off | N/A |
| 30% 24C P8 26W / 350W | 2MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 5 NVIDIA GeForce ... Off | 00000000:3F:00.0 Off | N/A |
| 30% 26C P8 26W / 350W | 0MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 6 NVIDIA GeForce ... Off | 00000000:40:00.0 Off | N/A |
| 30% 29C P8 23W / 350W | 0MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 7 NVIDIA GeForce ... Off | 00000000:41:00.0 Off | N/A |
| 65% 64C P2 349W / 350W | 20109MiB / 24576MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1702935 C 23857MiB |
| 2 N/A N/A 2085163 C 20461MiB |
| 3 N/A N/A 877638 C 23857MiB |
| 7 N/A N/A 2094000 C 20107MiB |
+-----------------------------------------------------------------------------+

有上面我们知道,其中GPU 0、GPU 2、GPU 3和GPU 7的内存几乎已满


解决方法

系统中似乎有几个GPU(GPU 1、GPU 4、GPU 5和GPU 6)几乎未被使用,每个GPU的内存使用量仅为2MiB。尝试将训练进程指定到这些GPU上运行,通过设置CUDA_VISIBLE_DEVICES环境变量来做到这一点。

原指令:

1
python -m torch.distributed.run --nproc_per_node=1 --master_port=2564 train.py --cfg-path lavis/projects/blip2/train/pretrain_stage1.yaml 

新指令(指定在GPU1上):

1
CUDA_VISIBLE_DEVICES=1 python -m torch.distributed.run --nproc_per_node=1 --master_port=2564 train.py --cfg-path lavis/projects/blip2/train/pretrain_stage1.yaml