报错详情信息
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Exception has occurred: RuntimeError NCCL error in : ../torch/csrc/distributed/c10d/NCCLUtils.hpp:121, unhandled cuda error, NCCL version 2.14.3 ncclUnhandledCudaError: Call to CUDA function failed. Last error: Cuda failure 'out of memory' File "/cyb/LAVIS/lavis/common/dist_utils.py" , line 114, in init_distributed_mode torch.distributed.barrier() File "/cyb/LAVIS/train.py" , line 96, in main init_distributed_mode(cfg.run_cfg) File "/cyb/LAVIS/train.py" , line 119, in <module> main() RuntimeError: NCCL error in : ../torch/csrc/distributed/c10d/NCCLUtils.hpp:121, unhandled cuda error, NCCL version 2.14.3 ncclUnhandledCudaError: Call to CUDA function failed. Last error: Cuda failure 'out of memory'
这个错误信息表明在执行分布式训练时遇到了CUDA内存不足的问题,因为GPU资源被大量占用,而当前任务又需要更多的内存资源(感觉默认先用GPU0,但GPU0被占了,所以才无法使用的 )
这是目前GPU使用情况:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 (ab) (base) root@b11a13895df1:/ Thu Mar 14 13:07:58 2024 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... Off | 00000000:1B:00.0 Off | N/A | | 73% 69C P2 343W / 350W | 23875MiB / 24576MiB | 100% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA GeForce ... Off | 00000000:1C:00.0 Off | N/A | | 30% 26C P8 21W / 350W | 0MiB / 24576MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 NVIDIA GeForce ... Off | 00000000:1D:00.0 Off | N/A | | 30% 30C P8 10W / 350W | 20465MiB / 24576MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 3 NVIDIA GeForce ... Off | 00000000:1E:00.0 Off | N/A | | 70% 67C P2 342W / 350W | 23875MiB / 24576MiB | 98% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 4 NVIDIA GeForce ... Off | 00000000:3D:00.0 Off | N/A | | 30% 24C P8 26W / 350W | 2MiB / 24576MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 5 NVIDIA GeForce ... Off | 00000000:3F:00.0 Off | N/A | | 30% 26C P8 26W / 350W | 0MiB / 24576MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 6 NVIDIA GeForce ... Off | 00000000:40:00.0 Off | N/A | | 30% 29C P8 23W / 350W | 0MiB / 24576MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 7 NVIDIA GeForce ... Off | 00000000:41:00.0 Off | N/A | | 65% 64C P2 349W / 350W | 20109MiB / 24576MiB | 100% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 1702935 C 23857MiB | | 2 N/A N/A 2085163 C 20461MiB | | 3 N/A N/A 877638 C 23857MiB | | 7 N/A N/A 2094000 C 20107MiB | +-----------------------------------------------------------------------------+
有上面我们知道,其中GPU 0、GPU 2、GPU 3和GPU 7的内存几乎已满
解决方法
系统中似乎有几个GPU(GPU 1、GPU 4、GPU 5和GPU 6)几乎未被使用,每个GPU的内存使用量仅为2MiB。尝试将训练进程指定到这些GPU上运行,通过设置CUDA_VISIBLE_DEVICES环境变量来做到这一点。
原指令:
1 python -m torch.distributed.run --nproc_per_node=1 --master_port=2564 train.py --cfg-path lavis/projects/blip2/train/pretrain_stage1.yaml
新指令(指定在GPU1上):
1 CUDA_VISIBLE_DEVICES=1 python -m torch.distributed.run --nproc_per_node=1 --master_port=2564 train.py --cfg-path lavis/projects/blip2/train/pretrain_stage1.yaml