Cuda failure 'out of memory'

报错详情信息

Exception has occurred: RuntimeError
NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:121, unhandled cuda error, NCCL version 2.14.3
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 'out of memory'
  File "/cyb/LAVIS/lavis/common/dist_utils.py", line 114, in init_distributed_mode
    torch.distributed.barrier()
  File "/cyb/LAVIS/train.py", line 96, in main
    init_distributed_mode(cfg.run_cfg)#初始化分布式训练模式
  File "/cyb/LAVIS/train.py", line 119, in <module>
    main()
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:121, unhandled cuda error, NCCL version 2.14.3
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 'out of memory'

这个错误信息表明在执行分布式训练时遇到了CUDA内存不足的问题，因为GPU资源被大量占用，而当前任务又需要更多的内存资源(感觉默认先用GPU0,但GPU0被占了,所以才无法使用的)

这是目前GPU使用情况：

(ab) (base) root@b11a13895df1:/# nvidia-smi
Thu Mar 14 13:07:58 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:1B:00.0 Off |                  N/A |
| 73%   69C    P2   343W / 350W |  23875MiB / 24576MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  Off  | 00000000:1C:00.0 Off |                  N/A |
| 30%   26C    P8    21W / 350W |      0MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce ...  Off  | 00000000:1D:00.0 Off |                  N/A |
| 30%   30C    P8    10W / 350W |  20465MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA GeForce ...  Off  | 00000000:1E:00.0 Off |                  N/A |
| 70%   67C    P2   342W / 350W |  23875MiB / 24576MiB |     98%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA GeForce ...  Off  | 00000000:3D:00.0 Off |                  N/A |
| 30%   24C    P8    26W / 350W |      2MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA GeForce ...  Off  | 00000000:3F:00.0 Off |                  N/A |
| 30%   26C    P8    26W / 350W |      0MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA GeForce ...  Off  | 00000000:40:00.0 Off |                  N/A |
| 30%   29C    P8    23W / 350W |      0MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA GeForce ...  Off  | 00000000:41:00.0 Off |                  N/A |
| 65%   64C    P2   349W / 350W |  20109MiB / 24576MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   1702935      C                                   23857MiB |
|    2   N/A  N/A   2085163      C                                   20461MiB |
|    3   N/A  N/A    877638      C                                   23857MiB |
|    7   N/A  N/A   2094000      C                                   20107MiB |
+-----------------------------------------------------------------------------+

有上面我们知道，其中GPU 0、GPU 2、GPU 3和GPU 7的内存几乎已满

解决方法

系统中似乎有几个GPU（GPU 1、GPU 4、GPU 5和GPU 6）几乎未被使用，每个GPU的内存使用量仅为2MiB。尝试将训练进程指定到这些GPU上运行，通过设置CUDA_VISIBLE_DEVICES环境变量来做到这一点。

原指令：

1	python -m torch.distributed.run --nproc_per_node=1 --master_port=2564 train.py --cfg-path lavis/projects/blip2/train/pretrain_stage1.yaml

新指令(指定在GPU1上)：

1	CUDA_VISIBLE_DEVICES=1 python -m torch.distributed.run --nproc_per_node=1 --master_port=2564 train.py --cfg-path lavis/projects/blip2/train/pretrain_stage1.yaml