CUDA OOM on Slurm but not locally, even if Slurm has more GPUs

1 year ago

#375156

Simon

I am working on a Slurm-based cluster. I debug my code on the login node, which has 2 GPUs. I can run it fine using model = nn.DataParallel(model), but my Slurm jobs crash because of

RuntimeError: CUDA out of memory. Tried to allocate 246.00 MiB (GPU 0; 15.78 GiB total capacity; 2.99 GiB already allocated; 97.00 MiB free; 3.02 GiB reserved in total by PyTorch)

I submit Slurm jobs using submitit.SlurmExecutor with the following parameters

    executor.update_parameters(
        time=1000,
        nodes=1,
        ntasks_per_node=4,
        num_gpus=4,
        job_name='test',
        mem="256GB",
        cpus_per_task=8,
    )

I am even requesting more GPUs (4 vs 2), yet it still crashes.
I am checking that all 4 GPUs are visible to the job, and they are.

The weird thing is that if I reduce the network size:

With nn.DataParallel I still get CUDA OOM.
Without it everything works and jobs do not crash. But I need to use a bigger model, so this is not a solution.

Why? Is it due to nn.DataParallel?

EDIT
My model has a LSTM inside, and I noticed that I get the following warning

/private/home/xxx/.local/lib/python3.8/site-packages/torch/nn/modules/rnn.py:679: UserWarning: RNN module weights are not part of single contiguous chunk of memory. This means they need to be compacted at every call, possibly greatly increasing memory usage. To compact weights again call flatten_parameters(). (Triggered internally at  /pytorch/aten/src/ATen/native/cudnn/RNN.cpp:924.)
  result = _VF.lstm(input, hx, self._flat_weights, self.bias, self.num_layers,

After a Google search, it seems I have to call flatten_parameters() before calling the LSTM, but I cannot find a definitive answer about this (like, where exactly should I call it?). Also, adding flatten_parameters() the code still works locally, but Slurm jobs now crash with

RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.

parallel-processing

pytorch

out-of-memory

lstm

slurm

0 Answers

Your Answer

Posts

Questions

Blogs

Jobs