0

My code structure is as follows:

def pro(epoch, model, device):
   #train a neural network for an epoch

device0 = torch.device('cuda:0')
device1 = torch.device('cuda:1')

devices = [device0, device0, device1, device1]
models = [copy.deepcopy(model) for i in range(4)]

for i in range(100):
    j = [i]*4
    pool = Pool(processes=4)
    results = pool.starmap(pro, zip(j, models, devices))
    #do more stuff with results

If my neural network is small and my dataset is small, everything works fine, but as I increase size the process just hangs.

I have seen this issue asked many times here and other websites. I have tried the solutions I have found, but none has worked for me.

Among some of the solutions I have tried are:

  • This, that suggest using apply_async instead of starmap, but nothing changes.
  • Another solution I found here said that I should have all my calls of Pool in a function, which I did, but also didn't change anything.
  • This, that suggest setting set_start_method to "spawn", but that also didn't do anything.

I suspect my problem is discussed in this SO thread, but I don't know how to translate their solution to my case.

I would appreciate any help.

Schach21
  • 398
  • 2
  • 15

0 Answers0