My code structure is as follows:
def pro(epoch, model, device):
#train a neural network for an epoch
device0 = torch.device('cuda:0')
device1 = torch.device('cuda:1')
devices = [device0, device0, device1, device1]
models = [copy.deepcopy(model) for i in range(4)]
for i in range(100):
j = [i]*4
pool = Pool(processes=4)
results = pool.starmap(pro, zip(j, models, devices))
#do more stuff with results
If my neural network is small and my dataset is small, everything works fine, but as I increase size the process just hangs.
I have seen this issue asked many times here and other websites. I have tried the solutions I have found, but none has worked for me.
Among some of the solutions I have tried are:
- This, that suggest using
apply_asyncinstead ofstarmap, but nothing changes. - Another solution I found here said that I should have all my calls of
Poolin a function, which I did, but also didn't change anything. - This, that suggest setting
set_start_methodto"spawn", but that also didn't do anything.
I suspect my problem is discussed in this SO thread, but I don't know how to translate their solution to my case.
I would appreciate any help.