I am writing an R package with tests using the testthat package. The tests pass locally and on Travis.
I want to plot the benefit of parallelisation on up to 24 cores, so I set up a virtual machine programmatically on Azure:
az vm create \
--resource-group <resource group> \
--name <name> \
--image microsoft-dsvm:linux-data-science-vm-ubuntu:linuxdsvmubuntu:18.12.01 \
--size Standard_NV24 \
--admin-username <azure user> \
--generate-ssh-keys
I call devtools::test() and the virtual machine gets stuck at testthat for hours with 0% CPU usage:
✔ checking for unstated dependencies in ‘tests’ ...
─ checking tests ...
Running ‘testthat.R’
devtools::test() has no specific arguments to print some output, and it calls testthat::test_dir(), which also has no arguments to print output.
When I remove the tests, the code stalls in a function call. I stopped and relaunched multiple times, so I think it's a problem with Azure.
Other sizes of machines, Standard_F16s_v2 and Standard DS3 v2, have the same problem.
How can I fix it?
Update
I stumbled upon the culprit when I was running another piece of code. The VM hangs when calling the R parallel package on multiple cores (it works fine on just one core).
I added some printouts around that call:
[1] "num_cores = 2"
[1] "Entering parallel"
[1] "Drawing 1"
[1] "Drawing 2"
and then it hangs. Following the suggestion of strace, I launched again and see this output around this last printout:
[pid 8571] 21:41:48.718625 write(1, "\n", 1[1] "num_cores = 2"
) = 1
[pid 8571] 21:41:48.719486 write(1, "[1]", 3) = 3
[pid 8571] 21:41:48.719550 write(1, " \"num_cores = 2\"", 16) = 16
[pid 8571] 21:41:48.719615 write(1, "\n", 1) = 1
[pid 8703] 21:41:48.730707 futex(0x498fcf4, FUTEX_WAIT_PRIVATE, 4032, NULL <unfinished ...>
[pid 8705] 21:41:48.730745 futex(0x498fcf4, FUTEX_WAIT_PRIVATE, 4032, NULL <unfinished ...>
[pid 8701] 21:41:48.730757 futex(0x498fcf4, FUTEX_WAIT_PRIVATE, 4032, NULL <unfinished ...>
[pid 8704] 21:41:48.730794 futex(0x498fcf4, FUTEX_WAIT_PRIVATE, 4032, NULL <unfinished ...>
[pid 8702] 21:41:48.730841 futex(0x498fcf4, FUTEX_WAIT_PRIVATE, 4032, NULL <unfinished ...>
[pid 8700] 21:41:48.730852 futex(0x498fcf4, FUTEX_WAIT_PRIVATE, 4032, NULL <unfinished ...>
[pid 8706] 21:41:48.731199 futex(0x498fcf4, FUTEX_WAIT_PRIVATE, 4032, NULL <unfinished ...>
[pid 8571] 21:41:48.742710 futex(0x498fcf4, FUTEX_WAKE_PRIVATE, 2147483647) = 7
[pid 8706] 21:41:48.742831 <... futex resumed> ) = 0
[pid 8705] 21:41:48.742847 <... futex resumed> ) = 0
[pid 8704] 21:41:48.742858 <... futex resumed> ) = 0
[pid 8703] 21:41:48.742868 <... futex resumed> ) = 0
[pid 8702] 21:41:48.742878 <... futex resumed> ) = 0
[pid 8701] 21:41:48.742888 <... futex resumed> ) = 0
[pid 8700] 21:41:48.742899 <... futex resumed> ) = 0
[pid 8700] 21:41:48.742931 futex(0x87da514, FUTEX_WAKE_PRIVATE, 2147483647) = 0
[1] "Entering parallel"
[pid 8700] 21:41:48.743001 futex(0x87da514, FUTEX_WAKE_PRIVATE, 2147483647) = 0
[pid 8571] 21:41:48.743411 write(1, "[1]", 3) = 3
[pid 8571] 21:41:48.743485 write(1, " \"Entering parallel\"", 20) = 20
[pid 8571] 21:41:48.743545 write(1, "\n", 1) = 1
[pid 8571] 21:41:48.743946 pipe([4, 11]) = 0
[pid 8571] 21:41:48.744019 pipe([12, 13]) = 0
[pid 8571] 21:41:48.744096 rt_sigaction(SIGCHLD, {0x7fdc52d45d40, [], SA_RESTORER|SA_RESTART|SA_SIGINFO, 0x7fdc693d1390}, NULL, 8) = 0
[pid 8571] 21:41:48.744160 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fdc6f002a90) = 8708
strace: Process 8708 attached
[pid 8708] 21:41:48.746022 set_robust_list(0x7fdc6f002aa0, 24) = 0
[pid 8571] 21:41:48.746478 close(11) = 0
[pid 8708] 21:41:48.746529 rt_sigaction(SIGCHLD, {SIG_DFL, [CHLD], SA_RESTORER|SA_RESTART, 0x7fdc6902b4b0}, <unfinished ...>
[pid 8571] 21:41:48.746549 close(12 <unfinished ...>
[pid 8708] 21:41:48.746559 <... rt_sigaction resumed> {0x7fdc52d45d40, [], SA_RESTORER|SA_RESTART|SA_SIGINFO, 0x7fdc693d1390}, 8) = 0
[pid 8571] 21:41:48.746571 <... close resumed> ) = 0
[pid 8708] 21:41:48.746581 close(4) = 0
[pid 8708] 21:41:48.746643 dup2(12, 0) = 0
[pid 8708] 21:41:48.746704 close(12) = 0
[pid 8708] 21:41:48.746764 rt_sigaction(SIGUSR1, {0x7fdc52d45a90, [USR1], SA_RESTORER|SA_RESTART, 0x7fdc6902b4b0}, {0x7fdc6ce94550, [USR1], SA_RESTORER|SA_RESTART, 0x7fdc6902b4b0}, 8) = 0
[pid 8571] 21:41:48.748205 pipe([11, 12]) = 0
[pid 8571] 21:41:48.748280 pipe([14, 15]) = 0
[pid 8571] 21:41:48.748350 rt_sigaction(SIGCHLD, {0x7fdc52d45d40, [], SA_RESTORER|SA_RESTART|SA_SIGINFO, 0x7fdc693d1390}, NULL, 8) = 0
[pid 8571] 21:41:48.748410 clone([1] "Drawing 1"
<unfinished ...>
[pid 8708] 21:41:48.749158 write(1, "[1]", 3) = 3
[pid 8708] 21:41:48.749242 write(1, " \"Drawing 1\"", 12) = 12
[pid 8708] 21:41:48.749300 write(1, "\n", 1) = 1
[pid 8571] 21:41:48.750225 <... clone resumed> child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fdc6f002a90) = 8709
strace: Process 8709 attached
[pid 8709] 21:41:48.750313 set_robust_list(0x7fdc6f002aa0, 24) = 0
[pid 8571] 21:41:48.750630 close(12) = 0
[pid 8571] 21:41:48.750695 close(14) = 0
[pid 8709] 21:41:48.750829 rt_sigaction(SIGCHLD, {SIG_DFL, [CHLD], SA_RESTORER|SA_RESTART, 0x7fdc6902b4b0}, {0x7fdc52d45d40, [], SA_RESTORER|SA_RESTART|SA_SIGINFO, 0x7fdc693d1390}, 8) = 0
[pid 8709] 21:41:48.750908 close(11) = 0
[pid 8709] 21:41:48.750976 dup2(14, 0) = 0
[pid 8709] 21:41:48.751050 close(14) = 0
[pid 8709] 21:41:48.751119 rt_sigaction(SIGUSR1, {0x7fdc52d45a90, [USR1], SA_RESTORER|SA_RESTART, 0x7fdc6902b4b0}, {0x7fdc6ce94550, [USR1], SA_RESTORER|SA_RESTART, 0x7fdc6902b4b0}, 8) = 0
[1] "Drawing 2"
[pid 8571] 21:41:48.752325 wait4(-1, 0x7ffd95fddc44, WNOHANG, NULL) = 0
[pid 8571] 21:41:48.752392 select(12, [4 11], NULL, NULL, {1, 0} <unfinished ...>
[pid 8709] 21:41:48.753053 write(1, "[1]", 3) = 3
[pid 8709] 21:41:48.753138 write(1, " \"Drawing 2\"", 12) = 12
[pid 8709] 21:41:48.753202 write(1, "\n", 1) = 1
[pid 8708] 21:41:48.755113 futex(0x498fcf4, FUTEX_WAKE_PRIVATE, 2147483647) = 0
[pid 8704] 21:41:48.756406 futex(0x498fcf4, FUTEX_WAIT_PRIVATE, 4040, NULL <unfinished ...>
[pid 8703] 21:41:48.756424 futex(0x498fcf4, FUTEX_WAIT_PRIVATE, 4040, NULL <unfinished ...>
[pid 8701] 21:41:48.756557 futex(0x498fcf4, FUTEX_WAIT_PRIVATE, 4040, NULL <unfinished ...>
[pid 8702] 21:41:48.756625 futex(0x498fcf4, FUTEX_WAIT_PRIVATE, 4040, NULL <unfinished ...>
[pid 8705] 21:41:48.756769 futex(0x498fcf4, FUTEX_WAIT_PRIVATE, 4040, NULL <unfinished ...>
[pid 8700] 21:41:48.757066 futex(0x498fcf4, FUTEX_WAIT_PRIVATE, 4040, NULL <unfinished ...>
[pid 8706] 21:41:48.757184 futex(0x498fcf4, FUTEX_WAIT_PRIVATE, 4040, NULL <unfinished ...>
[pid 8709] 21:41:48.758569 futex(0x498fcf4, FUTEX_WAKE_PRIVATE, 2147483647) = 0
[pid 8708] 21:41:48.768419 futex(0x88b9a04, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 8709] 21:41:48.772260 futex(0x88b9a04, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 8602] 21:41:48.993871 <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out)
[pid 8602] 21:41:48.993914 clock_gettime(CLOCK_REALTIME, {1558734108, 993928682}) = 0
[pid 8602] 21:41:48.993978 futex(0x10e40a8, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 8602] 21:41:48.994032 clock_gettime(CLOCK_REALTIME, {1558734108, 994048584}) = 0
[pid 8602] 21:41:48.994096 futex(0x10e40dc, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 25, {1558734109, 994048584}, ffffffff <unfinished ...>
[pid 8571] 21:41:49.753515 <... select resumed> ) = 0 (Timeout)
[pid 8571] 21:41:49.753623 wait4(-1, 0x7ffd95fddc44, WNOHANG, NULL) = 0
[pid 8571] 21:41:49.753688 select(12, [4 11], NULL, NULL, {1, 0} <unfinished ...>
[pid 8602] 21:41:49.994210 <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out)
[pid 8602] 21:41:49.994299 clock_gettime(CLOCK_REALTIME, {1558734109, 994341104}) = 0
[pid 8602] 21:41:49.994442 futex(0x10e40a8, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 8602] 21:41:49.994590 clock_gettime(CLOCK_REALTIME, {1558734109, 994614207}) = 0
[pid 8602] 21:41:49.994663 futex(0x10e40dc, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 27, {1558734110, 994614207}, ffffffff <unfinished ...>
[pid 8571] 21:41:50.754833 <... select resumed> ) = 0 (Timeout)
[pid 8571] 21:41:50.754971 wait4(-1, 0x7ffd95fddc44, WNOHANG, NULL) = 0
[pid 8571] 21:41:50.755117 select(12, [4 11], NULL, NULL, {1, 0} <unfinished ...>
[pid 8602] 21:41:50.994707 <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out)
[pid 8602] 21:41:50.994748 clock_gettime(CLOCK_REALTIME, {1558734110, 994763726}) = 0
[pid 8602] 21:41:50.994816 futex(0x10e40a8, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 8602] 21:41:50.994879 clock_gettime(CLOCK_REALTIME, {1558734110, 994894827}) = 0
[pid 8602] 21:41:50.994944 futex(0x10e40dc, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 29, {1558734111, 994894827}, ffffffff <unfinished ...>
The rest of the log is a repetition of the last few lines.
I think that the parent process is 8571, the first spawned process is 8708, which prints Drawing 1, and the second spawned process is 8709, which prints Drawing 2. Then the parent process lists a wait4(-1, 0x7ffd95fddc44, WNOHANG, NULL) = 0, process 8602 lists a <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out). I have never used strace and could not go beyond that.
I have pushed the code to this Github branch, commit 1b3d1f6, if you want to reproduce the problem. Pull the code from the branch, and run azure-parallel/deploy_azure.sh on a terminal (if you have the Azure Command-Line-Interface and are logged in), or change the code a bit to use the IP address of a existing virtual machine.
Has anyone succeeded in running parallel on Azure recently? If so, what were the virtual machine and image?
strace. I updated the question. – miguelmorin May 24 '19 at 23:18