0

I'm looking into ray to distribute many computation tasks that could run in parallel. Ray code looks like:

ray.init(f"ray://{head_ip}:10001")

@ray.remote
def compute(compute_params):
    # do some long computation
    return some_result

# ==== driver part
ray_res = [compute.remote(cp) for cp in many_computations]
remote_res=ray.get(ray_res)

What is the proper way to stop such computation?

Suppose every computation might take a couple of hours, and for some reason, the driver code is killed/stopped/crashed, how is that possible to stop the tasks on the worker machines? Maybe to have some special configuration for workers that will understand that driver is dead...?

Julias
  • 5,562
  • 16
  • 56
  • 80

1 Answers1

0

Looking into RAY documentation

Remote functions can be canceled by calling ray.cancel on the returned Object ref. Remote actor functions can be stopped by killing the actor using the ray.kill interface.

In your example it would be something like:

ray.init(f"ray://{head_ip}:10001")

@ray.remote
def compute(compute_params):
    # do some long computation
    return some_result
    

# ==== driver part
ray_res = [compute.remote(cp) for cp in many_computations]

for x in ray_res:
    try:
        remote_res=ray.get(x)
    except TaskCancelledError:
        ....
    
running.t
  • 4,559
  • 3
  • 26
  • 47
  • I'm actually looking for a way to kill a task from driver or driver helper . I can call [ray.cancel.(r) for r in ray_res]. But if my code failed, how can I recreate ray_res to kill long-running tasks? – Julias Oct 18 '21 at 12:33
  • Instead of recreating, you should kill them before program exits. You [can't handle system SIGKILL or SIGSTOP](https://stackoverflow.com/questions/33242630/how-to-handle-os-system-sigkill-signal-inside-python) signal, but you can handle almost any other internal exception [including e.g SIGTERM](https://stackoverflow.com/questions/18499497/how-to-process-sigterm-signal-gracefully) etc. The simplest way is to surround **all** your code with big `try: ... except (BaseException, Exception):` block. – running.t Oct 18 '21 at 12:46
  • in my usecase, the driver is a bash script that is usually killed with kill -9. How can i handle that? – Julias Oct 18 '21 at 12:48