Scheduler

A scheduler is something that schedules the tasks graph on parallel hardware. There are two types of scheduler,

  1. Single machine
  2. Distributed scheduler

Single machine

Single machine scheduler uses machines threads and processes.

dask.config.set(scheduler="processes") # it uses multiprocessing scheduler inside single machine
 
dask.config.set(scheduler="threads")

Above configs are set globally. However, they can be used with context manager.

with dask.config.set(scheduler="threads"):
    d.compute()

While calling compute, we can provide scheduler.

d.compute(scheduler="processes")

Local threads

Easy to setup scheduler that works well with simple computations. It provides good concurrency when used with numerical computing libraries such as numpy arrays etc. This is because python has this GIL problem which locks the python interpreter when one thread is in the execution.

Local Processes

To overcome the limitations of threading scheduler, we can use this scheduler where multiple processes are utilized by dask for the computations.

This scheduler is good when the workflow is linear as there is less transmission of data between processes.

Synchronous

Single threaded scheduler with no concurrency/parallelism. It is good for debugging and profiling.

dask.config.set(scheduler="synchronous")

Note

While creating a local cluster with multiprocessing should be enclosed in if __name__ == "__main__" because it will create multiple clients while loading the same script for spawning the sub processes.

if __name__ == "__main__":
   from dask.distributed import Client
   client = Client()

Distributed Scheduler

Distributed scheduler uses workers to schedule the job on. It can be setup on local single machine or a multi node cluster.

References

  1. https://docs.dask.org/en/stable/scheduling.html