Scheduler
A scheduler is something that schedules the tasks graph on parallel hardware. There are two types of scheduler,
- Single machine
- Distributed scheduler
Single machine
Single machine scheduler uses machines threads and processes.
dask.config.set(scheduler="processes") # it uses multiprocessing scheduler inside single machine
dask.config.set(scheduler="threads")
Above configs are set globally. However, they can be used with context manager.
with dask.config.set(scheduler="threads"):
d.compute()
While calling compute
, we can provide scheduler.
d.compute(scheduler="processes")
Local threads
Easy to setup scheduler that works well with simple computations. It provides good concurrency when used with numerical computing libraries such as numpy arrays etc. This is because python has this GIL problem which locks the python interpreter when one thread is in the execution.
Local Processes
To overcome the limitations of threading scheduler, we can use this scheduler where multiple processes are utilized by dask for the computations.
This scheduler is good when the workflow is linear as there is less transmission of data between processes.
Synchronous
Single threaded scheduler with no concurrency/parallelism. It is good for debugging and profiling.
dask.config.set(scheduler="synchronous")
Note
While creating a local cluster with multiprocessing should be enclosed in
if __name__ == "__main__"
because it will create multiple clients while loading the same script for spawning the sub processes.
if __name__ == "__main__":
from dask.distributed import Client
client = Client()
Distributed Scheduler
Distributed scheduler uses workers to schedule the job on. It can be setup on local single machine or a multi node cluster.