Slurm
Slurm is a fault-tolerant, highly scalable cluster management and job scheduling system for large and small linux clusters. As a workload manager, it handles allocation of resources, scheduling and monitoring jobs.
Components
The key components of slurm are
Component | Description |
---|---|
slurmctld | Slurm control daemon which runs on management node. It is responsible for managing the resources, scheduling and monitoring of jobs etc. There can be multiple slurm control daemons running incase anyone fails to operate. |
slurmd | These are slurm worker nodes which perform the task. slurmd daemon runs on each worker node. |
slurmdbd | Slurm database daemon for maintaining database of jobs, resources etc. |
Commands | Commands such as scontrol , sbatch , squeue , srun etc. |
![]() |
---|
Fig: Slurm architecture |