Slurm

Slurm is a fault-tolerant, highly scalable cluster management and job scheduling system for large and small linux clusters. As a workload manager, it handles allocation of resources, scheduling and monitoring jobs.

Components

The key components of slurm are

ComponentDescription
slurmctldSlurm control daemon which runs on management node.
It is responsible for managing the resources, scheduling and monitoring of jobs etc.
There can be multiple slurm control daemons running incase anyone fails to operate.
slurmdThese are slurm worker nodes which perform the task. slurmd daemon runs on each worker node.
slurmdbdSlurm database daemon for maintaining database of jobs, resources etc.
CommandsCommands such as scontrol, sbatch, squeue, srun etc.
slurm architecture
Fig: Slurm architecture

Resources

  1. https://slurm.schedmd.com/quickstart.html