Slurm

Slurm is a fault-tolerant, highly scalable cluster management and job scheduling system for large and small linux clusters. As a workload manager, it handles allocation of resources, scheduling and monitoring jobs.

Components

The key components of slurm are

Component	Description
`slurmctld`	Slurm control daemon which runs on management node. It is responsible for managing the resources, scheduling and monitoring of jobs etc. There can be multiple slurm control daemons running incase anyone fails to operate.
`slurmd`	These are slurm worker nodes which perform the task. `slurmd` daemon runs on each worker node.
`slurmdbd`	Slurm database daemon for maintaining database of jobs, resources etc.
Commands	Commands such as `scontrol`, `sbatch`, `squeue`, `srun` etc.


Fig: Slurm architecture

Resources

https://slurm.schedmd.com/quickstart.html

bitPhile

Explorer

slurm-overview

Slurm

Components

Resources

Graph View

Table of Contents