Slurm Cheat Sheet
The Cambridge HPC uses slurm to manage resources and schedule jobs. The following are a list of commands that may prove useful when managing slurm jobs.
For more detailed descriptions of slurm commands there are several available online resources including:
Additional information about each of the commands described, for example sinfo, can also be obtained using the man command, for example man sinfo.
Account Balance
Use the following command to get information about the amount of hours available for use on each of the accounts to which you have access:
(host) $ mybalance
This will return a list similar to the following:
User / Usage |
Account / Usage |
Account Limit |
Available (hours) |
|---|---|---|---|
dc-webs1 / 1 |
DIRAC-TP001-CPU / 12,000 |
500,000 |
488,000 |
dc-webs1 / 0 |
DIRAC-TP001-GPU /0 |
2,000 |
2,000 |
User is the name of the user whose account information is being shown
Account shows the account names available to the user along with the amount of hours that have already been spent against that account
Account Limit details the maximum number of hours allocated to each account
Available lists the amount of hours still available to be spent
Getting Cluster Info
To get information about the available nodes and partitions use the following command:
(host) $ sinfo
This will return a list similar to the following. Each partition is listed with separate entries for each unique node state.
PARTITION |
AVAIL |
TIMELIMIT |
NODES |
STATE |
NODELIST |
|---|---|---|---|---|---|
icelake |
up |
7-00:00:00 |
18 |
idle |
cpu-q-[58,97,99-100,169-176,397-400,411-412] |
icelake |
down |
7-00:00:00 |
1 |
down |
cpu-q-244 |
PARTITION is the partition name
AVAIL states whether the partition is currently available or not
TIMELIMIT gives the maximum time limit for any user job in days-hours:minutes:seconds. infinite is displayed if there is no time limit
NODES gives the number of nodes in this state
STATE gives the current state of the nodes. The most common states are allocated if the node has been allocated to a job, down if the node is unavailable and idle if the node is unallocated
NODELIST lists the names of every node in this state
Monitoring Jobs
To see the status of all jobs submitted by an individual user use the following command replacing username with your user name:
(host) $ squeue -u username
This will return a list similar to the following:
JOBID |
PARTITION |
NAME |
USER |
ST |
TIME |
NODES |
NODELIST(REASON) |
|---|---|---|---|---|---|---|---|
51610182 |
icelake |
VLA-Combining-Images |
dc-webs1 |
R |
5:12 |
1 |
cpu-q-79 |
51610183 |
icelake |
VLA-Basic-Imaging |
dc-webs1 |
PD |
0:00 |
1 |
JOBID is the id assigned to the job by slurm
PARTITION describes which partition the job is scheduled to be run on, icelake in the above example.
NAME gives the name of the script as defined using the
#SBATCH -JcommandUSER is the name of the user who submitted the job, dc-webs1 in the above example
ST is the state of the job. The most common states are R if the job is running, PD if the job is pending, F if the job has failed and CD if the job completed successfully
TIME is the amount of time used by the process
NODES is the number of nodes allocated to the job
NODELIST(REASON) lists the names of the nodes allocated to the job
Controlling Jobs
To cancel a job use the following command replacing jobid with the identifier of the job to be cancelled:
(host) $ scancel jobid