Site Tools


compute_slurm

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
compute_slurm [2021/03/05 17:06] pgh5acompute_slurm [2024/03/15 17:20] (current) – external edit 127.0.0.1
Line 1: Line 1:
 +==== Scheduling a Job using the SLURM job scheduler ====
 +
 +The Computer Science Department uses a "job scheduler" called [[https://en.wikipedia.org/wiki/Slurm_Workload_Manager|SLURM]].  The purpose of a job scheduler is to allocate computational resources (servers) to users who submit "jobs" to a queue. The job scheduler looks at the requirements stated in the job's script and allocates to the job a server (or servers) which matches the requirements specified in the job script. For example, if the job script specifies that a job needs 192GB of memory, the job scheduler will find a server with at least that much memory free.
 +
 +The job scheduler supports a direct login option (see below) that allows direct interactive logins to servers controlled by the scheduler, without the need for a job script.
 +
 +===== Updates =====
 +  * As of 26-Sep-2023, a default QoS was configured on all SLURM partitions in the CS cluster, **limiting the number of concurrent jobs per user to 64**.
 +    * Reservations may circumvent this limit by using the QoS ''csresnolim'' in ''srun'' commands or ''sbatch'' scripts using the parameter ''%%-q csresnolim%%'' or ''%%--qos=csresnolim%%'' see [[compute_slurm_reservations|SLURM Reservations]] for more details
 +    
 +  * As of 09-Jan-2024, the SLURM job scheduler was updated to allow for GPU resource tracking. GPUs are now trackable resources allowing for flags such as ''%%-G, --gpus=[type:]<number>%%'', ''%%--gpus-per-node=[type:]<number>%%'', and others. For example, ''%%-p gpu --gpus=nvidia_titan_x:1 ...%%'' to allocate a node with 1 Nvidia Titan X card. Utilize the command ''%%scontrol show node <node name>%%'' to see GPUs as well as reference our [[compute_resources|Computing resources]] page.
 +----
 +
 +==== Using SLURM ====
 +**[[https://slurm.schedmd.com/pdfs/summary.pdf|Slurm Commands Cheat Sheet]]**
 +
 +The SLURM commands below are ONLY available on the portal cluster of servers. They are not installed on the gpusrv* or the SLURM controlled nodes themselves.
 +
 +----
 +
 +==== Information Gathering ====
 +
 +To view information about compute nodes in the SLURM system, use the command ''sinfo''.
 +
 +<code>
 +[abc1de@portal04 ~]$ sinfo
 +PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
 +main*        up   infinite      3  drain falcon[3-5]
 +main*        up   infinite     27   idle cortado[01-10],falcon[1-2,6-10],lynx[08-12],slurm[1-5]
 +gpu          up   infinite      4    mix ai[02-03,05],lynx07
 +gpu          up   infinite      1  alloc ai04
 +gpu          up   infinite     12   idle ai[01,06],lynx[01-06],ristretto[01-04]
 +</code>
 +
 +With ''sinfo'' we can see a listing of the job queues or "partitions" and a list of nodes associated with these partitions.  A partition is a grouping of nodes, for example our //main// partition is a group of all general purpose nodes, and the //gpu// partition is a group of nodes that each contain GPUs.  Sometimes hosts can be listed in two or more partitions.
 +
 +To view jobs running on the queue, we can use the command ''squeue'' Say we have submitted one job to the main partition, running ''squeue'' will look like this:
 +
 +<code>
 +abc1de@portal01 ~ $ squeue
 +             JOBID PARTITION     NAME     USER    ST     TIME  NODES NODELIST(REASON)
 +            467039      main    my_job    abc1de  R      0:06      1 ai01
 +</code>
 +
 +and now that a node has been allocated, that node ''ai01'' will show as //alloc// in ''sinfo''
 +
 +<code>
 +abc1de@portal01 ~ $ sinfo
 +PARTITION     AVAIL  TIMELIMIT  NODES  STATE NODELIST
 +main*            up   infinite     37   idle hermes[1-4],artemis[2-7],slurm[1-5],nibbler[1-4],trillian[1-3],granger[1-6],granger[7-8],ai0[1-6]
 +main*            up   infinite      1  alloc ai01
 +qdata            up   infinite      8   idle qdata[1-8]
 +qdata-preempt    up   infinite      8   idle qdata[1-8]
 +falcon           up   infinite     10   idle falcon[1-10]
 +intel            up   infinite     24   idle artemis7,slurm[1-5],granger[1-6],granger[7-8],nibbler[1-4],ai0[1-6]
 +amd              up   infinite     13   idle hermes[1-4],artemis[1-6],trillian[1-3]
 +</code>
 +
 +You can also see what resources (such as GPUs) that a node has available by running the command ''scontrol show node <nodename>''.
 +
 +<code>
 +abc1de@portal01 ~$ scontrol show node ai10
 +NodeName=ai10 Arch=x86_64 CoresPerSocket=8
 +   CPUAlloc=0 CPUTot=32 CPULoad=0.01
 +   AvailableFeatures=(null)
 +   ActiveFeatures=(null)
 +   Gres=gpu:nvidia_geforce_gtx_1080:4
 +   NodeAddr=ai10 NodeHostName=ai10 Version=20.11.9
 +   OS=Linux 3.10.0-957.5.1.el7.x86_64 #1 SMP Fri Feb 1 14:54:57 UTC 2019
 +   RealMemory=128000 AllocMem=0 FreeMem=120054 Sockets=2 Boards=1
 +   State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
 +   Partitions=gnolim
 +   BootTime=2023-02-15T10:01:04 SlurmdStartTime=2023-03-14T12:29:33
 +   CfgTRES=cpu=32,mem=125G,billing=32
 +   AllocTRES=
 +   CapWatts=n/a
 +   CurrentWatts=0 AveWatts=0
 +   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
 +   Comment=(null)
 +</code>
 +
 +----
 +
 +==== Jobs ====
 +
 +To use SLURM resources, you must submit your jobs (program/script/etc.) to the SLURM controller.  The controller will then send your job to compute nodes for execution, after which time your results will be returned. There is also an //direct login option// (see below) that doesn't require a job script.
 +
 +Users can submit SLURM jobs from ''portal.cs.virginia.edu'' You can submit jobs using the commands [[https://slurm.schedmd.com/srun.html|srun]] or [[https://slurm.schedmd.com/sbatch.html|sbatch]].  Let's look at a very simple example script and ''sbatch'' command.
 +
 +Here is our script, all it does is print the hostname of the server running the script.  We must add ''SBATCH'' options to our script to handle various SLURM options.
 +
 +<code bash>
 +#!/bin/bash
 +# --- this job will be run on any available node
 +# and simply output the node's hostname to
 +# my_job.output
 +#SBATCH --job-name="Slurm Simple Test Job"
 +#SBATCH --error="my_job.err"
 +#SBATCH --output="my_job.output"
 +echo "$HOSTNAME"
 +</code>
 +
 +We run the script with ''sbatch'' and the results will be put in the file we specified with ''%%--output%%'' If no output file is specified, output will be saved to a file with the same name as the SLURM jobid.
 +
 +<code>
 +[abc1de@portal04 ~]$ sbatch slurm.test
 +Submitted batch job 640768
 +[abc1de@portal04 ~]$ more my_job.output
 +cortado06
 +</code>
 +
 +Here is a similar example using ''srun'' running on multiple nodes:
 +
 +<code>
 +abc1de@portal01 ~ $ srun -w slurm[1-5] -N5 hostname
 +slurm4
 +slurm1
 +slurm2
 +slurm3
 +slurm5
 +</code>
 +
 +If the node to be used is NOT in the main (default) "partition" (or queue), then you must specify the partition in your job script:
 +
 +<code bash>
 +#!/bin/bash
 +# --- this job will be run on any available node in the "gpu" partition
 +# and simply output the node's hostname to
 +# my_job.output
 +#SBATCH --job-name="Slurm Simple Test Job"
 +#SBATCH --error="my_job.err"
 +#SBATCH --output="my_job.output"
 +# --- specify the partition (queue) name
 +#SBATCH --partition="gpu"
 +echo "$HOSTNAME"
 +</code>
 +
 +If you are trying to use a node that is not in the default partition, and you don't specify the partition in your job script, you will get a message from ''srun'' saying "queued and waiting for resources", but the job will not start.
 +
 +----
 +
 +==== Direct login to servers (without a job script) ====
 +
 +You can use ''srun'' to login directly to a server controlled by the SLURM job scheduler.  This can be useful for debugging purposes as well as running your applications without using a job script.
 +
 +We must pass the ''%%--pty%%'' option to ''srun'' so output is directed to a pseudo-terminal. 
 +
 +For example, to open a direct login job on the node "cortado04", use:
 +
 +<code>
 +abc1de@portal ~$ srun -w cortado04 --pty bash -i -l -
 +abc1de@cortado04 ~$ hostname
 +cortado04
 +abc1de@cortado04 ~$
 +</code>
 +
 +The ''%%-w%%'' argument selects the server into which to login. The ''%%-i%%'' argument tells ''bash'' to run as an interactive shell.  The ''%%-l%%'' argument instructs bash that this is a login shell, this, along with the final ''%%-%%'' are important to reset environment variables that otherwise might cause issues using [[linux_environment_modules|Environment Modules]]
 +
 +If a node is in a partition (see below for partition information) other than the default "main" partition (for example, the "gpu" partition), then you //must// specify the partition in your command, for example:
 +<code>
 +abc1de@portal ~$ srun -w lynx05 -p gpu --pty bash -i -l -
 +</code>
 +
 +If you are using a reservation, you must specify the ''%%--reservation=<reservationname>%%'' option to srun. Be sure to use ''%%-p <partition name>%%'' if the node you want to use is in ''gpu, nolim, gnolim'' or any other non-default partition.
 +
 +----
 +
 +==== Terminating Jobs ====
 +
 +Please be aware of jobs you start and make sure that they finish executing.  If your job does not exit gracefully, it will continue running on the server, taking up resources and preventing others from running their jobs.
 +
 +To cancel a running job, use the ''scancel [jobid]'' command
 +
 +<code>
 +abc1de@portal01 ~ $ squeue
 +             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
 +            467039      main    sleep    abc1de  R       0:06      1 artemis1           <--  Running job
 +abc1de@portal01 ~ $ scancel 467039
 +</code>
 +
 +The default signal sent to a running job is SIGTERM (terminate). If you wish to send a different signal to the job's processes (for example, a SIGKILL which is often needed if a SIGTERM doesn't terminate the process), use the ''%%--signal%%'' argument to scancel, i.e.:
 +<code>
 +abc1de@portal01 ~ $ scancel --signal=KILL 467039
 +</code>
 +
 +----
 +
 +==== Queues/Partitions ====
 +
 +Slurm refers to job queues as //partitions//. We group similar systems into separate queues. For example, there is a "main" queue for general purpose systems, and a "gpu" queue for systems with GPUs. These queues can have unique constraints such as compute nodes, max runtime, resource limits, etc.
 +
 +If no partition is specified in your job script or when using the 'srun' command, it will go to the default partition "main".
 +
 +The "main" and "gpu" partitions have a time limit set, so jobs will terminate after a specified number of days, as shown in the output of the 'sinfo' command. However, there are two additional partitions, "nolim" and "gnolim", that have a time limit of 20 days, which is effectively unlimited time.
 +
 +Partition is indicated by ''%%-p partname%%'' or ''%%--partition partname%%''.
 +
 +To specify a partition with ''sbatch'' file:
 +
 +<code>
 +#SBATCH --partition=gpu
 +</code>
 +
 +Or from the command line with ''srun''
 +
 +<code>
 +-p gpu
 +</code>
 +
 +An example running the command ''hostname'' on the //main// partition, this will run on any node in the partition:
 +
 +<code>
 +srun -p main hostname
 +</code>
 +
 +----
 +
 +==== Long Running Jobs ====
 +
 +If a job is expected to run longer than the default for a given partition, two other paritions with unlimited runtime named ''nolim'' and ''gnolim'' exist for jobs that have long runtimes.\\
 +
 +The partition ''nolim'' is for long running jobs that do not require a GPU, and ''gnolim'' is for long running jobs that require a GPU.
 +
 +To utilize these partitions, simply specify the name of the partition in ''srun'' or ''sbatch''.
 +
 +----
 +
 +==== Using GPUs ====
 +
 +Slurm handles GPUs and other non-CPU computing resources using what are called [[https://slurm.schedmd.com/gres.html|GRES]] Resources (Generic Resource).  To use the GPU(s) on a system using Slurm, either using ''sbatch'' or ''srun'', you must request the GPUs using the ''%%--gres:x%%'' option.  You must specify the ''gres'' flag followed by '':'' and the quantity of resources
 +
 +Say we want to use 4 GPUs on a system, we would use the following ''sbatch'' option:
 +
 +<code>
 +#SBATCH --gres=gpu:4
 +</code>
 +
 +Or from the command line 
 +
 +<code>
 +--gres=gpu:4
 +</code>
 +
 +----
 +
 +==== Reservations ====
 +
 +Reservations for specific resources or nodes can be made by submitting a request to <cshelpdesk@virginia.edu> For more information about using reservations, see the main article on [[compute_slurm_reservations|SLURM Reservations]]
 +
 +----
 +
 +==== Job Accounting ====
 +The SLURM scheduler implements the Accounting features of slurm. So users can execute the ''sacct'' command to find job accounting information, like job ids, job names, partition run upon, allocated CPUs, job state, and exit codes. There are numerous other options supported. Type ''man sacct'' on portal to see all the options.  
 +
 +----
 +
 +==== Using "module" to load software in a job ====
 +
 +Due to the way ''sbatch'' spawns a bash session as a no-login session, initialization files are not loaded from ''/etc/profile.d'' This prevents the initialization of the [[linux_environment_modules|Environment Modules]] system and will prevent you from loading software modules.
 +
 +To fix this, simply include the following line in your sbatch scripts:
 +
 +<code bash>
 +source /etc/profile.d/modules.sh
 +</code>