Site Tools


compute_slurm_reservations

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
compute_slurm_reservations [2020/11/03 19:32] pgh5acompute_slurm_reservations [2024/02/26 17:33] (current) – external edit 127.0.0.1
Line 1: Line 1:
 +==== SLURM Reservations ====
 +
 +SLURM provides "reservations" of compute nodes. A reservation reserves a node (or nodes) for your exclusive use. So if you need to use a node without any other users on the node, or for an extended period of time, we can reserve the node (or nodes) for your use only.
 +
 +=== Getting a Reservation ===
 +
 +Reservations for nodes can be requested by sending an email to <cshelpdesk@virginia.edu> We can reserve an entire compute node or node group (ex. "cortado01" or "cortado[01-05]").
 +
 +In your request for a reservation, please be specific about:
 +
 +  - What servers are needed (ex. "cortado01-05")
 +  - The start and end time of the reservation (ex. today to Dec 31st)
 +  - Who can use the reservation (a list of user IDs)
 +
 +//Note: the requested node must be free of any running jobs before a reservation is started. So if you are requesting a reservation on a node upon which you are running a job, we will create the reservation but it will not take effect until the job completes on the node. So please cancel ('scancel') your job before requesting a reservation.//
 +
 +//Note: nodes may be reserved for a maximum of fourteen days, and extended in one week increments. Reservations that are not being used (no jobs running) will be removed.//
 +
 +=== Using reservations when submitting jobs ===
 +
 +If you have been granted a reservation you will receive a reservation tag, a short string that is required to submit jobs.  This string must be included in any ''%%srun%%'' or ''%%sbatch%%'' submissions via the flag ''%%--reservation=...%%''.
 +
 +If our tag is "abc1de_4" then you would submit jobs using the flag ''%%--reservation=abc1de_4%%''
 +
 +This can be done at the command line:
 +
 +<code>
 +[abc1de@portal03 ~]$ sbatch --reservation=abc1de_4 slurm.sh
 +...
 +[abc1de@portal03 ~]$ srun --reservation=abc1de_4 slurm.sh
 +</code>
 +
 +Or you can include the flag in the header of your sbatch file:
 +
 +<code>
 +[abc1de@portal03 ~]$ cat reservation.sh
 +#!/bin/bash
 +#SBATCH --job-name="Reservation Example" 
 +#
 +#SBATCH --error="stderr.txt"
 +#SBATCH --output="stdout.txt"
 +#
 +#SBATCH --reservation=abc1de_4               <- Include reservation tag
 +...
 +[abc1de@portal03 ~]$ sbatch reservation.sh
 +</code>
 +
 +If the reserved node is in a queue ("partition") other than the default "main" partition, you must include the partition in your job script or the command line.
 +<code>
 +[abc1de@portal03 ~]$ sbatch --reservation=abc1de_4 -p gpu slurm.sh
 +...
 +[abc1de@portal03 ~]$ srun --reservation=abc1de_4 -p gpu slurm.sh
 +</code>
 +
 +=== Listing Reservations ===
 +
 +You can see a listing of all active reservations by using the ''%%scontrol%%'' command:
 +
 +<code>
 +[abc1de@portal03 ~]$ scontrol show reservation
 +ReservationName=abc1de_4 StartTime=2019-06-25T00:00:00 EndTime=2019-07-02T16:00:00 
 +   Nodes=slurm1 NodeCnt=1 CoreCnt=12 Features=(null) PartitionName=(null) Flags=SPEC_NODES
 +   TRES=cpu=24
 +   Users=abc1de Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a
 +
 +</code>
 +
 +=== Avoiding Default Cluster Limits with Reservations ===
 +The CS cluster has a default SLURM QoS named ''csdefault'' which limits the number of concurrent jobs a user may run. Reservations may circumvent this limit by using the QoS ''csresnolim'' in ''srun'' commands or ''sbatch'' scripts using the parameter ''%%-q csresnolim%%'' or ''%%--qos=csresnolim%%''.
 +
 +This QoS requires a reservation be specified to be used, that is, you must also include ''%%--reservation=abc1de_4%%'' in your ''srun'' commands or ''sbatch'' scripts.
 +=== Deleting idle reservations ===
 +Computing resources are finite, and there is considerable demand for nodes, especially the higher end GPU nodes. Reserving a node means that you are not allowing anyone else to use the node, denying that resource to others. So if the support staff sees that a reserved node has sat idle for a few hours, they will delete the reservation. 
  
compute_slurm_reservations.txt · Last modified: 2024/02/26 17:33 by 127.0.0.1