Differences

This shows you the differences between two versions of the page.

--- compute_slurm [2020/08/26 20:59] – [Information Gathering] pgh5a
+++ compute_slurm [2025/11/05 23:04] (current) – external edit 127.0.0.1
@@ Line 1: / Line 1: @@
+====== CS Resource Manager (Job Scheduler) ======
+===== General Information =====
+The UVA Computer Science department utilizes //**(__[[https://slurm.schedmd.com/quickstart.html |SLURM]]__)**// to manage most server resources. {{ wiki:slurm_logo.png?nolink&0x256|slurm Logo}}
+Slurm acts as the "job scheduler" and the purpose of a job scheduler is to allocate computational resources (individual server(s)) to users who submit **job(s)** to a queue. The job scheduler looks at the requirements stated in the job's **command** or **script** and will allocate server(s) which match the requirements specified. For example, if a job script specifies that a job needs 64GBs of memory, the job scheduler will find a server with at least that much memory free.
+==== Terminology & Important General Information ====
+  * Servers managed by the **slurm scheduler** are referred to as **nodes**, **slurm nodes**, or **compute nodes**
+  * A collection of nodes controlled by the **slurm scheduler** is referred to as a **cluster**
+  * **tasks** in slurm can be considered as individual processes
+  * **CPUs** in slurm can be considered as individual cores on a processor
+    * For a job that allocates a single CPU, it is a single process program within a single task
+    * For a job that allocates multiple CPUs on the same node, it is a multicore program within a single task
+    * For a job that allocates CPU(s) and multiple nodes (distributed program), then a task will be run on each node
+  * **GPUs** in slurm are referred to as a Generic Resource (**GRES**)
+    * Using a specific **GRES** requires specifying the string associated to the GRES
+    * For example, using ''%%--gres=gpu:1%%'' or ''%%--gpus=1%%'' will allocate the first available GPU, regardless of type
+    * Using ''%%#SBATCH --gres=gpu:1%%'' with ''%%--constraint="a100_40gb"%%'' will require that an A100 GPU with 40GBs be used for your job
+  * **SSH logins to slurm nodes is disabled**. Interactive jobs are required for accessing a server's command line
+==== Environment Overview ====
+The slurm scheduler runs as a service or daemon process named ''%%slurmctld%%'' on a single designated non-compute node. On each individual compute node, a daemon process named ''%%slurmd%%'' is running that the scheduler service ''%%slurmctld%%'' communicates with. This allows the scheduler to assign and control jobs on individual nodes and perform status checks.
+Head login nodes such as ''%%portal%%'' then will contact ''%%slurmctld%%'' when commands such as ''%%sinfo%%'', ''%%salloc%%'', and ''%%sbatch%%'' are invoked to define a job and required resources to run a program. In turn, ''%%slurmctld%%'' will contact the appropriate ''%%slurmd%%'' on a node that has the resources required for a job, and then will queue the job to run on the node when the resources become available.
+Software modules are made available throughout the CS environment, including slurm nodes. Please see the CS wiki about **__[[software_modules|Software Modules]]__** for further details.
+{{wiki:slurm_topology.png?nolink&0x512|slurm Topology}}
+----
+===== Updates & Announcements =====
+  * **January 2025**
+    * **Updates**
+      * Resource restrictions for each partition have been added, replacing a concurrent running job limit. Further, jobs are now assigned a default time limit
+        * See the ([[compute_slurm#resource_limits|Resource Limits Section]]) for more details
+      * Reservations may circumvent this limit by using the QoS ''csresnolim'' in ''srun'' commands or ''sbatch'' scripts using the parameter ''%%-q csresnolim%%'' or ''%%--qos=csresnolim%%'' see the section regarding **reservations**
+      * Jobs that do not specify a time limit with ''%%-t%%'' or ''%%--time%%'' will have a maximum default time limit set. These vary based on the partition a job is run in
+        * See the ([[compute_slurm#resource_limits|Resource Limits Section]]) for more details
+    * **Nodes Added**
+      * 1 GPU node with 126 cores (AMD Epyc 9534), 500GBs of memory, and includes 1x Nvidia H100 NVL 94GB GPU with 94GBs of VMEM
+  * **June 2025**
+    * **Nodes Added**
+      * 4 GPU nodes EACH with 64 cores (AMD Epyc 9354), 1.5TBs of memory, and each include 2x Nvidia H100 NVL GPUs with 96GBs of VMEM
+      * 1 CPU node with 32 cores (AMD Epyc 7313p), and 124GBs of memory
+  * **July 2025**
+    * **Nodes Moved**
+      * 1 CPU node with 22 cores and 500GBs of CPU memory has been moved to ''%%nolim%%''
+      * 1 CPU node with 22 cores and 250GBs of CPU memory has been moved to ''%%nolim%%''
+      * 3 CPU nodes with 46 cores and 500GBs of CPU memory has been moved to ''%%nolim%%''
+  * **October 2025**
+    * **Node(s) Updated**
+      * 1 GPU node has been upgraded to have 4x Nvidia RTX 4000 Ada Generation GPUs with 20GBs of VMEM, up from a 2080TI which had 11GBs of VMEM
+      * 1 CPU node has been returned to 1.5TBs of CPU memory after being lowered to 1.4TBs
+      * 1 GPU node has been upgraded to have 4x RTX 2080TIs, up from 1x RTX 1080 and 2x RTX 1080TIs
+      * 1 GPU node has been upgraded to have 4x RTX 2080TIs, up from 4x RTX 1080TIs
+      * 1 GPU node has been upgraded to have 4x RTX 2080TIs, up from 3x Titan X and 1 GeForce GTX 1080
+      * 6 GPU node has been upgraded to have 4x RTX 2080TIs, up from 4x RTX 1080TIs
+    * **Node(s) Removed - Hardware Failure **
+      * 1 GNoLim node with 18 cores, 58GBs of CPU memory, and 1 Titan X GPU
+----
+===== Resources Available =====
+The tables below describe the resources available by partition name in the CS slurm cluster.
+**Each column represents on a per/node or per/resource basis**.
+At least two CPUs are reserved for the system on each node.
+==== cpu Partition Nodes ====
+^ Nodes ^ CPU \\ Type ^ #CPUs \\ Total ^ #CPU \\ Sockets ^ #Cores \\ /Socket ^ #Threads \\ /Core ^ CPU \\ Mem(GB) ^ Features ^
+| ''%%bigcat[01-06]%%'' | Intel | 62 | 2 | 16 | 2 | 1500 | skylake |
+| ''%%affogato03%%'' | Intel | 14 | 1 | 8 | 2 | 96 | skylake |
+| ''%%affogato02%%'' | Intel | 62 | 2 | 16 | 2 | 122 | skylake |
+| ''%%affogato[06-10]%%'' | Intel | 14 | 1 | 8 | 2 | 125 | skylake |
+| ''%%struct[01-09]%%'' | Intel | 26 | 1 | 14 | 2 | 125 | broadwell |
+| ''%%panther01%%'' | Intel | 14 | 1 | 8 | 2 | 500 | skylake |
+| ''%%lynx[08-09]%%'' | Intel | 30 | 2 | 8 | 2 | 62 | skylake |
+| ''%%affogato01%%'' | Intel | 30  | 1 | 16 | 2 | 125 | skylake |
+| ''%%affogato[04-05]%%'' | Intel | 30 | 2 | 8 | 2 | 125 | skylake |
+| ''%%cortado[01-10]%%'' | Intel | 46 | 2 | 12 | 2 | 500 | skylake |
+| ''%%hydro%%'' | Intel | 62 | 2 | 16 | 2 | 250 | skylake |
+| ''%%puma01%%'' | Intel | 158 | 2 | 40 | 2 | 252 | icelake |
+| ''%%sdscpu01%%'' | AMD | 30 | 1 | 16 | 2 | 125 | amd_epyc_7313p |
+==== gpu Partition Nodes ====
+All available GPU cards are manufactured by **Nvidia**. No **AMD** GPUs are available.
+**Each column represents on a per/node or per/resource basis**.
+At least two CPUs are reserved for the system on each node.
+^ Nodes ^ CPU \\ Type ^ #CPUs \\ Total ^ #CPU \\ Sockets ^ #Cores \\ /Socket ^ #Threads \\ /Core ^ CPU \\ Mem(GB) ^ GPU \\ Type ^ #GPUs ^ GPU \\ Mem(GB) ^ Features ^
+| ''%%affogato11%%'' | Intel | 30 | 2 | 8 | 2 | 125 | GeForce RTX 2080ti | 4 | 11 | rtx_2080ti, \\ broadwell |
+| ''%%affogato[13-15]%%'' | Intel | 30 | 2 | 8 | 2 | 125 | GeForce GTX 1080Ti | 4 | 11 | gtx_1080ti, \\ broadwell |
+| ''%%ai[01-04]%%'' | Intel | 30 | 2 | 8 | 2 | 62 | GeForce RTX 2080ti | 4 | 11 | rtx_2080ti, \\ broadwell |
+| ''%%ai06%%'' | Intel | 30 | 2 | 8 | 2 | 62 | GeForce RTX 2080ti | 3 | 11 | rtx_2080ti, \\ broadwell |
+| ''%%lynx[02-04]%%'' | Intel | 30 | 2 | 8 | 2 | 60 | GeForce GTX 1080Ti | 4 | 11 | gtx_1080ti, \\ broadwell |
+| ''%%lynx10%%'' | Intel | 30 | 2 | 8 | 2 | 60 | GeForce RTX 2080ti | 4 | 11 | rtx_2080ti, \\ broadwell |
+| ''%%lynx11%%'' | Intel | 30 | 2 | 8 | 2 | 60 | GeForce RTX 2080ti  | 4 | 11 | rtx_2080ti, \\ broadwell |
+| ''%%lynx01%%'' | Intel | 30 | 2 | 8 | 2 | 60 | Titan X | 4 | 12 | titan_x, \\ broadwell |
+| ''%%lynx[05-07]%%'' | Intel | 30 | 2 | 8 | 2 | 60 | Tesla P100 | 4 | 16 | tesla_p100, \\ broadwell |
+| ''%%cheetah02%%'' | Intel | 70 | 2 | 18 | 2 | 1000 | RTX 4000 Ada Generation | 4 | 20 | rtx_4000_ada, \\ skylake |
+| ''%%cheetah03%%'' | Intel | 70 | 2 | 18 | 2 | 1000 | GeForce RTX 2080ti | 2 | 11 | rtx_2080ti, \\ skylake |
+| ''%%adriatic[01-06]%%'' | Intel | 30 | 2 | 8 | 2 | 1000 | Quadro RTX 4000 | 4 | 8 | rtx_4000, \\ skylake|
+| ''%%jaguar05%%'' | Intel | 14 | 1 | 8 | 2 | 250 | Quadro RTX 4000 | 4 | 8 | rtx_4000, skylake |
+| ''%%lotus%%'' | Intel | 78 | 2 | 20 | 2 | 250 | Quadro RTX 6000 | 8 | 24 | rtx_6000, \\ skylake |
+| ''%%jaguar03%%'' | AMD | 222 | 2 | 56 | 2 | 1000 | RTX A4500 | 8 | 20 | a4500, \\ amd_epyc_7663 |
+| ''%%jaguar02%%'' | Intel | 30 | 2 | 8 | 2 | 1000 | A16 | 8 | 16 | a16, \\ icelake |
+| ''%%jaguar06%%'' | Intel | 46 | 2 | 12 | 2 | 122 | A40 | 2 | 48 | a40, \\ icelake |
+| ''%%jaguar01%%'' | Intel | 62 | 2 | 16 | 2 | 1000 | A40 **(1)*** | 4 | 48 | a40, \\ nvlink, \\ skylake |
+| ''%%jaguar04%%'' | Intel | 62 | 2 | 16 | 2 | 250 | A40 **(1)*** | 4 | 48 | a40, \\ nvlink, \\ icelake |
+| ''%%cheetah01%%'' | AMD | 28 | 2 | 8 | 2 | 250 | A100 | 4 | 40 | a100_40gb \\ amd_epyc_7252 |
+| ''%%cheetah04%%'' | AMD | 254 | 2 | 64 | 2 | 1000 | A100 **(2)*** | 4 | 80 | a100_80gb, \\ nvlink, \\ amd_epyc_7742 |
+| ''%%cheetah[08-09]%%'' | Intel | 38 | 2 | 10 | 2 | 500 | RTX A4000 | 4 | 16 | a4000, \\ skylake |
+| ''%%serval03%%'' | AMD | 126 | 1 | 64 | 2 | 500 | H100 NVL | 1 | 94 | h100_94gb, \\ amd_epyc_9534 |
+| ''%%serval[06-09]%%'' | AMD | 62 | 1 | 32 | 2 | 1500 | H100 NVL | 2 | 94 | h100_94gb, \\ amd_epyc_9354 |
+==== nolim Partition Nodes ====
+**Each column represents on a per/node or per/resource basis**.
+At least two CPUs are reserved for the system on each node.
+^ Nodes ^ CPU \\ Type ^ #CPUs \\ Total ^ #CPU \\ Sockets ^ #Cores \\ /Socket ^ #Threads \\ /Core ^ CPU \\ Mem(GB) ^ Features ^
+| ''%%heartpiece%%'' | Intel | 38 | 2 | 10 | 2 | 160 | skylake |
+| ''%%epona%%'' | Intel | 6 | 1 | 4 | 2 | 62 | skylake |
+| ''%%slurm1%%'' | Intel | 22 | 2 | 12 | 1 | 500 | haswell |
+| ''%%slurm5%%'' | Intel | 22 | 1 | 12 | 2 | 250 | haswell |
+| ''%%slurm[2-4]%%'' | Intel | 46 | 2 | 12 | 2 | 500 | haswell |
+==== gnolim Partition Nodes ====
+All available GPU cards are manufactured by **Nvidia**. No **AMD** GPUs are available.
+**Each column represents on a per/node or per/resource basis**.
+At least two CPUs are reserved for the system on each node.
+^ Nodes ^ CPU \\ Type ^ #CPUs \\ Total ^ #CPU \\ Sockets ^ #Cores \\ /Socket ^ #Threads \\ /Core ^ CPU \\ Mem(GB) ^ GPU \\ Type ^ #GPUs ^ GPU \\ Mem(GB) ^ Features ^
+| ''%%ai[05,10]%%'' | Intel | 30 | 2 | 8 | 2 | 125 | GeForce GTX 1080 | 4 | 8 | gtx_1080, \\ skylake |
+| ''%%jinx[01-02]%%'' | Intel | 22 | 1 | 12 | 2 | 220 | GeForce GTX 1080 | 2 | 8 | gtx_1080, \\ haswell |
+| ''%%ai[07-08]%%'' | Intel | 30 | 2 | 8 | 2 | 125 | GeForce GTX 1080Ti | 4 | 11 | gtx_1080ti, \\ skylake |
+| ''%%ai[09]%%'' | Intel | 30 | 2 | 8 | 2 | 112 | GeForce GTX 1080Ti | 4 | 11 | gtx_1080ti, \\ skylake |
+| ''%%titanx[02-03]%%'' | Intel | 22 | 1 | 12 | 2 | 250 | Titan X | 1 | 12 | titan_x, \\ haswell |
+| ''%%titanx05%%'' | Intel | 18 | 1 | 10 | 2 | 58 | Titan X | 1 | 12 | titan_x, \\ broadwell |
+  * Listed features can be used with the ''%%--constraint%%'' flag when submitting a job
+    * For example, ''%%--constraint="a100_80gb&nvlink"%%'', will request A100 cards that share an NVLink
+  * (1)* Scaled to 192GB by pairing GPUs with NVLink technology that provides a maximum bi-directional bandwidth of 112GB/s between the paired A40s
+  * (2)* Scaled to 320GB by aggregating all four GPUs with NVLink technology
+==== Resource Limits ====
+Any number of jobs can be submitted to the cluster, however, concurrent running for any user's jobs will be restricted by the resource limits listed below for each partition. Note, this does not apply to time limits.
+In other words, a user can only have allocated at most the amount(s) listed below. Jobs that would use more than currently allocated resources for running jobs will simply be queued until other jobs complete.
+Regarding **Default Time Limits**, these are set when jobs do not specify a time limit with the ''%%-t%%'' or ''%%--time%%'' job parameters. In other words, jobs submitted to the **CPU** partition that do not specify ''%%-t%%'' or ''%%--time%%'' will have a maximum WALL time of 2 hours.
+^ Partition ^ Max CPU Cores ^ Max Memory ^ Max GPUs ^ Default Time Limit ^ Max Time Limit ^
+| cpu | 400 | 4TBs | | 2 Hours | 4 days |
+| gpu | 400 | 4TBs | 40 | 2 Hours | 4 days |
+| nolim | 80 | 1TB | | 8 Hours | 20 days |
+| gnolim | 80 | 1TB | 20 | 8 Hours | 20 days |
+----
+===== Information Gathering =====
+Slurm produces a significant amount of information regarding node statuses and job statuses. These are a few of the commonly used tools for querying job data.
+==== Viewing Partitions ====
+Quick overview of all partitions and all nodes and respective statuses. Include ''%%-l%%'' for more details
+<code>
+~$ sinfo
+~$ sinfo -l
+</code>
+Take note of the **TIMELIMIT** column. This describes that maximum **WALL** time that a job may **run** for on a given node. It is formatted as ''%%days-hours:minutes:seconds%%'', which is shortened to ''%%d-hh:mm:ss%%''
+<code>
+~$ sinfo
+PARTITION   AVAIL   TIMELIMIT   NODES  STATE NODELIST
+gpu            up  4-00:00:00       1   idle  gpunode
+...
+</code>
+This shows a maximum WALL time of 4 days for any job run in the **gpu** partition.
+==== Viewing Job Queues ====
+To display all jobs that are running, queued, or in another state such as pending (PD)
+<code>
+~$ squeue
+</code>
+To display your jobs that are running, queued, or in another state such as pending (PD) (replace <userid> with your username)
+<code>
+~$ squeue -u <userid>
+</code>
+==== Node Status ====
+A full list of node states and symbol usage can be found on the **__[[https://slurm.schedmd.com/sinfo.html#SECTION_NODE-STATE-CODES | SINFO webpage]]__**.
+At times, a reason can be viewed for why a node is in a certain state such as **DOWN** or **DRAINING/DRAINED** with ''%%sinfo -R%%''. When a node is administratively set to drain or be down, relevant information from an admin will be found here
+<code>
+~$ sinfo -R
+OR
+~$ sinfo --list-reasons
+REASON               USER      TIMESTAMP           NODELIST
+Not responding       userid    2024-04-16T09:53:47    node0
+Server Upgrade       userid    2024-04-16T09:53:47    node1
+</code>
+To view a specific node for all details including a state reason
+<code>
+~$ scontrol show node <nodename>
+</code>
+==== Job Status ====
+A full list of job states can be found on the **__[[https://slurm.schedmd.com/squeue.html#SECTION_JOB-STATE-CODES | SQUEUE webpage]]__**.
+After submitting a job, be sure to check the state the job enters into. A job will likely enter a **PENDING (PD)** state while waiting for resources to become available, and then begin **RUNNING (R)** (replace <userid> with your username)
+<code>
+~$ squeue -u <userid>
+             JOBID PARTITION     NAME     USER   ST       TIME  NODES NODELIST(REASON)
+       cpu   myjob    userid  PD   00:00:64       1 (Resources)
+       cpu   myjob    userid  R    00:00:64       1 node0
+</code>
+==== Individual Node Details (scontrol) ====
+Most information found here can be found via the **SINFO** command and resources table.
+However, to view full details about a given node in the cluster including node state, reasons, resources allocated/available, features, etc.
+<code>
+~$ scontrol show node <nodename>
+NodeName=node0 Arch=x86_64 CoresPerSocket=8
+   CPUAlloc=2 CPUEfctv=32 CPUTot=32 CPULoad=4.35
+   AvailableFeatures=(null)
+   ActiveFeatures=(null)
+   Gres=gpu:a100_40gb:4(S:0-1)
+   NodeAddr=node0 NodeHostName=node0 Version=22.05.9
+   OS=Linux
+   RealMemory=256000 AllocMem=0 FreeMem=148379 Sockets=2 Boards=1
+   State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
+   Partitions=gpu
+   BootTime=2024-03-22T16:11:55 slurmdStartTime=2024-03-24T17:36:28
+   LastBusyTime=2024-04-16T00:34:17
+   CfgTRES=cpu=32,mem=250G,billing=32,gres/gpu=4
+   AllocTRES=cpu=2
+   CapWatts=n/a
+   CurrentWatts=0 AveWatts=0
+   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
+</code>
+==== Job Accounting ====
+A distinction between active (running) and completed jobs is made as information for recently completed jobs is only available to ''%%scontrol%%'' for five minutes after the job completes.
+=== Active & Completed Jobs ===
+Utilize the command ''%%sacct%%'' to obtain details about your job(s). The amount of details that can be obtained are vast, and thus for example, only a few options are shown. For a full list of available fields, visit the [[ https://slurm.schedmd.com/sacct.html#OPT_helpformat | sacct webpage]].
+Note, time output is formatted as ''%%days-hours:minutes:seconds%%'' and shortened to ''%%d-hh:mm:ss%%''. For example, a job with ''%%3-04:05:16%%'' has been running for three days, four hours, five minutes, and sixteen seconds.
+To query for all of your recently completed or actively running jobs
+<code>
+~$ sacct -o "jobid,jobname,state,exitcode,elapsed"
+JobID           JobName      State ExitCode    Elapsed
+------------ ---------- ---------- -------- ----------
+              myjob  RUNNING        0:0   00:05:02
+.0            myjob  RUNNING        0:0   00:05:02
+</code>
+To query for a single completed or active job, include ''%%-j <jobid>%%''
+<code>
+~$ sacct -j <jobid> -o "jobname,state,exitcode,elapsed"
+   JobName      State ExitCode    Elapsed
+---------- ---------- -------- ----------
+     myjob  COMPLETED      0:0   00:30:02
+     myjob  COMPLETED      0:0   00:30:02
+</code>
+----
+==== Utilization Metrics ====
+=== CPU Utilization ===
+To view utilization details of a **completed job** such as CPU and RAM utilization
+<code>
+~$ seff <jobid>
+Job ID: 123456
+Cluster: cs
+User/Group: userid/group
+State: COMPLETED (exit code 0)
+Nodes: 1
+Cores per node: 2
+CPU Utilized: 00:00:00
+CPU Efficiency: 0.00% of 00:01:42 core-walltime
+Job Wall-clock time: 00:00:51
+Memory Utilized: 4.88 MB
+Memory Efficiency: 0.06% of 8.00 GB
+</code>
+=== GPU Utilization ===
+There are no built in methods in slurm to easily check GPU utilization metrics. Instead the following can be done to view GPU usage for an actively running job
+<code>
+~$ srun --pty --overlap --jobid=<jobid> nvidia-smi
+</code>
+----
+===== Reservations =====
+Slurm reservations are made available when dedicated usage is needed for a subset of nodes and their respective resources from the cluster.
+Computing resources are finite, and there is considerable demand for nodes, especially the higher end GPU nodes. __Reserving a node means that you are not allowing anyone else to use the node, denying that resource to others__. So if the support staff sees that a reserved node has sat idle for a few hours, they will delete the reservation. **As a result, reservations are generally discouraged unless one is absolutely necessary, and all resources will be used throughout its duration**.
+==== Important Notes ====
+  * **__Reservations will be deleted if a job is not submitted/actively running for a period of time__**
+  * Reservations should be made in advance when possible
+    * Same day reservations requests will not cancel active jobs on the node(s) that provide the resources requested
+  * Nodes and their resources may be reserved for an initial **maximum of fourteen days**
+    * Extensions may be requested in increments of one week at most
+==== Requesting a reservation ====
+  * Observe the avaialble resource tables above and determine which node(s) will meet your resource requirements
+  * Send a request to ''%%cshelpdesk@virginia.edu%%'' with the following details
+    - Resources needed
+      - Number of Nodes
+      - Number of CPUs (per node)
+      - Amount of RAM (per node)
+      - (optinal) Type and number of GPUs
+    - Reservation duration (maximum is two weeks)
+    - Users that should have access (userids)
+==== Using a reservation ====
+Include the reservation name in your slurm commands
+Be sure to include the slurm QoS ''%%csresnolim%%'' to avoid cluster limits such as concurrent running job restrictions.
+Using ''%%salloc%%'' or ''%%srun%%'' (replace <reservation name> with the name of the reservation)
+<code>
+~$ srun --reservation=<reservation name> --qos=csresnolim
+~$ salloc --reservation=<reservation name> --qos=csresnolim
+</code>
+Using an ''%%sbatch%%'' script
+<code>
+#SBATCH --reservation=<reservation name>
+#SBATCH --qos=csresnolim
+</code>
+==== Viewing Current Reservations ====
+To view existing reservations
+<code>
+~$ scontrol show res
+</code>
+----
+===== Submitting & Controlling Jobs =====
+==== Submitting Jobs ====
+To submit a job to run on slurm nodes in the CS cluster, you must be logged into one of the login nodes. Currently, this is the **(__[[compute_portal|Portal SSH cluster]])__** or **__([[nx_lab|NoMachine Remote Desktop cluster]])__**. After logging into the head nodes, you are able to run slurm commands such as ''%%salloc%%'', ''%%srun%%'', and ''%%sbatch%%''.
+  * **salloc** when run will allocate resources/nodes but will not run anything ([[https://slurm.schedmd.com/salloc.html |official salloc documentation]])
+    * One purpose is parallel processing for allocating multiple nodes, then **srun** can execute a command across the allocated nodes
+  * **srun** can utilize a resource allocation from **salloc**, or can create one itself ([[https://slurm.schedmd.com/srun.html |official srun documentation]])
+  * **sbatch** is used to submit a script that describes the resources required and program execution procedure ([[https://slurm.schedmd.com/sbatch.html |official sbatch documentation]])
+Sample command execution
+<code>
+// submit an sbatch script
+~$ sbatch mysbatchscript
+</code>
+==== Important Job Submission Notes ====
+  * If you submit a job and it does not run immediately, it may be waiting for resources that are reserved, or you may be at the maximum number of concurrent jobs allowed. Jobs can be left in the queue and will run as soon as resources are available again
+==== Common Job Options ====
+Flags and parameters are passed directly to **salloc** and **srun** via the command line, or can be defined in an **sbatch** script by prefixing the flag with ''%%#SBATCH%%'' followed by a space. For example, ''%%#SBATCH --nodes=1%%'' attempts to allocate a single nodes.
+This list contains several of the common flags to request resources when submitting a job. **//__When submitting a job, you should avoid specifying a hostname, and instead specify required resources__//**.
+Note, the syntax of ''%%<...>%%'' denotes what should be replaced by various input such as a number or name. A full list of available options can be found on the official SLURM website for [[https://slurm.schedmd.com/salloc.html |salloc]], [[https://slurm.schedmd.com/srun.html |srun]], and [[https://slurm.schedmd.com/sbatch.html |sbatch]].
+<code>
+-J or --job-name=<jobname>                The name of your job
+-N <n> or --nodes=<n>                     Number of nodes to allocate
+-n <n> or --ntasks=<n>                    Number of tasks to run
+--ntasks-per-node=<n>                     Number of tasks to be run on each allocated node
+--ntasks-per-core=<n>                     Number of tasks to be run on each allocated core
+--ntasks-per-gpu=<n>                      Number of tasks to be run on each allocated GPU
+-p <partname> or --partition=<partname>   Submit a job to a specified partition
+-c <n> or --cpus-per-task=<n>             Number of cores to allocate per process,
+                                          primarily for multithreaded jobs,
+                                          default is one core per process/task
+--mem=<n>                                 System memory required for each node specified in MBs,
+                                          (Ex. --mem=4000 requires 4000MBs of memory per node)
+                                          (Ex. --mem=4G requires 4GBs = 4096MBs of memory per node)
+                                          (Note, --mem=0 requires ALL memory to be available on
+                                           a node for your job to run. It is recommended to avoid
+                                           specifying '0' as this requires an entire node to be
+                                           idle (i.e. no other jobs running) to process your job)
+--mem-per-cpu=<n>                         Minimum CPU memory required for each allocated core,
+                                          specified in MBs
+--mem-per-gpu=<n>                         Minimum CPU memory required for each allocated GPU,
+                                          specified in MBs
+-t D-HH:MM:SS or --time=D-HH:MM:SS        Maximum WALL clock time for a job, which
+                                          should always be specified for interactive jobs.
+                                          Estimate high, adding at least six hours more than you
+                                          expect your program to run for.
+                                          Defaults to partition time limit default if not set.
+                                          Cannot exceed partition maximum WALL time.
+--gres=<list>:<n>                         Comma separated list of GRES (such as GPUs) to include
+                                          (Ex. --gres=gpu:1, requests the first available GPU)
+                                          (Ex. --gres=gpu:2 and --constraint="a100_40gb",
+                                           requests 2 A100 GPUs that each have 40GBs of GPU memory)
+-C <features> or --constraint=<features>  Specify unique resource requirements.
+                                          Logical operators are valid
+                                          OR: |
+                                          AND: &
+                                          (Ex. --constraint="rtx_4000" requests a node with an
+                                           RTX 4000 GPU
+                                          (Ex. --constraint="amd_epyc_7252&a100_80gb" requests
+                                           a node that has an AMD Epyc processor AND
+                                           an A100 80GB GPU available)
+                                          (Ex. --constraint="a100_80gb|a100_40gb" requests
+                                           a node that has an A100 80GB OR an A100 40GB GPU available)
+--mail-type=<type>                        Specify the job state that should generate an email.
+                                          Valid types are: none,begin,end,fail,requeue,all
+                                          (Ex. --mail-type=begin,end, send emails for when a job
+                                           starts and completes)
+--mail-user=<userid>@virginia.edu         Specify the recipient virginia email address for email
+                                          notifications. All other domains such as '@gmail.com'
+                                          are silently ignored.
+                                          Your 'userid' is the same as your Computing ID
+</code>
+  * Note, ''%%--mem%%'', ''%%--mem-per-cpu%%'', and ''%%--mem-per-gpu%%'' options are mutually exclusive and should not be used together
+  * Note, for memory specifications, **1G = 1024MBs**. For example, a node with 4000MBs of available memory will not accept jobs specifying ''%%--mem=4G%%'' since 4G = 4096MBs. Thus, it's recommended to specify in MBs ''%%--mem=1000%%'' instead of GBs ''%%--mem=1G%%''
+  * Note, when allocating memory with ''%%--mem%%'', this allocates System or CPU memory only. When allocating a GPU, your job is allocated all VMEM for that GPU
+  * A full list of **constraint options** can be found here (__[[https://slurm.schedmd.com/sbatch.html#OPT_constraint|SLURM constraint options]]__)
+==== Environment Variables ====
+A full list of slurm environment variables can be found here **__[[https://slurm.schedmd.com/sbatch.html#SECTION_INPUT-ENVIRONMENT-VARIABLES|slurm environment variables]]__**. When a job is submitted, all environment variables are carried forward into the job unless otherwise specified. This behavior can be modified and is primarily changed for **sbatch** scripts when needed. When a job is submitted, by default, slurm will ''%%cd%%'' to the directory the job was submitted (using **salloc**, **sbatch**, or **srun**) from.
+To not carry environment variables from your shell forward
+<code>
+#SBATCH --export=NONE
+</code>
+To export individual variables
+<code>
+#SBATCH --export=var0,var1
+</code>
+Variables can be set and exported with
+<code>
+#SBATCH --export=var0=value0,var1=value1
+</code>
+==== Log files: Standard Output (STDOUT) and Standard Error (STDERR) ====
+slurm by default aggregates STDOUT and STDERR streams into a single file, which by default will be named ''%%slurm-%A_%a.out%%'' where ''%%%A%%'' is the jobid and ''%%%a%%'' is the array index (for job arrays). By default will be found in the directory that **sbatch** is executed in. The name of this file, streams included, and its location can be modified.
+See more information about **__[[https://slurm.schedmd.com/sbatch.html#SECTION_%3CB%3Efilename-pattern%3C/B%3E|file name patterns]]__**.
+Modify combined output file name
+<code>
+#SBATCH -o <output file name>
+or
+#SBATCH --output=<output file name>
+</code>
+To separate the output files, define both file names for STDOUT and STDERR
+<code>
+#SBATCH --output=<output file name>
+#SBATCH -e <error file name>
+or
+#SBATCH --output=<output file name>
+#SBATCH --error=<error file name>
+</code>
+A path can be specified for either option, patterns can be used in the file name as well
+<code>
+#SBATCH --output="/p/myproject/slurmlogs/%A.out"
+</code>
+==== Interactive Job ====
+**Direct SSH connections are disabled** to slurm nodes. Instead, an interactive job can be initialized to run commands directly on a node.
+Note, idle interactive sessions deny resource allocations for other jobs. **As such, idle interactive jobs when found are terminated.** Generally, interactive sessions should be used for testing and debugging with the intention of creating an SBATCH script to run your job.
+Note, an interactive session will **time out after one hour** if no commands are executed during the hour.
+The following example creates a resource allocation within the CPU partition for one node with two cores, 4GBs of memory, and a time limit of 30 minutes. Then, a BASH shell is initialized within the allocation:
+<code>
+userid@portal01~$ salloc -p cpu -c 2 --mem=4000 -J InteractiveJob -t 30
+salloc: Granted job allocation 12345
+userid@portal01~$ srun --pty bash -i -l --
+userid@node01~$
+... testing performed ...
+userid@node01~$ exit
+userid@portal01~$ exit
+salloc: Relinquishing job allocation 12345
+salloc: Job allocation 12345 has been revoked.
+userid@portal01~$
+</code>
+**Be sure to type ''%%exit%%'' twice**. Firstly to close/cancel the **srun** command on a node, then a second time to relinquish the allocation made with **salloc**.
+==== View Active Job Details ====
+To view all of your job(s) and their respective statuses
+<code>
+~$ squeue -u <userid>
+</code>
+To view individual job details in full
+<code>
+~$ scontrol show job <jobid>
+</code>
+To view full job utilization and allocation details
+<code>
+~$ sstat <jobid>
+</code>
+A time to job start estimation can sometimes be gained by running
+<code>
+~$ squeue --start -j <jobid>
+</code>
+==== Canceling Jobs ====
+To obtain jobids, be sure to utilize the ''%%squeue%%'' command as shown
+To cancel an individual job
+<code>
+~$ scancel <jobid>
+</code>
+To send a different signal to job processes (default is SIGTERM/terminate), use the ''%%--signal=<signal>%%'' flag
+<code>
+~$ scancel --signal=KILL <jobid>
+</code>
+To cancel all of your jobs regardless of state
+<code>
+~$ scancel -u <userid>
+</code>
+To cancel all of your PENDING (PD) jobs, include the ''%%-t <state>%%'' or ''%%--state=<state>%%'' flag
+<code>
+~$ scancel -u <userid> -t PENDING
+</code>
+To completely restart a job, that is to cancel and restart from the beginning
+<code>
+~$ scontrol requeue <jobid>
+</code>
+==== Jupyter Notebook ====
+A Jupyter notebook can be opened within a SLURM job.
+**Note, you MUST be on-grounds using __eduroam__ wifi or running a UVA VPN.**
+=== Interactive Job ===
+To open a Jupyter notebook during an interactive session, firstly load the miniforge module
+<code>
+~$ module load miniforge
+</code>
+Then, run the following command, and find the URL output to access the Jupyter instance
+<code>
+~$ jupyter notebook --no-browser --ip=$(hostname -A)
+... output omitted ...
+Or copy and paste one of these URLs:
+        http://hostname.cs.Virginia.EDU:8888/tree?token=12345689abcdefg
+</code>
+Copy and paste the generated URL into your browser.
+=== SBATCH Job ===
+Another option is to attach a Jupyter notebook to resources allocated via an SBATCH script.
+**Note, once you are finished, you must run ''%%~$ scancel <jobid>%%'', using the assigned ''%%<jobid>%%'', to free the allocated resources**
+The following SBATCH is used as an example, you will need to modify depending on the resource requirements of your job. **Enabling email notifications is recommended**
+<code>
+#!/bin/bash
+#SBATCH -n 1
+#SBATCH -t 00:30:00
+#SBATCH -p cpu
+#SBATCH --mail-type=begin
+#SBATCH --mail-user=<userid>@virginia.edu
+module purge
+module load miniforge
+jupyter notebook --no-browser --ip=$(hostname -A) > ~/slurm_jupyter_info 2>&1
+</code>
+Using the above SBATCH script, submit the job and wait until the resources are allocated
+<code>
+~$ sbatch sbatch_jupyter_example
+</code>
+You will receive an email when your job starts (if email notifications are enabled), or you can check your queue
+<code>
+~$ squeue -u $USER
+             JOBID PARTITION     NAME      USER  ST       TIME  NODES NODELIST(REASON)
+       cpu sbatch_j   <userid>  R       0:04      1 <node name>
+</code>
+Once your job has started running, i.e. has a state of **R**, then output the notebook connection info, and copy/paste the generated URL into your browser
+<code>
+~$ cat ~/slurm_jupyter_info
+... output omitted ...
+Or copy and paste one of these URLs:
+        http://hostname.cs.Virginia.EDU:8888/tree?token=12345689abcdefg
+</code>
+**Note, once you are finished, you must run ''%%~$ scancel <jobid>%%'', using the assigned ''%%<jobid>%%'', to free the allocated resources**
+<code>
+~$ scancel 12345
+</code>
+==== CPU Socket & Thread Specifications ====
+SLURM nodes typically will have 1-2 CPU sockets, along with multithread support. Jobs can be built to require a certain number of cores per socket, sockets per node, etc., and assign them to a number of tasks.
+As an example, a parameter can be provided to instruct SLURM to request a server that matches the following criteria
+<code>
+--extra-node-info=(SOCKETS):(CORES PER SOCKET):(THREADS PER CORE)
+</code>
+For example, the following parameters require a single task to be run across 64 CPUs, on a node with 64 available CPU cores on a single socket, and only to use one thread per CPU core (even if two are available)
+<code>
+srun ... -n 1 -c 64 --extra-node-info=1:64:1 --cpu-bind=verbose ...
+</code>
+An alternative is to provide a __hint__ to the scheduler, which will have a similar effect
+<code>
+srun ... -n 1 -c 64 --cores-per-socket=64 --hint=compute_bound,nomultithread --cpu-bind=verbose ...
+</code>
+For building such job parameters and for further reading, please see the official SLURM documentation found **__([[https://slurm.schedmd.com/mc_support.html |here]])__**.
+==== Email Notifications ====
+Email notifications are available for various job state changes, for example, when a job starts, completes, or fails. The scheduler checks **__every five minutes__** for emails to send out.
+When receiving an email for completed jobs, the ''%%seff <jobID>%%'' command is executed to provide utilization statistics for allocated CPUs and memory.
+To enable email notifications, include in a given SBATCH script
+<code>
+#SBATCH --mail-type=<type>
+#SBATCH --mail-user=<userid>@virginia.edu
+</code>
+Replace ''%%<type>%%'' with the option(s) (comma separated) ''%%none,begin,end,fail,requeue,all%%'', and replace ''%%<userid>%%'' with your UVA computing ID.
+Note, all other email domains such as **gmail.com** are silently ignored.
+Note, for job arrays, the scheduler only generates a single email for ''%%<jobID>_*%%'' and not each individual job such as ''%%<jobID>_1%%''. Further the ''%%seff%%'' command is run on the last array job that completes.
+For example, to receive a notification when a job starts and finishes
+<code>
+#SBATCH --mail-type=begin,end
+#SBATCH --mail-user=<userid>@virginia.edu
+</code>
+Official documentation for SLURM emailing functionality can be found **__([[https://slurm.schedmd.com/sbatch.html#OPT_mail-type |here]])__**
+----
+===== Example SBATCH Scripts =====
+**sbatch** scripts are simple sequentially executed command scripts. These are simple **bash** scripts where slurm parameters (''%%#SBATCH <flag>%%'') are defined along with commands to run a program. Simply, the commands should be the same that you would use in your terminal.
+When submitting an sbatch script, be aware of file paths. An sbatch script will use the current directory from where it was submitted.
+To submit an sbatch script from a login node, replace ''%%<script file name>%%'' with the name of your sbatch script
+<code>
+~$ sbatch <script file name>
+</code>
+The examples below are to give a starting point for creating a job script. It is recommended to modify as needed for your job(s).
+=== Single Process Program ===
+The following is an example of a single process job that runs on the CPU partition. This will allocate the default amount of memory for the cpu partition per core, which is only 256MBs
+<code>
+#!/bin/bash
+#SBATCH -n 1
+#SBATCH -t 04:00:00
+#SBATCH -p cpu
+#SBATCH --mail-type=begin,end
+#SBATCH --mail-user=<userid>@virginia.edu
+./myprogram
+</code>
+=== Simple GPU Program (allocates first available GPU) ===
+The following allocates the first available GPU, regardless of model/type, along with 8GBs of CPU memory
+<code>
+#!/bin/bash
+#SBATCH --gres=gpu:1
+#SBATCH --mem=8000
+#SBATCH -n 1
+#SBATCH -t 04:00:00
+#SBATCH -p gpu
+#SBATCH --mail-type=begin,end
+#SBATCH --mail-user=<userid>@virginia.edu
+module purge
+module load gcc
+module load python
+module load cuda
+python3 myprogram.py
+</code>
+=== Simple GPU Program (allocates a specific GPU) ===
+The following requests a single A100 GPU card with 40GBs of memory when one is available, along with 16GBs of CPU memory
+<code>
+#!/bin/bash
+#SBATCH --gres=gpu:1
+#SBATCH --constraint="a100_40gb"
+#SBATCH --mem=16000
+#SBATCH -n 1
+#SBATCH -t 04:00:00
+#SBATCH -p gpu
+#SBATCH --mail-type=begin,end
+#SBATCH --mail-user=<userid>@virginia.edu
+module purge
+module load gcc
+module load python
+module load cuda
+python3 myprogram.py
+</code>
+Learn more about allocating GPUs for your job **__[[https://slurm.schedmd.com/gres.html#Running_Jobs|here]]__**.
+==== Simple Parallel Program ====
+For jobs that will utilize the Message Passing Interface (MPI), several nodes/processors will have to be requested to utilize.
+The following requests two servers, each to run 8 tasks, for a total of 16 tasks.
+<code>
+#!/bin/bash
+#SBATCH --nodes=2
+#SBATCH --ntasks-per-node=8
+#SBATCH -t 06:00:00
+#SBATCH -p cpu
+#SBATCH --mail-type=begin,end
+#SBATCH --mail-user=<userid>@virginia.edu
+module purge
+module load gcc
+module load openmpi
+srun ./myparallel_program
+</code>
+==== Job Arrays ====
+An option using slurm is to submit several jobs simultaneously for processing separate data.
+The following requests a total of 16 array tasks, wherein each individually requires 2 CPUs and 1GB of memory. The resulting output files have the format ''%%examplearray_%A-%a.out%%'', where ''%%%A%%'' is the jobid, and ''%%%a%%'' is the task number.
+<code>
+#!/bin/bash
+#SBATCH --job-name=myjobarray
+#SBATCH --array=1-16
+#SBATCH --ntasks=1
+#SBATCH --mem=1000
+#SBATCH -t 00:04:00
+#SBATCH -p cpu
+#SBATCH --output=examplearray_%A-%a.out
+#SBATCH --mail-type=begin,end
+#SBATCH --mail-user=<userid>@virginia.edu
+echo "Hello world from $HOSTNAME, slurm taskid is: $SLURM_ARRAY_TASK_ID"
+</code>
+The following example uses dummy option input parameters from a file ''%%~/slurm_input_example%%''
+<code>
+--input_0=123 --normalize=True  --seed=0
+--input_0=321 --normalize=False --seed=8
+</code>
+Then, the following SBATCH script can be used. ''%%SLURM_ARRAY_TASK_ID%%'' increments by one for each individual job within the job array ''%%SLURM_ARRAY_TASK_ID={1,2,...}%%'', which then corresponds to each line within the option file
+<code>
+#!/bin/bash
+#
+#SBATCH --job-name=myjobarray
+#SBATCH --array=1-2
+#SBATCH --ntasks=1
+#SBATCH --mem=1000
+#SBATCH -t 00:02:00
+#SBATCH -p cpu
+#SBATCH --output=examplearray_%A-%a.out
+#SBATCH --mail-type=begin,end
+#SBATCH --mail-user=<userid>@virginia.edu
+# set up modules and/or environment if needed
+module purge
+module load ...
+OPTS=$(sed -n "${SLURM_ARRAY_TASK_ID}"p "~/slurm_input_example")
+python my_python_script $OPTS
+</code>
+For the first job, ''%%SLURM_ARRAY_TASK_ID=1%%'' which sets ''%%OPTS="--input_0=123 --normalize=True --seed=0"%%''. Then for the second job, ''%%SLURM_ARRAY_TASK_ID=2%%'' which sets ''%%OPTS="--input_0=123 --normalize=True --seed=0"%%''
+==== Job Ordering (Dependencies) ====
+Job dependencies allow for defining a condition which specifies when a job should run. Only one separator is allowed. If conditions are separated using a comma ''%%,%%'', then they must all be met. If conditions are separated using a ''%%?%%'' character, then only one must be met.
+For example, to specify that ''%%jobid=4%%'' should not run unless ''%%jobid=3%%'' successfully runs (exit code is 0), this would be accomplished by using ''%%-d afterok:3%%'' or ''%%--dependency=afterok:3%%''.
+To specify that job should start after two others have finished (regardless of status, use ''%%afterany%%''), ''%%--dependency=afterany:<jobid>:<jobid>%%''.
+For further reading, see **([[https://slurm.schedmd.com/sbatch.html#OPT_dependency|SLURM Job Dependencies]])**.
+=== Canceling Array Tasks ===
+To cancel one single task from an array
+<code>
+~$ scancel <jobid>_<taskid>
+</code>
+To cancel a range of tasks
+<code>
+~$ scancel <jobid>_[<task0>-<task3>]
+</code>
+A list format can also be provided
+<code>
+~$ scancel <jobid>_[<task0>,<task1>,<taskid2>]
+</code>
+----