Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
compute_slurm [2020/04/30 15:01]
pgh5a
compute_slurm [2020/08/03 13:11]
pgh5a [Terminating Jobs]
Line 3: Line 3:
 {{ ::​introtoslurm.pdf | Intro to SLURM slides }} {{ ::​introtoslurm.pdf | Intro to SLURM slides }}
  
-The Computer Science Department uses a resource management system ​called [[https://​en.wikipedia.org/​wiki/​Slurm_Workload_Manager|SLURM]]. ​ The purpose of a workload ​scheduler ​such as SLURM is to allocate computational resources to our users in such way that everyone gets their fair share of execution time. +The Computer Science Department uses a "job scheduler" ​called [[https://​en.wikipedia.org/​wiki/​Slurm_Workload_Manager|SLURM]]. ​ The purpose of a job scheduler is to allocate computational resources ​(servers) ​to users who submit "​jobs"​ to queueThe job scheduler looks at the requirements stated ​in the job's script and allocates ​to the job server ​(or serverswhich matches the requirements specified in the job script. For exampleif the job script specifies that a job needs 192GB of memory, ​the job scheduler ​will find a server with at least that much memory free.
- +
-===== Note on Modules ​in Slurm ===== +
- +
-Due to the way sbatch spawns ​bash session ​(non-login session), some init files are not loaded from ''​%%/​etc/​profile.d%%''​. ​ This prevents ​the initialization ​of the [[linux_environment_modules|Environment Modules]] system and will prevent you from loading software modules. +
- +
-To fix this, simply include the following line in your sbatch scripts: +
- +
-<code bash> +
-source /​etc/​profile.d/​modules.sh +
-</​code>​+
  
 ===== Using SLURM ===== ===== Using SLURM =====
Line 24: Line 14:
  
 <​code>​ <​code>​
-pgh5a@portal01 ​~ $ sinfo +[pgh5a@portal04 ​~]$ sinfo 
-PARTITION ​    ​AVAIL  TIMELIMIT ​ NODES  STATE NODELIST +PARTITION AVAIL  TIMELIMIT ​ NODES  STATE NODELIST 
-main*            up   ​infinite ​    37   idle hermes[1-4],​artemis[1-7],​slurm[1-5],nibbler[1-4],trillian[1-3],granger[1-6],granger[7-8],ai0[1-6+main*        up   ​infinite ​     ​3 ​ drain falcon[3-5] 
-qdata            ​up   ​infinite ​     ​8   idle qdata[1-8+main*        up   ​infinite ​    ​27 ​  idle cortado[01-10],falcon[1-2,6-10],lynx[08-12],slurm[1-5
-qdata-preempt ​   ​up   ​infinite ​     ​8   idle qdata[1-8] +gpu          ​up   ​infinite ​     ​4    mix ai[02-03,05],lynx07 
-falcon ​          up   ​infinite ​    10   idle falcon[1-10] +gpu          ​up   ​infinite ​     1  alloc ai04 
-intel            up   ​infinite ​    ​24 ​  idle artemis7,slurm[1-5],granger[1-6],granger[7-8],​nibbler[1-4],​ai0[1-6] +gpu          ​up   ​infinite ​    12   idle ai[01,06],lynx[01-06],ristretto[01-04]
-amd              up   ​infinite ​    ​13 ​  idle hermes[1-4],​artemis[1-6],​trillian[1-3]+
 </​code>​ </​code>​
  
-With ''​%%sinfo%%''​ we can see a listing of what SLURM calls partitions and a list of nodes associated with these partitions. ​ A partition is a grouping of nodes, for example our //main// partition is a group of all SLURM nodes that are not reserved ​and can be used by anyone. ​ Notice that hosts can be listed in several different partitions.. ​ For example, ''​%%slurm[1-5]%%''​ can be found in both the //main// and //intel// partitions. ​ //Intel// is a partition ​of systems with Intel processors, likewise the hosts in //amd// have AMD processors.+With ''​%%sinfo%%''​ we can see a listing of the job queues or "partitions" ​and a list of nodes associated with these partitions. ​ A partition is a grouping of nodes, for example our //main// partition is a group of all general purpose ​nodesand the //gpu// partition ​is a group of nodes that each contain GPUs.  Sometimes ​hosts can be listed ​in two or more partitions.
  
 To view jobs running on the queue, we can use the command ''​%%squeue%%''​. ​ Say we have submitted one job to the main partition, running ''​%%squeue%%''​ will look like this: To view jobs running on the queue, we can use the command ''​%%squeue%%''​. ​ Say we have submitted one job to the main partition, running ''​%%squeue%%''​ will look like this:
Line 41: Line 30:
 pgh5a@portal01 ~ $ squeue pgh5a@portal01 ~ $ squeue
              JOBID PARTITION ​    ​NAME ​    USER ST       ​TIME ​ NODES NODELIST(REASON)              JOBID PARTITION ​    ​NAME ​    USER ST       ​TIME ​ NODES NODELIST(REASON)
-            467039 ​     main    ​sleep    ​pgh5a ​ R       ​0:06      1 artemis1+            467039 ​     main    ​my_job ​   ​pgh5a ​ R      0:06      1 artemis1
 </​code>​ </​code>​
  
Line 62: Line 51:
 To use SLURM resources, you must submit your jobs (program/​script/​etc.) to the SLURM controller. ​ The controller will then send your job to compute nodes for execution, after which time your results will be returned. To use SLURM resources, you must submit your jobs (program/​script/​etc.) to the SLURM controller. ​ The controller will then send your job to compute nodes for execution, after which time your results will be returned.
  
-Users can submit SLURM jobs from any of the power servers: ​''​%%power1%%''​-''​%%power6%%''​. ​ From a shell, you can submit jobs using the commands [[https://​slurm.schedmd.com/​srun.html|srun]] or [[https://​slurm.schedmd.com/​sbatch.html|sbatch]]. ​ Let's look at a very simple example script and ''​%%sbatch%%''​ command.+Users can submit SLURM jobs from ''​%%portal.cs.virginia.edu%%''​. ​ From a shell, you can submit jobs using the commands [[https://​slurm.schedmd.com/​srun.html|srun]] or [[https://​slurm.schedmd.com/​sbatch.html|sbatch]]. ​ Let's look at a very simple example script and ''​%%sbatch%%''​ command.
  
 Here is our script, all it does is print the hostname of the server running the script. ​ We must add ''​%%SBATCH%%''​ options to our script to handle various SLURM options. Here is our script, all it does is print the hostname of the server running the script. ​ We must add ''​%%SBATCH%%''​ options to our script to handle various SLURM options.
Line 84: Line 73:
  
 <​code>​ <​code>​
-pgh5a@power3 ​~ $ cd slurm-test/  +pgh5a@portal01 ​~ $ cd slurm-test/  
-pgh5a@power3 ​~/​slurm-test $ chmod +x test.sh  +pgh5a@portal01 ​~/​slurm-test $ chmod +x test.sh  
-pgh5a@power3 ​~/​slurm-test $ sbatch test.sh ​+pgh5a@portal01 ​~/​slurm-test $ sbatch test.sh ​
 Submitted batch job 466977 Submitted batch job 466977
-pgh5a@power3 ​~/​slurm-test $ ls+pgh5a@portal01 ​~/​slurm-test $ ls
 my_job.err ​ my_job.output ​ test.sh my_job.err ​ my_job.output ​ test.sh
-pgh5a@power3 ​~/​slurm-test $ cat my_job.output ​+pgh5a@portal01 ​~/​slurm-test $ cat my_job.output ​
 slurm1 slurm1
 </​code>​ </​code>​
Line 97: Line 86:
  
 <​code>​ <​code>​
-pgh5a@power3 ​~ $ srun -w slurm[1-5] -N5 hostname+pgh5a@portal01 ​~ $ srun -w slurm[1-5] -N5 hostname
 slurm4 slurm4
 slurm1 slurm1
Line 107: Line 96:
 ==== Terminating Jobs ==== ==== Terminating Jobs ====
  
-Please be aware of jobs you start and make sure that they finish executing. ​ If your job does not converge, it will sit in the queue taking up resources and preventing others from running their jobs.+Please be aware of jobs you start and make sure that they finish executing. ​ If your job does not exit gracefully, it will continue running on the server, ​taking up resources and preventing others from running their jobs.
  
 To cancel a running job, use the ''​%%scancel [jobid]%%''​ command To cancel a running job, use the ''​%%scancel [jobid]%%''​ command
  
 <​code>​ <​code>​
-ktm5j@power1 ​~ $ squeue+ktm5j@portal01 ​~ $ squeue
              JOBID PARTITION ​    ​NAME ​    USER ST       ​TIME ​ NODES NODELIST(REASON)              JOBID PARTITION ​    ​NAME ​    USER ST       ​TIME ​ NODES NODELIST(REASON)
             467039 ​     main    sleep    ktm5j  R       ​0:​06 ​     1 artemis1 ​          <​-- ​ Running job             467039 ​     main    sleep    ktm5j  R       ​0:​06 ​     1 artemis1 ​          <​-- ​ Running job
-ktm5j@power1 ​~ $ scancel 467039+ktm5j@portal01 ​~ $ scancel 467039
 </​code>​ </​code>​
  
Line 136: Line 125:
 </​code>​ </​code>​
  
-An example running the command ''​%%hostname%%''​ on the //intel// partition, this will run on any node in the partition:+An example running the command ''​%%hostname%%''​ on the //main// partition, this will run on any node in the partition:
  
 <​code>​ <​code>​
-srun -p intel hostname+srun -p main hostname
 </​code>​ </​code>​
  
Line 160: Line 149:
 ==== Interactive Shell ==== ==== Interactive Shell ====
  
-We can use ''​%%srun%%''​ to spawn an interactive shell on a SLURM compute node.  ​While this can be useful for debugging purposes, this is **not** how you should typically use the SLURM system To spawn a shell we must pass the ''​%%--pty%%''​ option to ''​%%srun%%''​ so output is directed to a pseudo-terminal:​+We can use ''​%%srun%%''​ to spawn an interactive shell on a server controlled by the SLURM job scheduler.  ​This can be useful for debugging purposes ​as well as running jobs without using a job script. Creating an interactive session also reserves ​the node for your exclusive use 
 + 
 +To spawn a shell we must pass the ''​%%--pty%%''​ option to ''​%%srun%%''​ so output is directed to a pseudo-terminal:​
  
 <​code>​ <​code>​
-ktm5j@power3 ​~/​slurm-test ​$ srun -w slurm1 --pty bash -i -l - +pgh5a@portal ​~$ srun -w slurm1 --pty bash -i -l - 
-ktm5j@slurm1 ~/​slurm-test ​$ hostname+pgh5a@slurm1 ~$ hostname
 slurm1 slurm1
-ktm5j@slurm1 ~/​slurm-test ​$+pgh5a@slurm1 ~$
 </​code>​ </​code>​
  
Line 173: Line 164:
 If a node is in a partition other than the default "​main"​ partition (for example, the "​gpu"​ partition), then you must specify the partition in your command, for example: If a node is in a partition other than the default "​main"​ partition (for example, the "​gpu"​ partition), then you must specify the partition in your command, for example:
 <​code>​ <​code>​
-ktm5j@power3 ​~/​slurm-test ​$ srun -w lynx05 -p gpu --pty bash -i -l -+pgh5a@portal ​~$ srun -w lynx05 -p gpu --pty bash -i -l -
 </​code>​ </​code>​
  
Line 179: Line 170:
  
 Reservations for specific resources or nodes can be made by submitting a request to <​cshelpdesk@virginia.edu>​. ​ For more information about using reservations,​ see the main article on [[compute_slurm_reservations|SLURM Reservations]] Reservations for specific resources or nodes can be made by submitting a request to <​cshelpdesk@virginia.edu>​. ​ For more information about using reservations,​ see the main article on [[compute_slurm_reservations|SLURM Reservations]]
 +
 +===== Note on Modules in Slurm =====
 +
 +Due to the way sbatch spawns a bash session (non-login session), some init files are not loaded from ''​%%/​etc/​profile.d%%''​. ​ This prevents the initialization of the [[linux_environment_modules|Environment Modules]] system and will prevent you from loading software modules.
 +
 +To fix this, simply include the following line in your sbatch scripts:
 +
 +<code bash>
 +source /​etc/​profile.d/​modules.sh
 +</​code>​
  
  • compute_slurm.txt
  • Last modified: 2023/03/14 18:28
  • (external edit)