Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
compute_slurm [2019/04/03 21:22]
ktm5j
compute_slurm [2020/05/26 20:24]
pgh5a [Information Gathering]
Line 1: Line 1:
-====== SLURM ======+====== ​Scheduling a Job using the SLURM job scheduler ​======
  
-The Computer Science Department uses a resource management system called [[https://en.wikipedia.org/​wiki/​Slurm_Workload_Manager|SLURM]].  The purpose of a workload scheduler such as SLURM is to allocate computational resources to our users in such a way that everyone gets their fair share of execution time.+{{ ::​introtoslurm.pdf Intro to SLURM slides }}
  
-===== Note on Modules in Slurm ===== +The Computer Science Department uses "job scheduler"​ called [[https://en.wikipedia.org/​wiki/​Slurm_Workload_Manager|SLURM]].  The purpose of a job scheduler is to allocate computational resources (servers) to users who submit "​jobs"​ to a queue. The job scheduler looks at the requirements stated in the job's script ​and allocates to the job a server (or servers) which matches the requirements specified in the job scriptFor exampleif the job script specifies that a job needs 192GB of memory, the job scheduler will find a server with at least that much memory free.
- +
-Due to the way sbatch spawns ​bash session (non-login session), some init files are not loaded from ''​%%/etc/profile.d%%''​ This prevents the initialization of the [[linux_environment_modules|Environment Modules]] system ​and will prevent you from loading software modules. +
- +
-To fix thissimply include ​the following line in your sbatch scripts: +
- +
-<code bash> +
-source /​etc/​profile.d/​modules.sh +
-</​code>​+
  
 ===== Using SLURM ===== ===== Using SLURM =====
 +**[[https://​slurm.schedmd.com/​pdfs/​summary.pdf|
 +Slurm Commands Cheat Sheet]]**
  
 ==== Information Gathering ==== ==== Information Gathering ====
Line 20: Line 14:
  
 <​code>​ <​code>​
-ktm5j@power1 ​~ $ sinfo +[pgh5a@portal04 ​~]$ sinfo 
-PARTITION ​    ​AVAIL  TIMELIMIT ​ NODES  STATE NODELIST +PARTITION AVAIL  TIMELIMIT ​ NODES  STATE NODELIST 
-main*            up   ​infinite ​    37   idle hermes[1-4],​artemis[1-7],​slurm[1-5],nibbler[1-4],trillian[1-3],granger[1-6],granger[7-8],ai0[1-6+main*        up   ​infinite ​     ​3 ​ drain falcon[3-5] 
-qdata            ​up   ​infinite ​     ​8   idle qdata[1-8+main*        up   ​infinite ​    ​27 ​  idle cortado[01-10],falcon[1-2,6-10],lynx[08-12],slurm[1-5
-qdata-preempt ​   ​up   ​infinite ​     ​8   idle qdata[1-8] +gpu          ​up   ​infinite ​     ​4    mix ai[02-03,05],lynx07 
-falcon ​          up   ​infinite ​    10   idle falcon[1-10] +gpu          ​up   ​infinite ​     1  alloc ai04 
-intel            up   ​infinite ​    ​24 ​  idle artemis7,slurm[1-5],granger[1-6],granger[7-8],​nibbler[1-4],​ai0[1-6] +gpu          ​up   ​infinite ​    12   idle ai[01,06],lynx[01-06],ristretto[01-04]
-amd              up   ​infinite ​    ​13 ​  idle hermes[1-4],​artemis[1-6],​trillian[1-3]+
 </​code>​ </​code>​
  
-With ''​%%sinfo%%''​ we can see a listing of what SLURM calls partitions and a list of nodes associated with these partitions. ​ A partition is a grouping of nodes, for example our //main// partition is a group of all SLURM nodes that are not reserved ​and can be used by anyone. ​ Notice that hosts can be listed in several different partitions.. ​ For example, ''​%%slurm[1-5]%%''​ can be found in both the //main// and //intel// partitions. ​ //Intel// is a partition ​of systems with Intel processors, likewise the hosts in //amd// have AMD processors.+With ''​%%sinfo%%''​ we can see a listing of the job queues or "partitions" ​and a list of nodes associated with these partitions. ​ A partition is a grouping of nodes, for example our //main// partition is a group of all general purpose ​nodesand the //gpu// partition ​is a group of nodes that each contain GPUs.  Sometimes ​hosts can be listed ​in two or more partitions.
  
 To view jobs running on the queue, we can use the command ''​%%squeue%%''​. ​ Say we have submitted one job to the main partition, running ''​%%squeue%%''​ will look like this: To view jobs running on the queue, we can use the command ''​%%squeue%%''​. ​ Say we have submitted one job to the main partition, running ''​%%squeue%%''​ will look like this:
  
 <​code>​ <​code>​
-ktm5j@power1 ​~ $ squeue+pgh5a@portal01 ​~ $ squeue
              JOBID PARTITION ​    ​NAME ​    USER ST       ​TIME ​ NODES NODELIST(REASON)              JOBID PARTITION ​    ​NAME ​    USER ST       ​TIME ​ NODES NODELIST(REASON)
-            467039 ​     main    ​sleep    ktm5j  ​R ​      ​0:06      1 artemis1+            467039 ​     main    ​my_job ​   pgh5a  ​R ​     0:06      1 artemis1
 </​code>​ </​code>​
  
Line 43: Line 36:
  
 <​code>​ <​code>​
-ktm5j@power1 ​~ $ sinfo+pgh5a@portal01 ​~ $ sinfo
 PARTITION ​    ​AVAIL ​ TIMELIMIT ​ NODES  STATE NODELIST PARTITION ​    ​AVAIL ​ TIMELIMIT ​ NODES  STATE NODELIST
 main*            up   ​infinite ​    ​37 ​  idle hermes[1-4],​artemis[2-7],​slurm[1-5],​nibbler[1-4],​trillian[1-3],​granger[1-6],​granger[7-8],​ai0[1-6] main*            up   ​infinite ​    ​37 ​  idle hermes[1-4],​artemis[2-7],​slurm[1-5],​nibbler[1-4],​trillian[1-3],​granger[1-6],​granger[7-8],​ai0[1-6]
Line 58: Line 51:
 To use SLURM resources, you must submit your jobs (program/​script/​etc.) to the SLURM controller. ​ The controller will then send your job to compute nodes for execution, after which time your results will be returned. To use SLURM resources, you must submit your jobs (program/​script/​etc.) to the SLURM controller. ​ The controller will then send your job to compute nodes for execution, after which time your results will be returned.
  
-Users can submit SLURM jobs from any of the power servers: ​''​%%power1%%''​-''​%%power6%%''​. ​ From a shell, you can submit jobs using the commands [[https://​slurm.schedmd.com/​srun.html|srun]] or [[https://​slurm.schedmd.com/​sbatch.html|sbatch]]. ​ Let's look at a very simple example script and ''​%%sbatch%%''​ command.+Users can submit SLURM jobs from ''​%%portal.cs.virginia.edu%%''​. ​ From a shell, you can submit jobs using the commands [[https://​slurm.schedmd.com/​srun.html|srun]] or [[https://​slurm.schedmd.com/​sbatch.html|sbatch]]. ​ Let's look at a very simple example script and ''​%%sbatch%%''​ command.
  
 Here is our script, all it does is print the hostname of the server running the script. ​ We must add ''​%%SBATCH%%''​ options to our script to handle various SLURM options. Here is our script, all it does is print the hostname of the server running the script. ​ We must add ''​%%SBATCH%%''​ options to our script to handle various SLURM options.
Line 68: Line 61:
 # #
 #SBATCH --mail-type=ALL #SBATCH --mail-type=ALL
-#SBATCH --mail-user=ktm5j@virginia.edu+#SBATCH --mail-user=pgh5a@virginia.edu
 # #
 #SBATCH --error="​my_job.err" ​                   # Where to write std err #SBATCH --error="​my_job.err" ​                   # Where to write std err
Line 80: Line 73:
  
 <​code>​ <​code>​
-ktm5j@power3 ​~ $ cd slurm-test/  +pgh5a@portal01 ​~ $ cd slurm-test/  
-ktm5j@power3 ​~/​slurm-test $ chmod +x test.sh  +pgh5a@portal01 ​~/​slurm-test $ chmod +x test.sh  
-ktm5j@power3 ​~/​slurm-test $ sbatch test.sh ​+pgh5a@portal01 ​~/​slurm-test $ sbatch test.sh ​
 Submitted batch job 466977 Submitted batch job 466977
-ktm5j@power3 ​~/​slurm-test $ ls+pgh5a@portal01 ​~/​slurm-test $ ls
 my_job.err ​ my_job.output ​ test.sh my_job.err ​ my_job.output ​ test.sh
-ktm5j@power3 ​~/​slurm-test $ cat my_job.output ​+pgh5a@portal01 ​~/​slurm-test $ cat my_job.output ​
 slurm1 slurm1
 </​code>​ </​code>​
Line 93: Line 86:
  
 <​code>​ <​code>​
-ktm5j@power3 ​~ $ srun -w slurm[1-5] -N5 hostname+pgh5a@portal01 ​~ $ srun -w slurm[1-5] -N5 hostname
 slurm4 slurm4
 slurm1 slurm1
Line 108: Line 101:
  
 <​code>​ <​code>​
-ktm5j@power1 ​~ $ squeue+ktm5j@portal01 ​~ $ squeue
              JOBID PARTITION ​    ​NAME ​    USER ST       ​TIME ​ NODES NODELIST(REASON)              JOBID PARTITION ​    ​NAME ​    USER ST       ​TIME ​ NODES NODELIST(REASON)
             467039 ​     main    sleep    ktm5j  R       ​0:​06 ​     1 artemis1 ​          <​-- ​ Running job             467039 ​     main    sleep    ktm5j  R       ​0:​06 ​     1 artemis1 ​          <​-- ​ Running job
-ktm5j@power1 ​~ $ scancel 467039+ktm5j@portal01 ​~ $ scancel 467039
 </​code>​ </​code>​
  
Line 130: Line 123:
 <​code>​ <​code>​
 -p gpu -p gpu
 +</​code>​
 +
 +An example running the command ''​%%hostname%%''​ on the //main// partition, this will run on any node in the partition:
 +
 +<​code>​
 +srun -p main hostname
 </​code>​ </​code>​
  
Line 150: Line 149:
 ==== Interactive Shell ==== ==== Interactive Shell ====
  
-We can use ''​%%srun%%''​ to spawn an interactive shell on a SLURM compute node.  ​While this can be useful for debugging purposes, this is **not** how you should typically use the SLURM system To spawn a shell we must pass the ''​%%--pty%%''​ option to ''​%%srun%%''​ so output is directed to a pseudo-terminal:​+We can use ''​%%srun%%''​ to spawn an interactive shell on a server controlled by the SLURM job scheduler.  ​This can be useful for debugging purposes ​as well as running jobs without using a job script. Creating an interactive session also reserves ​the node for your exclusive use 
 + 
 +To spawn a shell we must pass the ''​%%--pty%%''​ option to ''​%%srun%%''​ so output is directed to a pseudo-terminal:​
  
 <​code>​ <​code>​
-ktm5j@power3 ​~/​slurm-test ​$ srun -w slurm1 --pty bash -i -l - +pgh5a@portal ​~$ srun -w slurm1 --pty bash -i -l - 
-ktm5j@slurm1 ~/​slurm-test ​$ hostname+pgh5a@slurm1 ~$ hostname
 slurm1 slurm1
-ktm5j@slurm1 ~/​slurm-test ​$+pgh5a@slurm1 ~$
 </​code>​ </​code>​
  
 The ''​%%-i%%''​ argument tells ''​%%bash%%''​ to run as interactive. ​ The ''​%%-l%%''​ arg instructs bash that this is a login shell, this, along with the final ''​%%-%%''​ are important to reset environment variables that otherwise might cause issues using [[linux_environment_modules|Environment Modules]] The ''​%%-i%%''​ argument tells ''​%%bash%%''​ to run as interactive. ​ The ''​%%-l%%''​ arg instructs bash that this is a login shell, this, along with the final ''​%%-%%''​ are important to reset environment variables that otherwise might cause issues using [[linux_environment_modules|Environment Modules]]
 +
 +If a node is in a partition other than the default "​main"​ partition (for example, the "​gpu"​ partition), then you must specify the partition in your command, for example:
 +<​code>​
 +pgh5a@portal ~$ srun -w lynx05 -p gpu --pty bash -i -l -
 +</​code>​
 +
 +===== Reservations =====
 +
 +Reservations for specific resources or nodes can be made by submitting a request to <​cshelpdesk@virginia.edu>​. ​ For more information about using reservations,​ see the main article on [[compute_slurm_reservations|SLURM Reservations]]
 +
 +===== Note on Modules in Slurm =====
 +
 +Due to the way sbatch spawns a bash session (non-login session), some init files are not loaded from ''​%%/​etc/​profile.d%%''​. ​ This prevents the initialization of the [[linux_environment_modules|Environment Modules]] system and will prevent you from loading software modules.
 +
 +To fix this, simply include the following line in your sbatch scripts:
 +
 +<code bash>
 +source /​etc/​profile.d/​modules.sh
 +</​code>​
 +
  • compute_slurm.txt
  • Last modified: 2020/08/03 13:12
  • by pgh5a