Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
compute_slurm [2019/04/03 21:22]
ktm5j
compute_slurm [2020/05/26 20:24] (current)
pgh5a [Information Gathering]
Line 1: Line 1:
-====== SLURM ======+====== ​Scheduling a Job using the SLURM job scheduler ​======
  
-The Computer Science Department uses a resource management system called [[https://en.wikipedia.org/​wiki/​Slurm_Workload_Manager|SLURM]].  The purpose of a workload scheduler such as SLURM is to allocate computational resources to our users in such a way that everyone gets their fair share of execution time.+{{ ::​introtoslurm.pdf Intro to SLURM slides }}
  
-===== Note on Modules in Slurm ===== +The Computer Science Department uses "job scheduler"​ called [[https://en.wikipedia.org/​wiki/​Slurm_Workload_Manager|SLURM]].  The purpose of a job scheduler is to allocate computational resources (servers) to users who submit "​jobs"​ to a queue. The job scheduler looks at the requirements stated in the job's script ​and allocates to the job a server (or servers) which matches the requirements specified in the job scriptFor exampleif the job script specifies that a job needs 192GB of memory, the job scheduler will find a server with at least that much memory free.
- +
-Due to the way sbatch spawns ​bash session (non-login session), some init files are not loaded from ''​%%/etc/profile.d%%''​ This prevents the initialization of the [[linux_environment_modules|Environment Modules]] system ​and will prevent you from loading software modules. +
- +
-To fix thissimply include ​the following line in your sbatch scripts: +
- +
-<code bash> +
-source /​etc/​profile.d/​modules.sh +
-</​code>​+
  
 ===== Using SLURM ===== ===== Using SLURM =====
 +**[[https://​slurm.schedmd.com/​pdfs/​summary.pdf|
 +Slurm Commands Cheat Sheet]]**
  
 ==== Information Gathering ==== ==== Information Gathering ====
Line 20: Line 14:
  
 <​code>​ <​code>​
-ktm5j@power1 ​~ $ sinfo +[pgh5a@portal04 ​~]$ sinfo 
-PARTITION ​    ​AVAIL  TIMELIMIT ​ NODES  STATE NODELIST +PARTITION AVAIL  TIMELIMIT ​ NODES  STATE NODELIST 
-main*            up   ​infinite ​    37   idle hermes[1-4],​artemis[1-7],​slurm[1-5],nibbler[1-4],trillian[1-3],granger[1-6],granger[7-8],ai0[1-6+main*        up   ​infinite ​     ​3 ​ drain falcon[3-5] 
-qdata            ​up   ​infinite ​     ​8   idle qdata[1-8+main*        up   ​infinite ​    ​27 ​  idle cortado[01-10],falcon[1-2,6-10],lynx[08-12],slurm[1-5
-qdata-preempt ​   ​up   ​infinite ​     ​8   idle qdata[1-8] +gpu          ​up   ​infinite ​     ​4    mix ai[02-03,05],lynx07 
-falcon ​          up   ​infinite ​    10   idle falcon[1-10] +gpu          ​up   ​infinite ​     1  alloc ai04 
-intel            up   ​infinite ​    ​24 ​  idle artemis7,slurm[1-5],granger[1-6],granger[7-8],​nibbler[1-4],​ai0[1-6] +gpu          ​up   ​infinite ​    12   idle ai[01,06],lynx[01-06],ristretto[01-04]
-amd              up   ​infinite ​    ​13 ​  idle hermes[1-4],​artemis[1-6],​trillian[1-3]+
 </​code>​ </​code>​
  
-With ''​%%sinfo%%''​ we can see a listing of what SLURM calls partitions and a list of nodes associated with these partitions. ​ A partition is a grouping of nodes, for example our //main// partition is a group of all SLURM nodes that are not reserved ​and can be used by anyone. ​ Notice that hosts can be listed in several different partitions.. ​ For example, ''​%%slurm[1-5]%%''​ can be found in both the //main// and //intel// partitions. ​ //Intel// is a partition ​of systems with Intel processors, likewise the hosts in //amd// have AMD processors.+With ''​%%sinfo%%''​ we can see a listing of the job queues or "partitions" ​and a list of nodes associated with these partitions. ​ A partition is a grouping of nodes, for example our //main// partition is a group of all general purpose ​nodesand the //gpu// partition ​is a group of nodes that each contain GPUs.  Sometimes ​hosts can be listed ​in two or more partitions.
  
 To view jobs running on the queue, we can use the command ''​%%squeue%%''​. ​ Say we have submitted one job to the main partition, running ''​%%squeue%%''​ will look like this: To view jobs running on the queue, we can use the command ''​%%squeue%%''​. ​ Say we have submitted one job to the main partition, running ''​%%squeue%%''​ will look like this:
  
 <​code>​ <​code>​
-ktm5j@power1 ​~ $ squeue+pgh5a@portal01 ​~ $ squeue
              JOBID PARTITION ​    ​NAME ​    USER ST       ​TIME ​ NODES NODELIST(REASON)              JOBID PARTITION ​    ​NAME ​    USER ST       ​TIME ​ NODES NODELIST(REASON)
-            467039 ​     main    ​sleep    ktm5j  ​R ​      ​0:06      1 artemis1+            467039 ​     main    ​my_job ​   pgh5a  ​R ​     0:06      1 artemis1
 </​code>​ </​code>​
  
Line 43: Line 36:
  
 <​code>​ <​code>​
-ktm5j@power1 ​~ $ sinfo+pgh5a@portal01 ​~ $ sinfo
 PARTITION ​    ​AVAIL ​ TIMELIMIT ​ NODES  STATE NODELIST PARTITION ​    ​AVAIL ​ TIMELIMIT ​ NODES  STATE NODELIST
 main*            up   ​infinite ​    ​37 ​  idle hermes[1-4],​artemis[2-7],​slurm[1-5],​nibbler[1-4],​trillian[1-3],​granger[1-6],​granger[7-8],​ai0[1-6] main*            up   ​infinite ​    ​37 ​  idle hermes[1-4],​artemis[2-7],​slurm[1-5],​nibbler[1-4],​trillian[1-3],​granger[1-6],​granger[7-8],​ai0[1-6]
Line 58: Line 51:
 To use SLURM resources, you must submit your jobs (program/​script/​etc.) to the SLURM controller. ​ The controller will then send your job to compute nodes for execution, after which time your results will be returned. To use SLURM resources, you must submit your jobs (program/​script/​etc.) to the SLURM controller. ​ The controller will then send your job to compute nodes for execution, after which time your results will be returned.
  
-Users can submit SLURM jobs from any of the power servers: ​''​%%power1%%''​-''​%%power6%%''​. ​ From a shell, you can submit jobs using the commands [[https://​slurm.schedmd.com/​srun.html|srun]] or [[https://​slurm.schedmd.com/​sbatch.html|sbatch]]. ​ Let's look at a very simple example script and ''​%%sbatch%%''​ command.+Users can submit SLURM jobs from ''​%%portal.cs.virginia.edu%%''​. ​ From a shell, you can submit jobs using the commands [[https://​slurm.schedmd.com/​srun.html|srun]] or [[https://​slurm.schedmd.com/​sbatch.html|sbatch]]. ​ Let's look at a very simple example script and ''​%%sbatch%%''​ command.
  
 Here is our script, all it does is print the hostname of the server running the script. ​ We must add ''​%%SBATCH%%''​ options to our script to handle various SLURM options. Here is our script, all it does is print the hostname of the server running the script. ​ We must add ''​%%SBATCH%%''​ options to our script to handle various SLURM options.
Line 68: Line 61:
 # #
 #SBATCH --mail-type=ALL #SBATCH --mail-type=ALL
-#SBATCH --mail-user=ktm5j@virginia.edu+#SBATCH --mail-user=pgh5a@virginia.edu
 # #
 #SBATCH --error="​my_job.err" ​                   # Where to write std err #SBATCH --error="​my_job.err" ​                   # Where to write std err
Line 80: Line 73:
  
 <​code>​ <​code>​
-ktm5j@power3 ​~ $ cd slurm-test/  +pgh5a@portal01 ​~ $ cd slurm-test/  
-ktm5j@power3 ​~/​slurm-test $ chmod +x test.sh  +pgh5a@portal01 ​~/​slurm-test $ chmod +x test.sh  
-ktm5j@power3 ​~/​slurm-test $ sbatch test.sh ​+pgh5a@portal01 ​~/​slurm-test $ sbatch test.sh ​
 Submitted batch job 466977 Submitted batch job 466977
-ktm5j@power3 ​~/​slurm-test $ ls+pgh5a@portal01 ​~/​slurm-test $ ls
 my_job.err ​ my_job.output ​ test.sh my_job.err ​ my_job.output ​ test.sh
-ktm5j@power3 ​~/​slurm-test $ cat my_job.output ​+pgh5a@portal01 ​~/​slurm-test $ cat my_job.output ​
 slurm1 slurm1
 </​code>​ </​code>​
Line 93: Line 86:
  
 <​code>​ <​code>​
-ktm5j@power3 ​~ $ srun -w slurm[1-5] -N5 hostname+pgh5a@portal01 ​~ $ srun -w slurm[1-5] -N5 hostname
 slurm4 slurm4
 slurm1 slurm1
Line 108: Line 101:
  
 <​code>​ <​code>​
-ktm5j@power1 ​~ $ squeue+ktm5j@portal01 ​~ $ squeue
              JOBID PARTITION ​    ​NAME ​    USER ST       ​TIME ​ NODES NODELIST(REASON)              JOBID PARTITION ​    ​NAME ​    USER ST       ​TIME ​ NODES NODELIST(REASON)
             467039 ​     main    sleep    ktm5j  R       ​0:​06 ​     1 artemis1 ​          <​-- ​ Running job             467039 ​     main    sleep    ktm5j  R       ​0:​06 ​     1 artemis1 ​          <​-- ​ Running job
-ktm5j@power1 ​~ $ scancel 467039+ktm5j@portal01 ​~ $ scancel 467039
 </​code>​ </​code>​
  
Line 130: Line 123:
 <​code>​ <​code>​
 -p gpu -p gpu
 +</​code>​
 +
 +An example running the command ''​%%hostname%%''​ on the //main// partition, this will run on any node in the partition:
 +
 +<​code>​
 +srun -p main hostname
 </​code>​ </​code>​
  
Line 150: Line 149:
 ==== Interactive Shell ==== ==== Interactive Shell ====
  
-We can use ''​%%srun%%''​ to spawn an interactive shell on a SLURM compute node.  ​While this can be useful for debugging purposes, this is **not** how you should typically use the SLURM system To spawn a shell we must pass the ''​%%--pty%%''​ option to ''​%%srun%%''​ so output is directed to a pseudo-terminal:​+We can use ''​%%srun%%''​ to spawn an interactive shell on a server controlled by the SLURM job scheduler.  ​This can be useful for debugging purposes ​as well as running jobs without using a job script. Creating an interactive session also reserves ​the node for your exclusive use 
 + 
 +To spawn a shell we must pass the ''​%%--pty%%''​ option to ''​%%srun%%''​ so output is directed to a pseudo-terminal:​
  
 <​code>​ <​code>​
-ktm5j@power3 ​~/​slurm-test ​$ srun -w slurm1 --pty bash -i -l - +pgh5a@portal ​~$ srun -w slurm1 --pty bash -i -l - 
-ktm5j@slurm1 ~/​slurm-test ​$ hostname+pgh5a@slurm1 ~$ hostname
 slurm1 slurm1
-ktm5j@slurm1 ~/​slurm-test ​$+pgh5a@slurm1 ~$
 </​code>​ </​code>​
  
 The ''​%%-i%%''​ argument tells ''​%%bash%%''​ to run as interactive. ​ The ''​%%-l%%''​ arg instructs bash that this is a login shell, this, along with the final ''​%%-%%''​ are important to reset environment variables that otherwise might cause issues using [[linux_environment_modules|Environment Modules]] The ''​%%-i%%''​ argument tells ''​%%bash%%''​ to run as interactive. ​ The ''​%%-l%%''​ arg instructs bash that this is a login shell, this, along with the final ''​%%-%%''​ are important to reset environment variables that otherwise might cause issues using [[linux_environment_modules|Environment Modules]]
 +
 +If a node is in a partition other than the default "​main"​ partition (for example, the "​gpu"​ partition), then you must specify the partition in your command, for example:
 +<​code>​
 +pgh5a@portal ~$ srun -w lynx05 -p gpu --pty bash -i -l -
 +</​code>​
 +
 +===== Reservations =====
 +
 +Reservations for specific resources or nodes can be made by submitting a request to <​cshelpdesk@virginia.edu>​. ​ For more information about using reservations,​ see the main article on [[compute_slurm_reservations|SLURM Reservations]]
 +
 +===== Note on Modules in Slurm =====
 +
 +Due to the way sbatch spawns a bash session (non-login session), some init files are not loaded from ''​%%/​etc/​profile.d%%''​. ​ This prevents the initialization of the [[linux_environment_modules|Environment Modules]] system and will prevent you from loading software modules.
 +
 +To fix this, simply include the following line in your sbatch scripts:
 +
 +<code bash>
 +source /​etc/​profile.d/​modules.sh
 +</​code>​
 +
  • compute_slurm.1554326531.txt.gz
  • Last modified: 2019/04/03 21:22
  • by ktm5j