Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
compute_slurm [2020/08/26 01:45]
pgh5a [Direct login to servers (without a job script)]
compute_slurm [2020/10/06 12:59] (current)
pgh5a
Line 1: Line 1:
-====== Scheduling a Job using the SLURM job scheduler ​======+==== Scheduling a Job using the SLURM job scheduler ====
  
 {{ ::​introtoslurm.pdf | Intro to SLURM slides }} {{ ::​introtoslurm.pdf | Intro to SLURM slides }}
Line 5: Line 5:
 The Computer Science Department uses a "job scheduler"​ called [[https://​en.wikipedia.org/​wiki/​Slurm_Workload_Manager|SLURM]]. ​ The purpose of a job scheduler is to allocate computational resources (servers) to users who submit "​jobs"​ to a queue. The job scheduler looks at the requirements stated in the job's script and allocates to the job a server (or servers) which matches the requirements specified in the job script. For example, if the job script specifies that a job needs 192GB of memory, the job scheduler will find a server with at least that much memory free. The Computer Science Department uses a "job scheduler"​ called [[https://​en.wikipedia.org/​wiki/​Slurm_Workload_Manager|SLURM]]. ​ The purpose of a job scheduler is to allocate computational resources (servers) to users who submit "​jobs"​ to a queue. The job scheduler looks at the requirements stated in the job's script and allocates to the job a server (or servers) which matches the requirements specified in the job script. For example, if the job script specifies that a job needs 192GB of memory, the job scheduler will find a server with at least that much memory free.
  
-===== Using SLURM =====+=== Using SLURM ===
 **[[https://​slurm.schedmd.com/​pdfs/​summary.pdf| **[[https://​slurm.schedmd.com/​pdfs/​summary.pdf|
 Slurm Commands Cheat Sheet]]** Slurm Commands Cheat Sheet]]**
  
-==== Information Gathering ​====+=== Information Gathering ===
  
-SLURM provides a set of tools to use for interacting with the scheduling system.  ​To view information about compute nodes in the SLURM system, ​we can use the command ''​%%sinfo%%''​.+To view information about compute nodes in the SLURM system, use the command ''​%%sinfo%%''​.
  
 <​code>​ <​code>​
-[pgh5a@portal04 ~]$ sinfo+[abc1de@portal04 ~]$ sinfo
 PARTITION AVAIL  TIMELIMIT ​ NODES  STATE NODELIST PARTITION AVAIL  TIMELIMIT ​ NODES  STATE NODELIST
 main*        up   ​infinite ​     3  drain falcon[3-5] main*        up   ​infinite ​     3  drain falcon[3-5]
Line 28: Line 28:
  
 <​code>​ <​code>​
-pgh5a@portal01 ~ $ squeue +abc1de@portal01 ~ $ squeue 
-             JOBID PARTITION ​    ​NAME ​    USER ST       ​TIME  NODES NODELIST(REASON) +             JOBID PARTITION ​    ​NAME ​    ​USER ​   ST     ​TIME  NODES NODELIST(REASON) 
-            467039 ​     main    my_job ​   ​pgh5a  ​R ​     0:06      1 artemis1+            467039 ​     main    my_job ​   ​abc1de ​ ​R ​     0:06      1 artemis1
 </​code>​ </​code>​
  
Line 36: Line 36:
  
 <​code>​ <​code>​
-pgh5a@portal01 ~ $ sinfo+abc1de@portal01 ~ $ sinfo
 PARTITION ​    ​AVAIL ​ TIMELIMIT ​ NODES  STATE NODELIST PARTITION ​    ​AVAIL ​ TIMELIMIT ​ NODES  STATE NODELIST
 main*            up   ​infinite ​    ​37 ​  idle hermes[1-4],​artemis[2-7],​slurm[1-5],​nibbler[1-4],​trillian[1-3],​granger[1-6],​granger[7-8],​ai0[1-6] main*            up   ​infinite ​    ​37 ​  idle hermes[1-4],​artemis[2-7],​slurm[1-5],​nibbler[1-4],​trillian[1-3],​granger[1-6],​granger[7-8],​ai0[1-6]
Line 47: Line 47:
 </​code>​ </​code>​
  
-==== Jobs ====+=== Jobs ===
  
 To use SLURM resources, you must submit your jobs (program/​script/​etc.) to the SLURM controller. ​ The controller will then send your job to compute nodes for execution, after which time your results will be returned. To use SLURM resources, you must submit your jobs (program/​script/​etc.) to the SLURM controller. ​ The controller will then send your job to compute nodes for execution, after which time your results will be returned.
Line 61: Line 61:
 # #
 #SBATCH --mail-type=ALL #SBATCH --mail-type=ALL
-#SBATCH --mail-user=pgh5a@virginia.edu+#SBATCH --mail-user=abc1de@virginia.edu
 # #
 #SBATCH --error="​my_job.err" ​                   # Where to write std err #SBATCH --error="​my_job.err" ​                   # Where to write std err
Line 73: Line 73:
  
 <​code>​ <​code>​
-pgh5a@portal01 ~ $ cd slurm-test/  +abc1de@portal01 ~ $ cd slurm-test/  
-pgh5a@portal01 ~/​slurm-test $ chmod +x test.sh  +abc1de@portal01 ~/​slurm-test $ chmod +x test.sh  
-pgh5a@portal01 ~/​slurm-test $ sbatch test.sh ​+abc1de@portal01 ~/​slurm-test $ sbatch test.sh ​
 Submitted batch job 466977 Submitted batch job 466977
-pgh5a@portal01 ~/​slurm-test $ ls+abc1de@portal01 ~/​slurm-test $ ls
 my_job.err ​ my_job.output ​ test.sh my_job.err ​ my_job.output ​ test.sh
-pgh5a@portal01 ~/​slurm-test $ cat my_job.output ​+abc1de@portal01 ~/​slurm-test $ cat my_job.output ​
 slurm1 slurm1
 </​code>​ </​code>​
Line 86: Line 86:
  
 <​code>​ <​code>​
-pgh5a@portal01 ~ $ srun -w slurm[1-5] -N5 hostname+abc1de@portal01 ~ $ srun -w slurm[1-5] -N5 hostname
 slurm4 slurm4
 slurm1 slurm1
Line 93: Line 93:
 slurm5 slurm5
 </​code>​ </​code>​
 +=== Direct login to servers (without a job script) ===
  
-==== Terminating Jobs ====+You can use ''​%%srun%%''​ to login directly to a server controlled by the SLURM job scheduler. ​ This can be useful for debugging purposes as well as running your applications without using a job script. Directly logging in also reserves the node for your exclusive use.  
 + 
 +To spawn a shell we must pass the ''​%%--pty%%''​ option to ''​%%srun%%''​ so output is directed to a pseudo-terminal:​ 
 + 
 +<​code>​ 
 +abc1de@portal ~$ srun -w cortado04 --pty bash -i -l - 
 +abc1de@cortado04 ~$ hostname 
 +cortado04 
 +abc1de@cortado04 ~$ 
 +</​code>​ 
 + 
 +The ''​%%-w%%''​ argument selects the server into which to login. The ''​%%-i%%''​ argument tells ''​%%bash%%''​ to run as an interactive shell. ​ The ''​%%-l%%''​ argument instructs bash that this is a login shell, this, along with the final ''​%%-%%''​ are important to reset environment variables that otherwise might cause issues using [[linux_environment_modules|Environment Modules]] 
 + 
 +If a node is in a partition (see below for partition information) other than the default "​main"​ partition (for example, the "​gpu"​ partition), then you //must// specify the partition in your command, for example: 
 +<​code>​ 
 +abc1de@portal ~$ srun -w lynx05 -p gpu --pty bash -i -l - 
 +</​code>​ 
 + 
 +=== Terminating Jobs ===
  
 Please be aware of jobs you start and make sure that they finish executing. ​ If your job does not exit gracefully, it will continue running on the server, taking up resources and preventing others from running their jobs. Please be aware of jobs you start and make sure that they finish executing. ​ If your job does not exit gracefully, it will continue running on the server, taking up resources and preventing others from running their jobs.
Line 107: Line 126:
 </​code>​ </​code>​
  
-==== Partitions ​====+The default signal sent to a running job is SIGTERM (terminate). If you wish to send a different signal to the job's processes (for example, a SIGKILL which is often needed if a SIGTERM doesn'​t terminate the process), use the ''​%%-s%%''​ argument to scancel, i.e.: 
 +<​code>​ 
 +abc1de@portal01 ~ $ scancel --signal=KILL 467039 
 +</​code>​ 
 + 
 + 
 +=== Queues/Partitions ===
  
-Slurm refers to job queues as //​partitions//​. ​ These queues can have unique constraints such as compute nodes, max runtime, resource limits, etc.  There is a ''​%%main%%''​ queue, which will make use of all non-GPU compute nodes.+Slurm refers to job queues as //​partitions//​. ​We group similar systems into separate queues. For example, there is a "​main"​ queue for general purpose systems, and a "​gpu"​ queue for systems with GPUs. These queues can have unique constraints such as compute nodes, max runtime, resource limits, etc.
  
 Partition is indicated by ''​%%-p partname%%''​ or ''​%%--partition partname%%''​. Partition is indicated by ''​%%-p partname%%''​ or ''​%%--partition partname%%''​.
Line 131: Line 156:
 </​code>​ </​code>​
  
-===GPUs ====+=== Using GPUs ===
  
 Slurm handles GPUs and other non-CPU computing resources using what are called [[https://​slurm.schedmd.com/​gres.html|GRES]] Resources (Generic Resource). ​ To use the GPU(s) on a system using Slurm, either using ''​%%sbatch%%''​ or ''​%%srun%%'',​ you must request the GPUs using the ''​%%--gres:​x%%''​ option. ​ You must specify the ''​%%gres%%''​ flag followed by ''​%%:​%%''​ and the quantity of resources Slurm handles GPUs and other non-CPU computing resources using what are called [[https://​slurm.schedmd.com/​gres.html|GRES]] Resources (Generic Resource). ​ To use the GPU(s) on a system using Slurm, either using ''​%%sbatch%%''​ or ''​%%srun%%'',​ you must request the GPUs using the ''​%%--gres:​x%%''​ option. ​ You must specify the ''​%%gres%%''​ flag followed by ''​%%:​%%''​ and the quantity of resources
Line 147: Line 172:
 </​code>​ </​code>​
  
-==== Direct login to servers (without a job script) ==== +=== Reservations ===
- +
-You can use ''​%%srun%%''​ to login directly to a server controlled by the SLURM job scheduler. ​ This can be useful for debugging purposes as well as running your applications without using a job script. Creating a login session also reserves the node for your exclusive use.  +
- +
-To spawn a shell we must pass the ''​%%--pty%%''​ option to ''​%%srun%%''​ so output is directed to a pseudo-terminal:​ +
- +
-<​code>​ +
-abc1de@portal ~$ srun -w cortado04 --pty bash -i -l - +
-abc1de@cortado04 ~$ hostname +
-cortado04 +
-abc1de@cortado04 ~$ +
-</​code>​ +
- +
-The ''​%%-w%%''​ argument selects the server into which to login. The ''​%%-i%%''​ argument tells ''​%%bash%%''​ to run as interactive. ​ The ''​%%-l%%''​ argument instructs bash that this is a login shell, this, along with the final ''​%%-%%''​ are important to reset environment variables that otherwise might cause issues using [[linux_environment_modules|Environment Modules]] +
- +
-If a node is in a partition other than the default "​main"​ partition (for example, the "​gpu"​ partition), then you must specify the partition in your command, for example: +
-<​code>​ +
-pgh5a@portal ~$ srun -w lynx05 -p gpu --pty bash -i -l - +
-</​code>​ +
- +
-===== Reservations ​=====+
  
 Reservations for specific resources or nodes can be made by submitting a request to <​cshelpdesk@virginia.edu>​. ​ For more information about using reservations,​ see the main article on [[compute_slurm_reservations|SLURM Reservations]] Reservations for specific resources or nodes can be made by submitting a request to <​cshelpdesk@virginia.edu>​. ​ For more information about using reservations,​ see the main article on [[compute_slurm_reservations|SLURM Reservations]]
  
-===== Note on Modules in Slurm =====+=== Note on Modules in Slurm ===
  
 Due to the way sbatch spawns a bash session (non-login session), some init files are not loaded from ''​%%/​etc/​profile.d%%''​. ​ This prevents the initialization of the [[linux_environment_modules|Environment Modules]] system and will prevent you from loading software modules. Due to the way sbatch spawns a bash session (non-login session), some init files are not loaded from ''​%%/​etc/​profile.d%%''​. ​ This prevents the initialization of the [[linux_environment_modules|Environment Modules]] system and will prevent you from loading software modules.
  • compute_slurm.1598406358.txt.gz
  • Last modified: 2020/08/26 01:45
  • by pgh5a