Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
compute_slurm [2020/09/24 13:58]
pgh5a
compute_slurm [2021/03/05 17:06]
pgh5a
Line 8: Line 8:
 **[[https://​slurm.schedmd.com/​pdfs/​summary.pdf| **[[https://​slurm.schedmd.com/​pdfs/​summary.pdf|
 Slurm Commands Cheat Sheet]]** Slurm Commands Cheat Sheet]]**
 +
 +The SLURM commands below are ONLY available on the portal cluster of servers. They are not installed on the gpusrv* or the SLURM controlled nodes themselves.
  
 === Information Gathering === === Information Gathering ===
Line 29: Line 31:
 <​code>​ <​code>​
 abc1de@portal01 ~ $ squeue abc1de@portal01 ~ $ squeue
-             JOBID PARTITION ​    ​NAME ​    USER ST       ​TIME  NODES NODELIST(REASON) +             JOBID PARTITION ​    ​NAME ​    ​USER ​   ST     ​TIME  NODES NODELIST(REASON) 
-            467039 ​     main    my_job ​   ​pgh5a  ​R ​     0:06      1 artemis1+            467039 ​     main    my_job ​   ​abc1de ​ ​R ​     0:06      1 artemis1
 </​code>​ </​code>​
  
Line 51: Line 53:
 To use SLURM resources, you must submit your jobs (program/​script/​etc.) to the SLURM controller. ​ The controller will then send your job to compute nodes for execution, after which time your results will be returned. To use SLURM resources, you must submit your jobs (program/​script/​etc.) to the SLURM controller. ​ The controller will then send your job to compute nodes for execution, after which time your results will be returned.
  
-Users can submit SLURM jobs from ''​%%portal.cs.virginia.edu%%''​.  ​From a shell, you can submit jobs using the commands [[https://​slurm.schedmd.com/​srun.html|srun]] or [[https://​slurm.schedmd.com/​sbatch.html|sbatch]]. ​ Let's look at a very simple example script and ''​%%sbatch%%''​ command.+Users can submit SLURM jobs from ''​%%portal.cs.virginia.edu%%''​.  ​You can submit jobs using the commands [[https://​slurm.schedmd.com/​srun.html|srun]] or [[https://​slurm.schedmd.com/​sbatch.html|sbatch]]. ​ Let's look at a very simple example script and ''​%%sbatch%%''​ command.
  
 Here is our script, all it does is print the hostname of the server running the script. ​ We must add ''​%%SBATCH%%''​ options to our script to handle various SLURM options. Here is our script, all it does is print the hostname of the server running the script. ​ We must add ''​%%SBATCH%%''​ options to our script to handle various SLURM options.
Line 57: Line 59:
 <code bash> <code bash>
 #!/bin/bash #!/bin/bash
- +# --- this job will be run on any available node 
-#SBATCH ​--job-name="​Slurm Simple Test Job" #Name of the job which appears in squeue +and simply output the node's hostname to 
-+my_job.output 
-#SBATCH --mail-type=ALL +#SBATCH --job-name="Slurm Simple Test Job" 
-#SBATCH --mail-user=pgh5a@virginia.edu +#SBATCH --error="​my_job.err"​ 
-# +#SBATCH --output="​my_job.output"​ 
-#SBATCH --error="​my_job.err" ​                   # Where to write std err +echo "​$HOSTNAME"​
-#SBATCH --output="​my_job.output" ​               # Where to write stdout +
-#SBATCH --nodelist=slurm1 +
- +
-hostname+
 </​code>​ </​code>​
  
-Let's put this in a directory called ''​%%slurm-test%%''​ in our home directory.  ​We run the script with ''​%%sbatch%%''​ and the results will be put in the file we specified with ''​%%--output%%''​. ​ If no output file is specified, output will be saved to a file with the same name as the SLURM jobid.+We run the script with ''​%%sbatch%%''​ and the results will be put in the file we specified with ''​%%--output%%''​. ​ If no output file is specified, output will be saved to a file with the same name as the SLURM jobid.
  
 <​code>​ <​code>​
-abc1de@portal01 ​~ $ cd slurm-test/  +[abc1de@portal04 ​~]sbatch ​slurm.test 
-abc1de@portal01 ~/​slurm-test $ chmod +x test.sh  +Submitted batch job 640768 
-abc1de@portal01 ~/slurm-test $ sbatch test.sh ​ +[abc1de@portal04 ​~]more my_job.output 
-Submitted batch job 466977 +cortado06
-abc1de@portal01 ​~/slurm-test $ ls +
-my_job.err ​ my_job.output ​ test.sh +
-abc1de@portal01 ~/​slurm-test ​cat my_job.output  +
-slurm1+
 </​code>​ </​code>​
  
Line 92: Line 86:
 slurm3 slurm3
 slurm5 slurm5
-</​code>​ 
- 
-=== Terminating Jobs === 
- 
-Please be aware of jobs you start and make sure that they finish executing. ​ If your job does not exit gracefully, it will continue running on the server, taking up resources and preventing others from running their jobs. 
- 
-To cancel a running job, use the ''​%%scancel [jobid]%%''​ command 
- 
-<​code>​ 
-abc1de@portal01 ~ $ squeue 
-             JOBID PARTITION ​    ​NAME ​    USER ST       ​TIME ​ NODES NODELIST(REASON) 
-            467039 ​     main    sleep    abc1de ​ R       ​0:​06 ​     1 artemis1 ​          <​-- ​ Running job 
-abc1de@portal01 ~ $ scancel 467039 
 </​code>​ </​code>​
  
 === Direct login to servers (without a job script) === === Direct login to servers (without a job script) ===
  
-You can use ''​%%srun%%''​ to login directly to a server controlled by the SLURM job scheduler. ​ This can be useful for debugging purposes as well as running your applications without using a job script. ​Directly logging in also reserves the node for your exclusive use. +You can use ''​%%srun%%''​ to login directly to a server controlled by the SLURM job scheduler. ​ This can be useful for debugging purposes as well as running your applications without using a job script. ​This feature ​also reserves the server ​for your exclusive use. 
  
-To spawn a shell we must pass the ''​%%--pty%%''​ option to ''​%%srun%%''​ so output is directed to a pseudo-terminal:​+We must pass the ''​%%--pty%%''​ option to ''​%%srun%%''​ so output is directed to a pseudo-terminal:​
  
 <​code>​ <​code>​
Line 122: Line 103:
 The ''​%%-w%%''​ argument selects the server into which to login. The ''​%%-i%%''​ argument tells ''​%%bash%%''​ to run as an interactive shell. ​ The ''​%%-l%%''​ argument instructs bash that this is a login shell, this, along with the final ''​%%-%%''​ are important to reset environment variables that otherwise might cause issues using [[linux_environment_modules|Environment Modules]] The ''​%%-w%%''​ argument selects the server into which to login. The ''​%%-i%%''​ argument tells ''​%%bash%%''​ to run as an interactive shell. ​ The ''​%%-l%%''​ argument instructs bash that this is a login shell, this, along with the final ''​%%-%%''​ are important to reset environment variables that otherwise might cause issues using [[linux_environment_modules|Environment Modules]]
  
-If a node is in a partition other than the default "​main"​ partition (for example, the "​gpu"​ partition), then you //must// specify the partition in your command, for example:+If a node is in a partition ​(see below for partition information) ​other than the default "​main"​ partition (for example, the "​gpu"​ partition), then you //must// specify the partition in your command, for example:
 <​code>​ <​code>​
 abc1de@portal ~$ srun -w lynx05 -p gpu --pty bash -i -l - abc1de@portal ~$ srun -w lynx05 -p gpu --pty bash -i -l -
 </​code>​ </​code>​
 +
 +=== Terminating Jobs ===
 +
 +Please be aware of jobs you start and make sure that they finish executing. ​ If your job does not exit gracefully, it will continue running on the server, taking up resources and preventing others from running their jobs.
 +
 +To cancel a running job, use the ''​%%scancel [jobid]%%''​ command
 +
 +<​code>​
 +abc1de@portal01 ~ $ squeue
 +             JOBID PARTITION ​    ​NAME ​    USER ST       ​TIME ​ NODES NODELIST(REASON)
 +            467039 ​     main    sleep    abc1de ​ R       ​0:​06 ​     1 artemis1 ​          <​-- ​ Running job
 +abc1de@portal01 ~ $ scancel 467039
 +</​code>​
 +
 +The default signal sent to a running job is SIGTERM (terminate). If you wish to send a different signal to the job's processes (for example, a SIGKILL which is often needed if a SIGTERM doesn'​t terminate the process), use the ''​%%--signal%%''​ argument to scancel, i.e.:
 +<​code>​
 +abc1de@portal01 ~ $ scancel --signal=KILL 467039
 +</​code>​
 +
  
 === Queues/​Partitions === === Queues/​Partitions ===
  • compute_slurm.txt
  • Last modified: 2022/12/01 18:11
  • (external edit)