Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
compute_slurm [2020/05/18 13:00]
pgh5a [Interactive Shell]
compute_slurm [2020/08/03 13:11]
pgh5a [Terminating Jobs]
Line 23: Line 23:
 </​code>​ </​code>​
  
-With ''​%%sinfo%%''​ we can see a listing of the job queues or "​partitions"​ and a list of nodes associated with these partitions. ​ A partition is a grouping of nodes, for example our //main// partition is a group of all SLURM nodes that are not reserved and can be used by anyone, and the //gpu// partition is a group of nodes that each contain GPUs.  Sometimes hosts can be listed in two or more partitions.+With ''​%%sinfo%%''​ we can see a listing of the job queues or "​partitions"​ and a list of nodes associated with these partitions. ​ A partition is a grouping of nodes, for example our //main// partition is a group of all general purpose ​nodes, and the //gpu// partition is a group of nodes that each contain GPUs.  Sometimes hosts can be listed in two or more partitions.
  
 To view jobs running on the queue, we can use the command ''​%%squeue%%''​. ​ Say we have submitted one job to the main partition, running ''​%%squeue%%''​ will look like this: To view jobs running on the queue, we can use the command ''​%%squeue%%''​. ​ Say we have submitted one job to the main partition, running ''​%%squeue%%''​ will look like this:
Line 30: Line 30:
 pgh5a@portal01 ~ $ squeue pgh5a@portal01 ~ $ squeue
              JOBID PARTITION ​    ​NAME ​    USER ST       ​TIME ​ NODES NODELIST(REASON)              JOBID PARTITION ​    ​NAME ​    USER ST       ​TIME ​ NODES NODELIST(REASON)
-            467039 ​     main    ​sleep    ​pgh5a ​ R       ​0:06      1 artemis1+            467039 ​     main    ​my_job ​   ​pgh5a ​ R      0:06      1 artemis1
 </​code>​ </​code>​
  
Line 51: Line 51:
 To use SLURM resources, you must submit your jobs (program/​script/​etc.) to the SLURM controller. ​ The controller will then send your job to compute nodes for execution, after which time your results will be returned. To use SLURM resources, you must submit your jobs (program/​script/​etc.) to the SLURM controller. ​ The controller will then send your job to compute nodes for execution, after which time your results will be returned.
  
-Users can submit SLURM jobs from any of the power servers: ​''​%%power1%%''​-''​%%power6%%''​. ​ From a shell, you can submit jobs using the commands [[https://​slurm.schedmd.com/​srun.html|srun]] or [[https://​slurm.schedmd.com/​sbatch.html|sbatch]]. ​ Let's look at a very simple example script and ''​%%sbatch%%''​ command.+Users can submit SLURM jobs from ''​%%portal.cs.virginia.edu%%''​. ​ From a shell, you can submit jobs using the commands [[https://​slurm.schedmd.com/​srun.html|srun]] or [[https://​slurm.schedmd.com/​sbatch.html|sbatch]]. ​ Let's look at a very simple example script and ''​%%sbatch%%''​ command.
  
 Here is our script, all it does is print the hostname of the server running the script. ​ We must add ''​%%SBATCH%%''​ options to our script to handle various SLURM options. Here is our script, all it does is print the hostname of the server running the script. ​ We must add ''​%%SBATCH%%''​ options to our script to handle various SLURM options.
Line 73: Line 73:
  
 <​code>​ <​code>​
-pgh5a@power3 ​~ $ cd slurm-test/  +pgh5a@portal01 ​~ $ cd slurm-test/  
-pgh5a@power3 ​~/​slurm-test $ chmod +x test.sh  +pgh5a@portal01 ​~/​slurm-test $ chmod +x test.sh  
-pgh5a@power3 ​~/​slurm-test $ sbatch test.sh ​+pgh5a@portal01 ​~/​slurm-test $ sbatch test.sh ​
 Submitted batch job 466977 Submitted batch job 466977
-pgh5a@power3 ​~/​slurm-test $ ls+pgh5a@portal01 ​~/​slurm-test $ ls
 my_job.err ​ my_job.output ​ test.sh my_job.err ​ my_job.output ​ test.sh
-pgh5a@power3 ​~/​slurm-test $ cat my_job.output ​+pgh5a@portal01 ​~/​slurm-test $ cat my_job.output ​
 slurm1 slurm1
 </​code>​ </​code>​
Line 86: Line 86:
  
 <​code>​ <​code>​
-pgh5a@power3 ​~ $ srun -w slurm[1-5] -N5 hostname+pgh5a@portal01 ​~ $ srun -w slurm[1-5] -N5 hostname
 slurm4 slurm4
 slurm1 slurm1
Line 96: Line 96:
 ==== Terminating Jobs ==== ==== Terminating Jobs ====
  
-Please be aware of jobs you start and make sure that they finish executing. ​ If your job does not converge, it will sit in the queue taking up resources and preventing others from running their jobs.+Please be aware of jobs you start and make sure that they finish executing. ​ If your job does not exit gracefully, it will continue running on the server, ​taking up resources and preventing others from running their jobs.
  
 To cancel a running job, use the ''​%%scancel [jobid]%%''​ command To cancel a running job, use the ''​%%scancel [jobid]%%''​ command
  
 <​code>​ <​code>​
-ktm5j@power1 ​~ $ squeue+ktm5j@portal01 ​~ $ squeue
              JOBID PARTITION ​    ​NAME ​    USER ST       ​TIME ​ NODES NODELIST(REASON)              JOBID PARTITION ​    ​NAME ​    USER ST       ​TIME ​ NODES NODELIST(REASON)
             467039 ​     main    sleep    ktm5j  R       ​0:​06 ​     1 artemis1 ​          <​-- ​ Running job             467039 ​     main    sleep    ktm5j  R       ​0:​06 ​     1 artemis1 ​          <​-- ​ Running job
-ktm5j@power1 ​~ $ scancel 467039+ktm5j@portal01 ​~ $ scancel 467039
 </​code>​ </​code>​
  
Line 125: Line 125:
 </​code>​ </​code>​
  
-An example running the command ''​%%hostname%%''​ on the //intel// partition, this will run on any node in the partition:+An example running the command ''​%%hostname%%''​ on the //main// partition, this will run on any node in the partition:
  
 <​code>​ <​code>​
-srun -p intel hostname+srun -p main hostname
 </​code>​ </​code>​
  
  • compute_slurm.txt
  • Last modified: 2020/10/06 12:59
  • by pgh5a