Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
compute_slurm [2020/05/04 13:59] pgh5a |
compute_slurm [2020/08/03 13:11] pgh5a [Terminating Jobs] |
||
---|---|---|---|
Line 3: | Line 3: | ||
{{ ::introtoslurm.pdf | Intro to SLURM slides }} | {{ ::introtoslurm.pdf | Intro to SLURM slides }} | ||
- | The Computer Science Department uses a "job scheduler" called [[https://en.wikipedia.org/wiki/Slurm_Workload_Manager|SLURM]]. The purpose of a job scheduler such as SLURM is to allocate computational resources to users who submit "jobs" to a queue. The job scheduler looks at the requirements stated in the job script and allocates a server (or servers) to the job which matches the requirements specified in the job script. | + | The Computer Science Department uses a "job scheduler" called [[https://en.wikipedia.org/wiki/Slurm_Workload_Manager|SLURM]]. The purpose of a job scheduler is to allocate computational resources (servers) to users who submit "jobs" to a queue. The job scheduler looks at the requirements stated in the job's script and allocates to the job a server (or servers) which matches the requirements specified in the job script. For example, if the job script specifies that a job needs 192GB of memory, the job scheduler will find a server with at least that much memory free. |
- | + | ||
- | ===== Note on Modules in Slurm ===== | + | |
- | + | ||
- | Due to the way sbatch spawns a bash session (non-login session), some init files are not loaded from ''%%/etc/profile.d%%''. This prevents the initialization of the [[linux_environment_modules|Environment Modules]] system and will prevent you from loading software modules. | + | |
- | + | ||
- | To fix this, simply include the following line in your sbatch scripts: | + | |
- | + | ||
- | <code bash> | + | |
- | source /etc/profile.d/modules.sh | + | |
- | </code> | + | |
===== Using SLURM ===== | ===== Using SLURM ===== | ||
Line 24: | Line 14: | ||
<code> | <code> | ||
- | pgh5a@portal01 ~ $ sinfo | + | [pgh5a@portal04 ~]$ sinfo |
- | PARTITION AVAIL TIMELIMIT NODES STATE NODELIST | + | PARTITION AVAIL TIMELIMIT NODES STATE NODELIST |
- | main* up infinite 37 idle hermes[1-4],artemis[1-7],slurm[1-5],nibbler[1-4],trillian[1-3],granger[1-6],granger[7-8],ai0[1-6] | + | main* up infinite 3 drain falcon[3-5] |
- | qdata up infinite 8 idle qdata[1-8] | + | main* up infinite 27 idle cortado[01-10],falcon[1-2,6-10],lynx[08-12],slurm[1-5] |
- | qdata-preempt up infinite 8 idle qdata[1-8] | + | gpu up infinite 4 mix ai[02-03,05],lynx07 |
- | falcon up infinite 10 idle falcon[1-10] | + | gpu up infinite 1 alloc ai04 |
- | intel up infinite 24 idle artemis7,slurm[1-5],granger[1-6],granger[7-8],nibbler[1-4],ai0[1-6] | + | gpu up infinite 12 idle ai[01,06],lynx[01-06],ristretto[01-04] |
- | amd up infinite 13 idle hermes[1-4],artemis[1-6],trillian[1-3] | + | |
</code> | </code> | ||
- | With ''%%sinfo%%'' we can see a listing of what SLURM calls partitions and a list of nodes associated with these partitions. A partition is a grouping of nodes, for example our //main// partition is a group of all SLURM nodes that are not reserved and can be used by anyone. Notice that hosts can be listed in several different partitions.. For example, ''%%slurm[1-5]%%'' can be found in both the //main// and //intel// partitions. //Intel// is a partition of systems with Intel processors, likewise the hosts in //amd// have AMD processors. | + | With ''%%sinfo%%'' we can see a listing of the job queues or "partitions" and a list of nodes associated with these partitions. A partition is a grouping of nodes, for example our //main// partition is a group of all general purpose nodes, and the //gpu// partition is a group of nodes that each contain GPUs. Sometimes hosts can be listed in two or more partitions. |
To view jobs running on the queue, we can use the command ''%%squeue%%''. Say we have submitted one job to the main partition, running ''%%squeue%%'' will look like this: | To view jobs running on the queue, we can use the command ''%%squeue%%''. Say we have submitted one job to the main partition, running ''%%squeue%%'' will look like this: | ||
Line 41: | Line 30: | ||
pgh5a@portal01 ~ $ squeue | pgh5a@portal01 ~ $ squeue | ||
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) | JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) | ||
- | 467039 main sleep pgh5a R 0:06 1 artemis1 | + | 467039 main my_job pgh5a R 0:06 1 artemis1 |
</code> | </code> | ||
Line 62: | Line 51: | ||
To use SLURM resources, you must submit your jobs (program/script/etc.) to the SLURM controller. The controller will then send your job to compute nodes for execution, after which time your results will be returned. | To use SLURM resources, you must submit your jobs (program/script/etc.) to the SLURM controller. The controller will then send your job to compute nodes for execution, after which time your results will be returned. | ||
- | Users can submit SLURM jobs from any of the power servers: ''%%power1%%''-''%%power6%%''. From a shell, you can submit jobs using the commands [[https://slurm.schedmd.com/srun.html|srun]] or [[https://slurm.schedmd.com/sbatch.html|sbatch]]. Let's look at a very simple example script and ''%%sbatch%%'' command. | + | Users can submit SLURM jobs from ''%%portal.cs.virginia.edu%%''. From a shell, you can submit jobs using the commands [[https://slurm.schedmd.com/srun.html|srun]] or [[https://slurm.schedmd.com/sbatch.html|sbatch]]. Let's look at a very simple example script and ''%%sbatch%%'' command. |
Here is our script, all it does is print the hostname of the server running the script. We must add ''%%SBATCH%%'' options to our script to handle various SLURM options. | Here is our script, all it does is print the hostname of the server running the script. We must add ''%%SBATCH%%'' options to our script to handle various SLURM options. | ||
Line 84: | Line 73: | ||
<code> | <code> | ||
- | pgh5a@power3 ~ $ cd slurm-test/ | + | pgh5a@portal01 ~ $ cd slurm-test/ |
- | pgh5a@power3 ~/slurm-test $ chmod +x test.sh | + | pgh5a@portal01 ~/slurm-test $ chmod +x test.sh |
- | pgh5a@power3 ~/slurm-test $ sbatch test.sh | + | pgh5a@portal01 ~/slurm-test $ sbatch test.sh |
Submitted batch job 466977 | Submitted batch job 466977 | ||
- | pgh5a@power3 ~/slurm-test $ ls | + | pgh5a@portal01 ~/slurm-test $ ls |
my_job.err my_job.output test.sh | my_job.err my_job.output test.sh | ||
- | pgh5a@power3 ~/slurm-test $ cat my_job.output | + | pgh5a@portal01 ~/slurm-test $ cat my_job.output |
slurm1 | slurm1 | ||
</code> | </code> | ||
Line 97: | Line 86: | ||
<code> | <code> | ||
- | pgh5a@power3 ~ $ srun -w slurm[1-5] -N5 hostname | + | pgh5a@portal01 ~ $ srun -w slurm[1-5] -N5 hostname |
slurm4 | slurm4 | ||
slurm1 | slurm1 | ||
Line 107: | Line 96: | ||
==== Terminating Jobs ==== | ==== Terminating Jobs ==== | ||
- | Please be aware of jobs you start and make sure that they finish executing. If your job does not converge, it will sit in the queue taking up resources and preventing others from running their jobs. | + | Please be aware of jobs you start and make sure that they finish executing. If your job does not exit gracefully, it will continue running on the server, taking up resources and preventing others from running their jobs. |
To cancel a running job, use the ''%%scancel [jobid]%%'' command | To cancel a running job, use the ''%%scancel [jobid]%%'' command | ||
<code> | <code> | ||
- | ktm5j@power1 ~ $ squeue | + | ktm5j@portal01 ~ $ squeue |
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) | JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) | ||
467039 main sleep ktm5j R 0:06 1 artemis1 <-- Running job | 467039 main sleep ktm5j R 0:06 1 artemis1 <-- Running job | ||
- | ktm5j@power1 ~ $ scancel 467039 | + | ktm5j@portal01 ~ $ scancel 467039 |
</code> | </code> | ||
Line 136: | Line 125: | ||
</code> | </code> | ||
- | An example running the command ''%%hostname%%'' on the //intel// partition, this will run on any node in the partition: | + | An example running the command ''%%hostname%%'' on the //main// partition, this will run on any node in the partition: |
<code> | <code> | ||
- | srun -p intel hostname | + | srun -p main hostname |
</code> | </code> | ||
Line 160: | Line 149: | ||
==== Interactive Shell ==== | ==== Interactive Shell ==== | ||
- | We can use ''%%srun%%'' to spawn an interactive shell on a SLURM compute node. While this can be useful for debugging purposes, this is **not** how you should typically use the SLURM system. To spawn a shell we must pass the ''%%--pty%%'' option to ''%%srun%%'' so output is directed to a pseudo-terminal: | + | We can use ''%%srun%%'' to spawn an interactive shell on a server controlled by the SLURM job scheduler. This can be useful for debugging purposes as well as running jobs without using a job script. Creating an interactive session also reserves the node for your exclusive use. |
+ | |||
+ | To spawn a shell we must pass the ''%%--pty%%'' option to ''%%srun%%'' so output is directed to a pseudo-terminal: | ||
<code> | <code> | ||
- | ktm5j@power3 ~/slurm-test $ srun -w slurm1 --pty bash -i -l - | + | pgh5a@portal ~$ srun -w slurm1 --pty bash -i -l - |
- | ktm5j@slurm1 ~/slurm-test $ hostname | + | pgh5a@slurm1 ~$ hostname |
slurm1 | slurm1 | ||
- | ktm5j@slurm1 ~/slurm-test $ | + | pgh5a@slurm1 ~$ |
</code> | </code> | ||
Line 173: | Line 164: | ||
If a node is in a partition other than the default "main" partition (for example, the "gpu" partition), then you must specify the partition in your command, for example: | If a node is in a partition other than the default "main" partition (for example, the "gpu" partition), then you must specify the partition in your command, for example: | ||
<code> | <code> | ||
- | ktm5j@power3 ~/slurm-test $ srun -w lynx05 -p gpu --pty bash -i -l - | + | pgh5a@portal ~$ srun -w lynx05 -p gpu --pty bash -i -l - |
</code> | </code> | ||
Line 179: | Line 170: | ||
Reservations for specific resources or nodes can be made by submitting a request to <cshelpdesk@virginia.edu>. For more information about using reservations, see the main article on [[compute_slurm_reservations|SLURM Reservations]] | Reservations for specific resources or nodes can be made by submitting a request to <cshelpdesk@virginia.edu>. For more information about using reservations, see the main article on [[compute_slurm_reservations|SLURM Reservations]] | ||
+ | |||
+ | ===== Note on Modules in Slurm ===== | ||
+ | |||
+ | Due to the way sbatch spawns a bash session (non-login session), some init files are not loaded from ''%%/etc/profile.d%%''. This prevents the initialization of the [[linux_environment_modules|Environment Modules]] system and will prevent you from loading software modules. | ||
+ | |||
+ | To fix this, simply include the following line in your sbatch scripts: | ||
+ | |||
+ | <code bash> | ||
+ | source /etc/profile.d/modules.sh | ||
+ | </code> | ||