Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
compute_slurm [2019/11/15 14:42] ktm5j |
compute_slurm [2020/05/26 20:23] pgh5a |
||
---|---|---|---|
Line 3: | Line 3: | ||
{{ ::introtoslurm.pdf | Intro to SLURM slides }} | {{ ::introtoslurm.pdf | Intro to SLURM slides }} | ||
- | The Computer Science Department uses a resource management system called [[https://en.wikipedia.org/wiki/Slurm_Workload_Manager|SLURM]]. The purpose of a workload scheduler such as SLURM is to allocate computational resources to our users in such a way that everyone gets their fair share of execution time. | + | The Computer Science Department uses a "job scheduler" called [[https://en.wikipedia.org/wiki/Slurm_Workload_Manager|SLURM]]. The purpose of a job scheduler is to allocate computational resources (servers) to users who submit "jobs" to a queue. The job scheduler looks at the requirements stated in the job's script and allocates to the job a server (or servers) which matches the requirements specified in the job script. For example, if the job script specifies that a job needs 192GB of memory, the job scheduler will find a server with at least that much memory free. |
- | + | ||
- | ===== Note on Modules in Slurm ===== | + | |
- | + | ||
- | Due to the way sbatch spawns a bash session (non-login session), some init files are not loaded from ''%%/etc/profile.d%%''. This prevents the initialization of the [[linux_environment_modules|Environment Modules]] system and will prevent you from loading software modules. | + | |
- | + | ||
- | To fix this, simply include the following line in your sbatch scripts: | + | |
- | + | ||
- | <code bash> | + | |
- | source /etc/profile.d/modules.sh | + | |
- | </code> | + | |
===== Using SLURM ===== | ===== Using SLURM ===== | ||
Line 24: | Line 14: | ||
<code> | <code> | ||
- | ktm5j@power1 ~ $ sinfo | + | [pgh5a@portal04 ~]$ sinfo |
- | PARTITION AVAIL TIMELIMIT NODES STATE NODELIST | + | PARTITION AVAIL TIMELIMIT NODES STATE NODELIST |
- | main* up infinite 37 idle hermes[1-4],artemis[1-7],slurm[1-5],nibbler[1-4],trillian[1-3],granger[1-6],granger[7-8],ai0[1-6] | + | main* up infinite 3 drain falcon[3-5] |
- | qdata up infinite 8 idle qdata[1-8] | + | main* up infinite 27 idle cortado[01-10],falcon[1-2,6-10],lynx[08-12],slurm[1-5] |
- | qdata-preempt up infinite 8 idle qdata[1-8] | + | gpu up infinite 4 mix ai[02-03,05],lynx07 |
- | falcon up infinite 10 idle falcon[1-10] | + | gpu up infinite 1 alloc ai04 |
- | intel up infinite 24 idle artemis7,slurm[1-5],granger[1-6],granger[7-8],nibbler[1-4],ai0[1-6] | + | gpu up infinite 12 idle ai[01,06],lynx[01-06],ristretto[01-04] |
- | amd up infinite 13 idle hermes[1-4],artemis[1-6],trillian[1-3] | + | |
</code> | </code> | ||
- | With ''%%sinfo%%'' we can see a listing of what SLURM calls partitions and a list of nodes associated with these partitions. A partition is a grouping of nodes, for example our //main// partition is a group of all SLURM nodes that are not reserved and can be used by anyone. Notice that hosts can be listed in several different partitions.. For example, ''%%slurm[1-5]%%'' can be found in both the //main// and //intel// partitions. //Intel// is a partition of systems with Intel processors, likewise the hosts in //amd// have AMD processors. | + | With ''%%sinfo%%'' we can see a listing of the job queues or "partitions" and a list of nodes associated with these partitions. A partition is a grouping of nodes, for example our //main// partition is a group of all SLURM nodes that are not reserved and can be used by anyone, and the //gpu// partition is a group of nodes that each contain GPUs. Sometimes hosts can be listed in two or more partitions. |
To view jobs running on the queue, we can use the command ''%%squeue%%''. Say we have submitted one job to the main partition, running ''%%squeue%%'' will look like this: | To view jobs running on the queue, we can use the command ''%%squeue%%''. Say we have submitted one job to the main partition, running ''%%squeue%%'' will look like this: | ||
<code> | <code> | ||
- | ktm5j@power1 ~ $ squeue | + | pgh5a@portal01 ~ $ squeue |
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) | JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) | ||
- | 467039 main sleep ktm5j R 0:06 1 artemis1 | + | 467039 main sleep pgh5a R 0:06 1 artemis1 |
</code> | </code> | ||
Line 47: | Line 36: | ||
<code> | <code> | ||
- | ktm5j@power1 ~ $ sinfo | + | pgh5a@portal01 ~ $ sinfo |
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST | PARTITION AVAIL TIMELIMIT NODES STATE NODELIST | ||
main* up infinite 37 idle hermes[1-4],artemis[2-7],slurm[1-5],nibbler[1-4],trillian[1-3],granger[1-6],granger[7-8],ai0[1-6] | main* up infinite 37 idle hermes[1-4],artemis[2-7],slurm[1-5],nibbler[1-4],trillian[1-3],granger[1-6],granger[7-8],ai0[1-6] | ||
Line 62: | Line 51: | ||
To use SLURM resources, you must submit your jobs (program/script/etc.) to the SLURM controller. The controller will then send your job to compute nodes for execution, after which time your results will be returned. | To use SLURM resources, you must submit your jobs (program/script/etc.) to the SLURM controller. The controller will then send your job to compute nodes for execution, after which time your results will be returned. | ||
- | Users can submit SLURM jobs from any of the power servers: ''%%power1%%''-''%%power6%%''. From a shell, you can submit jobs using the commands [[https://slurm.schedmd.com/srun.html|srun]] or [[https://slurm.schedmd.com/sbatch.html|sbatch]]. Let's look at a very simple example script and ''%%sbatch%%'' command. | + | Users can submit SLURM jobs from ''%%portal.cs.virginia.edu%%''. From a shell, you can submit jobs using the commands [[https://slurm.schedmd.com/srun.html|srun]] or [[https://slurm.schedmd.com/sbatch.html|sbatch]]. Let's look at a very simple example script and ''%%sbatch%%'' command. |
Here is our script, all it does is print the hostname of the server running the script. We must add ''%%SBATCH%%'' options to our script to handle various SLURM options. | Here is our script, all it does is print the hostname of the server running the script. We must add ''%%SBATCH%%'' options to our script to handle various SLURM options. | ||
Line 72: | Line 61: | ||
# | # | ||
#SBATCH --mail-type=ALL | #SBATCH --mail-type=ALL | ||
- | #SBATCH --mail-user=ktm5j@virginia.edu | + | #SBATCH --mail-user=pgh5a@virginia.edu |
# | # | ||
#SBATCH --error="my_job.err" # Where to write std err | #SBATCH --error="my_job.err" # Where to write std err | ||
Line 84: | Line 73: | ||
<code> | <code> | ||
- | ktm5j@power3 ~ $ cd slurm-test/ | + | pgh5a@portal01 ~ $ cd slurm-test/ |
- | ktm5j@power3 ~/slurm-test $ chmod +x test.sh | + | pgh5a@portal01 ~/slurm-test $ chmod +x test.sh |
- | ktm5j@power3 ~/slurm-test $ sbatch test.sh | + | pgh5a@portal01 ~/slurm-test $ sbatch test.sh |
Submitted batch job 466977 | Submitted batch job 466977 | ||
- | ktm5j@power3 ~/slurm-test $ ls | + | pgh5a@portal01 ~/slurm-test $ ls |
my_job.err my_job.output test.sh | my_job.err my_job.output test.sh | ||
- | ktm5j@power3 ~/slurm-test $ cat my_job.output | + | pgh5a@portal01 ~/slurm-test $ cat my_job.output |
slurm1 | slurm1 | ||
</code> | </code> | ||
Line 97: | Line 86: | ||
<code> | <code> | ||
- | ktm5j@power3 ~ $ srun -w slurm[1-5] -N5 hostname | + | pgh5a@portal01 ~ $ srun -w slurm[1-5] -N5 hostname |
slurm4 | slurm4 | ||
slurm1 | slurm1 | ||
Line 112: | Line 101: | ||
<code> | <code> | ||
- | ktm5j@power1 ~ $ squeue | + | ktm5j@portal01 ~ $ squeue |
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) | JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) | ||
467039 main sleep ktm5j R 0:06 1 artemis1 <-- Running job | 467039 main sleep ktm5j R 0:06 1 artemis1 <-- Running job | ||
- | ktm5j@power1 ~ $ scancel 467039 | + | ktm5j@portal01 ~ $ scancel 467039 |
</code> | </code> | ||
Line 136: | Line 125: | ||
</code> | </code> | ||
- | An example running the command ''%%hostname%%'' on the //intel// partition, this will run on any node in the partition: | + | An example running the command ''%%hostname%%'' on the //main// partition, this will run on any node in the partition: |
<code> | <code> | ||
- | srun -p intel hostname | + | srun -p main hostname |
</code> | </code> | ||
Line 160: | Line 149: | ||
==== Interactive Shell ==== | ==== Interactive Shell ==== | ||
- | We can use ''%%srun%%'' to spawn an interactive shell on a SLURM compute node. While this can be useful for debugging purposes, this is **not** how you should typically use the SLURM system. To spawn a shell we must pass the ''%%--pty%%'' option to ''%%srun%%'' so output is directed to a pseudo-terminal: | + | We can use ''%%srun%%'' to spawn an interactive shell on a server controlled by the SLURM job scheduler. This can be useful for debugging purposes as well as running jobs without using a job script. Creating an interactive session also reserves the node for your exclusive use. |
+ | |||
+ | To spawn a shell we must pass the ''%%--pty%%'' option to ''%%srun%%'' so output is directed to a pseudo-terminal: | ||
<code> | <code> | ||
- | ktm5j@power3 ~/slurm-test $ srun -w slurm1 --pty bash -i -l - | + | pgh5a@portal ~$ srun -w slurm1 --pty bash -i -l - |
- | ktm5j@slurm1 ~/slurm-test $ hostname | + | pgh5a@slurm1 ~$ hostname |
slurm1 | slurm1 | ||
- | ktm5j@slurm1 ~/slurm-test $ | + | pgh5a@slurm1 ~$ |
</code> | </code> | ||
Line 173: | Line 164: | ||
If a node is in a partition other than the default "main" partition (for example, the "gpu" partition), then you must specify the partition in your command, for example: | If a node is in a partition other than the default "main" partition (for example, the "gpu" partition), then you must specify the partition in your command, for example: | ||
<code> | <code> | ||
- | ktm5j@power3 ~/slurm-test $ srun -w lynx05 -p gpu --pty bash -i -l - | + | pgh5a@portal ~$ srun -w lynx05 -p gpu --pty bash -i -l - |
</code> | </code> | ||
Line 179: | Line 170: | ||
Reservations for specific resources or nodes can be made by submitting a request to <cshelpdesk@virginia.edu>. For more information about using reservations, see the main article on [[compute_slurm_reservations|SLURM Reservations]] | Reservations for specific resources or nodes can be made by submitting a request to <cshelpdesk@virginia.edu>. For more information about using reservations, see the main article on [[compute_slurm_reservations|SLURM Reservations]] | ||
+ | |||
+ | ===== Note on Modules in Slurm ===== | ||
+ | |||
+ | Due to the way sbatch spawns a bash session (non-login session), some init files are not loaded from ''%%/etc/profile.d%%''. This prevents the initialization of the [[linux_environment_modules|Environment Modules]] system and will prevent you from loading software modules. | ||
+ | |||
+ | To fix this, simply include the following line in your sbatch scripts: | ||
+ | |||
+ | <code bash> | ||
+ | source /etc/profile.d/modules.sh | ||
+ | </code> | ||