Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
compute_slurm [2020/10/06 12:51] pgh5a |
compute_slurm [2022/04/04 14:25] (current) |
||
---|---|---|---|
Line 5: | Line 5: | ||
The Computer Science Department uses a "job scheduler" called [[https://en.wikipedia.org/wiki/Slurm_Workload_Manager|SLURM]]. The purpose of a job scheduler is to allocate computational resources (servers) to users who submit "jobs" to a queue. The job scheduler looks at the requirements stated in the job's script and allocates to the job a server (or servers) which matches the requirements specified in the job script. For example, if the job script specifies that a job needs 192GB of memory, the job scheduler will find a server with at least that much memory free. | The Computer Science Department uses a "job scheduler" called [[https://en.wikipedia.org/wiki/Slurm_Workload_Manager|SLURM]]. The purpose of a job scheduler is to allocate computational resources (servers) to users who submit "jobs" to a queue. The job scheduler looks at the requirements stated in the job's script and allocates to the job a server (or servers) which matches the requirements specified in the job script. For example, if the job script specifies that a job needs 192GB of memory, the job scheduler will find a server with at least that much memory free. | ||
+ | The job scheduler supports a direct login option (see below) that allows direct interactive logins to servers controlled by the scheduler, without the need for a job script. | ||
+ | |||
+ | As of 10-Jan-2022, the SLURM job scheduler was updated to the latest revision which enforces memory limits. As a result, jobs that exceed their requested memory size will be terminated by the scheduler. | ||
+ | |||
+ | As of 02-Apr-2022, the time limit enforcement policy within SLURM has changed. All jobs submitted with time limits are extended 60 minutes past the user-submitted time limit. E.g. If the user submits a job with a time limit of 10 minutes using parameter "-t 10", SLURM will kill the job after 70 minutes. | ||
+ | |||
=== Using SLURM === | === Using SLURM === | ||
**[[https://slurm.schedmd.com/pdfs/summary.pdf| | **[[https://slurm.schedmd.com/pdfs/summary.pdf| | ||
Slurm Commands Cheat Sheet]]** | Slurm Commands Cheat Sheet]]** | ||
+ | |||
+ | The SLURM commands below are ONLY available on the portal cluster of servers. They are not installed on the gpusrv* or the SLURM controlled nodes themselves. | ||
=== Information Gathering === | === Information Gathering === | ||
Line 49: | Line 57: | ||
=== Jobs === | === Jobs === | ||
- | To use SLURM resources, you must submit your jobs (program/script/etc.) to the SLURM controller. The controller will then send your job to compute nodes for execution, after which time your results will be returned. | + | To use SLURM resources, you must submit your jobs (program/script/etc.) to the SLURM controller. The controller will then send your job to compute nodes for execution, after which time your results will be returned. There is also an //direct login option// (see below) that doesn't require a job script. |
- | Users can submit SLURM jobs from ''%%portal.cs.virginia.edu%%''. From a shell, you can submit jobs using the commands [[https://slurm.schedmd.com/srun.html|srun]] or [[https://slurm.schedmd.com/sbatch.html|sbatch]]. Let's look at a very simple example script and ''%%sbatch%%'' command. | + | Users can submit SLURM jobs from ''%%portal.cs.virginia.edu%%''. You can submit jobs using the commands [[https://slurm.schedmd.com/srun.html|srun]] or [[https://slurm.schedmd.com/sbatch.html|sbatch]]. Let's look at a very simple example script and ''%%sbatch%%'' command. |
Here is our script, all it does is print the hostname of the server running the script. We must add ''%%SBATCH%%'' options to our script to handle various SLURM options. | Here is our script, all it does is print the hostname of the server running the script. We must add ''%%SBATCH%%'' options to our script to handle various SLURM options. | ||
Line 57: | Line 65: | ||
<code bash> | <code bash> | ||
#!/bin/bash | #!/bin/bash | ||
- | + | # --- this job will be run on any available node | |
- | #SBATCH --job-name="Slurm Simple Test Job" #Name of the job which appears in squeue | + | # and simply output the node's hostname to |
- | # | + | # my_job.output |
- | #SBATCH --mail-type=ALL | + | #SBATCH --job-name="Slurm Simple Test Job" |
- | #SBATCH --mail-user=abc1de@virginia.edu | + | #SBATCH --error="my_job.err" |
- | # | + | #SBATCH --output="my_job.output" |
- | #SBATCH --error="my_job.err" # Where to write std err | + | echo "$HOSTNAME" |
- | #SBATCH --output="my_job.output" # Where to write stdout | + | |
- | #SBATCH --nodelist=slurm1 | + | |
- | + | ||
- | hostname | + | |
</code> | </code> | ||
- | Let's put this in a directory called ''%%slurm-test%%'' in our home directory. We run the script with ''%%sbatch%%'' and the results will be put in the file we specified with ''%%--output%%''. If no output file is specified, output will be saved to a file with the same name as the SLURM jobid. | + | We run the script with ''%%sbatch%%'' and the results will be put in the file we specified with ''%%--output%%''. If no output file is specified, output will be saved to a file with the same name as the SLURM jobid. |
<code> | <code> | ||
- | abc1de@portal01 ~ $ cd slurm-test/ | + | [abc1de@portal04 ~]$ sbatch slurm.test |
- | abc1de@portal01 ~/slurm-test $ chmod +x test.sh | + | Submitted batch job 640768 |
- | abc1de@portal01 ~/slurm-test $ sbatch test.sh | + | [abc1de@portal04 ~]$ more my_job.output |
- | Submitted batch job 466977 | + | cortado06 |
- | abc1de@portal01 ~/slurm-test $ ls | + | |
- | my_job.err my_job.output test.sh | + | |
- | abc1de@portal01 ~/slurm-test $ cat my_job.output | + | |
- | slurm1 | + | |
</code> | </code> | ||
Line 93: | Line 93: | ||
slurm5 | slurm5 | ||
</code> | </code> | ||
+ | |||
=== Direct login to servers (without a job script) === | === Direct login to servers (without a job script) === | ||
- | You can use ''%%srun%%'' to login directly to a server controlled by the SLURM job scheduler. This can be useful for debugging purposes as well as running your applications without using a job script. Directly logging in also reserves the node for your exclusive use. | + | You can use ''%%srun%%'' to login directly to a server controlled by the SLURM job scheduler. This can be useful for debugging purposes as well as running your applications without using a job script. This feature also reserves the server for your exclusive use. |
- | To spawn a shell we must pass the ''%%--pty%%'' option to ''%%srun%%'' so output is directed to a pseudo-terminal: | + | We must pass the ''%%--pty%%'' option to ''%%srun%%'' so output is directed to a pseudo-terminal: |
<code> | <code> | ||
Line 124: | Line 125: | ||
467039 main sleep abc1de R 0:06 1 artemis1 <-- Running job | 467039 main sleep abc1de R 0:06 1 artemis1 <-- Running job | ||
abc1de@portal01 ~ $ scancel 467039 | abc1de@portal01 ~ $ scancel 467039 | ||
+ | </code> | ||
+ | |||
+ | The default signal sent to a running job is SIGTERM (terminate). If you wish to send a different signal to the job's processes (for example, a SIGKILL which is often needed if a SIGTERM doesn't terminate the process), use the ''%%--signal%%'' argument to scancel, i.e.: | ||
+ | <code> | ||
+ | abc1de@portal01 ~ $ scancel --signal=KILL 467039 | ||
</code> | </code> | ||
Line 170: | Line 176: | ||
Reservations for specific resources or nodes can be made by submitting a request to <cshelpdesk@virginia.edu>. For more information about using reservations, see the main article on [[compute_slurm_reservations|SLURM Reservations]] | Reservations for specific resources or nodes can be made by submitting a request to <cshelpdesk@virginia.edu>. For more information about using reservations, see the main article on [[compute_slurm_reservations|SLURM Reservations]] | ||
+ | |||
+ | === Job Accounting === | ||
+ | The SLURM scheduler implements the Accounting features of slurm. So users can execute the ''%%sacct%%'' command to find job accounting information, like job ids, job names, partition run upon, allocated CPUs, job state, and exit codes. There are numerous other options supported. Type ''%%man sacct%%'' on portal to see all the options. | ||
=== Note on Modules in Slurm === | === Note on Modules in Slurm === |