SLURM

From CS Support Wiki
Jump to: navigation, search

The department clusters are managed using the SLURM batch scheduling system. Jobs are submitted to the cluster(s) by creating a SLURM job command file that specifies certain attributes of the job, such as how long the job is expected to run and how many nodes of the cluster are needed (e.g. for parallel programs). PBS then schedules when the job is to start running on the cluster (based in part on those attributes), runs and monitors the job at the scheduled time, and returns any output to the user once the job completes.

Instructions below are given for using the UVA CS Department systems, but the Official SLURM Documentation has a quickstart guide which is a good place to start.

When to use SLURM: Whenever you need to run a job that will last for more than a few hours, consumes more than 16GB of RAM or more than two cores, you should be using SLURM. HINT: if you are going to invoke the job under a SCREEN session, then it's likely the job is appropriate for the compute cluster.

Accessing the Cluster(s)

Access to each cluster is restricted through the scheduler, and jobs are submitted via front-end nodes. You cannot directly log onto a cluster node because resources are scheduled by SLURM for users. This way jobs don't compete for resources and run more efficiently.

Front Ends

The general purpose front-end nodes are power[1-6].cs.virginia.edu; these are general purpose interactive servers and can be used for running any general interactive programs. These nodes have the identical software environment as each compute node and are the appropriate place to build and debug your job(s) before submitting large runs of them on the compute nodes. The common environment includes storage (/bigtemp, /lustre, /localtmp and /home) and all packages and libraries. If your job will not run on a power node, please email root to ask us to add or fix packages.

There are also three nodes equipped with CUDA-capable cards (cuda[1-3].cs.virginia.edu). If you have a GPU (Nvidia) job, you should log onto one of those and use that to debug your job/code before submitting your job on the CUDA-equipped compute nodes.

All of these can be reached using the NoMachine nxclient or SSH (bundled with most *nix distros and MacOS or SecureCRT on Windows).

If your jobs are failing and exiting suddenly, please debug the actual job outside of the job script by running your program on the interactive front-end node! Send any output error messages (for example, incorrect or missing libraries) to root@cs.virginia.edu.

Compute Nodes

The individual nodes are lumped together into "partitions" of nodes, and are only accessible through the job scheduler - no direct login is permitted for users. This is to ensure that users actually get full use of the resources they are assigned by the scheduler; this arbitrates resource conflicts for single-user resources like the GPUs.

User ID and authentication are handled auto-magically by MUNGE.

The main "partition" (collection of compute nodes) is open to all department users, with some soft resource limits. This includes a range of CPU types and memory size, so if you need to run your job on specific hardware, it's important to request the resources correctly.

The DEFAULT partition nodes range in resources from Relatively old Quad-core Xeons to new Phenom processors; memory ranges from 32GB to 256GB; it is important that you specify enough resources so that your big-memory job doesn't wind up on a small memory node (and get killed, or cause the OS to thrash); similarly, only ask for as much as you need so that your jobs can get scheduled sooner.

There are a handful of semi-private partitions which have ACL-type use restrictions.

Partition Information

The default partition (group of machines) where users can run jobs has 9 nodes:

NodeName CPUs RAM /localtmp GPU
hermes1 4x16c 64c AMD 6276 256G 500GB n/a
hermes2 4x16c 64c AMD 6276 256G 500GB n/a
hermes3 4x16c 64c AMD 6276 256G 500GB n/a
hermes4 4x16c 64c AMD 6276 256G 500GB n/a
artemis1 2x16c 32c AMD 6276 128G 1TB K20c (5GB)
artemis2 2x16c 32c AMD 6276 128G 1TB K20c (5GB)
artemis3 2x16c 32c AMD 6276 128G 1TB K20c (5GB)
artemis4 2x16c 32c AMD 6276 128G 1TB K20c (5GB)
artemis5 2x16c 32c AMD 6276 128G 1TB K20c (5GB)

Submitting Jobs

Jobs should be scheduled using the sbatch or srun commands and invoked from an interactive front-end. SLURM does a set-uid and fork to execute an srun job, not en entire shell; therefore should only be used directly for the simplest of jobs (single step). Any job which involves multiple steps should be submitted using sbatch and then each job-step run inside of the sbatch script with srun.

Simple Job

The simplest jobs are single-CPU requests you might make on an interactive node, but for a particularly large memory or long-running job. Rather than starting the job with the standard command line, you can invoke it with "srun" which requests the number of resources you want for your job and then launches it on the first available node with your requested resources.

jpr9c@power5
: /af13/jpr9c/work/slurm/test_jobs/simple ; srun hostname
artemis1

jpr9c@power5
: /af13/jpr9c/work/slurm/test_jobs/simple ; srun --nodelist=hermes2 hostname
hermes2

Invoking a simple multi-step (multiple command) job via sbatch:

First, the sbatch script file - this is a shell script, so it requires a "shebang" at the beginning:

jpr9c@power5
: /af13/jpr9c/work/slurm/test_jobs/simple ; cat simple.sh 
#!/bin/bash
# This is a template for a simple SLURM sbatch job script file
#
# First, SBATCH options - these could be passed via the command line, but this
# is more convenient
#
#SBATCH --job-name="Slurm Simple Test Job" #Name of the job which appears in squeue
#
#SBATCH --mail-type=ALL #What notifications are sent by email
#SBATCH --mail-user=ruffner@cs.virginia.edu
#
# Set up your user environment!!
#SBATCH --get-user-env
#
#SBATCH --error="my_job.err"                    # Where to write std err
#SBATCH --output="my_job.output"                # Where to write stdout

srun hostname
srun pwd
srun df -h /localtmp
srun hostinfo

This file contains all the job running options. You can use this as a template for your job. Please change the email address.

Then run the job:

jpr9c@power5
: /af13/jpr9c/work/slurm/test_jobs/simple ; sbatch simple.sh 
Submitted batch job 76172

jpr9c@power5
: /af13/jpr9c/work/slurm/test_jobs/simple ; ls
my_job.err  my_job.output  simple.sh

jpr9c@power5
: /af13/jpr9c/work/slurm/test_jobs/simple ; cat my_job.output 
artemis1
/net/af13/jpr9c/work/slurm/test_jobs/simple
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda3       499G   70M  474G   1% /localtmp

jpr9c@power5
: /af13/jpr9c/work/slurm/test_jobs/simple ; cat my_job.err 
slurmd[artemis1]: execve(): hostinfo: No such file or directory
srun: error: artemis1: task 0: Exited with exit code 2

Interactive Shell

The simplest job is an interactive shell; while this is appropriate and useful for debugging a job - one which presumably runs correctly on a power node - it is not appropriate for most jobs, because an idle shell consumes CPU. The "--pty" argument tells srun that stdout and stderr are being directed to a regular pseudo-terminal:

jpr9c@power5
: /af13/jpr9c/work/slurm/test_jobs/simple ; srun --pty bash -i
jpr9c@artemis1:~/work/slurm/test_jobs/simple$ 

Requesting Resources

You may request specific resources from the nodes, or specific nodes to run your jobs on. The most common of these are RAM and nCPUs (for parallel jobs).

Memory, CPUs and multiple nodes

For more complex and parallel jobs, you may request multiple CPUs, multiple nodes and a minimum amount of RAM. A detailed listing of all the various flags for selecting resources is available using "man srun" or "man sbatch". Send mail to root if you're having trouble coming up with the right combination to get the resource allocation you need.

Minimum RAM

If you have a job working with large datasets, be sure to request a large enough minimum quantity of RAM to avoid heavy swapping. Two useful flags for this are:

--mem=<GB>

--mem-per-cpu=<GB>

jpr9c@power5
: /af13/jpr9c ; srun --mem=128 --pty bash -i 
jpr9c@artemis1:~$ free
             total       used       free     shared    buffers     cached
Mem:     132020192    1835088  130185104       9836     119444     480240
-/+ buffers/cache:    1235404  130784788
Swap:    250000380          0  250000380

nCPUS

If your job runs in parallel (for example, MPI) you may wish to use several CPUs binding your job to a single node, but may not want the cores to all come from the same CPU (socket):

jpr9c@power5
: /af13/jpr9c/work/slurm/test_jobs/mpi ; srun -n 5 -c 5 --cpus-per-task 1 --pty bash -i
jpr9c@artemis1:~/work/slurm/test_jobs/mpi$ set | grep SLURM | more
SLURMD_NODENAME=artemis1

SLURM_CPUS_PER_TASK=1

SLURM_NODELIST=artemis1
SLURM_NPROCS=5
SLURM_NTASKS=5

nNODES

In some cases, you may wish to spread your job out across nodes rather than having many tasks run on the same node; you can request a number of nodes with -N

jpr9c@power5
: /af13/jpr9c/work/slurm/test_jobs/mpi ; srun -N 5 --pty bash -i
jpr9c@artemis5:~/work/slurm/test_jobs/mpi$ set | grep SLURM_NODELIST
SLURM_NODELIST='artemis5,hermes[1-4]'

GPUs

The cluster includes 5 Nvidia Tesla K20c (5GB) GPUs which can be used for GPU specific jobs. Because these resources are limited to one user at a time, they're ideal candidates for batch job submission.

As with CPUs or RAM, the resource required is passed as an option to srun or sbatch:

jpr9c@power5
: /af13/jpr9c/work/slurm/test_jobs/gres ; srun --gres=gpu --pty nvidia-smi
Sun Oct 25 11:56:16 2015       
+------------------------------------------------------+                       
| NVIDIA-SMI 346.96     Driver Version: 346.96         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K20c          Off  | 0000:04:00.0     Off |                    0 |
| 30%   29C    P0    52W / 225W |     12MiB /  4799MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Parallel Jobs

Both MPICH2 and OpenMPI are configured to work properly with SLURM on the cs clusters, and both sets of libraries are installed. MPICH2 is used for the examples below.

mpi "hello world"

A simple "hello world" mpich program:

jpr9c@power6
: /af13/jpr9c/work/slurm/test_jobs/mpi ; cat testMPI.c
#include "mpi.h"
#include "stdlib.h"
#include <iostream>

using namespace std;

int main(int argc, char *argv[]) {
   // Initialize MPI
   MPI_Init(&argc, &argv);
   // Get the processor number of the current node
        int rank; // can't be unsigned
   MPI_Comm_rank(MPI_COMM_WORLD, &rank);
   // Get the total number of nodes
        int numNodes; // can't be unsigned
   MPI_Comm_size(MPI_COMM_WORLD, &numNodes);
        cout << "My rank is: " << rank << " and I think there are "
        << numNodes << " nodes to talk to." << endl;
   MPI_Finalize();
}

Build our program:

jpr9c@power6
: /af13/jpr9c/work/slurm/test_jobs/mpi ; mpic++ -o testMPI testMPI.c

-n (number of tasks)

Each task is an individual Process - mpiexec launches the tasks using the resources, by default allocating one core per task. Under PBS this was approximated under -ncpus. This is the simplest case; first the batch submission script:

 
jpr9c@power6
: /af13/jpr9c/work/slurm/test_jobs/mpi ; cat 10-cores.sh 
#!/bin/bash
#
# This is a simple MPI sbatch submission script for SLURM and MPICH2
#
#SBATCH -o 10-cores.out
#SBATCH -n 10

export HYDRA_BOOTSTRAP=slurm

echo "nodelist = $SLURM_NODELIST"
/usr/cs/bin/mpiexec ./testMPI.mpich

Our results:

jpr9c@power6
: /af13/jpr9c/work/slurm/test_jobs/mpi ; sbatch 10-cores.sh 
Submitted batch job 76335

jpr9c@power6
: /af13/jpr9c/work/slurm/test_jobs/mpi ; cat 10-cores.out 
nodelist = artemis1
My rank is: 1 and I think there are 10 nodes to talk to.
My rank is: 2 and I think there are 10 nodes to talk to.
My rank is: 3 and I think there are 10 nodes to talk to.
My rank is: 4 and I think there are 10 nodes to talk to.
My rank is: 5 and I think there are 10 nodes to talk to.
My rank is: 6 and I think there are 10 nodes to talk to.
My rank is: 8 and I think there are 10 nodes to talk to.
My rank is: 9 and I think there are 10 nodes to talk to.
My rank is: 0 and I think there are 10 nodes to talk to.
My rank is: 7 and I think there are 10 nodes to talk to.

So, SLURM tries to pack jobs on a single host (and in this case, a single socket - there are four sockets in this node, 16 cores each), and creates a task on each core.

-N nodes (distributing your job across multiple nodes)

If you wish to distribute your job across a number of nodes, rather than having them all packed onto the same node (perhaps for disk or network I/O reasons), then you can ask SLURM to allocate multiple nodes, and it will distribute those tasks across multiple nodes:

jpr9c@power6
: /af13/jpr9c/work/slurm/test_jobs/mpi ; cat 5-nodes.sh 
#!/bin/bash
#
# This is a simple MPI sbatch submission script for SLURM and MPICH2
#
#SBATCH -o 5-nodes.out
#SBATCH -N 5
#SBATCH -n 10

export HYDRA_BOOTSTRAP=slurm

echo "nodelist = $SLURM_NODELIST"
/usr/cs/bin/mpiexec ./testMPI.mpich

There are still ten tasks, but now they are distributed over five nodes:

jpr9c@power6
: /af13/jpr9c/work/slurm/test_jobs/mpi ; cat 5-nodes.out 
nodelist = artemis[1-5]
My rank is: 0 and I think there are 10 nodes to talk to.
My rank is: 1 and I think there are 10 nodes to talk to.
My rank is: My rank is: 5 and I think there are 10 nodes to talk to.
4 and I think there are 10 nodes to talk to.
My rank is: 6 and I think there are 10 nodes to talk to.
My rank is: 8 and I think there are 10 nodes to talk to.
My rank is: 3 and I think there are 10 nodes to talk to.
My rank is: 7 and I think there are 10 nodes to talk to.
My rank is: 9 and I think there are 10 nodes to talk to.
My rank is: 2 and I think there are 10 nodes to talk to.

There are a very large number of options for distributing jobs and setting job affinity for particular CPUs, If you need more information on selecting a particular set of resources consult the sbatch or srun pages, or send mail to root.