Welcome! This workshop is to familiarize new users with resources available that are under the job scheduler. This is not comprehensive and is designed for a quick introduction into SLURM resources that are available after being granted a CS account.
After finishing this page, please be sure visit our (wiki page about SLURM).
Presently, the CS department utilizes the software SLURM to control access to most compute resources in the CS environment.
Simply, SLURM manages and grants access to server resources such as CPU cores, CPU memory, and GPUs.
The portal and NoMachine Remote Linux Desktop clusters are connected to the SLURM Job Scheduler, which is connected with SLURM compute nodes.
From the login clusters, you are able to request specific resources in the form of a job. When the scheduler detects a sever that has those specified resources is available, the scheduler will assign your job to the respective compute node (server).
These jobs can be either interactive, i.e. a terminal is opened, or non-interactive. In either case, your job will run commands as you would in your terminal, on a compute node, using the requested resources.
For example, you can request to allocate an A100 or H100 GPU, along with a certain number of CPU cores, and a certain amount of CPU memory, to train a Large Language Model (LLM) or run a program.
--gres=gpu:1 or --gpus=1 will allocate the first available GPU, regardless of type#SBATCH --gres=gpu:1 with --constraint="a100_40gb" will require that an A100 GPU with 40GBs be used for your jobThe best way to view available resources is to visit our wiki page on (SLURM Compute Resources).
Using this information, you can submit a job to request that certain resources be allocated to your job.
Here are a few of the most common job options that are needed for submitting a job
Be sure to check our main wiki page for full details about these options and others (SLURM Common Job Options).
-J or --job-name=<jobname> The name of your job
-n <n> or --ntasks=<n> Number of tasks to run
-p <partname> or --partition=<partname> Submit a job to a specified partition
-c <n> or --cpus-per-task=<n> Number of cores to allocate per process,
primarily for multithreaded jobs,
default is one core per process/task
--mem=<n> System memory required for each node specified in MBs
-t D-HH:MM:SS or --time=D-HH:MM:SS Maximum WALL clock time for a job
-C <features> or --constraint=<features> Specify unique resource requirements such as specific GPUs
--mail-type=<type> Specify the job state that should generate an email.
--mail-user=<computingID>@virginia.edu Specify the recipient virginia email address for email
notifications
(all other domains such as 'gmail.com' are ignored)
Submitting an interactive job is a two step process. Firstly, to request an allocation of resources, then to run a command on the allocation, i.e. start a shell
Note, a partition must always be specified
The following example creates a resource allocation within the CPU partition for one node with two cores, 4GBs of memory, and a time limit of 30 minutes
userid@portal01~$ salloc -p cpu -c 2 --mem=4000 -J InteractiveJob -t 30 salloc: Granted job allocation 12345
Then, a BASH shell is initialized within the allocation
userid@portal01~$ srun --pty bash -i -l -- userid@node01~$ echo "Hello from $(hostname)!" Hello from from node01!
Notice that the hostname (i.e. the server you're on) changes from portal01 to node01.
Be sure to fully exit and relinquish the job allocation when you have finished
userid@node01~$ exit logout userid@portal01~$ exit exit salloc: Relinquishing job allocation 12345 userid@portal01~$
The following requests to allocate the first available GPU, regardless of type
userid@portal01~$ salloc -p gpu --gres=gpu:1 -c 2 --mem=4000 -J InteractiveJob -t 30 salloc: Granted job allocation 12345 userid@portal01~$ srun --pty bash -i -l -- userid@gpunode01~$ nvidia-smi -L GPU 0: NVIDIA GeForce GTX 1080 (UUID: GPU-75a60714-6650-dea8-5b2f-fa3041799070)
Be sure to fully exit and relinquish the job allocation when you have finished
userid@gpunode01~$ exit logout userid@portal01~$ exit exit salloc: Relinquishing job allocation 12345 userid@portal01~$
Submitting a non-interactive job allows for queuing a job without waiting for it to begin. This is done using SBATCH Scripts, which function similarly to salloc.
Firstly, create a file to be an SBATCH script
~$ nano/vim my_sbatch_script
Inside the file, add the necessary resource requirements for your job using the prefix #SBATCH to each job option.
The following requests the first available GPU, regardless of type, along with 16GBs of CPU memory, and a time limit of 4 hours. Then the python script, myprogram.py will be run after loading the required modules.
Note, there are email notifications available. This can help with queuing a job and receiving an email when your job starts or finishes
#!/bin/bash #SBATCH --gres=gpu:1 #SBATCH --mem=16000 #SBATCH -n 1 #SBATCH -t 04:00:00 #SBATCH -p gpu #SBATCH --mail-type=begin,end #SBATCH --mail-user=<computingID>@virginia.edu module purge module load gcc python cuda python3 myprogram.py
To view all jobs that you have queued or running
~$ squeue -u $USER
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
12345 cpu myjob userid R 52:01 1 node01
12346 cpu myjob userid R 52:01 1 node01
12347 cpu myjob userid PD 00:00 1 (Priority)
To view all of your jobs that are running, include --state=running
~$ squeue -u $USER --state=running
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
12345 cpu myjob userid R 52:01 1 node01
12346 cpu myjob userid R 52:01 1 node01
To cancel a job, run the following command, providing a JobID which can be obtained from squeue
~$ scancel <jobid>
As shown in the diagram, software modules along with storage are connected to SLURM compute nodes.
This means that after submitting a job and being allocated resources, you can load and use modules as you normally would when logged into portal for example.
The following example highlights module usage for an interactive job
userid@portal01~$ salloc -p cpu -c 2 --mem=4000 -J InteractiveJob -t 30
salloc: Granted job allocation 12345
userid@portal01~$ srun --pty bash -i -l --
userid@node01~$ module purge
userid@node01~$ module load gcc python
userid@node01~$ python3
>>> print("Hello, world!")
Hello, world!
>>>
Similarly, an SBATCH script can use modules in the same way that would be done when logged in via SSH, and can run a program
#!/bin/bash #SBATCH ... #SBATCH ... module purge module load gcc python python3 hello_world.py
A Jupyter notebook can be opened within a SLURM job.
Note, you MUST be on-grounds using eduroam wifi or running a UVA VPN.
To open a Jupyter notebook during an interactive session, firstly load the miniforge module
~$ module load miniforge
Then, run the following command, and find the URL output to access the Jupyter instance
~$ jupyter notebook --no-browser --ip=$(hostname -A)
... output omitted ...
Or copy and paste one of these URLs:
http://hostname.cs.Virginia.EDU:8888/tree?token=12345689abcdefg
Copy and paste the generated URL into your browser.
Another option is to attach a Jupyter notebook to resources allocated via an SBATCH script.
Note, once you are finished, you must run ~$ scancel <jobid>, using the assigned <jobid>, to free the allocated resources
The following SBATCH is used as an example, you will need to modify depending on the resource requirements of your job. Enabling email notifications is recommended
#!/bin/bash #SBATCH -n 1 #SBATCH -t 00:30:00 #SBATCH -p cpu #SBATCH --mail-type=begin #SBATCH --mail-user=<userid>@virginia.edu module purge module load miniforge jupyter notebook --no-browser --ip=$(hostname -A) > ~/slurm_jupyter_info 2>&1
Using the above SBATCH script, submit the job and wait until the resources are allocated
~$ sbatch sbatch_jupyter_example
You will receive an email when your job starts (if email notifications are enabled), or you can check your queue
~$ squeue -u $USER
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
12345 cpu sbatch_j <userid> R 0:04 1 <node name>
Once your job has started running, i.e. has a state of R, then output the notebook connection info, and copy/paste the generated URL into your browser
~$ cat ~/slurm_jupyter_info
... output omitted ...
Or copy and paste one of these URLs:
http://hostname.cs.Virginia.EDU:8888/tree?token=12345689abcdefg
Note, once you are finished, you must run ~$ scancel <jobid>, using the assigned <jobid>, to free the allocated resources
~$ scancel 12345
Here are a few general tips and recommendations for using SLURM in CS
~$ module purge at the start of your job