1. Getting Started with Slurm
-
Slurm is a very capable workload manager.
-
Think of this as our own miniature version of Compute Cloud (such as Amazon EC2, Google Compute).
-
Slurm can manage a bunch of computing resources (CPUs, RAM, GPUS etc).
-
Shared grid - resources are shared with others.
-
We can request resources when we need it, run our jobs, and release back the resources.
-
Slurm keeps track of who is using what, avoids conflicts, decides priorities, queues jobs and runs whenever resources become available, does book keeping, etc.
-
Slurm has a command line interface ⇒ there are a bunch of commands we need to learn.
2. Commands
-
squeue
: Shows jobs in queue — running or pending. You may filter jobs based on many criterion -
srun
: Runs a command in interactive shell with STDIN/STDOUT connected. This is great for running a interactive shell (such as bash) on a worker node that needs input from your keyboard. -
sbatch
: Runs a command in batch mode. This is great for running shell scripts on worker node(s) that DO NOT need input from your keyboard. -
scancel
: Cancel any job (owned by you) in the queue.
There are more slurm commands, but these are essential to do any basic work with slurm.
3. Logistics
You need to know partition
.
For example, USC HPC has many partitions https://hpcc.usc.edu/resources/infrastructure/node-allocation/
I have access to isi
partition and I am going to use it in the following examples.
You may want to use large/main/quick etc based on your allocations.
4. Check the jobs in queue
squeue # all jobs squeue -p isi # jobs of isi partition squeue -u $USER # jobs of me
Here are some handy aliases
format="\"%8i %8g %9u %5P %30j %2t %12M %12l %5D %3C %10R %10b\"" flags="-S +i -o ${format}" alias q="squeue -u $USER ${flags}" # my jobs alias qa="squeue -p isi -a ${flags}" # Jobs in the parition I care alias qaa="squeue -a ${flags}" # Jobs from all paritions
I suggest pasting the above contents to ~slurm-env.sh
and sourcing it in bashrc
echo "source ~/slurm-env.sh >> ~/.bashrc"
You’d most often use q
for checking your jobs.
If you want to keep an eye on who else is using(or competing for) resources from the partition, then qa
.
If you want to see across all partitions try qaa
.
5. Get an interactive shell
To get a bash shell on one of worker node
srun -p isi -t 4:00:00 --mem=4G -c 1 --pty bash
Note the parameters: isi
partition, 4:00:00
time which is 4 hours, with 4G
memory and -c 1
cpu cores.
--pty bash
gives you bash shell after allocating a node.
Some version of slurm -n 1 ` instead of `-c 1
, refer to man slurm | grep cpu
.
This assumes you need only one node -N 1
as default, but for some versions of slurm you have to set it explicitly.
You can increase -c 10
to get 10 CPU cores, or --mem=20G
to get 20GB memory.
CTRL+D to exit or type exit
to exit the shell.
To request 1
GPU, add --gres=gpu:1
srun -p isi -t 4:00:00 --mem=8G -c 4 --gres=gpu:1 --pty bash
And verify that you got a GPU by typing nvidia-smi
You may also be more specific of GPU types
-
--gres=gpu:p100:2
get 2 GPUs of type P100 -
--gres=gpu:v100:1
get 1 GPU of type P100 -
--gres=gpu:k80:4
get 4 GPU of type K80 -
--gres=gpu:k80:4 -N 2
get 4 GPU of type K80, but 2 of such nodes ⇒ 2x4 = 8 GPUs
Go to cluster overview page to see what kind of GPUs your cluster has.
6. Submit a batch job
If you swap srun
with sbatch
and --pty bash
with yourscript.sh arg1 arg2 .. argn
you get batch job!
Batch job comes with an extra feature — if you wish (and I recommend), you can set resource specs inside job script itself
and you dont have to specify them from command line every time you it.
If a line begins with #SBATCH
, slurm parses the rest of that line as if it was command line argument.
Example:
#!/usr/bin/env bash
#SBATCH --partition=isi,main,large
#SBATCH --mem=40g
#SBATCH --time=23:59:00
#SBATCH --nodes=1 --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --gres=gpu:p100:2
#SBATCH --output=R-%x.out.%j
#SBATCH --error=R-%x.err.%j
#SBATCH --export=NONE
# rest of the script.
hostname -f
echo "Hello World : args: ${@}"
sbatch myjob.sh arg1 arg2
By default, slurm uses script name as job name. If you run same script with different args, you may wish to set job name manually to easily identify them on queue.
sbatch -J myjob1 myjob.sh arg1 arg2 sbatch -J myjob2 myjob.sh arg3 arg4
7. Kill a job
scancel <jobid>
You may get <jobid>
from squeue
or its pretty alias q
8. Setting up Spark cluster inside Slurm cluster
8.1. Download Spark
get it from https://spark.apache.org/downloads.html
mkdir ~/libs/ && cd ~/libs/
wget http://apache.mirrors.pair.com/spark/spark-2.4.6/spark-2.4.6-bin-hadoop2.7.tgz
tar xvf spark-2.4.6-bin-hadoop2.7.tgz
ln -s spark-2.4.6-bin-hadoop2.7 spark
export SPARK_HOME=$(realpath spark)
8.2. Request a bunch of nodes from slurm
First, lets try to do it in interactive shell srun --pty bash
, later sbatch
this
$ srun -p isi -t 0-4:00:00 -c 16 -N 4 --pty bash
where isi
is my partition name.
If you happened to have GPUs on some nodes which you want to avoid use -x
.
Its best to not take up GPU nodes if you dont use them .
$ srun -p isi -t 0-4:00:00 -c 16 -N 4 -x hpc43[25-30] --pty bash
Next, lets see which nodes we got
$ echo $SLURM_NODELIST
hpc[3842-3843,3850-3851]
# but we need to expand the hostnames to be working with spark
$ scontrol show hostnames
hpc3842
hpc3843
hpc3850
hpc3851
Great! now we are good to start spark cluster.
8.3. Start Spark Cluster
Assumptions:
-
your bash/shell session is on one of the worker nodes allotted by Slurm
-
you can get full list of nodes by
scontrol show hostnames
-
You can ssh to each of them; you may verify it with this command
scontrol show hostnames | while read h; do ssh -n $h hostname -f ; done
-
You have downloaded spark and extracted at
$SPARK_HOME
export SPARK_HOME=/path/to/spark/extract/dir
-
All nodes have shared file system (NFS) and it can take the load of multiple read and writes; If not, please setup HDFS with yarn. This is for Standalone Spark cluster
Now, the remaining documentation is at https://spark.apache.org/docs/latest/spark-standalone.html
Note that Apache Spark developers have given awesome set of scripts already
-
sbin/start-master.sh - Starts a master instance on the machine the script is executed on.
-
sbin/start-slaves.sh - Starts a slave instance on each machine specified in the conf/slaves file.
-
sbin/start-slave.sh - Starts a slave instance on the machine the script is executed on.
-
sbin/start-all.sh - Starts both a master and a number of slaves as described above.
-
sbin/stop-master.sh - Stops the master that was started via the sbin/start-master.sh script.
-
sbin/stop-slaves.sh - Stops all slave instances on the machines specified in the conf/slaves file.
-
sbin/stop-all.sh - Stops both the master and the slaves as described above.
So, sbin/start-all.sh
and sbin/stop-all.sh
should take care of starting and stopping,
except we have to specify worker nodes to it.
Digging inside the scripts I found that, if we set SPARK_SLAVES
variable
to point to a plain text file with list of hosts, one host per line, the rest will work as expected
To start
$ scontrol show hostnames > hostnames.txt
$ export SPARK_SLAVES=$(realpath hostnames.txt)
$SPARK_HOME/sbin/start-all.sh
To stop
$SPARK_HOME/sbin/stop-all.sh
If there are any issues, check out $SPARK_HOME/logs
for logs.
Spark master is usually on the node which you run start-all.sh
.
So the master URL is spark://<firsthost>:7077