Using High Performance Computing Cluster

Vihari Piratla, Diptesh Kanojia, Sabyasachi Ghosh · September 13, 2020

Computer Center (CC) provides access to 64 GPU nodes with P100 (these are decent GPUs at the time of this post). You can know more about the facility called SpaceTime at this (IITB internal) link.

SpaceTime has 64 GPU P100 (16GB, cutting-edge) nodes. Your faculty/supervisor must create an account for you to be able to access the facility. They can write to spacetime@iitb.ac.in to get an account created.

Each faculty gets once account which means all the supervised students will possibly end up using the same (shared) credentials to run experiments. Based on our experience, we used the following directives among the lab members to ensure fair usage.

Every job should have a student identifier in the job name.
There must be an upper-cap on the jobs per student.
Each student must create their own directory into the faculty login accounts provided and all operations/logs must be stored in that directory only.
Ensure that each job execution script contains exactly one aprun command.
There is also a feature on SpaceTime which limits the execution time per job, nne must also think about using it for round-robin waiting jobs.

These directives are just suggestions based on past experience. You can, of course, make you own rules!

It is not straight-forward to use PBS to submit jobs and working with HPC entails a steep learning curve.

You should submit your job through PBS. The typical queue wait time for your process is around 12-24 hrs. There is an upper limit to how long your process can run i.e., 120 hours. The number of parallel jobs that can be started at a point of time by a single account (faculty) is also limited to five. If you need to get that limit increased, you should try and show that the limit allowed to you is not sufficient and discuss the matter with Mr. Trirag who sits in the CC office (Spacetime) on ground floor, CC building.

You need to know available software, libraries, debugging, MPI threads, code changes, arch., network adaptations etc. The provided manuals are no match for the needed support. A sample job script is shown below:

## PBS Script
#PBS -N PP|<model name>
#PBS -q gpuq
##PBS -q gpu_nodes
#PBS -l select=1:ncpus=18:accelerator=True:vntype=cray_compute
#PBS -l place=pack
#PBS -o <model name>.log
#PBS -V

module load craype-broadwell
module load craype-accel-nvidia60

source activate <env_name>
cd <Your Folder Path>
## If using m GPUs, set select=m, set aprun -n m -d m*ncpus, set place=scatter
## For 1 GPU, set place=pack
## Per shell one aprun command only


## Job with 40 MPI processes
aprun -n 1 -N 1 -d 18 -cc none -a xt <executable file> <arguments>

Painless PBS

Submitting jobs to a server through PBS can be taxing especially if you are a beginner. The following interfaces to PBS somewhat easier.

The Dask library for distributed computing in python provides a PBS backend. It should allow you to transparently run jobs on PBS, without worrying too much about the internal details. It also supports other backends, such as threads, processes, and SSH cluster.

Similar support is available for Julia using the Distributed library.

Edit on GitHub!

Share: Twitter, Facebook