Goal: Discuss what workflows can look like, being a good cluster citizen, and some best practices.
...
Code Block |
---|
#!/bin/bash #SBATCH --account=arccanetrain #SBATCH --time=1:00 #SBATCH --reservation=HPC_workshop<reservation-name> #SBATCH --partition=tetonmb-gpul40s #SBATCH --gres=gpu:1 echo "SLURM_JOB_ID:" $SLURM_JOB_ID echo "SLURM_GPUS_ON_NODE:" $SLURM_GPUS_ON_NODE echo "SLURM_JOB_GPUS:" $SLURM_JOB_GPUS echo "CUDA_VISIBLE_DEVICES:" $CUDA_VISIBLE_DEVICES nvidia-smi –L-L # Output: SLURM_JOB_ID: 13517905 SLURM_GPUS_ON_NODE: 1 SLURM_JOB_GPUS: 0 CUDA_VISIBLE_DEVICES: 0 GPU 0: Tesla P100-PCIE-16GB (UUID: GPU-c1859587-9722-77f3-1b3a-63e9d4fe9d4f) |
...
Info |
---|
When requesting ARCC help, this is then documented within your scripts that are This is good reproducibility practice. |
...
What does a general workflow look like?
...
Develop/Try/Test:
Typically use an interactive session (salloc) where you’re typing/trying/testing.
Are modules available? If not submit a HPC Software Consultation request to start the discussion.
Develop code/scripts.
Understand how the command-line works – what commands/scripts to call with options.
Understand if parallelization is available – can you optimize your code/application?
Test against a subset of data. Something that runs quick – maybe a couple of minutes/hours.
Do the results look correct?
...
Put it all together within a bash Slurm script:
Request appropriate resources using
#SBATCH
Request appropriate wall time – hours, days…
Load modules:
module load …
Run scripts/command-line.
Finally, submit your job to the cluster (sbatch) using a complete set of data.
Use:
sbatch <script-name.sh>
Monitor job(s) progress.
...
How can I be a good cluster citizen?
Don’t run intensive applications on the login nodes.
Understand your software/application.
Shared resource - multi-tenancy.
Other jobs running on the same node do not affect each other.
Don’t ask for everything. Don’t use:
mem=0
exclusive tag.
Only ask for a GPU if you know it’ll be used.
Use
/lscratch
for I/O intensive tasks rather than accessing/gscratch
over the network.You will need to copy files back before the job ends.
Track usage and job performance:
seff <jobid>
...