Goal: Discuss what workflows can look like, being a good cluster citizen, and some best practices.

...

Code Block

#!/bin/bash
#SBATCH --account=arccanetrain
#SBATCH --time=1:00
#SBATCH --reservation=HPC_workshop<reservation-name>
#SBATCH --partition=tetonmb-gpul40s
#SBATCH --gres=gpu:1
echo "SLURM_JOB_ID:" $SLURM_JOB_ID
echo "SLURM_GPUS_ON_NODE:" $SLURM_GPUS_ON_NODE
echo "SLURM_JOB_GPUS:" $SLURM_JOB_GPUS
echo "CUDA_VISIBLE_DEVICES:" $CUDA_VISIBLE_DEVICES
nvidia-smi –L-L

# Output:
SLURM_JOB_ID: 13517905
SLURM_GPUS_ON_NODE: 1
SLURM_JOB_GPUS: 0
CUDA_VISIBLE_DEVICES: 0
GPU 0: Tesla P100-PCIE-16GB (UUID: GPU-c1859587-9722-77f3-1b3a-63e9d4fe9d4f)

...

Info

When requesting ARCC help, this is then documented within your scripts that are sbatch-ed helping us to exactly replicate what you’ve done.

This is good reproducibility practice.

...

What does a general workflow look like?

...

Develop/Try/Test:

Typically use an interactive session (salloc) where you’re typing/trying/testing.
Are modules available? If not submit a HPC Software Consultation request to start the discussion.
Develop code/scripts.
Understand how the command-line works – what commands/scripts to call with options.
Understand if parallelization is available – can you optimize your code/application?
Test against a subset of data. Something that runs quick – maybe a couple of minutes/hours.
Do the results look correct?

...

Put it all together within a bash Slurm script:
- Request appropriate resources using #SBATCH
- Request appropriate wall time – hours, days…
- Load modules: module load …
- Run scripts/command-line.
Finally, submit your job to the cluster (sbatch) using a complete set of data.
- Use: sbatch <script-name.sh>
- Monitor job(s) progress.

...

How can I be a good cluster citizen?

Policies
Don’t run intensive applications on the login nodes.
Understand your software/application.
Shared resource - multi-tenancy.
- Other jobs running on the same node do not affect each other.
Don’t ask for everything. Don’t use:
- mem=0
- exclusive tag.
- Only ask for a GPU if you know it’ll be used.
Use /lscratch for I/O intensive tasks rather than accessing /gscratch over the network.
- You will need to copy files back before the job ends.
Track usage and job performance: seff <jobid>

...

Versions Compared

Old Version 10

New Version 11

Key

What does a general workflow look like?

How can I be a good cluster citizen?

Page Comparison

Versions Compared

Old Version 10

New Version 11

Key

What does a general workflow look like?

How can I be a good cluster citizen?