Goal: Discuss what workflows can look like, being a good cluster citizen, and some best practices.

Info

When requesting ARCC help, this is then documented within your scripts that are sbatch-ed helping us to exactly replicate what you’ve done.

This is good reproducibility practice.

...

What does a general workflow look like?

...

Develop/Try/Test:

Typically use an interactive session (salloc) where you’re typing/trying/testing.
Are modules available? If not submit a New HPC Software Request to get installedConsultation request to start the discussion.
Develop code/scripts.
Understand how the command-line works – what commands/scripts to call with options.
Understand if parallelization is available – can you optimize your code/application?
Test against a subset of data. Something that runs quick – maybe a couple of minutes/hours.
Do the results look correct?

...

Put it all together within a bash Slurm script:
- Request appropriate resources using #SBATCH
- Request appropriate wall time – hours, days…
- Load modules: module load …
- Run scripts/command-line.
Finally, submit your job to the cluster (sbatch) using a complete set of data.
- Use: sbatch <script-name.sh>
- Monitor job(s) progress.

...

Threads - multiple cpus/cores: Single node, single task, multiple cores.
- Example: Chime
OpenMP: Single task, multiple cores. Set environment variable.
- an application programming interface (API) that supports multi-platform shared-memory multiprocessing programming in C, C++, and Fortran.
- Example: ImageMagick
MPI: Message Passing Interface: Multiple nodes, multiple tasks
- OpenMPI: ARCC Wiki: OpenMPI and oneAPI Compiling,
Hybrid: MPI / OpenMP and/or threads.
- Examples: DFTB and Quantum Espresso

...

What does it mean for an application to be GPU enabled?

...

GPU / Nvidia / Cuda?
Examples:
- Applications: AlphaFold and GPU Blast
  - Via conda based environments built with GPU libraries - and converted to Jupyter kernels:
  - Examples: TensorFlow and PyTorch PyTorch
  - Jupyter Kernels: PyTorch 1.13.1

...

Policies
Don’t run intensive applications on the login nodes.
Understand your software/application.
Shared resource - multi-tenancy.
- Other jobs running on the same node do not affect each other.
Don’t ask for everything. Don’t use:
- mem=0
- exclusive tag.
- Only ask for a GPU if you know it’ll be used.
Use /lscratch for I/O intensive tasks rather than accessing /gscratch over the network.
- You will need to copy files back before the job ends.
Track usage and job performance: seff <jobid>

...

Prev

What is Slurm

Workshop Home

Intro to Job Scheduling

Next

Slurm: More Features