Goal: Discuss what workflows can look like, being a good cluster citizen, and some best practices.

...

Code Block

#!/bin/bash
#SBATCH --account=arccanetrain<project-name>
#SBATCH --time=1:00
#SBATCH --reservation=HPC_workshop<reservation-name>
#SBATCH --partition=tetonmb-gpul40s
#SBATCH --gres=gpu:1
echo "SLURM_JOB_ID:" $SLURM_JOB_ID
echo "SLURM_GPUS_ON_NODE:" $SLURM_GPUS_ON_NODE
echo "SLURM_JOB_GPUS:" $SLURM_JOB_GPUS
echo "CUDA_VISIBLE_DEVICES:" $CUDA_VISIBLE_DEVICES

nvidia-smi -L
–L
# Output:
SLURM_JOB_ID: 13517905
SLURM_GPUS_ON_NODE: 1
SLURM_JOB_GPUS: 0
CUDA_VISIBLE_DEVICES: 0
GPU 0: Tesla P100-PCIE-16GB (UUID: GPU-c1859587-9722-77f3-1b3a-63e9d4fe9d4f)

...

Info
Typically: Modules loaded, and environment variables that have been set on the login nodes will be inherited when you create an interactive `salloc` session and or call an `sbatch`.

Code Block

[salexan5@mblog1 ~]$ module purge
[]$ module load gcc/13.2.0 r/4.4.0
[salexan5@mblog1 ~]$ ml
Currently Loaded Modules:
  1) slurm/latest           (S)  42) libxau/1.0.8
...
 41) xproto/7.0.31

[salexan5@mblog1 ~]$ salloc -A arcc -t 10:00
salloc: Granted job allocation 1243593
salloc: Nodes mbcpu-025 are ready for job

[salexan5@mbcpu@mbcpu-025 ~]$ ml
Currently Loaded Modules:
  1) slurm/latest    (S)  15) libxml2/2.10.3          29) perl/5.38.0           43) libxdmcp/1.1.4      57) curl/8.4.0                    71) openjdk/11.0.20.1_1
 ...
 14) xz/5.4.1             28) gdbm/1.23               42) libxau/1.0.8          56) nghttp2/1.57.0      70) openblas/0.3.24

...

Info

When requesting ARCC help, this is then documented within your scripts that are sbatch-ed helping us to exactly replicate what you’ve done.

This is good reproducibility practice.

...

What does a general workflow look like?

...

Develop/Try/Test:

Typically use an interactive session (salloc) where you’re typing/trying/testing.
Are modules available? If not submit a New HPC Software Request to get installedConsultation request to start the discussion.
Develop code/scripts.
Understand how the command-line works – what commands/scripts to call with options.
Understand if parallelization is available – can you optimize your code/application?
Test against a subset of data. Something that runs quick – maybe a couple of minutes/hours.
Do the results look correct?

...

Put it all together within a bash Slurm script:
- Request appropriate resources using #SBATCH
- Request appropriate wall time – hours, days…
- Load modules: module load …
- Run scripts/command-line.
Finally, submit your job to the cluster (sbatch) using a complete set of data.
- Use: sbatch <script-name.sh>
- Monitor job(s) progress.

...

Threads - multiple cpus/cores: Single node, single task, multiple cores.
- Example: Chime
OpenMP: Single task, multiple cores. Set environment variable.
- an application programming interface (API) that supports multi-platform shared-memory multiprocessing programming in C, C++, and Fortran.
- Example: ImageMagick
MPI: Message Passing Interface: Multiple nodes, multiple tasks
- OpenMPI: ARCC Wiki: OpenMPI and oneAPI Compiling,
Hybrid: MPI / OpenMP and/or threads.
- Examples: DFTB and Quantum Espresso

...

What does it mean for an application to be GPU enabled?

...

GPU / Nvidia / Cuda?
Examples:
- Applications: AlphaFold and GPU Blast
  - Via conda based environments built with GPU libraries - and converted to Jupyter kernels:
  - Examples: TensorFlow and PyTorch PyTorch
  - Jupyter Kernels: PyTorch 1.13.1

...

How can I be a good cluster citizen?

Policies
Don’t run intensive applications on the login nodes.
Understand your software/application.
Shared resource - multi-tenancy.
- Other jobs running on the same node do not affect each other.
Don’t ask for everything. Don’t use:
- mem=0
- exclusive tag.
- Only ask for a GPU if you know it’ll be used.
Use /lscratch for I/O intensive tasks rather than accessing /gscratch over the network.
- You will need to copy files back before the job ends.
Track usage and job performance: seff <jobid>

...

Prev

What is Slurm

Workshop Home

Intro to Job Scheduling

Next

Slurm: More Features

Versions Compared

Old Version 9

New Version Current

Key

What does a general workflow look like?

What does it mean for an application to be GPU enabled?

How can I be a good cluster citizen?

Page Comparison

Versions Compared

Old Version 9

New Version Current

Key

What does a general workflow look like?

What does it mean for an application to be GPU enabled?

How can I be a good cluster citizen?