Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Goal: Discuss what workflows can look like, being a good cluster citizen, and some best practices.

...

Code Block
#!/bin/bash
#SBATCH --account=arccanetrain<project-name>
#SBATCH --time=1:00
#SBATCH --reservation=HPC_workshop<reservation-name>
#SBATCH --partition=tetonmb-gpul40s
#SBATCH --gres=gpu:1
echo "SLURM_JOB_ID:" $SLURM_JOB_ID
echo "SLURM_GPUS_ON_NODE:" $SLURM_GPUS_ON_NODE
echo "SLURM_JOB_GPUS:" $SLURM_JOB_GPUS
echo "CUDA_VISIBLE_DEVICES:" $CUDA_VISIBLE_DEVICES

nvidia-smi -L
–L
# Output:
SLURM_JOB_ID: 13517905
SLURM_GPUS_ON_NODE: 1
SLURM_JOB_GPUS: 0
CUDA_VISIBLE_DEVICES: 0
GPU 0: Tesla P100-PCIE-16GB (UUID: GPU-c1859587-9722-77f3-1b3a-63e9d4fe9d4f)

...

Info

Typically: Modules loaded, and environment variables that have been set on the login nodes will be inherited when you create an interactive salloc session and or call an sbatch.

Code Block
[salexan5@mblog1 ~]$ module purge
[]$ module load gcc/13.2.0 r/4.4.0
[salexan5@mblog1 ~]$ ml
Currently Loaded Modules:
  1) slurm/latest           (S)  42) libxau/1.0.8
...
 41) xproto/7.0.31

[salexan5@mblog1 ~]$ salloc -A arcc -t 10:00
salloc: Granted job allocation 1243593
salloc: Nodes mbcpu-025 are ready for job

[salexan5@mbcpu@mbcpu-025 ~]$ ml
Currently Loaded Modules:
  1) slurm/latest    (S)  15) libxml2/2.10.3          29) perl/5.38.0           43) libxdmcp/1.1.4      57) curl/8.4.0                    71) openjdk/11.0.20.1_1
 ...
 14) xz/5.4.1             28) gdbm/1.23               42) libxau/1.0.8          56) nghttp2/1.57.0      70) openblas/0.3.24

...

Info

When requesting ARCC help, this is then documented within your scripts that are sbatch-ed helping us to exactly replicate what you’ve done.

This is good reproducibility practice.

...

What does a general workflow look like?

...

Develop/Try/Test:

  • Typically use an interactive session (salloc) where you’re typing/trying/testing.

  • Are modules available? If not submit a New HPC Software Request to get installedConsultation request to start the discussion.

  • Develop code/scripts.

  • Understand how the command-line works – what commands/scripts to call with options.

  • Understand if parallelization is available – can you optimize your code/application?

  • Test against a subset of data. Something that runs quick – maybe a couple of minutes/hours.

  • Do the results look correct?

...

  • Put it all together within a bash Slurm script: 

    • Request appropriate resources using #SBATCH

    • Request appropriate wall time – hours, days…

    • Load modules: module load …

    • Run scripts/command-line.

  • Finally, submit your job to the cluster (sbatch) using a complete set of data.

    • Use: sbatch <script-name.sh>

    • Monitor job(s) progress.

...

  • Threads - multiple cpus/cores: Single node, single task, multiple cores.

    • Example: Chime

  • OpenMP: Single task, multiple cores. Set environment variable.

    • an application programming interface (API) that supports multi-platform shared-memory multiprocessing programming in C, C++, and Fortran.

    • Example: ImageMagick

  • MPI: Message Passing Interface: Multiple nodes, multiple tasks

  • Hybrid: MPI / OpenMP and/or threads.

    • Examples: DFTB and Quantum Espresso

...

What does it mean for an application to be GPU enabled? 

...

  • GPU / Nvidia / Cuda?

  • Examples:

    • Applications: AlphaFold and GPU Blast

      • Via conda based environments built with GPU libraries - and converted to Jupyter kernels:

      • Examples: TensorFlow and PyTorch PyTorch 

      • Jupyter Kernels: PyTorch 1.13.1

...

How can I be a good cluster citizen?

  • Policies

  • Don’t run intensive applications on the login nodes.

  • Understand your software/application.

  • Shared resource - multi-tenancy.

    • Other jobs running on the same node do not affect each other.

  • Don’t ask for everything. Don’t use:

    • mem=0

    • exclusive tag.

    • Only ask for a GPU if you know it’ll be used.

  • Use /lscratch for I/O intensive tasks rather than accessing /gscratch over the network. 

    • You will need to copy files back before the job ends.

  • Track usage and job performance: seff <jobid>

...