Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Goal: Discuss what workflows can look like, being a good cluster citizen, and some best practices.

...

Info

When requesting ARCC help, this is then documented within your scripts that are sbatch-ed helping us to exactly replicate what you’ve done.

This is good reproducibility practice.

...

What does a general workflow look like?

...

Develop/Try/Test:

  • Typically use an interactive session (salloc) where you’re typing/trying/testing.

  • Are modules available? If not submit a New HPC Software Request to get installedConsultation request to start the discussion.

  • Develop code/scripts.

  • Understand how the command-line works – what commands/scripts to call with options.

  • Understand if parallelization is available – can you optimize your code/application?

  • Test against a subset of data. Something that runs quick – maybe a couple of minutes/hours.

  • Do the results look correct?

...

  • Put it all together within a bash Slurm script: 

    • Request appropriate resources using #SBATCH

    • Request appropriate wall time – hours, days…

    • Load modules: module load …

    • Run scripts/command-line.

  • Finally, submit your job to the cluster (sbatch) using a complete set of data.

    • Use: sbatch <script-name.sh>

    • Monitor job(s) progress.

...

  • Threads - multiple cpus/cores: Single node, single task, multiple cores.

    • Example: Chime

  • OpenMP: Single task, multiple cores. Set environment variable.

    • an application programming interface (API) that supports multi-platform shared-memory multiprocessing programming in C, C++, and Fortran.

    • Example: ImageMagick

  • MPI: Message Passing Interface: Multiple nodes, multiple tasks

  • Hybrid: MPI / OpenMP and/or threads.

    • Examples: DFTB and Quantum Espresso

...

What does it mean for an application to be GPU enabled? 

...

  • GPU / Nvidia / Cuda?

  • Examples:

    • Applications: AlphaFold and GPU Blast

      • Via conda based environments built with GPU libraries - and converted to Jupyter kernels:

      • Examples: TensorFlow and PyTorch PyTorch 

      • Jupyter Kernels: PyTorch 1.13.1

...

How can I be a good cluster citizen?

  • Policies

  • Don’t run intensive applications on the login nodes.

  • Understand your software/application.

  • Shared resource - multi-tenancy.

    • Other jobs running on the same node do not affect each other.

  • Don’t ask for everything. Don’t use:

    • mem=0

    • exclusive tag.

    • Only ask for a GPU if you know it’ll be used.

  • Use /lscratch for I/O intensive tasks rather than accessing /gscratch over the network. 

    • You will need to copy files back before the job ends.

  • Track usage and job performance: seff <jobid>

...