Goal: Discuss common issues, what workflows can look like, being a good cluster citizen, and some best practices.

...

Code Block

#!/bin/bash
#SBATCH --account=arccanetrain<project-name>
#SBATCH --time=1:00
#SBATCH --reservation=HPC_workshop<reservation-name>
#SBATCH --partition=tetonmb-gpul40s
#SBATCH --gres=gpu:1
echo "SLURM_JOB_ID:" $SLURM_JOB_ID
echo "SLURM_GPUS_ON_NODE:" $SLURM_GPUS_ON_NODE
echo "SLURM_JOB_GPUS:" $SLURM_JOB_GPUS
echo "CUDA_VISIBLE_DEVICES:" $CUDA_VISIBLE_DEVICES

nvidia-smi –L-L

# Output:
SLURM_JOB_ID: 13517905
SLURM_GPUS_ON_NODE: 1
SLURM_JOB_GPUS: 0
CUDA_VISIBLE_DEVICES: 0
GPU 0: Tesla P100-PCIE-16GB (UUID: GPU-c1859587-9722-77f3-1b3a-63e9d4fe9d4f)

...

Info
Typically: Modules loaded, and environment variables that have been set on the login nodes will be inherited when you create an interactive `salloc` session and or call an `sbatch`.

Code Block

[salexan5@mblog1 ~]$ module purge
[]$ module load gcc/13.2.0 r/4.4.0
[salexan5@mblog1 ~]$ ml
Currently Loaded Modules:
  1) slurm/latest           (S)  42) libxau/1.0.8
...
 41) xproto/7.0.31

[salexan5@mblog1 ~]$ salloc -A arcc -t 10:00
salloc: Granted job allocation 1243593
salloc: Nodes mbcpu-025 are ready for job

[salexan5@mbcpu@mbcpu-025 ~]$ ml
Currently Loaded Modules:
  1) slurm/latest    (S)  15) libxml2/2.10.3          29) perl/5.38.0           43) libxdmcp/1.1.4      57) curl/8.4.0                    71) openjdk/11.0.20.1_1
 ...
 14) xz/5.4.1             28) gdbm/1.23               42) libxau/1.0.8          56) nghttp2/1.57.0      70) openblas/0.3.24

...

Info

When requesting ARCC help, this is then documented within your scripts that are sbatch-ed helping us to exactly replicate what you’ve done.

This is good reproducibility practice.

Common Questions

How do I know what number of nodes, cores, memory etc to ask for my jobs?
How do I find out whether a cluster/partition supports these resources?
How do I find out whether these resources are available on the cluster?
How long will I have to wait in the queue before my job starts? How busy is the cluster?
How do I monitor the progress of my job?

Common Questions: Suggestions

How do I know what number of nodes, cores, memory etc to ask for my jobs?
- Understand your software and application.
  - Read the docs – look at the help for commands/options.
  - Can it run multiple threads - use multi cores (OpenMP) / nodes (MPI)?
  - Can it use a GPU? Nvidia cuda.
  - Are their suggestions on data and memory requirements?
How do I find out whether a cluster/partition supports these resources?
How do I find out whether these resources are available on the cluster?
- Consult the wiki: Beartooth Hardware Summary Table
How long will I have to wait in the queue before my job starts?
- How busy is the cluster?
- Current Cluster utilization: Commands sinfo / arccjobs and SouthPass status page.
How do I monitor the progress of my job?
- Slurm commands: squeue

Common Issues

...

Not defining the account and time options.

...

The account is the name of the project you are associated with. It is not your username.

...

Requesting combinations of resources that can not be satisfied: Beartooth Hardware Summary Table

For example, you can not request 40 cores on a teton node (max of 32).
Requesting too much memory, or too many GPU devices with respect to a partition.

...

My job is pending? Why?

Because the resources are currently not available.
Have you unnecessarily defined a specific partition (restricted yourself) that is busy?
We only have a small number of GPUs.
This is a shared resource - sometimes you just have to be patient…
Check current cluster utilization.

Preemption: Users of an investment get priority on their hardware.

...

.

...

What does a general workflow look like?

...

Develop/Try/Test:

Typically use an interactive session (salloc) where you’re typing/trying/testing.
Are modules available? If not submit a New HPC Software Request to get installedConsultation request to start the discussion.
Develop code/scripts.
Understand how the command-line works – what commands/scripts to call with options.
Understand if parallelization is available – can you optimize your code/application?
Test against a subset of data. Something that runs quick – maybe a couple of minutes/hours.
Do the results look correct?

...

Put it all together within a bash Slurm script:
- Request appropriate resources using #SBATCH
- Request appropriate wall time – hours, days…
- Load modules: module load …
- Run scripts/command-line.
Finally, submit your job to the cluster (sbatch) using a complete set of data.
- Use: sbatch <script-name.sh>
- Monitor job(s) progress.

...

Threads - multiple cpus/cores: Single node, single task, multiple cores.
- Example: Chime
OpenMP: Single task, multiple cores. Set environment variable.
- an application programming interface (API) that supports multi-platform shared-memory multiprocessing programming in C, C++, and Fortran.
- Example: ImageMagick
MPI: Message Passing Interface: Multiple nodes, multiple tasks
- OpenMPI: ARCC Wiki: OpenMPI and oneAPI Compiling,
Hybrid: MPI / OpenMP and/or threads.
- Examples: DFTB and Quantum Espresso

...

What does it mean for an application to be GPU enabled?

...

GPU / Nvidia / Cuda?
Examples:
- Applications: AlphaFold and GPU Blast
  - Via conda based environments built with GPU libraries - and converted to Jupyter kernels:
  - Examples: TensorFlow and PyTorch PyTorch
  - Jupyter Kernels: PyTorch 1.13.1

...

How can I be a good cluster citizen?

Policies
Don’t run intensive applications on the login nodes.
Understand your software/application.
Shared resource - multi-tenancy.
- Other jobs running on the same node do not affect each other.
Don’t ask for everything. Don’t use:
- mem=0
- exclusive tag.
- Only ask for a GPU if you know it’ll be used.
Use /lscratch for I/O intensive tasks rather than accessing /gscratch over the network.
- You will need to copy files back before the job ends.
Track usage and job performance: seff <jobid>

...

Prev

What is Slurm

Workshop Home

Intro to Job Scheduling

Next

Slurm: More Features

Version	Old Version 7	New Version 12
Changes made by	Simon Alexander	Simon Alexander
Saved on	Aug 01, 2024	Aug 21, 2024

Versions Compared

Key

Common Questions

Common Questions: Suggestions

Common Issues

What does a general workflow look like?

What does it mean for an application to be GPU enabled?

How can I be a good cluster citizen?

Content Comparison

Versions Compared

Key

Common Questions

Common Questions: Suggestions

Common Issues

What does a general workflow look like?

What does it mean for an application to be GPU enabled?

How can I be a good cluster citizen?