Page Comparison

Introduction: This workshop will introduce users to job management using the Slurm system - demonstrating how to create interactive jobs and submit jobs to the cluster queue that follow a basic workflow. After the workshop, participants will understand:

How to create a script that defines their workflow (i.e. loading modules).
Understand how to start interactive sessions to work within, as well as how to submit and track jobs on the cluster.
Participants will require an intro level of experience of using Linux, as well as the ability to use a text editor from the command line.

Course Goals:

Introduce:

Slurm: What is Slurm?
How to start an Interactive sessions, and perform job submission, resource selection and monitoring
How to select appropriate resource allocations.
How to monitor your jobs.
What does a general workflow look like?
Best practices in using HPC.
How to be a good cluster citizen?

...

You submit a job to the queue and walk away.
Monitor its progress/state using command-line and/or email notifications.
Once complete, come back and analyze results.

...

Submit Jobs: sbatch: Template:

...

How do I know what number of nodes, cores, memory etc to ask for my jobs?
- Understand your software and application.
  - Read the docs – look at the help for commands/options.
  - Can it run multiple threads - use multi cores (OpenMP) / nodes (MPI)?
  - Can it use a GPU? Nvidia cuda.
  - Are their suggestions on data and memory requirements?
How do I find out whether a cluster/partition supports these resources?
How do I find out whether these resources are available on the cluster?
- Consult the wiki: Beartooth Hardware Summary Table
How long will I have to wait in the queue before my job starts?
- How busy is the cluster?
- Current Cluster utilization: Commands sinfo / arccjobs and SouthPass status page.
How do I monitor the progress of my job?
- Slurm commands: squeue

...

Common Issues:

Not defining the account and time options.
The account is the name of the project you are associated with. It is not your username.
Requesting combinations of resources that can not be satisfied: Beartooth Hardware Summary Table
- For example, you can not request 40 cores on a teton node (max of 32).
- Requesting too much memory, or too many GPU devices with respect to a partition.
My job is pending? Why?
- Because the resources are currently not available.
- Have you unnecessarily defined a specific partition (restricted yourself) that is busy?
- We only have a small number of GPUs.
- This is a shared resource - sometimes you just have to be patient…
- Check current cluster utilization.
Preemption: Users of an investment get priority on their hardware.
- We have the non-investor partition.

...

Develop/Try/Test:

Typically use an interactive session (salloc) where you’re typing/trying/testing.
Are modules available? If not submit a New Software Request to get installed.
Develop code/scripts.
Understand how the command-line works – what commands/scripts to call with options.
Understand if parallelization is available – can you optimize your code/application?
Test against a subset of data. Something that runs quick – maybe a couple of minutes/hours.
Do the results look correct?

...

Put it all together within a bash Slurm script:
- Request appropriate resources using #SBATCH
- Request appropriate wall time – hours, days…
- Load modules: module load …
- Run scripts/command-line.
Finally, submit your job to the cluster (sbatch) using a complete set of data.
- Use: sbatch <script-name.sh>
- Monitor job(s) progress.

...

How can I be a good cluster citizen?

Policies
Don’t run intensive applications on the login nodes.
Understand your software/application.
Shared resource - multi-tenancy.
- Other jobs running on the same node do not affect each other.
Don’t ask for everything. Don’t use:
- mem=0
- exclusive tag.
- Only ask for a GPU if you know it’ll be used.
Use /lscratch for I/O intensive tasks rather than accessing /gscratch over the network.
- You will need to copy files back before the job ends.
Track usage and job performance: seff <jobid>

...

Versions Compared

Old Version 9

New Version 10

Key