Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Introduction: This workshop will introduce users to job management using the Slurm system - demonstrating how to create interactive jobs and submit jobs to the cluster queue that follow a basic workflow. After the workshop, participants will understand:

  • How to create a script that defines their workflow (i.e. loading modules).

  • Understand how to start interactive sessions to work within, as well as how to submit and track  jobs on the cluster.

  • Participants will require an intro level of experience of using Linux, as well as the ability to use a text editor from the command line.

Course Goals:

Introduce:

  • Slurm: What is Slurm?

  • How to start an Interactive sessions, and perform job submission, resource selection and monitoring

  • How to select appropriate resource allocations.

  • How to monitor your jobs.

  • What does a general workflow look like?

  • Best practices in using HPC.

  • How to be a good cluster citizen?

...

  • You submit a job to the queue and walk away.

  • Monitor its progress/state using command-line and/or email notifications.

  • Once complete, come back and analyze results.

...

Submit Jobs: sbatch: Template:

...

  • How do I know what number of nodes, cores, memory etc to ask for my jobs?

    • Understand your software and application. 

      • Read the docs – look at the help for commands/options.

      • Can it run multiple threads - use multi cores (OpenMP) / nodes (MPI)?

      • Can it use a GPU? Nvidia cuda.

      • Are their suggestions on data and memory requirements?

  • How do I find out whether a cluster/partition supports these resources?

  • How do I find out whether these resources are available on the cluster?

  • How long will I have to wait in the queue before my job starts? 

    • How busy is the cluster? 

    • Current Cluster utilization: Commands sinfo / arccjobs and SouthPass status page.

  • How do I monitor the progress of my job?

    • Slurm commands: squeue

...

Common Issues:

  • Not defining the account and time options.

  • The account is the name of the project you are associated with. It is not your username.

  • Requesting combinations of resources that can not be satisfied: Beartooth Hardware Summary Table

    • For example, you can not request 40 cores on a teton node (max of 32).

    • Requesting too much memory, or too many GPU devices with respect to a partition.

  • My job is pending? Why? 

    • Because the resources are currently not available.

    • Have you unnecessarily defined a specific partition (restricted yourself) that is busy

    • We only have a small number of GPUs.

    • This is a shared resource - sometimes you just have to be patient…

    • Check current cluster utilization.

  • Preemption: Users of an investment get priority on their hardware.

    • We have the non-investor partition.

...

Develop/Try/Test:

  • Typically use an interactive session (salloc) where you’re typing/trying/testing.

  • Are modules available? If not submit a New Software Request to get installed.

  • Develop code/scripts.

  • Understand how the command-line works – what commands/scripts to call with options.

  • Understand if parallelization is available – can you optimize your code/application?

  • Test against a subset of data. Something that runs quick – maybe a couple of minutes/hours.

  • Do the results look correct?

...

  • Put it all together within a bash Slurm script: 

    • Request appropriate resources using #SBATCH

    • Request appropriate wall time – hours, days…

    • Load modules: module load …

    • Run scripts/command-line.

  • Finally, submit your job to the cluster (sbatch) using a complete set of data.

    • Use: sbatch <script-name.sh>

    • Monitor job(s) progress.

...

How can I be a good cluster citizen?

  • Policies

  • Don’t run intensive applications on the login nodes.

  • Understand your software/application.

  • Shared resource - multi-tenancy.

    • Other jobs running on the same node do not affect each other.

  • Don’t ask for everything. Don’t use:

    • mem=0

    • exclusive tag.

    • Only ask for a GPU if you know it’ll be used.

  • Use /lscratch for I/O intensive tasks rather than accessing /gscratch over the network. 

    • You will need to copy files back before the job ends.

  • Track usage and job performance: seff <jobid>

...