Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Introduction: This workshop will introduce users to job management using the Slurm system - demonstrating how to create interactive jobs and submit jobs to the cluster queue that follow a basic workflow. After the workshop, participants will understand:

...

Table of Contents
stylenone

...

01: Slurm

Topics:

  • Slurm: 

    • Interactive sessions.

    • Job submission.

    • Resource selection.

    • Monitoring.

...

  1. Allocates access to appropriate computer nodes specific to your requests.

  2. Framework for starting, executing, monitoring, and even canceling your jobs.

  3. Queue management and job state notification.

...

ARCC: Slurm: 

...

Exercises:

...

  • You submit a job to the queue and walk away.

  • Monitor its progress/state using command-line and/or email notifications.

  • Once complete, come back and analyze results.

...

Submit Jobs: sbatch: Template:

...

  • How do I know what number of nodes, cores, memory etc to ask for my jobs?

    • Understand your software and application. 

      • Read the docs – look at the help for commands/options.

      • Can it run multiple threads - use multi cores (OpenMP) / nodes (MPI)?

      • Can it use a GPU? Nvidia cuda.

      • Are their suggestions on data and memory requirements?

  • How do I find out whether a cluster/partition supports these resources?

  • How do I find out whether these resources are available on the cluster?

  • How long will I have to wait in the queue before my job starts? 

    • How busy is the cluster? 

    • Current Cluster utilization: Commands sinfo / arccjobs and SouthPass status page.

  • How do I monitor the progress of my job?

    • Slurm commands: squeue

...

Common Issues:

  • Not defining the account and time options.

  • The account is the name of the project you are associated with. It is not your username.

  • Requesting combinations of resources that can not be satisfied: Beartooth Hardware Summary Table

    • For example, you can not request 40 cores on a teton node (max of 32).

    • Requesting too much memory, or too many GPU devices with respect to a partition.

  • My job is pending? Why? 

    • Because the resources are currently not available.

    • Have you unnecessarily defined a specific partition (restricted yourself) that is busy

    • We only have a small number of GPUs.

    • This is a shared resource - sometimes you just have to be patient…

    • Check current cluster utilization.

  • Preemption: Users of an investment get priority on their hardware.

    • We have the non-investor partition.

...

02: Workflows and Best Practices

Topics:

  • What does a general workflow look like?

  • Best practices in using HPC.

  • How to be a good cluster citizen?

...

Develop/Try/Test:

  • Typically use an interactive session (salloc) where you’re typing/trying/testing.

  • Are modules available? If not submit a New Software Request to get installed.

  • Develop code/scripts.

  • Understand how the command-line works – what commands/scripts to call with options.

  • Understand if parallelization is available – can you optimize your code/application?

  • Test against a subset of data. Something that runs quick – maybe a couple of minutes/hours.

  • Do the results look correct?

...

  • Put it all together within a bash Slurm script: 

    • Request appropriate resources using #SBATCH

    • Request appropriate wall time – hours, days…

    • Load modules: module load …

    • Run scripts/command-line.

  • Finally, submit your job to the cluster (sbatch) using a complete set of data.

    • Use: sbatch <script-name.sh>

    • Monitor job(s) progress.

...

  • Threads - multiple cpus/cores: Single node, single task, multiple cores.

  • OpenMP: Single task, multiple cores. Set environment variable.

    • an application programming interface (API) that supports multi-platform shared-memory multiprocessing programming in C, C++, and Fortran.

    • Example: ImageMagick

  • MPI: Message Passing Interface: Multiple nodes, multiple tasks

  • Hybrid: MPI / OpenMP and/or threads.

...

  • GPU / Nvidia / Cuda?

  • Examples:

    • Applications: AlphaFold and GPU Blast

      • Via conda based environments built with GPU libraries - and converted to Jupyter kernels:

      • Examples: TensorFlow and PyTorch 

      • Jupyter Kernels: PyTorch 1.13.1

...

How can I be a good cluster citizen?

  • Policies

  • Don’t run intensive applications on the login nodes.

  • Understand your software/application.

  • Shared resource - multi-tenancy.

    • Other jobs running on the same node do not affect each other.

  • Don’t ask for everything. Don’t use:

    • mem=0

    • exclusive tag.

    • Only ask for a GPU if you know it’ll be used.

  • Use /lscratch for I/O intensive tasks rather than accessing /gscratch over the network. 

    • You will need to copy files back before the job ends.

  • Track usage and job performance: seff <jobid>

...

  • Only accurate is the job is successful.

  • If the job fails with say an OOM: Out-Of-Memory the details will be inaccurate.

  • This is emailed out if you have Slurm email notifications turned on.

...

03: Intermediate/Advance Next Steps

...

Next Steps to look at: 

Future Workshops:

...

...

Installing Software: 

Installing software comes in a number of forms:

  • Download an existing binary that has built to run across the cluster’s operating system i.e. a Window’s executable will not run on a Linux based platform.

  • Download source code:

    • Run using an interpret language - Python, R.

    • Configure and compile - C/C++/Fortran.

    • Are there specific scientific libraries you need to load? What dependencies does it have?

  • Create a conda environment.

  • Create and/or use an existing Singularity (docker) container image.

    • A container is a standard unit of software that packages up code and all its dependencies so the application runs quickly and reliably from one computing environment to another.

  • Request via ARCC Portal: Request New/Updated HPC Software

...

Conda: 

Conda:

  • is a package, dependency and environment management for any language - Python, R, Ruby, Lua, Scala, Java, JavaScript, C/ C++, Fortran, and more.

  • is an open source package management system and environment management system that runs on Windows, macOS and Linux. 

  • quickly installs, runs and updates packages and their dependencies. 

  • easily creates, saves, loads and switches between environments – which can be exported from one system and imported onto another - work on local desktop then setup on a cluster.

  • can install and manage the thousand packages at repo.anaconda.com that are built, reviewed and maintained by Anaconda®.

...

Getting Data On/Off the Cluster: 

...

04 Summary

Summary: 

Covered:

  • Slurm: Interactive sessions, job submission, resource selection and monitoring.

  • What does a general workflow look like?

  • Best practices in using HPC.

  • How to be a good cluster citizen?

  • Intermediate/Advance next steps to consider looking at,.