Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Slurm will identify a list of partitions to try to run your job against.

  2. Slurm will automatically prioritize your investment partition and hardware to try to run your job there. 

  3. Slurm checks for jobs running on your investment.

    1. If:
      (tick) there are no jobs running on your investment
      (tick) and the hardware requested in your job fits within the hardware confines of your partition
      Slurm immediately starts your job.

    2. If:
      (tick) there are jobs running on your investment
      (tick) and they were requested by HPC users who are not a member of any projects associated with your investment

      1. Slurm will pre-empt these jobs (ie stop them and add back to the queue)

      2. Slurm will immediately start your job.

    3. If:
      (tick) your investment hardware is 'full' with jobs from users who are members of project(s) associated with your investment

      1. Slurm will try to allocate your requested job across the other partitions

        1. On MedicineBow: The list of other partitions tried in order are: mb, mb-a30, mb-l40s, mb-h100.

        2. On Beartooth: The list of other partitions tried in order are: moran, teton, teton-cascade, teton-gpu, teton-hugemem.

        3. If:
          (tick) resources are available

          1. Slurm will start your job

      2. If:
        (tick) there are no resources available to fit your job (i.e. cluster usage is very high)

        1. Slurm will place your job in the queue and your job will have a state of pending (i.e. waiting in the queue).

        2. Slurm will monitor the queue on regular intervals and run the job when any appropriate resources become available.

A job can be pending for a number of reasons, and Slurm will only show one of the many potential reasons, which can be a bit confusing. For example, it might state BadConstraint (The job's constraints can not be satisfied) or Resources (The job is waiting for resources to become available) A full list can be found here: https://slurm.schedmd.com/squeue.html#SECTION_JOB-REASON-CODES

...