Slurm: Common Questions and Issues and How to Resolve

Goal: List some common issues and how to resolve.


Common Questions

  • How do I know what number of nodes, cores, memory etc to ask for my jobs?

  • How do I find out whether a cluster/partition supports these resources?

  • How do I find out whether these resources are available on the cluster?

  • How long will I have to wait in the queue before my job starts? How busy is the cluster?

  • How do I monitor the progress of my job?

  • My job finished before the wall time - what happens?


Common Questions: Suggestions

  • How do I know what number of nodes, cores, memory etc to ask for my jobs?

    • Understand your software and application. 

      • Read the docs – look at the help for commands/options.

      • Can it run multiple threads - use multi cores (OpenMP) / nodes (MPI)?

      • Can it use a GPU? Nvidia cuda.

      • Are their suggestions on data and memory requirements?

  • How do I find out whether a cluster/partition supports these resources?

  • How do I find out whether these resources are available on the cluster?

  • How long will I have to wait in the queue before my job starts? 

    • How busy is the cluster? 

    • Current Cluster utilization: Commands sinfo / arccjobs / pestat and OnDemand’s MedicineBow System Status page.

  • How do I monitor the progress of my job?

    • Slurm commands: squeue

  • My job finished before the wall time - what happens?

    • If your job has completely finished before the wall time you requested (e.g. it finished in two hours and you requested four hours) i.e. it’s status is no longer running, then Slurm will remove the job from the queue, release any requested resources back to the cluster, and allow other jobs to start running. Your job is not sitting idle on the cluster waiting for the wall time to run down.


Common Issues

  • Not defining the account and time options.

  • The account is the name of the project you are associated with. It is not your username.

  • Requesting combinations of resources that can not be satisfied: Medicine Hardware Summary Table

    • For example, you can not request 40 cores on a compute node with a max of 32.

    • Requesting too much memory, or too many GPU devices with respect to a partition.

  • My job is pending? Why? 

    • Because the resources are currently not available.

      • Be mindful of specialized nodes (such as our huge mem with 4T of RAM) we might only have a few of them.

    • Have you unnecessarily defined a specific partition (restricted yourself) that is busy

    • We only have a small number of GPUs.

    • This is a shared resource - sometimes you just have to be patient…

    • Check current cluster utilization.

    • What ever resources you are asking for are currently not available. Slurm will start you job when they become available.

    • We do empathize and under the frustration, but this is a shared resource, and sometimes we just have to be patient and wait in the queue.

  • Preemption: Users of an investment get priority on their hardware.

    • We have the non-investor partition.


Pending Jobs and Scheduled Cluster Maintenance

When we have schedule maintenance on the cluster, an announcement will go out and indicate the date/time that this is scheduled to start.

All jobs currently running on the cluster are allowed to finish, and we can aspect further jobs if Slurm can complete them before the maintenance starts.

Be conscious of: If an announcement goes out on a Monday that maintenance is to start on the following Friday, there is only a window of four days that any new job must complete within.

If you submit a job with a wall time of seven days, then this can not and will not be started since it can not complete before Friday.

You job will be accepted by Slurm, queued, have a status of pending, and automatically started once maintenance is completed and Slurm is restarted.


Required: Account and Walltime

Remember: By default you must define the project (account) you’re using and a walltime.

[]$ salloc salloc: error: You didn't specify a project account (-A,--account). Please open a ticket at arcc-help@uwyo.edu for help. salloc: error: Job submit/allocate failed: Invalid account or account/partition combination specified []$ salloc -A <project-name> salloc: error: You didn't specify a walltime (-t, --time=) for the job. Please open a ticket at arcc-help@uwyo.edu for help. salloc: error: Job submit/allocate failed: Requested time limit is invalid (missing or exceeds some limit) []$ salloc -t 10:00 salloc: error: You didn't specify a project account (-A,--account). Please open a ticket at arcc-help@uwyo.edu for help. salloc: error: Job submit/allocate failed: Invalid account or account/partition combination specified # The bare minimum: []$ salloc -A <project-name> -t 10:00 salloc: Granted job allocation 1250349 salloc: Nodes mbcpu-025 are ready for job

Correct Partitions

[]$ salloc -A <project-name> -t 10:00 --partition=mb-l40 salloc: error: invalid partition specified: mb-l40 salloc: error: Job submit/allocate failed: Invalid partition name specified
[]$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST mb* up 7-00:00:00 4 mix mbcpu-[008,010-011,025] mb* up 7-00:00:00 8 alloc mbcpu-[001-007,009] mb* up 7-00:00:00 13 idle mbcpu-[012-024] mb-a30 up 7-00:00:00 8 idle mba30-[001-008] mb-l40s up 7-00:00:00 1 mix mbl40s-001 mb-l40s up 7-00:00:00 5 idle mbl40s-[002-005,007] mb-h100 up 7-00:00:00 1 drain$ mbh100-001 mb-h100 up 7-00:00:00 1 down$ mbh100-006 mb-h100 up 7-00:00:00 1 drain mbh100-002 mb-h100 up 7-00:00:00 3 mix mbh100-[003-005] mb-a6000 up 7-00:00:00 1 idle mba6000-001 inv-arcc up infinite 1 mix mbcpu-025 inv-inbre up infinite 1 idle mbl40s-007 inv-ssheshap up infinite 1 idle mba6000-001 inv-wysbc up infinite 1 alloc mbcpu-001 inv-wysbc up infinite 1 idle mba30-001 inv-soc up infinite 1 mix mbl40s-001 inv-wildiris up infinite 5 idle wi[001-005] non-investor up 7-00:00:00 1 drain$ mbh100-001 non-investor up 7-00:00:00 1 down$ mbh100-006 non-investor up 7-00:00:00 1 drain mbh100-002 non-investor up 7-00:00:00 6 mix mbcpu-[008,010-011],mbh100-[003-005] non-investor up 7-00:00:00 7 alloc mbcpu-[002-007,009] non-investor up 7-00:00:00 24 idle mba30-[002-008],mbcpu-[012-024],mbl40s-[002-005]

Timeouts


My Jobs Need to Run Longer than 7 Days


Requested node configuration is not available

Too many cores on a node:

Must define a GPU enabled partition:


OUT-OF-MEMORY: Segmentation Fault


My Job Stopped and Re-Started: Preemption


Why Is My Job One-of-Many on a Compute Node?


Can I Submit Jobs from within a Script?


Closing a Linux Session while Running an salloc