Goal: List some common issues and how to resolve.

How do I know what number of nodes, cores, memory etc to ask for my jobs?
- Understand your software and application.
  - Read the docs – look at the help for commands/options.
  - Can it run multiple threads - use multi cores (OpenMP) / nodes (MPI)?
  - Can it use a GPU? Nvidia cuda.
  - Are their suggestions on data and memory requirements?
How do I find out whether a cluster/partition supports these resources?
How do I find out whether these resources are available on the cluster?
- Consult the wiki: Medicine Hardware Summary Table
How long will I have to wait in the queue before my job starts?
- How busy is the cluster?
- Current Cluster utilization: Commands sinfo / arccjobs / pestat and OnDemand’s MedicineBow System Status page.
How do I monitor the progress of my job?
- Slurm commands: squeue

...

Common Issues

Not defining the account and time options.
The account is the name of the project you are associated with. It is not your username.
Requesting combinations of resources that can not be satisfied: Medicine Hardware Summary Table
- For example, you can not request 40 cores on a compute node with a max of 32.
- Requesting too much memory, or too many GPU devices with respect to a partition.
My job is pending? Why?
- Because the resources are currently not available.
  - Be mindful of specialized nodes (such as our huge mem with 4T of RAM) we might only have a few of them.
- Have you unnecessarily defined a specific partition (restricted yourself) that is busy?
- We only have a small number of GPUs.
- This is a shared resource - sometimes you just have to be patient…
- Check current cluster utilization.
- What ever resources you are asking for are currently not available. Slurm will start you job when they become available.
- We do empathize and under the frustration, but this is a shared resource, and sometimes we just have to be patient and wait in the queue.
Preemption: Users of an investment get priority on their hardware.
- We have the non-investor partition.

...

Info
Remember: By default you must define the project (account) you’re using and a walltime.

Code Block

[salexan5@mblog2 ~]$ salloc
salloc: error: You didn't specify a project account (-A,--account). Please open a ticket at arcc-help@uwyo.edu for help.
salloc: error: Job submit/allocate failed: Invalid account or account/partition combination specified

[salexan5@mblog2 ~]$ salloc -A arcc<project-name>
salloc: error: You didn't specify a walltime (-t, --time=) for the job. Please open a ticket at arcc-help@uwyo.edu for help.
salloc: error: Job submit/allocate failed: Requested time limit is invalid (missing or exceeds some limit)

[salexan5@mblog2 ~]$ salloc -t 10:00
salloc: error: You didn't specify a project account (-A,--account). Please open a ticket at arcc-help@uwyo.edu for help.
salloc: error: Job submit/allocate failed: Invalid account or account/partition combination specified

# The bare minimum:
[salexan5@mblog2 ~]$ salloc -A arcc<project-name> -t 10:00
salloc: Granted job allocation 1250349
salloc: Nodes mbcpu-025 are ready for job

...

Info
If you need to explicitly request a partition, the name must be correct:

Code Block

[salexan5@mblog2 ~]$ salloc -A arcc<project-name> -t 10:00 --partition=mb-l40
salloc: error: invalid partition specified: mb-l40
salloc: error: Job submit/allocate failed: Invalid partition name specified

...

Expand

title	Example: sinfo

Code Block

[salexan5@mblog2 ~]$ sinfo
PARTITION    AVAIL  TIMELIMIT  NODES  STATE NODELIST
mb*             up 7-00:00:00      4    mix mbcpu-[008,010-011,025]
mb*             up 7-00:00:00      8  alloc mbcpu-[001-007,009]
mb*             up 7-00:00:00     13   idle mbcpu-[012-024]
mb-a30          up 7-00:00:00      8   idle mba30-[001-008]
mb-l40s         up 7-00:00:00      1    mix mbl40s-001
mb-l40s         up 7-00:00:00      5   idle mbl40s-[002-005,007]
mb-h100         up 7-00:00:00      1 drain$ mbh100-001
mb-h100         up 7-00:00:00      1  down$ mbh100-006
mb-h100         up 7-00:00:00      1  drain mbh100-002
mb-h100         up 7-00:00:00      3    mix mbh100-[003-005]
mb-a6000        up 7-00:00:00      1   idle mba6000-001
inv-arcc        up   infinite      1    mix mbcpu-025
inv-inbre       up   infinite      1   idle mbl40s-007
inv-ssheshap    up   infinite      1   idle mba6000-001
inv-wysbc       up   infinite      1  alloc mbcpu-001
inv-wysbc       up   infinite      1   idle mba30-001
inv-soc         up   infinite      1    mix mbl40s-001
inv-wildiris    up   infinite      5   idle wi[001-005]
non-investor    up 7-00:00:00      1 drain$ mbh100-001
non-investor    up 7-00:00:00      1  down$ mbh100-006
non-investor    up 7-00:00:00      1  drain mbh100-002
non-investor    up 7-00:00:00      6    mix mbcpu-[008,010-011],mbh100-[003-005]
non-investor    up 7-00:00:00      7  alloc mbcpu-[002-007,009]
non-investor    up 7-00:00:00     24   idle mba30-[002-008],mbcpu-[012-024],mbl40s-[002-005]

Code Block

# Corrected:
[salexan5@mblog2 ~]$ salloc -A arcc<project-name> -t 10:00 --partition=mb-l40s
salloc: Pending job allocation 1250907
salloc: job 1250907 queued and waiting for resources
salloc: job 1250907 has been allocated resources
salloc: Granted job allocation 1250907
salloc: Nodes mbl40s-001 are ready for job

...

Info
The maximum allowed wall time is 7 days:

Code Block

[arcc-t01@mblog2 ~]$ salloc -A arccanetrain<project-name> -t 7-00:00:01
salloc: error: Job submit/allocate failed: Requested time limit is invalid (missing or exceeds some limit)

[arcc-t01@mblog2 ~]$ salloc -A arccanetrain<project-name> -t 7-00:00:00
salloc: Granted job allocation 1251651
salloc: Nodes mbcpu-010 are ready for job

...

Too many cores on a node:

Code Block

[salexan5@mblog2 ~]$ salloc -A arcc<project-name> -t 10:00 -c 100
salloc: error: CPU count per node can not be satisfied
salloc: error: Job submit/allocate failed: Requested node configuration is not available

Must define a GPU enabled partition:

Code Block

[salexan5@mblog2 ~]$ salloc -A arcc<project-name> -t 10:00 --gres=gpu:1
salloc: error: Job submit/allocate failed: Requested node configuration is not available
salloc: Job allocation 1253677 has been revoked.

[salexan5@mblog2 ~]$ salloc -A arcc<project-name> -t 10:00 --gres=gpu:1 --partition=mb-a30
salloc: Granted job allocation 1253691
salloc: Nodes mba30-001 are ready for job

...

Versions Compared

Old Version 22

New Version 23

Key

Common Issues

Page Comparison

Versions Compared

Old Version 22

New Version 23

Key

Common Issues