Goal: List some common issues and how to resolve.

Required: Account and Walltime

Remember: By default you must define the project (account) you’re using and a walltime.

[salexan5@mblog2 ~]$ salloc
salloc: error: You didn't specify a project account (-A,--account). Please open a ticket at arcc-help@uwyo.edu for help.
salloc: error: Job submit/allocate failed: Invalid account or account/partition combination specified

[salexan5@mblog2 ~]$ salloc -A arcc
salloc: error: You didn't specify a walltime (-t, --time=) for the job. Please open a ticket at arcc-help@uwyo.edu for help.
salloc: error: Job submit/allocate failed: Requested time limit is invalid (missing or exceeds some limit)

[salexan5@mblog2 ~]$ salloc -t 10:00
salloc: error: You didn't specify a project account (-A,--account). Please open a ticket at arcc-help@uwyo.edu for help.
salloc: error: Job submit/allocate failed: Invalid account or account/partition combination specified

# The bare minimum:
[salexan5@mblog2 ~]$ salloc -A arcc -t 10:00
salloc: Granted job allocation 1250349
salloc: Nodes mbcpu-025 are ready for job

Correct Partitions

If you need to explicitly request a partition, the name must be correct:

[salexan5@mblog2 ~]$ salloc -A arcc -t 10:00 --partition=mb-l40
salloc: error: invalid partition specified: mb-l40
salloc: error: Job submit/allocate failed: Invalid partition name specified

Use the sinfo command to get a list of know partitions, as well as detailing their current use:

Example: sinfo

[salexan5@mblog2 ~]$ sinfo
PARTITION    AVAIL  TIMELIMIT  NODES  STATE NODELIST
mb*             up 7-00:00:00      4    mix mbcpu-[008,010-011,025]
mb*             up 7-00:00:00      8  alloc mbcpu-[001-007,009]
mb*             up 7-00:00:00     13   idle mbcpu-[012-024]
mb-a30          up 7-00:00:00      8   idle mba30-[001-008]
mb-l40s         up 7-00:00:00      1    mix mbl40s-001
mb-l40s         up 7-00:00:00      5   idle mbl40s-[002-005,007]
mb-h100         up 7-00:00:00      1 drain$ mbh100-001
mb-h100         up 7-00:00:00      1  down$ mbh100-006
mb-h100         up 7-00:00:00      1  drain mbh100-002
mb-h100         up 7-00:00:00      3    mix mbh100-[003-005]
mb-a6000        up 7-00:00:00      1   idle mba6000-001
inv-arcc        up   infinite      1    mix mbcpu-025
inv-inbre       up   infinite      1   idle mbl40s-007
inv-ssheshap    up   infinite      1   idle mba6000-001
inv-wysbc       up   infinite      1  alloc mbcpu-001
inv-wysbc       up   infinite      1   idle mba30-001
inv-soc         up   infinite      1    mix mbl40s-001
inv-wildiris    up   infinite      5   idle wi[001-005]
non-investor    up 7-00:00:00      1 drain$ mbh100-001
non-investor    up 7-00:00:00      1  down$ mbh100-006
non-investor    up 7-00:00:00      1  drain mbh100-002
non-investor    up 7-00:00:00      6    mix mbcpu-[008,010-011],mbh100-[003-005]
non-investor    up 7-00:00:00      7  alloc mbcpu-[002-007,009]
non-investor    up 7-00:00:00     24   idle mba30-[002-008],mbcpu-[012-024],mbl40s-[002-005]

# Corrected:
[salexan5@mblog2 ~]$ salloc -A arcc -t 10:00 --partition=mb-l40s
salloc: Pending job allocation 1250907
salloc: job 1250907 queued and waiting for resources
salloc: job 1250907 has been allocated resources
salloc: Granted job allocation 1250907
salloc: Nodes mbl40s-001 are ready for job

Timeouts

Timeouts aren’t errors as such, just that the time you requested was not long enough to compete the computation.

The maximum allowed wall time is 7 days:

[arcc-t01@mblog2 ~]$ salloc -A arccanetrain -t 7-00:00:01
salloc: error: Job submit/allocate failed: Requested time limit is invalid (missing or exceeds some limit)

[arcc-t01@mblog2 ~]$ salloc -A arccanetrain -t 7-00:00:00
salloc: Granted job allocation 1251651
salloc: Nodes mbcpu-010 are ready for job

Do not request 7 days just because you can!

Wall time is considered when Slurm tries to allocate your job. A job is more likely to be back filled (slotted onto the cluster) in busy times than pending jobs will longer wall times.

My Jobs Need to Run Longer than 7 Days

ARCC can provide users with wall times longer than 7 days.

Please contact use, but we require that you can demonstrate that you job can not be optimized, for example:

Can it run faster by using more cores, or even multiple nodes?
Can it utilize GPUs?
Can the job actually be divided up into sections that can be run concurrently across multiple jobs?

ARCC can provide assistance with trying to understand if a job can be optimized.

Requested node configuration is not available

This is caused because you’re trying to request a configuration that isn’t available, or requires more details: For example:

Too many cores on a node:

[salexan5@mblog2 ~]$ salloc -A arcc -t 10:00 -c 100
salloc: error: CPU count per node can not be satisfied
salloc: error: Job submit/allocate failed: Requested node configuration is not available

Must define a GPU enabled partition:

[salexan5@mblog2 ~]$ salloc -A arcc -t 10:00 --gres=gpu:1
salloc: error: Job submit/allocate failed: Requested node configuration is not available
salloc: Job allocation 1253677 has been revoked.

[salexan5@mblog2 ~]$ salloc -A arcc -t 10:00 --gres=gpu:1 --partition=mb-a30
salloc: Granted job allocation 1253691
salloc: Nodes mba30-001 are ready for job

OUT-OF-MEMORY: Segmentation Fault

Segmentation faults are typically caused by an application trying to access memory outside what has been allocated to the job.

Basically, you job is out of memory of what it requested.

Resolved: Request more memory using either the mem or mem-per-cpu .

Prev

Slurm: More Features

Workshop Home

Intro to Job Scheduling

Slurm: Common Issues and How to Resolve