Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Table of Contents
minLevel1
maxLevel1
outlinefalse
stylenone
typelist
printabletrue

Common Issues and How to Resolve

...

Required: Account and Walltime

Info

Remember: By default you must define the project (account) you’re using and a walltime.

Code Block
[salexan5@mblog2 ~]$ salloc
salloc: error: You didn't specify a project account (-A,--account). Please open a ticket at arcc-help@uwyo.edu for help.
salloc: error: Job submit/allocate failed: Invalid account or account/partition combination specified

[salexan5@mblog2 ~]$ salloc -A arcc
salloc: error: You didn't specify a walltime (-t, --time=) for the job. Please open a ticket at arcc-help@uwyo.edu for help.
salloc: error: Job submit/allocate failed: Requested time limit is invalid (missing or exceeds some limit)

[salexan5@mblog2 ~]$ salloc -t 10:00
salloc: error: You didn't specify a project account (-A,--account). Please open a ticket at arcc-help@uwyo.edu for help.
salloc: error: Job submit/allocate failed: Invalid account or account/partition combination specified

# The bare minimum:
[salexan5@mblog2 ~]$ salloc -A arcc -t 10:00
salloc: Granted job allocation 1250349
salloc: Nodes mbcpu-025 are ready for job

...

Correct Partitions

Info

...

Walltime and TIMEOUT:

...

If you need to explicitly request a partition, the name must be correct:

Code Block
[salexan5@mblog2 ~]$ salloc -A arcc -t 10:00 --partition=mb-l40
salloc: error: invalid partition specified: mb-l40
salloc: error: Job submit/allocate failed: Invalid partition name specified
Info

Use the sinfo command to get a list of know partitions, as well as detailing their current use:

Expand
titleExample: sinfo
Code Block
[salexan5@mblog2 ~]$ sinfo
PARTITION    AVAIL  TIMELIMIT  NODES  STATE NODELIST
mb*             up 7-00:00:00      4    mix mbcpu-[008,010-011,025]
mb*             up 7-00:00:00      8  alloc mbcpu-[001-007,009]
mb*             up 7-00:00:00     13   idle mbcpu-[012-024]
mb-a30          up 7-00:00:00      8   idle mba30-[001-008]
mb-l40s         up 7-00:00:00      1    mix mbl40s-001
mb-l40s         up 7-00:00:00      5   idle mbl40s-[002-005,007]
mb-h100         up 7-00:00:00      1 drain$ mbh100-001
mb-h100         up 7-00:00:00      1  down$ mbh100-006
mb-h100         up 7-00:00:00      1  drain mbh100-002
mb-h100         up 7-00:00:00      3    mix mbh100-[003-005]
mb-a6000        up 7-00:00:00      1   idle mba6000-001
inv-arcc        up   infinite      1    mix mbcpu-025
inv-inbre       up   infinite      1   idle mbl40s-007
inv-ssheshap    up   infinite      1   idle mba6000-001
inv-wysbc       up   infinite      1  alloc mbcpu-001
inv-wysbc       up   infinite      1   idle mba30-001
inv-soc         up   infinite      1    mix mbl40s-001
inv-wildiris    up   infinite      5   idle wi[001-005]
non-investor    up 7-00:00:00      1 drain$ mbh100-001
non-investor    up 7-00:00:00      1  down$ mbh100-006
non-investor    up 7-00:00:00      1  drain mbh100-002
non-investor    up 7-00:00:00      6    mix mbcpu-[008,010-011],mbh100-[003-005]
non-investor    up 7-00:00:00      7  alloc mbcpu-[002-007,009]
non-investor    up 7-00:00:00     24   idle mba30-[002-008],mbcpu-[012-024],mbl40s-[002-005]
Code Block
# Corrected:
[salexan5@mblog2 ~]$ salloc -A arcc -t 10:00 --partition=mb-l40s
salloc: Pending job allocation 1250907
salloc: job 1250907 queued and waiting for resources
salloc: job 1250907 has been allocated resources
salloc: Granted job allocation 1250907
salloc: Nodes mbl40s-001 are ready for job

...

Timeouts

Info

Timeouts aren’t errors as such, just that the time you requested was not long enough to compete the computation.

Info

The maximum allowed wall time is 7 days:

Code Block
[arcc-t01@mblog2 ~]$ salloc -A arccanetrain -t 7-00:00:01
salloc: error: Job submit/allocate failed: Requested time limit is invalid (missing or exceeds some limit)

[arcc-t01@mblog2 ~]$ salloc -A arccanetrain -t 7-00:00:00
salloc: Granted job allocation 1251651
salloc: Nodes mbcpu-010 are ready for job
Note

Do not request 7 days just because you can!

Wall time is considered when Slurm tries to allocate your job. A job is more likely to be back filled (slotted onto the cluster) in busy times than pending jobs will longer wall times.

...

My Jobs Need to Run Longer than 7 Days

Info

ARCC can provide users with wall times longer than 7 days.

But, we require that you can demonstrate that you job can not be optimized, for example:

  • Can it run faster by using more cores, or even multiple nodes?

  • Can it utilize GPUs?

  • Can the job actually be divided up into sections that can be run concurrently across multiple jobs?

ARCC can provide assistance with trying to understand if a job can be optimized.

...

Requested node configuration is not available

Info

This is caused because you’re trying to request a configuration that isn’t available, or requires more details: For example:

Too many cores on a node:

Code Block
[salexan5@mblog2 ~]$ salloc -A arcc -t 10:00 -c 100
salloc: error: CPU count per node can not be satisfied
salloc: error: Job submit/allocate failed: Requested node configuration is not available

Must define a GPU enabled partition:

Code Block
[salexan5@mblog2 ~]$ salloc -A arcc -t 10:00 --gres=gpu:1
salloc: error: Job submit/allocate failed: Requested node configuration is not available
salloc: Job allocation 1253677 has been revoked.

[salexan5@mblog2 ~]$ salloc -A arcc -t 10:00 --gres=gpu:1 --partition=mb-a30
salloc: Granted job allocation 1253691
salloc: Nodes mba30-001 are ready for job

...

OUT-OF-MEMORY: Segmentation Fault

Info

Segmentation faults are typically caused by an application trying to access memory outside what has been allocated to the job.

Basically, you job is out of memory of what it requested.

Info

Resolved: Request more memory using either the mem or mem-per-cpu .

...

...