Goal: List some common issues and how to resolve.
Required: Account and Walltime
Remember: By default you must define the project (account) you’re using and a walltime.
[salexan5@mblog2 ~]$ salloc salloc: error: You didn't specify a project account (-A,--account). Please open a ticket at arcc-help@uwyo.edu for help. salloc: error: Job submit/allocate failed: Invalid account or account/partition combination specified [salexan5@mblog2 ~]$ salloc -A arcc salloc: error: You didn't specify a walltime (-t, --time=) for the job. Please open a ticket at arcc-help@uwyo.edu for help. salloc: error: Job submit/allocate failed: Requested time limit is invalid (missing or exceeds some limit) [salexan5@mblog2 ~]$ salloc -t 10:00 salloc: error: You didn't specify a project account (-A,--account). Please open a ticket at arcc-help@uwyo.edu for help. salloc: error: Job submit/allocate failed: Invalid account or account/partition combination specified # The bare minimum: [salexan5@mblog2 ~]$ salloc -A arcc -t 10:00 salloc: Granted job allocation 1250349 salloc: Nodes mbcpu-025 are ready for job
Correct Partitions
If you need to explicitly request a partition, the name must be correct:
[salexan5@mblog2 ~]$ salloc -A arcc -t 10:00 --partition=mb-l40 salloc: error: invalid partition specified: mb-l40 salloc: error: Job submit/allocate failed: Invalid partition name specified
Use the sinfo
command to get a list of know partitions, as well as detailing their current use:
# Corrected: [salexan5@mblog2 ~]$ salloc -A arcc -t 10:00 --partition=mb-l40s salloc: Pending job allocation 1250907 salloc: job 1250907 queued and waiting for resources salloc: job 1250907 has been allocated resources salloc: Granted job allocation 1250907 salloc: Nodes mbl40s-001 are ready for job
Timeouts
Timeouts aren’t errors as such, just that the time you requested was not long enough to compete the computation.
The maximum allowed wall time is 7 days:
[arcc-t01@mblog2 ~]$ salloc -A arccanetrain -t 7-00:00:01 salloc: error: Job submit/allocate failed: Requested time limit is invalid (missing or exceeds some limit) [arcc-t01@mblog2 ~]$ salloc -A arccanetrain -t 7-00:00:00 salloc: Granted job allocation 1251651 salloc: Nodes mbcpu-010 are ready for job
Do not request 7 days just because you can!
Wall time is considered when Slurm tries to allocate your job. A job is more likely to be back filled (slotted onto the cluster) in busy times than pending jobs will longer wall times.
My Jobs Need to Run Longer than 7 Days
ARCC can provide users with wall times longer than 7 days.
Please contact use, but we require that you can demonstrate that you job can not be optimized, for example:
Can it run faster by using more cores, or even multiple nodes?
Can it utilize GPUs?
Can the job actually be divided up into sections that can be run concurrently across multiple jobs?
ARCC can provide assistance with trying to understand if a job can be optimized.
Requested node configuration is not available
This is caused because you’re trying to request a configuration that isn’t available, or requires more details: For example:
Too many cores on a node:
[salexan5@mblog2 ~]$ salloc -A arcc -t 10:00 -c 100 salloc: error: CPU count per node can not be satisfied salloc: error: Job submit/allocate failed: Requested node configuration is not available
Must define a GPU enabled partition:
[salexan5@mblog2 ~]$ salloc -A arcc -t 10:00 --gres=gpu:1 salloc: error: Job submit/allocate failed: Requested node configuration is not available salloc: Job allocation 1253677 has been revoked. [salexan5@mblog2 ~]$ salloc -A arcc -t 10:00 --gres=gpu:1 --partition=mb-a30 salloc: Granted job allocation 1253691 salloc: Nodes mba30-001 are ready for job
OUT-OF-MEMORY: Segmentation Fault
Segmentation faults are typically caused by an application trying to access memory outside what has been allocated to the job.
Basically, you job is out of memory of what it requested.
Resolved: Request more memory using either the mem
or mem-per-cpu
.
My Job Stopped and Re-Started: Preemption
As discussed in the Intro to HPC workshop, we have a Condominium Model where if your job is running an a compute node that is part of another project’s hardware investment, your job can be preempted.
Your job will be stopped and automatically re-queued and when resources come available on the cluster, it will be restarted.
Further details can be found on our Slurm and Preemption page and how to use the non-investor
partition to prevent this from happening.
Why is my Job one of Many on a Compute Node?
When I run pestat, it appears that my job is one of many on a particular compute node.
As discussed in the Intro to HPC workshop when talking about Compute Nodes
Prev | Workshop Home |