Goal: List some common issues and how to resolve.
Common Questions
How do I know what number of nodes, cores, memory etc to ask for my jobs?
How do I find out whether a cluster/partition supports these resources?
How do I find out whether these resources are available on the cluster?
How long will I have to wait in the queue before my job starts? How busy is the cluster?
How do I monitor the progress of my job?
Common Questions: Suggestions
How do I know what number of nodes, cores, memory etc to ask for my jobs?
Understand your software and application.
Read the docs – look at the help for commands/options.
Can it run multiple threads - use multi cores (OpenMP) / nodes (MPI)?
Can it use a GPU? Nvidia cuda.
Are their suggestions on data and memory requirements?
How do I find out whether a cluster/partition supports these resources?
How do I find out whether these resources are available on the cluster?
Consult the wiki: Medicine Hardware Summary Table
How long will I have to wait in the queue before my job starts?
How busy is the cluster?
Current Cluster utilization: Commands
sinfo
/arccjobs
/pestat
and OnDemand’s MedicineBow System Status page.
How do I monitor the progress of my job?
Slurm commands:
squeue
Common Issues
Not defining the
account
andtime
options.The
account
is the name of the project you are associated with. It is not your username.Requesting combinations of resources that can not be satisfied: Medicine Hardware Summary Table
For example, you can not request 40 cores on a compute node with a max of 32.
Requesting too much memory, or too many GPU devices with respect to a partition.
My job is pending? Why?
Because the resources are currently not available.
Be mindful of specialized nodes (such as our huge mem with 4T of RAM) we might only have a few of them.
Have you unnecessarily defined a specific partition (restricted yourself) that is busy?
We only have a small number of GPUs.
This is a shared resource - sometimes you just have to be patient…
Check current cluster utilization.
What ever resources you are asking for are currently not available. Slurm will start you job when they become available.
We do empathize and under the frustration, but this is a shared resource, and sometimes we just have to be patient and wait in the queue.
Preemption: Users of an investment get priority on their hardware.
We have the
non-investor
partition.
Pending Jobs and Scheduled Cluster Maintenance
When we have schedule maintenance on the cluster, an announcement will go out and indicate the date/time that this is scheduled to start.
All jobs currently running on the cluster are allowed to finish, and we can aspect further jobs if Slurm can complete them before the maintenance starts.
Be conscious of: If an announcement goes out on a Monday that maintenance is to start on the following Friday, there is only a window of four days that any new job must complete within.
If you submit a job with a wall time of seven days, then this can not and will not be started since it can not complete before Friday.
You job will be accepted by Slurm, queued, have a status of pending, and automatically started once maintenance is completed and Slurm is restarted.
Required: Account and Walltime
Remember: By default you must define the project (account) you’re using and a walltime.
[]$ salloc salloc: error: You didn't specify a project account (-A,--account). Please open a ticket at arcc-help@uwyo.edu for help. salloc: error: Job submit/allocate failed: Invalid account or account/partition combination specified []$ salloc -A <project-name> salloc: error: You didn't specify a walltime (-t, --time=) for the job. Please open a ticket at arcc-help@uwyo.edu for help. salloc: error: Job submit/allocate failed: Requested time limit is invalid (missing or exceeds some limit) []$ salloc -t 10:00 salloc: error: You didn't specify a project account (-A,--account). Please open a ticket at arcc-help@uwyo.edu for help. salloc: error: Job submit/allocate failed: Invalid account or account/partition combination specified # The bare minimum: []$ salloc -A <project-name> -t 10:00 salloc: Granted job allocation 1250349 salloc: Nodes mbcpu-025 are ready for job
Correct Partitions
If you need to explicitly request a partition, the name must be correct:
[]$ salloc -A <project-name> -t 10:00 --partition=mb-l40 salloc: error: invalid partition specified: mb-l40 salloc: error: Job submit/allocate failed: Invalid partition name specified
Use the sinfo
command to get a list of know partitions, as well as detailing their current use:
# Corrected: []$ salloc -A <project-name> -t 10:00 --partition=mb-l40s salloc: Pending job allocation 1250907 salloc: job 1250907 queued and waiting for resources salloc: job 1250907 has been allocated resources salloc: Granted job allocation 1250907 salloc: Nodes mbl40s-001 are ready for job
Timeouts
Timeouts aren’t errors as such, just that the time you requested was not long enough to compete the computation.
The maximum allowed wall time is 7 days:
[]$ salloc -A <project-name> -t 7-00:00:01 salloc: error: Job submit/allocate failed: Requested time limit is invalid (missing or exceeds some limit) []$ salloc -A <project-name> -t 7-00:00:00 salloc: Granted job allocation 1251651 salloc: Nodes mbcpu-010 are ready for job
Do not request 7 days just because you can!
Wall time is considered when Slurm tries to allocate your job.
A job is more likely to be back filled (slotted onto the cluster) in busy times than pending jobs will longer wall times.
My Jobs Need to Run Longer than 7 Days
ARCC can provide users with wall times longer than 7 days.
Please contact use, but we require that you can demonstrate that you job can not be optimized, for example:
Can it run faster by using more cores, or even multiple nodes?
Can it utilize GPUs?
Can the job actually be divided up into sections that can be run concurrently across multiple jobs?
ARCC can provide assistance with trying to understand if a job can be optimized.
Requested node configuration is not available
This is caused because you’re trying to request a configuration that isn’t available, or requires more details: For example:
Too many cores on a node:
[]$ salloc -A <project-name> -t 10:00 -c 100 salloc: error: CPU count per node can not be satisfied salloc: error: Job submit/allocate failed: Requested node configuration is not available
Must define a GPU enabled partition:
[]$ salloc -A <project-name> -t 10:00 --gres=gpu:1 salloc: error: Job submit/allocate failed: Requested node configuration is not available salloc: Job allocation 1253677 has been revoked. []$ salloc -A <project-name> -t 10:00 --gres=gpu:1 --partition=mb-a30 salloc: Granted job allocation 1253691 salloc: Nodes mba30-001 are ready for job
OUT-OF-MEMORY: Segmentation Fault
Segmentation faults are typically caused by an application trying to access memory outside what has been allocated to the job.
Basically, you job is out of memory of what it requested.
Resolved: Request more memory using either the mem
or mem-per-cpu
.
My Job Stopped and Re-Started: Preemption
As discussed in the Intro to HPC workshop, we have a Condominium Model where if your job is running an a compute node that is part of another project’s hardware investment, your job can be preempted.
Your job will be stopped and automatically re-queued and when resources come available on the cluster, it will be restarted.
Further details can be found on our Slurm and Preemption page and how to use the
non-investor
partition to prevent this from happening.
Why Is My Job One-of-Many on a Compute Node?
When I run pestat
, it appears that my job is one of many on a particular compute node.
[]$ pestat -n mbl40s-001 Select only nodes in hostlist=mbl40s-001 Hostname Partition Node Num_CPU CPUload Memsize Freemem Joblist State Use/Tot (15min) (MB) (MB) JobID(JobArrayID) User ... mbl40s-001 mb-l40s mix 64 96 55.62* 765525 310618 1480439 hbalantr 1440260 vvarenth
As discussed in the Intro to HPC workshop when talking about Compute Nodes this is perfectly acceptable and one of the tasks that Slurm manages.
Remember: All jobs are independent and do not affect anyone else.
Prev | Workshop Home |
Use the following link to provide feedback on this training: https://forms.gle/bxhKoVaPns51Qhb99 or use the QR code below.