Goal: List some common issues and how to resolve.
Table of Contents | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
|
Required: Account and Walltime
Info |
---|
Remember: By default you must define the project (account) you’re using and a walltime. |
Code Block |
---|
[salexan5@mblog2 ~]$ salloc
salloc: error: You didn't specify a project account (-A,--account). Please open a ticket at arcc-help@uwyo.edu for help.
salloc: error: Job submit/allocate failed: Invalid account or account/partition combination specified
[salexan5@mblog2 ~]$ salloc -A arcc
salloc: error: You didn't specify a walltime (-t, --time=) for the job. Please open a ticket at arcc-help@uwyo.edu for help.
salloc: error: Job submit/allocate failed: Requested time limit is invalid (missing or exceeds some limit)
[salexan5@mblog2 ~]$ salloc -t 10:00
salloc: error: You didn't specify a project account (-A,--account). Please open a ticket at arcc-help@uwyo.edu for help.
salloc: error: Job submit/allocate failed: Invalid account or account/partition combination specified
# The bare minimum:
[salexan5@mblog2 ~]$ salloc -A arcc -t 10:00
salloc: Granted job allocation 1250349
salloc: Nodes mbcpu-025 are ready for job |
Correct Partitions
Info |
---|
If you need to explicitly request a partition, the name must be correct: |
Code Block |
---|
[salexan5@mblog2 ~]$ salloc -A arcc -t 10:00 --partition=mb-l40
salloc: error: invalid partition specified: mb-l40
salloc: error: Job submit/allocate failed: Invalid partition name specified |
Info |
---|
Use the |
Expand | ||
---|---|---|
| ||
|
Code Block |
---|
# Corrected:
[salexan5@mblog2 ~]$ salloc -A arcc -t 10:00 --partition=mb-l40s
salloc: Pending job allocation 1250907
salloc: job 1250907 queued and waiting for resources
salloc: job 1250907 has been allocated resources
salloc: Granted job allocation 1250907
salloc: Nodes mbl40s-001 are ready for job |
Timeouts
Info |
---|
Timeouts aren’t errors as such, just that the time you requested was not long enough to compete the computation. |
Info |
---|
The maximum allowed wall time is 7 days: |
Code Block |
---|
[arcc-t01@mblog2 ~]$ salloc -A arccanetrain -t 7-00:00:01
salloc: error: Job submit/allocate failed: Requested time limit is invalid (missing or exceeds some limit)
[arcc-t01@mblog2 ~]$ salloc -A arccanetrain -t 7-00:00:00
salloc: Granted job allocation 1251651
salloc: Nodes mbcpu-010 are ready for job |
Note |
---|
Do not request 7 days just because you can! Wall time is considered when Slurm tries to allocate your job. A job is more likely to be back filled (slotted onto the cluster) in busy times than pending jobs will longer wall times. |
My Jobs Need to Run Longer than 7 Days
Info |
---|
ARCC can provide users with wall times longer than 7 days. Please contact use, but we require that you can demonstrate that you job can not be optimized, for example:
ARCC can provide assistance with trying to understand if a job can be optimized. |
Requested node configuration is not available
Info |
---|
This is caused because you’re trying to request a configuration that isn’t available, or requires more details: For example: |
Too many cores on a node:
Code Block |
---|
[salexan5@mblog2 ~]$ salloc -A arcc -t 10:00 -c 100
salloc: error: CPU count per node can not be satisfied
salloc: error: Job submit/allocate failed: Requested node configuration is not available |
Must define a GPU enabled partition:
Code Block |
---|
[salexan5@mblog2 ~]$ salloc -A arcc -t 10:00 --gres=gpu:1
salloc: error: Job submit/allocate failed: Requested node configuration is not available
salloc: Job allocation 1253677 has been revoked.
[salexan5@mblog2 ~]$ salloc -A arcc -t 10:00 --gres=gpu:1 --partition=mb-a30
salloc: Granted job allocation 1253691
salloc: Nodes mba30-001 are ready for job |
OUT-OF-MEMORY: Segmentation Fault
Info |
---|
Segmentation faults are typically caused by an application trying to access memory outside what has been allocated to the job. Basically, you job is out of memory of what it requested. |
...
Goal: List some common issues and how to resolve.
Table of Contents | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
|
...
Common Questions
Info |
---|
|
...
Common Questions: Suggestions
Info |
---|
|
...
Common Issues
Info |
---|
|
...
Pending Jobs and Scheduled Cluster Maintenance
Info |
---|
When we have schedule maintenance on the cluster, an announcement will go out and indicate the date/time that this is scheduled to start. All jobs currently running on the cluster are allowed to finish, and we can aspect further jobs if Slurm can complete them before the maintenance starts. |
Note |
---|
Be conscious of: If an announcement goes out on a Monday that maintenance is to start on the following Friday, there is only a window of four days that any new job must complete within. If you submit a job with a wall time of seven days, then this can not and will not be started since it can not complete before Friday. You job will be accepted by Slurm, queued, have a status of pending, and automatically started once maintenance is completed and Slurm is restarted. |
...
Required: Account and Walltime
Info |
---|
Remember: By default you must define the project (account) you’re using and a walltime. |
Code Block |
---|
[]$ salloc
salloc: error: You didn't specify a project account (-A,--account). Please open a ticket at arcc-help@uwyo.edu for help.
salloc: error: Job submit/allocate failed: Invalid account or account/partition combination specified
[]$ salloc -A <project-name>
salloc: error: You didn't specify a walltime (-t, --time=) for the job. Please open a ticket at arcc-help@uwyo.edu for help.
salloc: error: Job submit/allocate failed: Requested time limit is invalid (missing or exceeds some limit)
[]$ salloc -t 10:00
salloc: error: You didn't specify a project account (-A,--account). Please open a ticket at arcc-help@uwyo.edu for help.
salloc: error: Job submit/allocate failed: Invalid account or account/partition combination specified
# The bare minimum:
[]$ salloc -A <project-name> -t 10:00
salloc: Granted job allocation 1250349
salloc: Nodes mbcpu-025 are ready for job |
...
Correct Partitions
Info |
---|
If you need to explicitly request a partition, the name must be correct: |
Code Block |
---|
[]$ salloc -A <project-name> -t 10:00 --partition=mb-l40
salloc: error: invalid partition specified: mb-l40
salloc: error: Job submit/allocate failed: Invalid partition name specified |
Info |
---|
Use the |
Expand | ||
---|---|---|
| ||
|
Code Block |
---|
# Corrected:
[]$ salloc -A <project-name> -t 10:00 --partition=mb-l40s
salloc: Pending job allocation 1250907
salloc: job 1250907 queued and waiting for resources
salloc: job 1250907 has been allocated resources
salloc: Granted job allocation 1250907
salloc: Nodes mbl40s-001 are ready for job |
...
Timeouts
Info |
---|
Timeouts aren’t errors as such, just that the time you requested was not long enough to compete the computation. |
Info |
---|
The maximum allowed wall time is 7 days: |
Code Block |
---|
[]$ salloc -A <project-name> -t 7-00:00:01
salloc: error: Job submit/allocate failed: Requested time limit is invalid (missing or exceeds some limit)
[]$ salloc -A <project-name> -t 7-00:00:00
salloc: Granted job allocation 1251651
salloc: Nodes mbcpu-010 are ready for job |
Note |
---|
Do not request 7 days just because you can! Wall time is considered when Slurm tries to allocate your job. A job is more likely to be back filled (slotted onto the cluster) in busy times than pending jobs will longer wall times. |
...
My Jobs Need to Run Longer than 7 Days
Info |
---|
ARCC can provide users with wall times longer than 7 days. Please contact use, but we require that you can demonstrate that you job can not be optimized, for example:
ARCC can provide assistance with trying to understand if a job can be optimized. |
...
Requested node configuration is not available
Info |
---|
This is caused because you’re trying to request a configuration that isn’t available, or requires more details: For example: |
Too many cores on a node:
Code Block |
---|
[]$ salloc -A <project-name> -t 10:00 -c 100
salloc: error: CPU count per node can not be satisfied
salloc: error: Job submit/allocate failed: Requested node configuration is not available |
Must define a GPU enabled partition:
Code Block |
---|
[]$ salloc -A <project-name> -t 10:00 --gres=gpu:1
salloc: error: Job submit/allocate failed: Requested node configuration is not available
salloc: Job allocation 1253677 has been revoked.
[]$ salloc -A <project-name> -t 10:00 --gres=gpu:1 --partition=mb-a30
salloc: Granted job allocation 1253691
salloc: Nodes mba30-001 are ready for job |
...
OUT-OF-MEMORY: Segmentation Fault
Note |
---|
Segmentation faults are typically caused by an application trying to access memory outside what has been allocated to the job. Basically, you job is out of memory of what it requested. |
Info |
---|
Resolved: Request more memory using either the |
...
My Job Stopped and Re-Started: Preemption
Info |
---|
|
...
Why Is My Job One-of-Many on a Compute Node?
Info | ||
---|---|---|
When I run
As discussed in the Intro to HPC workshop when talking about Compute Nodes this is perfectly acceptable and one of the tasks that Slurm manages. Remember: All jobs are independent and do not affect anyone else. |
...
Can I Submit Jobs from within a Script?
Info |
---|
General Use Case: You have a job running on the cluster, from which you’d like to submit further jobs. General Answer: Yes. If you have the scripting ability, then you can write code that creates a submission script, and then calls the This submission will be treated as any other submission and be added to the queue, and depending on the current cluster utilization might be pending before it starts running. This can also be performed from scripts that are already running as part of a job. Is this a good idea? Again yes. There are existing applications that do exactly this, and with some extra Slurm understanding, you can have jobs that are dependent on other jobs i.e. job B won’t start until job A is completed, basically breaking a pipeline down into a sequence of jobs. |
Note |
---|
Note: There is a maximum number of jobs that Slurm can accommodate within the queue (currently set at 75K), do not try submitting more that this in one batch, and you will need to throttle their submission i.e. submit say 10 every second. If you try submitting 1000s in a single call then you can affect Slurm. Be a good cluster citizen. |
Possible Alternative: Use a Slurm Job Array which allows you to submit a single submission script that will be run by the size of the array (current max size is 10K). i.e. you request an array of size 100, Slurm will automatically submit 100 jobs with a copy of the original submission script - with a little scripting you can have each copy use different input values/data, but all perform the same defined workflow.
...
Closing a Linux Session while Running an salloc
Note |
---|
If you have a Linux session running in a terminal, in which you have an From the command-line you can not go back into it, you will have to start a new interactive session. |
...
Prev | Workshop Home |
...