Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Goal: List some common issues and how to resolve.

Table of Contents
minLevel1
maxLevel1
outlinefalse
stylenone
typelist
printabletrue

...

Common Questions

Info
  • How do I know what number of nodes, cores, memory etc to ask for my jobs?

  • How do I find out whether a cluster/partition supports these resources?

  • How do I find out whether these resources are available on the cluster?

  • How long will I have to wait in the queue before my job starts? How busy is the cluster?

  • How do I monitor the progress of my job?

  • My job finished before the wall time - what happens?

...

Common Questions: Suggestions

Info
Code Block
[salexan5@mblog2 ~]$ salloc salloc: error: You didn't specify a project account (-A,--account). Please open a ticket at arcc-help@uwyo.edu for help. salloc: error: Job submit/allocate failed: Invalid account or account/partition combination specified [salexan5@mblog2 ~]$ salloc -A arcc salloc: error: You didn't specify a walltime (-t, --time=) for the job. Please open a ticket at arcc-help@uwyo.edu for help. salloc: error: Job submit/allocate failed: Requested time limit is invalid (missing or exceeds some limit) [salexan5@mblog2 ~]$ salloc -t 10:00 salloc: error: You didn't specify a project account (-A,--account). Please open a ticket at arcc-help@uwyo.edu for help. salloc: error: Job submit/
  • How do I know what number of nodes, cores, memory etc to ask for my jobs?

    • Understand your software and application. 

      • Read the docs – look at the help for commands/options.

      • Can it run multiple threads - use multi cores (OpenMP) / nodes (MPI)?

      • Can it use a GPU? Nvidia cuda.

      • Are their suggestions on data and memory requirements?

  • How do I find out whether a cluster/partition supports these resources?

  • How do I find out whether these resources are available on the cluster?

  • How long will I have to wait in the queue before my job starts? 

    • How busy is the cluster? 

    • Current Cluster utilization: Commands sinfo / arccjobs / pestat and OnDemand’s MedicineBow System Status page.

  • How do I monitor the progress of my job?

    • Slurm commands: squeue

Common Issues

  • Not defining the account and time options.

  • The account is the name of the project you are associated with. It is not your username.

  • Requesting combinations of resources that can not be satisfied: Medicine Hardware Summary Table

    • For example, you can not request 40 cores on a compute node with a max of 32.

    • Requesting too much memory, or too many GPU devices with respect to a partition.

  • My job is pending? Why? 

    • Because the resources are currently not available.

    • Have you unnecessarily defined a specific partition (restricted yourself) that is busy

    • We only have a small number of GPUs.

    • This is a shared resource - sometimes you just have to be patient…

    • Check current cluster utilization.

  • Preemption: Users of an investment get priority on their hardware.

    • We have the non-investor partition.

Required: Account and Walltime

Info

Remember: By default you must define the project (account) you’re using and a walltime.

  • My job finished before the wall time - what happens?

    • If your job has completely finished before the wall time you requested (e.g. it finished in two hours and you requested four hours) i.e. it’s status is no longer running, then Slurm will remove the job from the queue, release any requested resources back to the cluster, and allow other jobs to start running. Your job is not sitting idle on the cluster waiting for the wall time to run down.

...

Common Issues

Info
  • Not defining the account and time options.

  • The account is the name of the project you are associated with. It is not your username.

  • Requesting combinations of resources that can not be satisfied: Medicine Hardware Summary Table

    • For example, you can not request 40 cores on a compute node with a max of 32.

    • Requesting too much memory, or too many GPU devices with respect to a partition.

  • My job is pending? Why? 

    • Because the resources are currently not available.

      • Be mindful of specialized nodes (such as our huge mem with 4T of RAM) we might only have a few of them.

    • Have you unnecessarily defined a specific partition (restricted yourself) that is busy

    • We only have a small number of GPUs.

    • This is a shared resource - sometimes you just have to be patient…

    • Check current cluster utilization.

    • What ever resources you are asking for are currently not available. Slurm will start you job when they become available.

    • We do empathize and under the frustration, but this is a shared resource, and sometimes we just have to be patient and wait in the queue.

  • Preemption: Users of an investment get priority on their hardware.

    • We have the non-investor partition.

...

Pending Jobs and Scheduled Cluster Maintenance

Info

When we have schedule maintenance on the cluster, an announcement will go out and indicate the date/time that this is scheduled to start.

All jobs currently running on the cluster are allowed to finish, and we can aspect further jobs if Slurm can complete them before the maintenance starts.

Note

Be conscious of: If an announcement goes out on a Monday that maintenance is to start on the following Friday, there is only a window of four days that any new job must complete within.

If you submit a job with a wall time of seven days, then this can not and will not be started since it can not complete before Friday.

You job will be accepted by Slurm, queued, have a status of pending, and automatically started once maintenance is completed and Slurm is restarted.

...

Required: Account and Walltime

Info

Remember: By default you must define the project (account) you’re using and a walltime.

Code Block
[]$ salloc
salloc: error: You didn't specify a project account (-A,--account). Please open a ticket at arcc-help@uwyo.edu for help.
salloc: error: Job submit/allocate failed: Invalid account or account/partition combination specified

# The bare
minimum:
[salexan5@mblog2 ~]$ salloc -A arcc <project-tname>
10salloc:00 sallocerror: GrantedYou jobdidn't allocationspecify 1250349a salloc: Nodes mbcpu-025 are ready for job

Correct Partitions

Info

If you need to explicitly request a partition, the name must be correct:

Code Block
[salexan5@mblog2 ~]$ salloc -A arcc -t 10:00 --partition=mb-l40
salloc: error: invalid partition specified: mb-l40walltime (-t, --time=) for the job. Please open a ticket at arcc-help@uwyo.edu for help.
salloc: error: Job submit/allocate failed: InvalidRequested partitiontime name specified
Info

Use the sinfo command to get a list of know partitions, as well as detailing their current use:

[salexan5@mblog2 ~]$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST mb* up 7-00:00:00 4 mix mbcpu-[008,010-011,025] mb* up 7-00:00:00 8 alloc mbcpu-[001-007,009] mb* up 7-00:00:00 13 idle mbcpu-[012-024] mb-a30 up 7-00:00:00 8 idle mba30-[001-008] mb-l40s
limit is invalid (missing or exceeds some limit)

[]$ salloc -t 10:00
salloc: error: You didn't specify a project account (-A,--account). Please open a ticket at arcc-help@uwyo.edu for help.
salloc: error: Job submit/allocate failed: Invalid account or account/partition combination specified

# The bare minimum:
[]$ salloc -A <project-name> -t 10:00
salloc: Granted job allocation 1250349
salloc: Nodes mbcpu-025 are ready for job
Expand
titleExample: sinfo
Code Block

...

Correct Partitions

Info

If you need to explicitly request a partition, the name must be correct:

Code Block
[]$ salloc -A <project-name> -t 10:00 --partition=mb-l40
salloc: error: invalid partition specified: mb-l40
salloc: error: Job submit/allocate failed: Invalid partition name specified
Info

Use the sinfo command to get a list of know partitions, as well as detailing their current use:

Expand
titleExample: sinfo
Code Block
[]$ sinfo
PARTITION    AVAIL  TIMELIMIT  NODES  STATE NODELIST
mb*             up 7-00:00:00      14    mix mbl40s-001
mb-l40smbcpu-[008,010-011,025]
mb*             up 7-00:00:00      58  alloc idle mbl40smbcpu-[002001-005007,007009]
mb-h100*             up 7-00:00:00     13  1 drain$ mbh100-001idle mbcpu-[012-024]
mb-h100a30          up 7-00:00:00      8 1  down$ mbh100-006idle mba30-[001-008]
mb-h100l40s         up 7-00:00:00      1    drainmix mbh100mbl40s-002001
mb-h100l40s         up 7-00:00:00      35   idle mix mbh100mbl40s-[003002-005,007]
mb-a6000h100         up 7-00:00:00      1 drain$  idle mba6000mbh100-001
invmb-arcch100         up   infinite7-00:00:00      1  down$  mix mbcpu-025
inv-inbrembh100-006
mb-h100       up   infiniteup 7-00:00:00      1  drain idle mbl40smbh100-007002
invmb-ssheshaph100    up   infinite  up 7-00:00:00   1   idle mba6000-001
inv-wysbc3    mix mbh100-[003-005]
mb-a6000        up 7-00:00:00  infinite    1  1 idle alloc mbcpumba6000-001
inv-wysbcarcc        up   infinite      1    idlemix mba30mbcpu-001025
inv-soc  inbre       up   infinite      1    mixidle mbl40s-001007
inv-wildirisssheshap    up   infinite      51   idle wi[mba6000-001-005]
non-investor
inv-wysbc       up 7-00:00:00   infinite      1  drain$alloc mbh100mbcpu-001
non-investorinv-wysbc       up 7-00:00:00   infinite      1   down$idle mbh100mba30-006001
noninv-investorsoc     up 7-00:00:00   up   1infinite  drain mbh100-002
non-investor    1    mix mbl40s-001
inv-wildiris    up  7-00:00:00 infinite     6 5   mixidle mbcpu-[008,010-011],mbh100-[003wi[001-005]
non-investor    up 7-00:00:00      71 drain$ alloc mbcpu-[002-007,009]mbh100-001
non-investor    up 7-00:00:00     24 1  idledown$ mba30-[002-008],mbcpu-[012-024],mbl40s-[002-005]
Code Block
# Corrected:
[salexan5@mblog2 ~]$ salloc -A arcc -t 10:00 --partition=mb-l40s
salloc: Pending job allocation 1250907
salloc: job 1250907 queued and waiting for resources
salloc: job 1250907 has been allocated resources
salloc: Granted job allocation 1250907
salloc: Nodes mbl40s-001 are ready for job

Timeouts

Info

Timeouts aren’t errors as such, just that the time you requested was not long enough to compete the computation.

Info

The maximum allowed wall time is 7 days:

Code Block
[arcc-t01@mblog2 ~]$ salloc -A arccanetrain -t
mbh100-006
non-investor    up 7-00:00:00      1  drain mbh100-002
non-investor    up 7-00:00:00      6    mix mbcpu-[008,010-011],mbh100-[003-005]
non-investor    up 7-00:00:00      7  alloc mbcpu-[002-007,009]
non-investor    up 7-00:00:
01
00 
salloc:
 
error:
 
Job
 
submit/allocate
 
failed:
24 
Requested
 
time
 
limit is invalid (missing or exceeds some limit) [arcc-t01@mblog2 ~]
idle mba30-[002-008],mbcpu-[012-024],mbl40s-[002-005]
Code Block
# Corrected:
[]$ salloc -A arccanetrain<project-name> -t 7-0010:00:00 --partition=mb-l40s
salloc: GrantedPending job allocation 12516511250907
salloc: job Nodes1250907 mbcpu-010queued areand readywaiting for resources
salloc: job
Note

Do not request 7 days just because you can!

Wall time is considered when Slurm tries to allocate your job.

A job is more likely to be back filled (slotted onto the cluster) in busy times than pending jobs will longer wall times.

My Jobs Need to Run Longer than 7 Days

Info

ARCC can provide users with wall times longer than 7 days.

Please contact use, but we require that you can demonstrate that you job can not be optimized, for example:

  • Can it run faster by using more cores, or even multiple nodes?

  • Can it utilize GPUs?

  • Can the job actually be divided up into sections that can be run concurrently across multiple jobs?

ARCC can provide assistance with trying to understand if a job can be optimized.

Requested node configuration is not available

Info

This is caused because you’re trying to request a configuration that isn’t available, or requires more details: For example:

Too many cores on a node:

Code Block
[salexan5@mblog2 ~ 1250907 has been allocated resources
salloc: Granted job allocation 1250907
salloc: Nodes mbl40s-001 are ready for job

...

Timeouts

Info

Timeouts aren’t errors as such, just that the time you requested was not long enough to compete the computation.

Info

The maximum allowed wall time is 7 days:

Code Block
[]$ salloc -A <project-name> -t 7-00:00:01
salloc: error: Job submit/allocate failed: Requested time limit is invalid (missing or exceeds some limit)

[]$ salloc -A arcc<project-name> -t 107-00:00:00
-c 100
sallocsalloc: error:Granted CPUjob count per node can not be satisfiedallocation 1251651
salloc: error: Job submit/allocate failed: Requested node configuration is not available

Must define a GPU enabled partition:

Code Block
[salexan5@mblog2 ~]$ salloc -A arcc -t 10:00 --gres=gpu:1
salloc: error: Job submit/allocate failed: Requested node configuration is not available
salloc: Job allocation 1253677 has been revoked.

[salexan5@mblog2 ~]$ salloc -A arcc -t 10:00 --gres=gpu:1 --partition=mb-a30
salloc: Granted job allocation 1253691
salloc: Nodes mba30-001 are ready for job

OUT-OF-MEMORY: Segmentation Fault

Info

Segmentation faults are typically caused by an application trying to access memory outside what has been allocated to the job.

Basically, you job is out of memory of what it requested.

Info

Resolved: Request more memory using either the mem or mem-per-cpu .

My Job Stopped and Re-Started: Preemption

Info

As discussed in the Intro to HPC workshop, we have a Condominium Model where if your job is running an a compute node that is part of another project’s hardware investment, your job can be preempted.

Your job will be stopped and automatically re-queued and when resources come available on the cluster, it will be restarted.

Further details can be found on our Slurm and Preemption page and how to use the non-investor partition to prevent this from happening.

Why Is My Job One-of-Many on a Compute Node?

Info

When I run pestat, it appears that my job is one of many on a particular compute node.

Code Block
[]$ pestat -n mbl40s-001
Select only nodes in hostlist=mbl40s-001
Hostname         Partition     Node Num_CPU  CPUload  Memsize  Freemem  Joblist
                              State Use/Tot  (15min)     (MB)     (MB)  JobID(JobArrayID) User ...
mbl40s-001         mb-l40s     mix   64  96   55.62*   765525   310618  1480439 hbalantr 1440260 vvarenth

As discussed in the Intro to HPC workshop when talking about Compute Nodes this is perfectly acceptable and one of the tasks that Slurm manages.

Remember: All jobs are independent and do not affect anyone elses
Nodes mbcpu-010 are ready for job
Note

Do not request 7 days just because you can!

Wall time is considered when Slurm tries to allocate your job.

A job is more likely to be back filled (slotted onto the cluster) in busy times than pending jobs will longer wall times.

...

My Jobs Need to Run Longer than 7 Days

Info

ARCC can provide users with wall times longer than 7 days.

Please contact use, but we require that you can demonstrate that you job can not be optimized, for example:

  • Can it run faster by using more cores, or even multiple nodes?

  • Can it utilize GPUs?

  • Can the job actually be divided up into sections that can be run concurrently across multiple jobs?

ARCC can provide assistance with trying to understand if a job can be optimized.

...

Requested node configuration is not available

Info

This is caused because you’re trying to request a configuration that isn’t available, or requires more details: For example:

Too many cores on a node:

Code Block
[]$ salloc -A <project-name> -t 10:00 -c 100
salloc: error: CPU count per node can not be satisfied
salloc: error: Job submit/allocate failed: Requested node configuration is not available

Must define a GPU enabled partition:

Code Block
[]$ salloc -A <project-name> -t 10:00 --gres=gpu:1
salloc: error: Job submit/allocate failed: Requested node configuration is not available
salloc: Job allocation 1253677 has been revoked.

[]$ salloc -A <project-name> -t 10:00 --gres=gpu:1 --partition=mb-a30
salloc: Granted job allocation 1253691
salloc: Nodes mba30-001 are ready for job

...

OUT-OF-MEMORY: Segmentation Fault

Note

Segmentation faults are typically caused by an application trying to access memory outside what has been allocated to the job.

Basically, you job is out of memory of what it requested.

Info

Resolved: Request more memory using either the mem or mem-per-cpu .

...

My Job Stopped and Re-Started: Preemption

Info
  • As discussed in the Intro to HPC workshop, we have a Condominium Model where if your job is running an a compute node that is part of another project’s hardware investment, your job can be preempted.

  • Your job will be stopped and automatically re-queued and when resources come available on the cluster, it will be restarted.

  • Further details can be found on our Slurm and Preemption page and how to use the non-investor partition to prevent this from happening.

...

Why Is My Job One-of-Many on a Compute Node?

Info

When I run pestat, it appears that my job is one of many on a particular compute node.

Code Block
[]$ pestat -n mbl40s-001
Select only nodes in hostlist=mbl40s-001
Hostname         Partition     Node Num_CPU  CPUload  Memsize  Freemem  Joblist
                              State Use/Tot  (15min)     (MB)     (MB)  JobID(JobArrayID) User ...
mbl40s-001         mb-l40s     mix   64  96   55.62*   765525   310618  1480439 hbalantr 1440260 vvarenth

As discussed in the Intro to HPC workshop when talking about Compute Nodes this is perfectly acceptable and one of the tasks that Slurm manages.

Remember: All jobs are independent and do not affect anyone else.

...

Can I Submit Jobs from within a Script?

Info

General Use Case: You have a job running on the cluster, from which you’d like to submit further jobs.

General Answer: Yes.

If you have the scripting ability, then you can write code that creates a submission script, and then calls the sbatch command.

This submission will be treated as any other submission and be added to the queue, and depending on the current cluster utilization might be pending before it starts running.

This can also be performed from scripts that are already running as part of a job.

Is this a good idea? Again yes. There are existing applications that do exactly this, and with some extra Slurm understanding, you can have jobs that are dependent on other jobs i.e. job B won’t start until job A is completed, basically breaking a pipeline down into a sequence of jobs.

Note

Note: There is a maximum number of jobs that Slurm can accommodate within the queue (currently set at 75K), do not try submitting more that this in one batch, and you will need to throttle their submission i.e. submit say 10 every second.

If you try submitting 1000s in a single call then you can affect Slurm. Be a good cluster citizen.

Possible Alternative: Use a Slurm Job Array which allows you to submit a single submission script that will be run by the size of the array (current max size is 10K). i.e. you request an array of size 100, Slurm will automatically submit 100 jobs with a copy of the original submission script - with a little scripting you can have each copy use different input values/data, but all perform the same defined workflow.

...

Closing a Linux Session while Running an salloc

Note

If you have a Linux session running in a terminal, in which you have an salloc interactive session running, and your terminal session closes, or is interrupted for any reason, your interactive session will be stopped.

From the command-line you can not go back into it, you will have to start a new interactive session.

...

...