Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Goal: List some common issues and how to resolve.

...

  • How do I know what number of nodes, cores, memory etc to ask for my jobs?

    • Understand your software and application. 

      • Read the docs – look at the help for commands/options.

      • Can it run multiple threads - use multi cores (OpenMP) / nodes (MPI)?

      • Can it use a GPU? Nvidia cuda.

      • Are their suggestions on data and memory requirements?

  • How do I find out whether a cluster/partition supports these resources?

  • How do I find out whether these resources are available on the cluster?

  • How long will I have to wait in the queue before my job starts? 

    • How busy is the cluster? 

    • Current Cluster utilization: Commands sinfo / arccjobs / pestat and OnDemand’s MedicineBow System Status page.

  • How do I monitor the progress of my job?

    • Slurm commands: squeue

...

Common Issues

  • Not defining the account and time options.

  • The account is the name of the project you are associated with. It is not your username.

  • Requesting combinations of resources that can not be satisfied: Medicine Hardware Summary Table

    • For example, you can not request 40 cores on a compute node with a max of 32.

    • Requesting too much memory, or too many GPU devices with respect to a partition.

  • My job is pending? Why? 

    • Because the resources are currently not available.

      • Be mindful of specialized nodes (such as our huge mem with 4T of RAM) we might only have a few of them.

    • Have you unnecessarily defined a specific partition (restricted yourself) that is busy

    • We only have a small number of GPUs.

    • This is a shared resource - sometimes you just have to be patient…

    • Check current cluster utilization.

    • What ever resources you are asking for are currently not available. Slurm will start you job when they become available.

    • We do empathize and under the frustration, but this is a shared resource, and sometimes we just have to be patient and wait in the queue.

  • Preemption: Users of an investment get priority on their hardware.

    • We have the non-investor partition.

...

Info

Remember: By default you must define the project (account) you’re using and a walltime.

Code Block
[salexan5@mblog2 ~]$ salloc
salloc: error: You didn't specify a project account (-A,--account). Please open a ticket at arcc-help@uwyo.edu for help.
salloc: error: Job submit/allocate failed: Invalid account or account/partition combination specified

[salexan5@mblog2 ~]$ salloc -A arcc<project-name>
salloc: error: You didn't specify a walltime (-t, --time=) for the job. Please open a ticket at arcc-help@uwyo.edu for help.
salloc: error: Job submit/allocate failed: Requested time limit is invalid (missing or exceeds some limit)

[salexan5@mblog2 ~]$ salloc -t 10:00
salloc: error: You didn't specify a project account (-A,--account). Please open a ticket at arcc-help@uwyo.edu for help.
salloc: error: Job submit/allocate failed: Invalid account or account/partition combination specified

# The bare minimum:
[salexan5@mblog2 ~]$ salloc -A arcc<project-name> -t 10:00
salloc: Granted job allocation 1250349
salloc: Nodes mbcpu-025 are ready for job

...

Info

If you need to explicitly request a partition, the name must be correct:

Code Block
[salexan5@mblog2 ~]$ salloc -A arcc<project-name> -t 10:00 --partition=mb-l40
salloc: error: invalid partition specified: mb-l40
salloc: error: Job submit/allocate failed: Invalid partition name specified

...

Expand
titleExample: sinfo
Code Block
[salexan5@mblog2 ~]$ sinfo
PARTITION    AVAIL  TIMELIMIT  NODES  STATE NODELIST
mb*             up 7-00:00:00      4    mix mbcpu-[008,010-011,025]
mb*             up 7-00:00:00      8  alloc mbcpu-[001-007,009]
mb*             up 7-00:00:00     13   idle mbcpu-[012-024]
mb-a30          up 7-00:00:00      8   idle mba30-[001-008]
mb-l40s         up 7-00:00:00      1    mix mbl40s-001
mb-l40s         up 7-00:00:00      5   idle mbl40s-[002-005,007]
mb-h100         up 7-00:00:00      1 drain$ mbh100-001
mb-h100         up 7-00:00:00      1  down$ mbh100-006
mb-h100         up 7-00:00:00      1  drain mbh100-002
mb-h100         up 7-00:00:00      3    mix mbh100-[003-005]
mb-a6000        up 7-00:00:00      1   idle mba6000-001
inv-arcc        up   infinite      1    mix mbcpu-025
inv-inbre       up   infinite      1   idle mbl40s-007
inv-ssheshap    up   infinite      1   idle mba6000-001
inv-wysbc       up   infinite      1  alloc mbcpu-001
inv-wysbc       up   infinite      1   idle mba30-001
inv-soc         up   infinite      1    mix mbl40s-001
inv-wildiris    up   infinite      5   idle wi[001-005]
non-investor    up 7-00:00:00      1 drain$ mbh100-001
non-investor    up 7-00:00:00      1  down$ mbh100-006
non-investor    up 7-00:00:00      1  drain mbh100-002
non-investor    up 7-00:00:00      6    mix mbcpu-[008,010-011],mbh100-[003-005]
non-investor    up 7-00:00:00      7  alloc mbcpu-[002-007,009]
non-investor    up 7-00:00:00     24   idle mba30-[002-008],mbcpu-[012-024],mbl40s-[002-005]
Code Block
# Corrected:
[salexan5@mblog2 ~]$ salloc -A arcc<project-name> -t 10:00 --partition=mb-l40s
salloc: Pending job allocation 1250907
salloc: job 1250907 queued and waiting for resources
salloc: job 1250907 has been allocated resources
salloc: Granted job allocation 1250907
salloc: Nodes mbl40s-001 are ready for job

...

Info

The maximum allowed wall time is 7 days:

Code Block
[arcc-t01@mblog2 ~]$ salloc -A arccanetrain<project-name> -t 7-00:00:01
salloc: error: Job submit/allocate failed: Requested time limit is invalid (missing or exceeds some limit)

[arcc-t01@mblog2 ~]$ salloc -A arccanetrain<project-name> -t 7-00:00:00
salloc: Granted job allocation 1251651
salloc: Nodes mbcpu-010 are ready for job

...

Too many cores on a node:

Code Block
[salexan5@mblog2 ~]$ salloc -A arcc<project-name> -t 10:00 -c 100
salloc: error: CPU count per node can not be satisfied
salloc: error: Job submit/allocate failed: Requested node configuration is not available

Must define a GPU enabled partition:

Code Block
[salexan5@mblog2 ~]$ salloc -A arcc<project-name> -t 10:00 --gres=gpu:1
salloc: error: Job submit/allocate failed: Requested node configuration is not available
salloc: Job allocation 1253677 has been revoked.

[salexan5@mblog2 ~]$ salloc -A arcc<project-name> -t 10:00 --gres=gpu:1 --partition=mb-a30
salloc: Granted job allocation 1253691
salloc: Nodes mba30-001 are ready for job

...