Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Goal: List some common issues and how to resolve.

...

  • How do I know what number of nodes, cores, memory etc to ask for my jobs?

    • Understand your software and application. 

      • Read the docs – look at the help for commands/options.

      • Can it run multiple threads - use multi cores (OpenMP) / nodes (MPI)?

      • Can it use a GPU? Nvidia cuda.

      • Are their suggestions on data and memory requirements?

  • How do I find out whether a cluster/partition supports these resources?

  • How do I find out whether these resources are available on the cluster?

  • How long will I have to wait in the queue before my job starts? 

    • How busy is the cluster? 

    • Current Cluster utilization: Commands sinfo / arccjobs / pestat and OnDemand’s MedicineBow System Status page.

  • How do I monitor the progress of my job?

    • Slurm commands: squeue

...

Common Issues

  • Not defining the account and time options.

  • The account is the name of the project you are associated with. It is not your username.

  • Requesting combinations of resources that can not be satisfied: Medicine Hardware Summary Table

    • For example, you can not request 40 cores on a compute node with a max of 32.

    • Requesting too much memory, or too many GPU devices with respect to a partition.

  • My job is pending? Why? 

    • Because the resources are currently not available.

    • Have you unnecessarily defined a specific partition (restricted yourself) that is busy

    • We only have a small number of GPUs.

    • This is a shared resource - sometimes you just have to be patient…

    • Check current cluster utilization.

  • Preemption: Users of an investment get priority on their hardware.

    • We have the non-investor partition.

...

OUT-OF-MEMORY: Segmentation Fault

Infonote

Segmentation faults are typically caused by an application trying to access memory outside what has been allocated to the job.

Basically, you job is out of memory of what it requested.

Info

Resolved: Request more memory using either the mem or mem-per-cpu .

...

My Job Stopped and Re-Started: Preemption

Info
  • As discussed in the Intro to HPC workshop, we have a Condominium Model where if your job is running an a compute node that is part of another project’s hardware investment, your job can be preempted.

  • Your job will be stopped and automatically re-queued and when resources come available on the cluster, it will be restarted.

  • Further details can be found on our Slurm and Preemption page and how to use the non-investor partition to prevent this from happening.

...

Why Is My Job One-of-Many on a Compute Node?

Info

When I run pestat, it appears that my job is one of many on a particular compute node.

Code Block
[]$ pestat -n mbl40s-001
Select only nodes in hostlist=mbl40s-001
Hostname         Partition     Node Num_CPU  CPUload  Memsize  Freemem  Joblist
                              State Use/Tot  (15min)     (MB)     (MB)  JobID(JobArrayID) User ...
mbl40s-001         mb-l40s     mix   64  96   55.62*   765525   310618  1480439 hbalantr 1440260 vvarenth

As discussed in the Intro to HPC workshop when talking about Compute Nodes this is perfectly acceptable and one of the tasks that Slurm manages.

Remember: All jobs are independent and do not affect anyone elseselse.

...

...