Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Goal: List some common issues and how to resolve.

...

  • How do I know what number of nodes, cores, memory etc to ask for my jobs?

    • Understand your software and application. 

      • Read the docs – look at the help for commands/options.

      • Can it run multiple threads - use multi cores (OpenMP) / nodes (MPI)?

      • Can it use a GPU? Nvidia cuda.

      • Are their suggestions on data and memory requirements?

  • How do I find out whether a cluster/partition supports these resources?

  • How do I find out whether these resources are available on the cluster?

  • How long will I have to wait in the queue before my job starts? 

    • How busy is the cluster? 

    • Current Cluster utilization: Commands sinfo / arccjobs / pestat and OnDemand’s MedicineBow System Status page.

  • How do I monitor the progress of my job?

    • Slurm commands: squeue

...

Common Issues

  • Not defining the account and time options.

  • The account is the name of the project you are associated with. It is not your username.

  • Requesting combinations of resources that can not be satisfied: Medicine Hardware Summary Table

    • For example, you can not request 40 cores on a compute node with a max of 32.

    • Requesting too much memory, or too many GPU devices with respect to a partition.

  • My job is pending? Why? 

    • Because the resources are currently not available.

      • Be mindful of specialized nodes (such as our huge mem with 4T of RAM) we might only have a few of them.

    • Have you unnecessarily defined a specific partition (restricted yourself) that is busy

    • We only have a small number of GPUs.

    • This is a shared resource - sometimes you just have to be patient…

    • Check current cluster utilization.

    • What ever resources you are asking for are currently not available. Slurm will start you job when they become available.

    • We do empathize and under the frustration, but this is a shared resource, and sometimes we just have to be patient and wait in the queue.

  • Preemption: Users of an investment get priority on their hardware.

    • We have the non-investor partition.

...

...