Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Goal: List some common issues and how to resolve.

...

  • How do I know what number of nodes, cores, memory etc to ask for my jobs?

  • How do I find out whether a cluster/partition supports these resources?

  • How do I find out whether these resources are available on the cluster?

  • How long will I have to wait in the queue before my job starts? How busy is the cluster?

  • How do I monitor the progress of my job?

  • My job finished before the wall time - what happens?

...

Common Questions: Suggestions

  • How do I know what number of nodes, cores, memory etc to ask for my jobs?

    • Understand your software and application. 

      • Read the docs – look at the help for commands/options.

      • Can it run multiple threads - use multi cores (OpenMP) / nodes (MPI)?

      • Can it use a GPU? Nvidia cuda.

      • Are their suggestions on data and memory requirements?

  • How do I find out whether a cluster/partition supports these resources?

  • How do I find out whether these resources are available on the cluster?

  • How long will I have to wait in the queue before my job starts? 

    • How busy is the cluster? 

    • Current Cluster utilization: Commands sinfo / arccjobs / pestat and OnDemand’s MedicineBow System Status page.

  • How do I monitor the progress of my job?

    • Slurm commands: squeue

Common Issues

  • Not defining the account and time options.

  • The account is the name of the project you are associated with. It is not your username.

  • Requesting combinations of resources that can not be satisfied: Medicine Hardware Summary Table

    • For example, you can not request 40 cores on a compute node with a max of 32.

    • Requesting too much memory, or too many GPU devices with respect to a partition.

  • My job is pending? Why? 

    • Because the resources are currently not available.

      • Be mindful of specialized nodes (such as our huge mem with 4T of RAM) we might only have a few of them.

    • Have you unnecessarily defined a specific partition (restricted yourself) that is busy

    • We only have a small number of GPUs.

    • This is a shared resource - sometimes you just have to be patient…

    • Check current cluster utilization.

    • What ever resources you are asking for are currently not available. Slurm will start you job when they become available.

    • We do empathize and under the frustration, but this is a shared resource, and sometimes we just have to be patient and wait in the queue.

  • Preemption: Users of an investment get priority on their hardware.

    We

    My job finished before the wall time - what happens?

    • If your job has completely finished before the wall time you requested (e.g. it finished in two hours and you requested four hours) i.e. it’s status is no longer running, then Slurm will remove the job from the queue, release any requested resources back to the cluster, and allow other jobs to start running. Your job is not sitting idle on the cluster waiting for the wall time to run down.

...

Common Issues

  • Not defining the account and time options.

  • The account is the name of the project you are associated with. It is not your username.

  • Requesting combinations of resources that can not be satisfied: Medicine Hardware Summary Table

    • For example, you can not request 40 cores on a compute node with a max of 32.

    • Requesting too much memory, or too many GPU devices with respect to a partition.

  • My job is pending? Why? 

    • Because the resources are currently not available.

      • Be mindful of specialized nodes (such as our huge mem with 4T of RAM) we might only have a few of them.

    • Have you unnecessarily defined a specific partition (restricted yourself) that is busy

    • We only have a small number of GPUs.

    • This is a shared resource - sometimes you just have to be patient…

    • Check current cluster utilization.

    • What ever resources you are asking for are currently not available. Slurm will start you job when they become available.

    • We do empathize and under the frustration, but this is a shared resource, and sometimes we just have to be patient and wait in the queue.

  • Preemption: Users of an investment get priority on their hardware.

    • We have the non-investor partition.

...

Info

When I run pestat, it appears that my job is one of many on a particular compute node.

Code Block
[]$ pestat -n mbl40s-001
Select only nodes in hostlist=mbl40s-001
Hostname         Partition     Node Num_CPU  CPUload  Memsize  Freemem  Joblist
                              State Use/Tot  (15min)     (MB)     (MB)  JobID(JobArrayID) User ...
mbl40s-001         mb-l40s     mix   64  96   55.62*   765525   310618  1480439 hbalantr 1440260 vvarenth

As discussed in the Intro to HPC workshop when talking about Compute Nodes this is perfectly acceptable and one of the tasks that Slurm manages.

Remember: All jobs are independent and do not affect anyone else.

...

Can I Submit Jobs from within a Script?

Info

General Use Case: You have a job running on the cluster, from which you’d like to submit further jobs.

General Answer: Yes.

If you have the scripting ability, then you can write code that creates a submission script, and then calls the sbatch command.

This submission will be treated as any other submission and be added to the queue, and depending on the current cluster utilization might be pending before it starts running.

This can also be performed from scripts that are already running as part of a job.

Is this a good idea? Again yes. There are existing applications that do exactly this, and with some extra Slurm understanding, you can have jobs that are dependent on other jobs i.e. job B won’t start until job A is completed, basically breaking a pipeline down into a sequence of jobs.

Note

Note: There is a maximum number of jobs that Slurm can accommodate within the queue (currently set at 75K), do not try submitting more that this in one batch, and you will need to throttle their submission i.e. submit say 10 every second.

If you try submitting 1000s in a single call then you can affect Slurm. Be a good cluster citizen.

Possible Alternative: Use a Slurm Job Array which allows you to submit a single submission script that will be run by the size of the array (current max size is 10K). i.e. you request an array of size 100, Slurm will automatically submit 100 jobs with a copy of the original submission script - with a little scripting you can have each copy use different input values/data, but all perform the same defined workflow.

...

Closing a Linux Session while Running an salloc

Note

If you have a Linux session running in a terminal, in which you have an salloc interactive session running, and your terminal session closes, or is interrupted for any reason, your interactive session will be stopped.

From the command-line you can not go back into it, you will have to start a new interactive session.

...

...