Goal: List some common issues and how to resolve.
Table of Contents |
---|
minLevel | 1 |
---|
maxLevel | 1 |
---|
outline | false |
---|
style | none |
---|
type | list |
---|
printable | true |
---|
|
...
Common Questions
Info |
---|
How do I know what number of nodes, cores, memory etc to ask for my jobs? How do I find out whether a cluster/partition supports these resources? How do I find out whether these resources are available on the cluster? How long will I have to wait in the queue before my job starts? How busy is the cluster? How do I monitor the progress of my job? My job finished before the wall time - what happens?
|
...
Common Questions: Suggestions
Info |
---|
How do I know what number of nodes, cores, memory etc to ask for my jobs? How do I find out whether a cluster/partition supports these resources? How do I find out whether these resources are available on the cluster? How long will I have to wait in the queue before my job starts? How do I monitor the progress of my job? My job finished before the wall time - what happens? If your job has completely finished before the wall time you requested (e.g. it finished in two hours and you requested four hours) i.e. it’s status is no longer running, then Slurm will remove the job from the queue, release any requested resources back to the cluster, and allow other jobs to start running. Your job is not sitting idle on the cluster waiting for the wall time to run down.
|
...
Common Issues
Info |
---|
Not defining the account and time options. The account is the name of the project you are associated with. It is not your username. Requesting combinations of resources that can not be satisfied: Medicine Hardware Summary Table For example, you can not request 40 cores on a compute node with a max of 32. Requesting too much memory, or too many GPU devices with respect to a partition.
My job is pending? Why? Because the resources are currently not available. Have you unnecessarily defined a specific partition (restricted yourself) that is busy? We only have a small number of GPUs. This is a shared resource - sometimes you just have to be patient… Check current cluster utilization. What ever resources you are asking for are currently not available. Slurm will start you job when they become available. We do empathize and under the frustration, but this is a shared resource, and sometimes we just have to be patient and wait in the queue.
Preemption: Users of an investment get priority on their hardware.
|
...
Pending Jobs and Scheduled Cluster Maintenance
...