How do investment partitions work and why is my job pending if I have a partition?
This may happen for a number of reasons, but usually it’s because your investment already has other jobs running on it that are queued before yours, or your job is not able to be run entirely from your investment, and is therefore queued waiting for those additional resources that fall outside of your investment.
What is the job queuing workflow if I have an investment partition?
The answer is dependent on whether you have explicitly defined a partition in your slurm script:
If you explicitly define a partition within your script then that partition (and subsequent nodes) is the one that slurm will try to allocate your jobs across. If you define a combination of nodes/cores/memory/gpus that are not provided entirely across the partition, then slurm will not accept the job, and you will be provided with an appropriate message. This definition will override even if you have an investment.
If you project does have an investment and you do NOT define a partition, then Slurm will create a list of partitions on which to to try and run your job.
Slurm will automatically try to determine if you have an investment, and attempt to run your job on the investment first.
If there are jobs running on your investment belonging to users who are not part of your project, slurm will preempt these jobs (stop them and add them back to the queue) and immediately start your job.
If your investment is 'full' with jobs from users who are members of your project, then it will try to allocate across the other partitions if resources are available.
If there are no resources available to fit your job (i.e. cluster usage is very high), then your job will have a state of pending (i.e. waiting in the queue). On a regular interval Slurm will monitor the queue and run the job when appropriate resources become available.
General
Content
Integrations