Overview
This page details some of the common Slurm messages a user will see after a job has been submitted and is in the queue waiting to run:
Messages
QOSGrpCpuLimit
“The job's QOS has reached its aggregate CPU limit.”
In a nutshell, what is happening is that the user's project group as a whole is submitting enough jobs, that the total number of CPUs being used (across all jobs, potentially across their investment, and resources outside their investment) is hitting their Quality of Service cpu limit - this is currently set to 75% of total ARCC cluster CPUs (as of 20230327 on Beartooth this is 4144 cores), plus their investment if they have one.
This cap is in place to prevent any one user/project using the entire cluster.
If a user/account has an investment, jobs will first be allocated to their investment, and then uses nodes outside of the investment if it is full i.e. no 'other account' jobs can be preempted, and the jobs can actually be fitted onto their investment nodes.
As soon as the QoS limit is reached, Slurm will put any new jobs as pending, even if there are free cores available on an investment.