Overview
This page details some of the common Slurm messages a user will see after a job has been submitted and is in the queue waiting to run:
Messages
QOSGrpCpuLimit
“The job's QOS has reached its aggregate CPU limit.”
In a nutshell, what is happening is that the user's project group as a whole is submitting enough jobs, that the total number of CPUs being used (across all jobs, potentially across their investment, and resources outside their investment) is hitting their Quality of Service cpu limit. ARCC reviews what this limit is on a regular basis - as of March the 28th 2023, this is set to 100% of the non-investor partition (which is 5920 cores), plus their investment if they have one.
This cap is in place to prevent any one user/project using the entire cluster.
If a user/account has an investment, jobs will first be attempted to be allocated to their investment, and then uses nodes outside of the investment if it is full i.e. no 'other account' jobs can be preempted, and the jobs can actually be fitted onto their investment nodes.
As soon as the QoS limit is reached, Slurm will put any new jobs as pending, even if there are free cores available on an investment.