...
“The job's QOS has reached its aggregate CPU limit.”
This message stems from the policy that on Beartooth all projects are subject to a maximum concurrent core usage policy. This limits the total number of CPUs that can be allocated by all users in a project.
In a nutshell, what is happening is that the user's project group as a whole is submitting enough jobs, that the total number of CPUs being used (across all jobs, potentially across their investment, and resources outside their investment) is hitting their Quality of Service cpu limit. ARCC reviews what this limit is on a regular basis - as of March the 28th 2023, this is currently set to 75% 100% of the non-investor partition nodes/CPUs (as of 20230327 on Beartooth this is 4144 (which is 5920 cores), plus their investment if they have one.
...