Investment FAQs

I would like to use special hardware that you don’t provide on the cluster currently. How would I get that added?

This is where investments can be come in. Investing is the only way to get a specialty hardware we don’t currently have installed onto the cluster.

How do investment partitions work and why is my job pending if I have a partition?

This may happen for a number of reasons, but usually it’s because your investment already has other jobs running on it that are queued before yours, or your job is not able to be run entirely from your investment, and is therefore queued waiting for those additional resources that fall outside of your investment.

What is the job queuing workflow if I have an investment partition?

The answer is dependent on whether you have explicitly defined a partition in your slurm script:

  1. If you explicitly define a partition within your script then that partition (and subsequent nodes) is the one that slurm will try to allocate your jobs across. If you define a combination of nodes/cores/memory/gpus that are not provided entirely across the partition, then slurm will not accept the job, and you will be provided with an appropriate message. This definition will override even if you have an investment.

  2. If you project does have an investment and you do NOT define a partition, then Slurm will create a list of partitions on which to to try and run your job.

    1. Slurm will automatically try to determine if you have an investment partially based on project name, and attempt to run your job on the investment first.

    2. If there are jobs running on your investment belonging to users who are not part of your project, slurm will preempt these jobs (stop them and add them back to the queue) and immediately start your job.

    3. If your investment is 'full' with jobs from users who are members of your project, then it will try to allocate across the other partitions if resources are available.

    4. If there are no resources available to fit your job (i.e. cluster usage is very high), then your job will have a state of pending (i.e. waiting in the queue). On a regular interval Slurm will monitor the queue and run the job when appropriate resources become available.

I am getting a QOSLimit Message/Error when submitting jobs to Slurm, and then my jobs are queued. Why?

  1. This error will occur if total CPUs requested by jobs running under your project is exceeded.

    1. This would include jobs running under you and those run under other group members of the same project. When total CPU’s requested by everyone in the group exceeds the limit available for your investment you will receive this error - even if not all your investment nodes are in use.

    2. Usually this occurs because you or other project members are running jobs that extend to other nodes/resources on the cluster that are outside of your investment.

  2. You may check total usage for your project by running: squeue -A <project_name>

  3. If you only want to check the availability of your investment nodes specifically, run: sinfo -p <investment_name>

    1. Note that your project name and your investment name may not be the same.

  4. See this page to determine your investment group’s QOSGrpCpuLimit.