...
Effective January 13, 2025, changes were made to the Medicinebow job scheduler. These change are detailed here. These changes may result in your jobs not running the same way they did previously, ending in an error, or in queue for a longer period of time. Please reference the troubleshooting section below for issues that may occur with jobs after maintenance, typical error messages, and the more common solutions. In the event that this troubleshooting page is unhelpfuldoes not resolve your problem, please don’t hesitate to contact arcc-help@uwyo.edu for assistance.
Table of Contents | ||
---|---|---|
|
Frequently Asked Questions
What if I want to submit a CPU job on a GPU node?
Unless you are performing computations that require a GPU you are restricted from running CPU jobs on a GPU node, with the exception of investors. Investors and their group members may run CPU jobs on the GPU nodes that fall within their investment.
What happens if I don’t specify a QoS?
If you do not specify a QoS as part of your job, a QoS will be assigned to that job based on partition or wall-time. Different partitions and wall-times are associated with different QoS, as detailed in our published Slurm Policy. Should no QoS, partition, or wall-time be specified, the job by default will be placed in the Normal queue with a 3 day wall-time.
What happens if I don’t specify wall-time?
Similar to jobs with unspecified QoS, wall-time is assigned to a job based on other job specifications, like QoS or partition. Specific QoS or partitions in a job submission will result in a default wall-time associated with those other flags. If no QoS, partition, or wall-time is specified, the job by default is placed in the Normal queue with a 3 day wall-time.
Do I need to specify a partition?
If you are requesting a GPU, you must also specify a partition with GPU nodes. Otherwise, you are not required to specify a partition. Users requesting GPUs should likely use a --gres=gpu:#
or --gpus-per-node
flag AND a --partition
flag in their job submission.
Why can’t I request an OnDemand job more than 8 hours
To encourage users to use only the time they need, all interactive jobs, including those requested through OnDemand have been limited to 8 hours in length. Please specify a time from the OnDemand webform under 8 hours.
My job has been sitting in queue for a very long time without running. Why?
This is usually the result of specified walltime. If you have specified a 7 day walltime in your job using --time
or -t
flag over 3 days, you will be placed in the “long” queue which may result in a longer wait time. If your job doesn’t require 7 days, please try specifying a shorter walltime (ideally under 3 days). This should result in your job being placed in a queue with a shorter wait time.
Error Message Troubleshooting
...
sbatch/salloc: error: Interactive jobs cannot be longer than 8 hours
...
salloc -A projectname -t 8:00:00
I can no longer request an OnDemand job more than 8 hours
...
sbatch/salloc: error: You didn't specify a project account (-A,--account). Please open a ticket at arcc-help@uwyo.edu for help
If accompanied by “sbatch/salloc: error: Batch job submission failed: Invalid account or account/partition combination specified
” it’s likely you need to specify an account in your batch script or salloc
command, or the account name provided after the -A or --account flag is invalid. The account flag should specify the name of the project in which you’re running your job. Example: salloc -A projectname -t 8:00:00
sbatch/salloc: error: Use of --mem=0 is not permitted. Consider using --exclusive instead
Users may no longer request all memory on a node using the --mem=0
flag and are encouraged to request only the memory they require to run their job. If you know you need the use of an entire node, replace your --mem=0
flag specification in your job with --exclusive
to get use of an entire node an all it’s resources.
...
Users must specify the interactive or debug queue, or a time under 8 hrs when requesting an interactive job.
sbatch/salloc: error: Job submit/allocate failed: Invalid qos specification
Users should specify a walltime that is available for their specified queue. i.e.,
Debug (<= 1 hr)
Interactive (<= 8 hrs)
Fast (< = 12 hrs)
Normal (<= 3 days)
Long (<= 7 days)
...
This may occur for a number of reasons, but is likely due to the combination of nodes and hardware you’ve requested, and whether that hardware is available on the node/partition. If you need assistance please e-mail arcc-help@uwyo.edu with the location of the batch script you’re attempting to run, or salloc command you’re attempting to run, and the error message you receive.
My job has been sitting in queue for a very long time without running
This is usually the result of specified walltime. If you have specified a 7 day walltime in your job using --time
or -t
flag over 3 days, you will be placed in the “long” queue which may result in a longer wait time. If your job doesn’t require 7 days, please try specifying a shorter walltime (ideally under 3 days). This should result in your job being placed in a queue with a shorter wait time.