Known Issues

Known Issues

Overview

The purpose of this page is to inform users of ongoing issues known to ARCC. Here we detail what each problem is and what we are doing to remedy it. Any issues that users come across that are not detailed on this page should inform ARCC by emailing us at arcc-help@uwyo.edu.

 


Overwhelming the SLURM controller

Dec 9, 2025

While it may not result in an error message or impact your jobs on the cluster, submitting too many jobs at a time to SLURM can overwhelm the job scheduler.

If you are contacted by ARCC staff associated with issues related to the number of jobs you’re running within a short window of time, it may be helpful to try encapsulating the jobs into a job array. If that’s not possible or doesn’t fix the issue, and you’re running sbatch calls from a loop or similar mechanism to spur a large number of jobs automatically, it’s usually a reasonable alternative to include delays using a sleep command between sbatch calls which will help reduce the load of request on the SLURM controller.


MaxJobCount limit reached (75000)

Dec 8, 2025

If the total number of cluster jobs exceeds SLURM’s maximum limit, as defined by SLURM configuration on the cluster, your job(s) will not be submitted and the cluster will not have a record of your submission.

The SLURM MaxJobCount configuration is defined here: https://slurm.schedmd.com/slurm.conf.html#OPT_MaxJobCount. Information for SLURM on Medicinebow may be found here: Slurm Workload Manager

In the event you reach the maximum job count limit, you may find that the issue self corrects as jobs are completed on the cluster. Users can check for total jobs queued and running with the command: squeue -h | wc -l

 


Failure to Submit OnDemand Session: ‘medicinebow’ can’t be reached right now

November 17, 2025

Some interactive sessions requested through Open Ondemand (Medicinebow Web Access Portal) are never launched, and result in an error message after filling out the webform and hitting “Launch” that says:
Failed to submit session with the following error: sbatch: error: ‘medicinebow’ can’t be reached now or is an invalid entry for --cluster.

Symptoms:

When requesting an OnDemand interactive session through medicinebow web portal, after hitting “Launch button to submit session request, users receive the error shown in the screenshot below:

error2.jpeg

What to do:

Please e-mail arcc-help@uwyo.edu specifying your username, project/account and the date and time you received the error message/tried to submit your job. Information provided will help our team review logs to determine cause and mitigate the issue going forward.

 


Job(s) Being Held Messages

November 11, 2025

Some jobs are getting put into a pending state that we are currently troubleshooting. This doesn’t happen to every job, project, or user all the time and it appears to be random. This is not a user problem, but one with some configuration that we are attempting to track down.

Symptoms:

JobHeldAdmin user_env_retrieval_failed_requeued_held

What to do:

Please email arcc-help@uwyo.edu with which message you are getting, your job ID and the most exact time you can get of when you submitted the job. This information can help us to look at a number of different configurations and assist us in figuring out the problem. Additionally, please attempt to cancel these pending job(s) and re-submit them to see if they run. If all else fails, if it is during business hours, ARCC admins will release the jobs from this state.

 


Nano Segmentation fault error

Running nano command results in crash with output:

user@mblog1~]$ nano Segmentation fault (core dumped)

This occurs because nano installation is dependent upon an older version of ncurses than the one that is loaded for other modules. When you load modules dependent upon the newer version of ncurses, nano crashes.

Workaround

Open a new and separate ssh session in a new window, then run module purge to purge ncurses package dependencies. Then run nano as normal.

 


Jupyter Lab

18 Jan. 2023. Symptoms:

There is an issue with Jupyter that is preventing users from deleting files from within Jupyter. ARCC is working on this.
In the meantime users can:

  1. go the Southpass window,

  2. click ‘Files’ -> /project/atsc5009.

  3. Then select the offending files and click ‘Delete’.

 


MPI Fail

6 Jan. 2023. Symptoms:

-------------------------------------------------------------------------- WARNING: Open MPI failed to TCP connect to a peer MPI process. This should not happen. Your Open MPI job may now hang or fail.

This is an issue with MPI. ARCC is working to resolve this intermittent issue. If you encounter it, simply restarting your job is the workaround at this time.

Credential Caching

Description

Occasionally some users' credentials are getting cached on Beartooth that prevents them from logging in.

Solution

We have a script in place that clears the cache on the system that runs every hour on the hour. We can also run this manually if users run into trouble with logging into Beartooth when they have previously been able to login and can’t wait for the hourly script.


arccquota Error

25 Oct. 2022: Symptoms:

Traceback (most recent call last): File "/apps/s/arcc/latest/bin/arccquota", line 336, in <module> for each_p in _user_projs[each_u]:

This is a known issue that ARCC is working to resolve. It should have no impact on your work.


SouthPass:

3/16/22: An issue has been found with running salloc and srun on Southpass/Ondemand interactive desktops. ARCC is working to resolve this.