PyTorch Based Environments and Issues
Projects Built Upon or Using PyTorch
There are more and more open source projects that use the PyTorch ecosystem as the underlying machine learning engine or have direct integration with, which they build upon and extend - typically based around LLMs.
For example:
Issues and Expectations
When submitting any issue, please be mindful of Submitting Useful Tickets via the Portal.
ARCC is getting more and more questions and issues when these appear to not be running as expected.
Please bear in mind, and consider before submitting questions/issues:
You need to understand what platforms these open source projects have been tested on, and which they haven’t.
Remember that the A30 is a different architecture to the L40S which is a different architecture to the H100.
Just because something run on one architecture doesn’t necessarily mean it’ll work on newer/older architectures. Please check and ask the developer.
Projects will hopefully, clearly start their requirements and versions they’ve been developed with.
Be aware we have only tested and can verify (at the time of writing) PyTorch 2.4.1 under three versions of cuda (11.8, 12.1, 12.4). If your project is using different versions, then we’d suggest running similar stripped down tests of PyTorch/Cuda to confirm your base environment works.
Be aware of GPU and cuda capabilities, and that some versions of PyTorch will not run with older versions of cuda. For example: H100 Compatibility - PyTorch with CUDA 12.2 and lower · Issue #110153 · pytorch/pytorch “You need CUDA11.8 or newer for H100 support. You installed the 11.7 binaries. We actually dropped support for 11.7 in the latest release anyhow.“
Please look at the issues of any project and look for known (similar) problems. If there are related issues then you will need to discuss/work with the developer. ARCC can not fix bugs related to the projects.
If you update an environment, then you need to track what and how you updated it, and then understand how it might affect the original environment used by the developer/project. If you change it, understand its effects.
If you change you base code and something is now not working, debug you code first. Is there debug logging you can turn on? ARCC typically does not debug your failing code.
If you come to ARCC, we will typically try and replicate the issue which involves creating our own version of the environment. You should be able to provide exact details so we can build and replicate. If you can’t then it will be much harder for us to assist.
Hardware can fail. ARCC does monitor and detect hardware issues and will remove a compute node if we see an issue and get it repaired in a timely manner. But if previously running code is now failing - and you haven’t changed anything - this might be the cause.
If you are using conda environments make sure you understand how these work, especially if you are using a local version of say
miniconda
you’re installed yourself. See:
Memory Issues
Remember, GPU devices only have a finite amount of memory, the same as CPUs.
If you are running out of CUDA memory, then:
Check how much you’ve requested - can you request more?
If you’re using the full amount, then either:
Look at reducing the size of your data set.
Request a GPU device with more memory.
Use the
nvidia-smi
tool to monitor memory utilization across your allocated devices.