Projects Built Upon or Using PyTorch
There are more and more open source projects that use the PyTorch ecosystem as the underlying machine learning engine or have direct integration with, which they build upon and extend - typically based around LLMs.
For example:
Issues and Expectations
ARCC is getting more and more questions and issues when these appear to not be running as expected.
Please bear in mind, and consider before submitting questions/issues:
You need to understand what platforms these open source projects have been tested on, and which they haven’t.
Remember that the A30 is a different architecture to the L40S which is a different architecture to the H100.
Just because something run on one architecture doesn’t necessarily mean it’ll work on newer/older architectures. Please check and ask the developer.
Projects will hopefully, clearly start their requirements and versions they’ve been developed with.
Be aware we have only tested and can verify (at the time of writing) PyTorch 2.4.1 under three versions of cuda (11.8, 12.1, 12.4). If your project is using different versions, then we’d suggest running similar stripped down tests of PyTorch/Cuda to confirm your base environment works.
Be aware of GPU and cuda capabilities, and that some versions of PyTorch will not run with older versions of cuda. For example: https://github.com/pytorch/pytorch/issues/110153 “You need CUDA11.8 or newer for H100 support. You installed the 11.7 binaries. We actually dropped support for 11.7 in the latest release anyhow.“
Please look at the issues of any project and look for known (similar) problems. If there are related issues then you will need to discuss/work with the developer. ARCC can not fix bugs related to the projects.
If you update an environment, then you need to track what and how you updated it, and then understand how it might affect the original environment used by the developer/project. If you change it, understand its effects.
If you come to ARCC, we will typically try and replicate the issue which involves creating our own version of the environment. You should be able to provide exact details so we can build and replicate. If you can’t then it will be much harder for us to assist.