Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Next »

Projects Built Upon or Using PyTorch

There are more and more open source projects that use the PyTorch ecosystem as the underlying machine learning engine or have direct integration with, which they build upon and extend - typically based around LLMs.

For example:

Issues and Expectations

ARCC is getting more and more questions and issues when these appear to not be running as expected.

Please bear in mind, and consider before submitting questions/issues:

  • You need to understand what platforms these open source projects have been tested on, and which they haven’t.

    • Remember that the A30 is a different architecture to the L40S which is a different architecture to the H100.

    • Just because something run on one architecture doesn’t necessarily mean it’ll work on newer/older architectures. Please check and ask the developer.

  • Projects will hopefully, clearly start their requirements and versions they’ve been developed with.

    • Be aware we have only tested and can verify (at the time of writing) PyTorch 2.4.1 under three versions of cuda (11.8, 12.1, 12.4). If your project is using different versions, then we’d suggest running similar stripped down tests of PyTorch/Cuda to confirm your base environment works.

    • Be aware of GPU and cuda capabilities, and that some versions of PyTorch will not run with older versions of cuda. For example: https://github.com/pytorch/pytorch/issues/110153You need CUDA11.8 or newer for H100 support. You installed the 11.7 binaries. We actually dropped support for 11.7 in the latest release anyhow.

  • Please look at the issues of any project and look for known (similar) problems. If there are related issues then you will need to discuss/work with the developer. ARCC can not fix bugs related to the projects.

  • If you update an environment, then you need to track what and how you updated it, and then understand how it might affect the original environment used by the developer/project. If you change it, understand its effects.

  • If you change you base code and something is now not working, debug you code first. Is there debug logging you can turn on? ARCC typically does not debug your failing code.

  • If you come to ARCC, we will typically try and replicate the issue which involves creating our own version of the environment. You should be able to provide exact details so we can build and replicate. If you can’t then it will be much harder for us to assist.

  • Hardware can fail. ARCC does monitor and detect hardware issues and will remove a compute node if we see an issue and get it repaired in a timely manner. But if previously running code is now failing - and you haven’t changed anything - this might be the cause.

Memory Issues

Remember, GPU devices only have a finite amount of memory, the same as CPUs.

If you are running out of CUDA memory, then:

  • Check how much you’ve requested - can you request more?

  • If you’re using the full amount, then either:

    • Look at reducing the size of your data set.

    • Request a GPU device with more memory.

  • Use the nvidia-smi tool to monitor memory utilization across your allocated devices.

  • No labels