Using PyTorch on Beartooth
ARCC is aware that the exact details and versions presented here are out-of-date, but the general process is still valid.
We will endeavor to update this page as soon as we can.
Overview
PyTorch is an open source machine learning library based on the Torch library, used for applications such as computer vision and natural language processing, primarily developed by Facebook's AI Research lab. It is free and open-source software released under the Modified BSD license
PyTorch is a Python package that provides two high-level features:
Tensor computation (like NumPy) with strong GPU acceleration
Deep neural networks built on a tape-based autograd system
Using PyTorch with GPU on Teton
Here we will describe setting up a Conda environment, with pytorch installed, that allows you to run related source code and that utilizes GPUs.
The basic environment will:
Step though creating a basic Conda environment.
Provide a template for a bash script to submit jobs using sbatch.
Provide a very simple script that tests PyTorch has been imported and can identify the allocated GPU.
Note:
This is a short page and assumes some familiarization with using Conda. The “Package and Dependency Management with Conda” can be found on ARCC’s Training/Consultation page.
The installation of pytorch within the conda environment will also install related dependencies, but nothing else. Since you’re creating the conda environment, you can extend and install other packages. You can view the conda packages installed using
conda listwhile in an active environment.The bash script only uses a single node and single core. It is up to the user to explore other configurations.
In the scripts and examples below, please remember to appropriately edit to use your account, email address, folder locations etc.
Creating the Conda Environment
Setup the basic Conda environment to run with python version 3.8:
cd /project/arcc/salexan5/conda/gpu/pytorch
module load miniconda3/4.3.30
conda create -p pytorch_env python=3.8There are a number of conda options on how/where to install an environment. In this case -p with create an environment called pytorch_env in the folder you’re running the command from. Once setup, make a note of the installation message that indicates how to activate your environment when you want to use it.
# To activate this environment, use:
# > source activate /pfs/tsfs1/project/arcc/salexan5/conda/gpu/pytorch/pytorch_env
#
# To deactivate an active environment, use:
# > source deactivateActivate your environment, and install the pytorch related package. Once installation has finished, deactivate your environment.
source activate /pfs/tsfs1/project/arcc/salexan5/conda/gpu/pytorch/pytorch_env
conda install pytorch torchvision cudatoolkit=10.1 -c pytorch
source deactivateBash Script to use with sbatch
Below is a basic template to use that you’ll need to insert your account and email details into.
#!/bin/bash
#SBATCH --account=<your_arcc_project>
#SBATCH --time=0:10:00
#SBATCH --job-name=pytorch_test
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<your_email>
#SBATCH --output=pytorch_%A.log
#SBATCH --mem=8G
#SBATCH --partition=moran-bigmem-gpu
#SBATCH --gres=gpu:k80:1
echo "Load Modules:"
module load swset/2018.05
module load cuda/10.1.243
module load miniconda3/4.3.30
echo "Check GPU Allocation:"
echo "CUDA Visibale Devices:" $CUDA_VISIBLE_DEVICES
echo "Running nvidia-smi:"
srun nvidia-smi -L
nvcc --version
echo "Activate Conda Environment"
source activate /pfs/tsfs1/project/arcc/salexan5/conda/gpu/pytorch/pytorch_env
python --version
echo "- - - - - - - - - - - - - - - - - - - - -"
srun python pytorch_test.py
echo "- - - - - - - - - - - - - - - - - - - - -"
echo "Deactivate Conda:"
source deactivate
echo "Done"Simple Source Code Example
Below is some very simple source code that will test your environment and GPU request is functioning properly.
It simply imports the tensor package, and then using this checks that it can identify the allocated GPU(s). To work with the bash script above, save this file as pytorch_test.py
import torch
print("PyTorch Version: " + str(torch.__version__))
print("Cuda Available: " + str(torch.cuda.is_available()))
print("Device Name: " + str(torch.cuda.get_device_name(0)))
print("Device Count: " + str(torch.cuda.device_count()))
print("Device(0): " + str(torch.cuda.device(0)))
print("Device Current: " + str(torch.cuda.current_device()))Requesting GPUs and Testing
We have a variety of GPUs on Teton, and depending on which you require, you'll need to adjust your bash script. The reason for the srun nvidia-smi -L within the bash script it that it will write out confirming the GPU configuration you’ve requested.
Below demonstrates the bash options for each GPU, as well as what you’d see from running the nvidia-smi -L command and source code from the bash script:
#SBATCH --partition=moran-bigmem-gpu
#SBATCH --gres=gpu:k80:1
Running nvidia-smi:
GPU 0: Tesla K80 (UUID: GPU-53acbde2-ec88-e8fa-d477-719e700fb22f)
PyTorch Version: 1.6.0
Cuda Available: True
Device Name: Tesla K80
Device Count: 1
Device(0): <torch.cuda.device object at 0x2ac6aadc6e80>
Device Current: 0#SBATCH --partition=teton-gpu
#SBATCH --gres=gpu:p100:1
Running nvidia-smi:
GPU 0: Tesla P100-PCIE-16GB (UUID: GPU-3dc86d50-5ad9-21f2-db08-78f1b6aafb5d)
PyTorch Version: 1.6.0
Cuda Available: True
Device Name: Tesla P100-PCIE-16GB
Device Count: 1
Device(0): <torch.cuda.device object at 0x2aee95755eb0>
Device Current: 0Depending on what you need, and available resources, you can also request multiple GPUs
#SBATCH --partition=moran-bigmem-gpu
#SBATCH --gres=gpu:k80:2
Running nvidia-smi:
GPU 0: Tesla K80 (UUID: GPU-53acbde2-ec88-e8fa-d477-719e700fb22f)
GPU 1: Tesla K80 (UUID: GPU-4529ea7c-6085-9b22-ebdf-07f39556d0f7)
# With a little modification to the test script you can step through each GPU device.
PyTorch Version: 1.6.0
Cuda Available: True
Device Count: 2
Device(0): <torch.cuda.device object at 0x2b4c5c160ac0>
Device Name: Tesla K80
Device(1): <torch.cuda.device object at 0x2b4c5c160ac0>
Device Name: Tesla K80
Device Current: 0PyTorch + k20/k40 GPUs
The version of PyTorch (1.6) installed is not working on some of our earlier GPUs.
#SBATCH --gres=gpu:1
Running nvidia-smi:
GPU 0: Tesla K20m (UUID: GPU-6b95c19a-916e-f488-d5e0-1d87f752ffe6)
1.6.0
Cuda Available: True
Device Name: Tesla K20m
Device Count: 1
Device(0): <torch.cuda.device object at 0x2b0b6572ae80>
Device Current: 0
/pfs/tsfs1/project/arcc/salexan5/conda/gpu/pytorch/pytorch_env/lib/python3.8/site-packages/torch/cuda/__init__.py:125: UserWarning:
Tesla K20m with CUDA capability sm_35 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_61 sm_70 sm_75 compute_37.
If you want to use the Tesla K20m GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/
warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))If you need to use these GPUs then look to installing an older version of PyTorch, and/or contact ARCC and we can assist.
Running/Testing in an Interactive Session
If you’re just exploring, trying things out, and/or performing tests, then you can just as straight-forwardly use an interactive session. Below is an example of using salloc. Notice the steps are the same as if running a bash script via sbatch.
Once logged onto one of the login nodes, request an interactive session:
[salexan5@tlog2 ~]$ salloc --account=arcc --time=01:00:00 -N 1 -c 1 --partition=moran-bigmem-gpu --gres=gpu:k80:1
salloc: Granted job allocation 9974878Load the modules you require. Since we’re using GPUs we need to load the appropriate NVidia drivers.
[salexan5@mbm01 ~]$ module load miniconda3/4.3.30
[salexan5@mbm01 ~]$ module load cuda/10.1.243If you want, you can check the requested GPU has been allocated.
[salexan5@mbm01 ~]$ srun nvidia-smi -L
GPU 0: Tesla K80 (UUID: GPU-53acbde2-ec88-e8fa-d477-719e700fb22f)Because we are using a Conda environment, we need to activate it.
[salexan5@mbm01 ~]$ source activate /pfs/tsfs1/project/arcc/salexan5/conda/gpu/pytorch/pytorch_envNavigate to the folder containing the source code and then run it
(/pfs/tsfs1/project/arcc/salexan5/conda/gpu/pytorch/pytorch_env) [salexan5@mbm01 ~]$ cd /project/arcc/salexan5/conda/gpu/pytorch/
(/pfs/tsfs1/project/arcc/salexan5/conda/gpu/pytorch/pytorch_env) [salexan5@mbm01 pytorch]$ srun python pytorch_test.py
PyTorch Version: 1.6.0
Cuda Available: True
Device Name: Tesla K80
Device Count: 1
Device(0): <torch.cuda.device object at 0x2abf1d5fde50>
Device Current: 0Once finished, deactivate the Conda environment and cancel you interactive session.
(/pfs/tsfs1/project/arcc/salexan5/conda/gpu/pytorch/pytorch_env) [salexan5@mbm01 pytorch]$ source deactivate
[salexan5@mbm01 pytorch]$ scancel 9974878
salloc: Job allocation 9974878has been revoked.
[salexan5@mbm01 pytorch]$ srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: m029: task 0: Killed
srun: Terminating job step 9974853.0
[salexan5@tlog2 ~]$
GPU Not Found/Detected
Remember to prefix the line where you call your application/program with srun. This actually releases the GPU allocation you requested.
If you forget, then you’ll see a warning like the following:
python pytorch_test.py
PyTorch Version: 1.6.0
Cuda Available: False
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1595629411241/work/aten/src/THC/THCGeneral.cpp line=47 error=100 : no CUDA-capable device is detected
Traceback (most recent call last):
File "pytorch_test.py", line 5, in <module>
print("Device Name: " + str(torch.cuda.get_device_name(0)))
File "/pfs/tsfs1/project/arcc/salexan5/conda/gpu/pytorch/pytorch_env/lib/python3.8/site-packages/torch/cuda/__init__.py", line 293, in get_device_name
return get_device_properties(device).name
File "/pfs/tsfs1/project/arcc/salexan5/conda/gpu/pytorch/pytorch_env/lib/python3.8/site-packages/torch/cuda/__init__.py", line 314, in get_device_properties
_lazy_init() # will define _get_device_properties
File "/pfs/tsfs1/project/arcc/salexan5/conda/gpu/pytorch/pytorch_env/lib/python3.8/site-packages/torch/cuda/__init__.py", line 190, in _lazy_init
torch._C._cuda_init()
RuntimeError: cuda runtime error (100) : no CUDA-capable device is detected at /opt/conda/conda-bld/pytorch_1595629411241/work/aten/src/THC/THCGeneral.cpp:47