Goal: Provide an exercise to work through that puts together the various concepts covered within this workshop.

Exercise: Put Things Together

Exercise: Create a self contained Conda environment that would allow a user to run a python file that uses PyTorch.

The environment should be created within a location that can be shared with other users within a project.
Make sure the environment can utilize GPUs running with cuda 12.4.

Using the the following pytorch_gpu_test.py script, create a bash script to submit a job to the cluster that allows this to be run, using the created Conda environment, that utilizes one H100 NVidia GPU device on a single compute node, with 8 cores and 16G of memory.
Any output should be written into a slurms sub folder, with a filename of the form pytorch_<job-id>.out
Send email notification to yourself regards the status of the submission.

pytorch_gpu_test.py

import torch
import math

print("PyTorch Version: " + str(torch.__version__))
print("Cuda Available: " + str(torch.cuda.is_available()))
print("Device Count: " + str(torch.cuda.device_count()))
print("Device Name: " + str(torch.cuda.get_device_name(0)))

dtype = torch.float
device = torch.device("cuda:0") # Uncomment this to run on GPU

# Create random input and output data
x = torch.linspace(-math.pi, math.pi, 2000, device=device, dtype=dtype)
y = torch.sin(x)

# Randomly initialize weights
a = torch.randn((), device=device, dtype=dtype)
b = torch.randn((), device=device, dtype=dtype)
c = torch.randn((), device=device, dtype=dtype)
d = torch.randn((), device=device, dtype=dtype)

learning_rate = 1e-6
for t in range(2000):
    # Forward pass: compute predicted y
    y_pred = a + b * x + c * x ** 2 + d * x ** 3

    # Compute and print loss
    loss = (y_pred - y).pow(2).sum().item()
    if t % 100 == 99:
        print(t, loss)

    # Backprop to compute gradients of a, b, c, d with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_a = grad_y_pred.sum()
    grad_b = (grad_y_pred * x).sum()
    grad_c = (grad_y_pred * x ** 2).sum()
    grad_d = (grad_y_pred * x ** 3).sum()

    # Update weights using gradient descent
    a -= learning_rate * grad_a
    b -= learning_rate * grad_b
    c -= learning_rate * grad_c
    d -= learning_rate * grad_d

print(f'Result: y = {a.item()} + {b.item()} x + {c.item()} x^2 + {d.item()} x^3')

Ignoring any cached packages, how much storage space does this Conda environment use?
Which version of Python is being used?

Exercise: Extend with Pandas

Exercise: I want to also use the Python Pandas package, specifically version 2.1.4 to assist with some further analysis.

Can I extend my existing Conda environment?
If so how?
Should I?

Exercise: Suggested Answers: Create the Environment

Considerations:

Asking for a shared environment: Install within a /project/<project-name/> location - a project you and users you want to share with are part of.
Read the documentation!
- The installation process offers both conda and pip install directions. Since we want a self contained environment we would suggest using the conda install approach. If you use pip install you would need to set the PYTHONUSERBASE environment variable.
- Directions indicate how to install cuda version 12.4.
Look at the Linux du command to calculate the storage a folder takes.

Example process:

[~]$ cd /project/<project-name>/software/pytorch
[]$ module purge
[]$ module load miniconda3/24.3.0
[]$ conda create -p pytorch_env
...
# To activate this environment, use
#
#     $ conda activate /cluster/medbow/project/<project-name>/software/pytorch/pytorch_env
#
# To deactivate an active environment, use
#
#     $ conda deactivate

[]$ conda activate /cluster/medbow/project/<project-name>/software/pytorch/pytorch_env
(/cluster/medbow/project/<project-name>/software/pytorch/pytorch_env) []$ conda install pytorch torchvision torchaudio pytorch-cuda=12.4 -c pytorch -c nvidia
(/cluster/medbow/project/<project-name>/software/pytorch/pytorch_env) []$ conda deactivate
[]$

Regards the storage space taken up:

[pytorch]$ du -d 1 -h
6.3G    ./pytorch_env
6.3G

Notice within the installation of pytorch the following (or something similar):

...
python-3.12.3              |hab00c5b_0_cpython        30.5 MB  conda-forge
...

You can confirm within the activate environment:

(/cluster/medbow/project/<project-name>/software/pytorch/pytorch_env) []$ python --version
Python 3.12.3

Exercise: Suggested Answers: Run the Code

Considerations:

Decide on your working directory where you’ll have the python test script and the bash script to submit. Do you want to share this?
Look at the required resources:
- One H100 - which partition do you need to define? Remember: if you don’t ask, you don’t get.
- How do you request 8 cores and 16G of memory?
What are all the steps you need to perform? You can test via an interactive salloc session, and then copy into your bash submission script.

Suggested Submission Script: run.sh

#!/bin/bash
#SBATCH --account=<project-name>
#SBATCH --time=10:00
#SBATCH --job-name=pytorch_test
#SBATCH --cpus-per-task=8
#SBATCH --mem=16G
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<email-addr>
#SBATCH --output=slurms/pytorch_%A.out
#SBATCH --partition=mb-h100
#SBATCH --gres=gpu:1

module purge
module load miniconda3/24.3.0

conda activate /cluster/medbow/project/arcc/software/pytorch/pytorch_env
python pytorch_gpu_test.py
conda deactivate
echo "Done."

If successful, your output should take the following form:

cat slurms/pytorch_2331610.out

The following modules were not unloaded:
  (Use "module --force purge" to unload all):
  1) slurm/latest   2) arcc/1.0

PyTorch Version: 2.4.0
Cuda Available: True
Device Count: 1
Device Name: NVIDIA H100 80GB HBM3
99 176.02212524414062
199 119.40426635742188
299 81.9596939086914
399 57.194664001464844
499 40.815223693847656
599 29.981685638427734
699 22.816429138183594
799 18.07718276977539
899 14.942441940307617
999 12.869009017944336
1099 11.497486114501953
1199 10.590255737304688
1299 9.990135192871094
1399 9.59315013885498
1499 9.330546379089355
1599 9.156827926635742
1699 9.041902542114258
1799 8.965866088867188
1899 8.915557861328125
1999 8.882278442382812
Result: y = -0.0008966787136159837 + 0.8645414710044861 x + 0.0001546924322610721 x^2 + -0.09443996101617813 x^3

Done.

Exercise: Suggested Answers: Extend with Pandas

Considerations:

Can you add Pandas to the environment? Yes. You can always go back to an existing environment, activate, and update.
How will you install this package? Using conda install or pip install?
You’re explicitly asked for version 2.1.4. Is this version available? How can you check?
If you conda install, is there anything I need to take note off during the solving stage?

Because we want to try and keep are environment self contained, we’d suggest first looking at using Conda.

Perform the following to check if it is available as a conda package:

(/cluster/medbow/project/<project-name>/software/pytorch/pytorch_env) []$ conda search pandas
Loading channels: done
# Name                       Version           Build  Channel
...
pandas                         2.1.4  py39hddac248_0  conda-forge
...
pandas                         2.2.2  py39hfc16268_1  conda-forge

Notice this package is within the conda-forge channel. Do you have this configured?

What is details during the solving stage of the conda install?

(/cluster/medbow/project/<project-name>/software/pytorch/pytorch_env) []$ conda install pandas==2.1.4
The following packages will be downloaded:
    package                    |            build
    ---------------------------|-----------------
    numpy-1.26.4               |  py312heda63a1_0         7.1 MB  conda-forge
    pandas-2.1.4               |  py312hfb8ada1_0        14.0 MB  conda-forge
    ------------------------------------------------------------
                                           Total:        21.1 MB

The following NEW packages will be INSTALLED:
  pandas             conda-forge/linux-64::pandas-2.1.4-py312hfb8ada1_0
  ...

The following packages will be DOWNGRADED:
  numpy                               2.1.0-py312h1103770_0 --> 1.26.4-py312heda63a1_0

Notice: The numpy package was installed as part of the original torch install, but is going to be downgraded from 2.1.0 to 1.26.4!

This is a reason to maybe not set the always_yes option in the ~/.condarc file. If you have this set to yes, then the installation would have continued regardless.
How does this downgraded version potentially affect torch? With out testing, we don’t know.
We would suggest not downgrading. Instead create a separate Conda environment for Pandas so we do not run into potential dependency issues.

Prev

Python, Conda and Pip: Suggested Best Practices

Workshop Home

Using Python, Conda and Pip on the Cluster

Next

Python, Conda and Pip: Summary

Python, Conda and Pip: Exercise

Exercise: Put Things Together

Exercise: Extend with Pandas

Exercise: Suggested Answers: Create the Environment

Exercise: Suggested Answers: Run the Code

Exercise: Suggested Answers: Extend with Pandas