Goal: Provide an exercise to work through that puts together the various concepts covered within this workshop.
Table of Contents |
---|
minLevel | 1 |
---|
maxLevel | 1 |
---|
outline | false |
---|
style | none |
---|
type | list |
---|
printable | true |
---|
|
...
Exercise: Put Things Together
Note |
---|
Exercise: Create a self contained Conda environment that would allow a user to run a python file that uses PyTorch. |
Note |
---|
Using the the following pytorch_gpu_test.py script, create a bash script to submit a job to the cluster that allows this to be run, using the created Conda environment, that utilizes one H100 NVidia GPU device on a single compute node, with 8 cores and 16G of memory. Any output should be written into a slurms sub folder, with a filename of the form pytorch_<job-id>.out Send email notification to yourself regards the status of the submission.
|
Expand |
---|
|
Code Block |
---|
import torch
import math
print("PyTorch Version: " + str(torch.__version__))
print("Cuda Available: " + str(torch.cuda.is_available()))
print("Device Count: " + str(torch.cuda.device_count()))
print("Device Name: " + str(torch.cuda.get_device_name(0)))
dtype = torch.float
device = torch.device("cuda:0") # Uncomment this to run on GPU
# Create random input and output data
x = torch.linspace(-math.pi, math.pi, 2000, device=device, dtype=dtype)
y = torch.sin(x)
# Randomly initialize weights
a = torch.randn((), device=device, dtype=dtype)
b = torch.randn((), device=device, dtype=dtype)
c = torch.randn((), device=device, dtype=dtype)
d = torch.randn((), device=device, dtype=dtype)
learning_rate = 1e-6
for t in range(2000):
# Forward pass: compute predicted y
y_pred = a + b * x + c * x ** 2 + d * x ** 3
# Compute and print loss
loss = (y_pred - y).pow(2).sum().item()
if t % 100 == 99:
print(t, loss)
# Backprop to compute gradients of a, b, c, d with respect to loss
grad_y_pred = 2.0 * (y_pred - y)
grad_a = grad_y_pred.sum()
grad_b = (grad_y_pred * x).sum()
grad_c = (grad_y_pred * x ** 2).sum()
grad_d = (grad_y_pred * x ** 3).sum()
# Update weights using gradient descent
a -= learning_rate * grad_a
b -= learning_rate * grad_b
c -= learning_rate * grad_c
d -= learning_rate * grad_d
print(f'Result: y = {a.item()} + {b.item()} x + {c.item()} x^2 + {d.item()} x^3') |
|
Note |
---|
Ignoring any cached packages, how much storage space does this Conda environment use? Which version of Python is being used?
|
...
Exercise: Extend with Pandas
Note |
---|
Exercise: I want to also use the Python Pandas package, specifically version 2.1.4 to assist with some further analysis. |
...
Exercise: Suggested Answers: Create the Environment
Info |
---|
Considerations: Asking for a shared environment: Install within a /project/<project-name/> location - a project you and users you want to share with are part of. Read the documentation! The installation process offers both conda and pip install directions. Since we want a self contained environment we would suggest using the conda install approach. If you use pip install you would need to set the PYTHONUSERBASE environment variable. Directions indicate how to install cuda version 12.4.
Look at the Linux du command to calculate the storage a folder takes.
|
Expand |
---|
|
Code Block |
---|
[~]$ cd /project/<project-name>/software/pytorch
[]$ module purge
[]$ module load miniconda3/24.3.0
[]$ conda create -p pytorch_env
...
# To activate this environment, use
#
# $ conda activate /cluster/medbow/project/<project-name>/software/pytorch/pytorch_env
#
# To deactivate an active environment, use
#
# $ conda deactivate
[]$ conda activate /cluster/medbow/project/<project-name>/software/pytorch/pytorch_env
(/cluster/medbow/project/<project-name>/software/pytorch/pytorch_env) []$ conda install pytorch torchvision torchaudio pytorch-cuda=12.4 -c pytorch -c nvidia
(/cluster/medbow/project/<project-name>/software/pytorch/pytorch_env) []$ conda deactivate
[]$ |
|
Info |
---|
Regards the storage space taken up: |
Code Block |
---|
[pytorch]$ du -d 1 -h
6.3G ./pytorch_env
6.3G |
Info |
---|
Notice within the installation of pytorch the following (or something similar): Code Block |
---|
...
python-3.12.3 |hab00c5b_0_cpython 30.5 MB conda-forge
... |
You can confirm within the activate environment: Code Block |
---|
(/cluster/medbow/project/<project-name>/software/pytorch/pytorch_env) []$ python --version
Python 3.12.3 |
|
...
Exercise: Suggested Answers: Run the Code
Info |
---|
Considerations: Decide on your working directory where you’ll have the python test script and the bash script to submit. Do you want to share this? Look at the required resources: One H100 - which partition do you need to define? Remember: if you don’t ask, you don’t get. How do you request 8 cores and 16G of memory?
What are all the steps you need to perform? You can test via an interactive salloc session, and then copy into your bash submission script.
|
Expand |
---|
title | Suggested Submission Script: run.sh |
---|
|
Code Block |
---|
#!/bin/bash
#SBATCH --account=<project-name>
#SBATCH --time=10:00
#SBATCH --job-name=pytorch_test
#SBATCH --cpus-per-task=8
#SBATCH --mem=16G
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<email-addr>
#SBATCH --output=slurms/pytorch_%A.out
#SBATCH --partition=mb-h100
#SBATCH --gres=gpu:1
module purge
module load miniconda3/24.3.0
conda activate /cluster/medbow/project/arcc/software/pytorch/pytorch_env
python pytorch_gpu_test.py
conda deactivate
echo "Done." |
|
Info |
---|
If successful, your output should take the following form: |
Expand |
---|
title | cat slurms/pytorch_2331610.out |
---|
|
Code Block |
---|
The following modules were not unloaded:
(Use "module --force purge" to unload all):
1) slurm/latest 2) arcc/1.0
PyTorch Version: 2.4.0
Cuda Available: True
Device Count: 1
Device Name: NVIDIA H100 80GB HBM3
99 176.02212524414062
199 119.40426635742188
299 81.9596939086914
399 57.194664001464844
499 40.815223693847656
599 29.981685638427734
699 22.816429138183594
799 18.07718276977539
899 14.942441940307617
999 12.869009017944336
1099 11.497486114501953
1199 10.590255737304688
1299 9.990135192871094
1399 9.59315013885498
1499 9.330546379089355
1599 9.156827926635742
1699 9.041902542114258
1799 8.965866088867188
1899 8.915557861328125
1999 8.882278442382812
Result: y = -0.0008966787136159837 + 0.8645414710044861 x + 0.0001546924322610721 x^2 + -0.09443996101617813 x^3
Done. |
|
...
Exercise: Suggested Answers: Extend with Pandas
Info |
---|
Considerations: Can you add Pandas to the environment? Yes. You can always go back to an existing environment, activate, and update. How will you install this package? Using conda install or pip install ? You’re explicitly asked for version 2.1.4 . Is this version available? How can you check? If you conda install , is there anything I need to take note off during the solving stage?
|
Info |
---|
Because we want to try and keep are environment self contained, we’d suggest first looking at using Conda. Perform the following to check if it is available as a Conda package: Code Block |
---|
(/cluster/medbow/project/<project-name>/software/pytorch/pytorch_env) []$ conda search pandas
Loading channels: done
# Name Version Build Channel
...
pandas 2.1.4 py39hddac248_0 conda-forge
... |
Notice: This package is within the conda-forge channel. Do you have this configured? |
Info |
---|
What is details during the solving stage of the conda install ? Code Block |
---|
(/cluster/medbow/project/<project-name>/software/pytorch/pytorch_env) []$ conda install pandas==2.1.4
The following packages will be downloaded:
package | build
---------------------------|-----------------
numpy-1.26.4 | py312heda63a1_0 7.1 MB conda-forge
pandas-2.1.4 | py312hfb8ada1_0 14.0 MB conda-forge
------------------------------------------------------------
Total: 21.1 MB
The following NEW packages will be INSTALLED:
pandas conda-forge/linux-64::pandas-2.1.4-py312hfb8ada1_0
...
The following packages will be DOWNGRADED:
numpy 2.1.0-py312h1103770_0 --> 1.26.4-py312heda63a1_0 |
|
Note |
---|
Notice: The numpy package was installed as part of the original torch install, but is going to be downgraded from 2.1.0 to 1.26.4 ! This is a reason to maybe not set the always_yes option in the ~/.condarc file. If you have this set to yes , then the installation would have continued regardless. How does this downgraded version potentially affect torch? With out testing, we don’t know. We would suggest not downgrading. Instead create a separate Conda environment for Pandas so we do not run into potential dependency issues.
|
...
...