Using TensorFlow on Beartooth

Although ARCC will endeavor to keep this page up-to-date, TensorFlow is under continuous development and we might be playing catchup. If you think this page is out-of-date, please notify ARCC via our portal, and please refer to the TensorFlow Install page to check for the latest approach.

Overview

TensorFlow is an end-to-end open source platform for machine learning. It has a comprehensive, flexible ecosystem of tools, libraries and community resources that lets researchers push the state-of-the-art in ML and developers easily build and deploy ML powered applications.

General Install process for Beartooth:

The basic process to setting up conda, TensorFlow and running a script is:

Step though creating a basic Conda environment - CPU vs GPU.
Provide a template for a bash script to submit jobs using sbatch.
Provide a very simple script that tests TensorFlow has been imported and can identify the allocated GPU.

Note:

This is a short page and assumes some familiarization with using Conda.
The installation of TensorFlow within the conda environment will also install related dependencies, but nothing else. Since you’re creating the conda environment, you can extend and install other packages. You can view the conda packages installed using conda list while in an active environment.
The bash script only uses a single node and single core. It is up to the user to explore other configurations.
In the scripts and examples below, please remember to appropriately edit to use your account, email address, folder locations etc.

Setting Up Conda Environment

The process below is an example. Please be ware you’ll need to replace <project-name> with your project, and that <username> represents your username..

[<username>@blog2 <project-name>]$ pwd
/project/<project-name>
[<username>@blog2 <project-name>]$ mkdir tf
[<username>@blog2 <project-name>]$ cd tf

[<username>@blog2 <project-name>] module load miniconda3/23.11.0
# The '-p' option will create the conda environment in your current location.
[<username>@blog2 tf]$ conda create -p tf_env

# The activate path is that which will be detailed at the end of the conda create step above.
[<username>@blog2 tf]$ conda activate /pfs/tc1/project/<project-name>/tf/tf_env

# Force conda to install the python packages within the conda environment location,
# rather than your home folder.
(/pfs/tc1/project/<project-name>/tf/tf_env) [<username>@blog2 tf]$ export PYTHONUSERBASE=$CONDA_PREFIX
(/pfs/tc1/project/<project-name>/tf/tf_env) [<username>@blog2 tf]$ echo $PYTHONUSERBASE

# CPU ONLY VERSION
(/pfs/tc1/project/<project-name>/tf/tf_env) [<username>@blog2 tf]$ pip install tensorflow==2.15.1

# GPU VERSION
(/pfs/tc1/project/<project-name>/tf/tf_env) [<username>@blog2 tf]$ pip install tensorflow[and-cuda]==2.15.1

(/pfs/tc1/project/<project-name>/tf/tf_env) [<username>@blog2 tf]$ conda deactivate
[<project-name>@blog2 tf]$

Bash Script to use with sbatch (with GPU)

Below is a basic template to use that you’ll need to insert your account, email and path to the conda environment you created:

#!/bin/bash
#SBATCH --account=<your_arcc_project>
#SBATCH --time=0:10:00
#SBATCH --job-name=tensorflow_test
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<your_email>
#SBATCH --output=tensorflow_%A.log
#SBATCH --mem=8G
#SBATCH --partition=<select_appropriate_gpu_partion>
#SBATCH --gres=gpu:1

echo "Load Modules:"
miniconda3/23.11.0

echo "Check GPU Allocation:"
echo "Running nvidia-smi:"
nvidia-smi -L

echo "Activate Conda Environment"
conda activate /pfs/tc1/project/<project-name>/tf/tf_env

python --version
echo "- - - - - - - - - - - - - - - - - - - - -"
python tf_test.py
echo "- - - - - - - - - - - - - - - - - - - - -"

echo "Deactivate Conda:"
conda deactivate
echo "Done"

Simple Source Code Example

Below is some very simple source code that will test your environment and GPU request is functioning properly.

It simply imports the tensor package, and then using this checks that it can identify the allocated GPU(s). To work with the bash script above, save this file as tf_test.py

import tensorflow as tf
print("TensorFlow Version: " + str( tf.__version__))
print(tf.config.list_physical_devices('GPU'))
print(tf.test.gpu_device_name())
print(tf.reduce_sum(tf.random.normal([1000, 1000])))

Example Output

Issues

As mentioned, TensorFlow is under constant development and errors/bugs will creep in.

For example, when installing version 2.16.1, GPUs were not being detecting: TF 2.16.1 Fails to work with GPUs #63362 - this is why in the example above we explicitly installed version 2.15.1.

ARCC Wiki