Using TensorFlow

Overview

TensorFlow is an end-to-end open source platform for machine learning. It has a comprehensive, flexible ecosystem of tools, libraries and community resources that lets researchers push the state-of-the-art in ML and developers easily build and deploy ML powered applications.

Although ARCC will endeavor to keep this page up-to-date, TensorFlow is under continuous development and we might be playing catchup. If you think this page is out-of-date, please notify ARCC via our portal, and please refer to the TensorFlow Install page to check for the latest approach.

General Install process for Beartooth:

The basic process to setting up conda, TensorFlow and running a script is:

Step though creating a basic Conda environment - CPU vs GPU.
Provide a template for a bash script to submit jobs using sbatch.
Provide a very simple script that tests TensorFlow has been imported and can identify the allocated GPU.

Note:

This is a short page and assumes some familiarization with using Conda.
The installation of TensorFlow within the conda environment will also install related dependencies, but nothing else. Since you’re creating the conda environment, you can extend and install other packages. You can view the conda packages installed using conda list while in an active environment.
The bash script only uses a single node and single core. It is up to the user to explore other configurations.
In the scripts and examples below, please remember to appropriately edit to use your account, email address, folder locations etc.

Setting Up Conda Environment

The process below is an example. Please be ware you’ll need to replace <project-name> with your project, and that <username> represents your username..

Create conda environment

[<username>@blog2 <project-name>] module load miniconda3/23.11.0
[<username>@blog2 <project-name>]$ pwd
/project/<project-name>

[<username>@blog2 <project-name>]$ mkdir tf
[<username>@blog2 <project-name>]$ cd tf

# The '-p' option will create the conda environment in your current location.
[<username>@blog2 tf]$ conda create -p tf_env

# The activate path is that which will be detailed at the end of the conda create step above.
[<username>@blog2 tf]$ conda activate /pfs/tc1/project/<project-name>/tf/tf_env

# Force conda to install the python packages within the conda environment location,
# rather than your home folder.
(/pfs/tc1/project/<project-name>/tf/tf_env) [<username>@blog2 tf]$ export PYTHONUSERBASE=$CONDA_PREFIX
(/pfs/tc1/project/<project-name>/tf/tf_env) [<username>@blog2 tf]$ echo $PYTHONUSERBASE

# CPU ONLY VERSION
(/pfs/tc1/project/<project-name>/tf/tf_env) [<username>@blog2 tf]$ pip install tensorflow==2.15.1

# GPU VERSION
(/pfs/tc1/project/<project-name>/tf/tf_env) [<username>@blog2 tf]$ pip install tensorflow[and-cuda]==2.15.1

(/pfs/tc1/project/<project-name>/tf/tf_env) [<username>@blog2 tf]$ conda deactivate
[<project-name>@blog2 tf]$

Bash Script to use with sbatch (with GPU)

Below is a basic template to use that you’ll need to insert your account, email and path to the conda environment you created:

Slurm Template

#!/bin/bash
#SBATCH --account=<your_arcc_project>
#SBATCH --time=0:10:00
#SBATCH --job-name=tensorflow_test
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<your_email>
#SBATCH --output=tensorflow_%A.log
#SBATCH --mem=8G
#SBATCH --partition=<select_appropriate_gpu_partion>
#SBATCH --gres=gpu:1

echo "Load Modules:"
miniconda3/23.11.0

echo "Check GPU Allocation:"
echo "Running nvidia-smi:"
nvidia-smi -L

echo "Activate Conda Environment"
conda activate /pfs/tc1/project/<project-name>/tf/tf_env

python --version
echo "- - - - - - - - - - - - - - - - - - - - -"
python tf_test.py
echo "- - - - - - - - - - - - - - - - - - - - -"

echo "Deactivate Conda:"
conda deactivate
echo "Done"

Simple Source Code Example

Below is some very simple source code that will test your environment and GPU request is functioning properly.

It simply imports the tensor package, and then using this checks that it can identify the allocated GPU(s). To work with the bash script above, save this file as tf_test.py

Example: tf_test.py

import tensorflow as tf
print("TensorFlow Version: " + str( tf.__version__))
print(tf.config.list_physical_devices('GPU'))
print(tf.test.gpu_device_name())
print(tf.reduce_sum(tf.random.normal([1000, 1000])))

Example Output

Example run and output:

[<username>@blog2 tf]$ sbatch run.sh
Submitted batch job 13515489

[<username>@blog2 tf]$ cat tensorflow_13515489.log
Load Modules:
Check GPU Allocation:
Running nvidia-smi:
GPU 0: NVIDIA A30 (UUID: GPU-6727f67b-4e78-6947-4889-d9742499866b)
Activate Conda Environment
Python 3.9.18
- - - - - - - - - - - - - - - - - - - - -
2024-03-13 09:54:39.425766: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-03-13 09:54:39.570846: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-13 09:54:39.571111: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-13 09:54:39.579691: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-03-13 09:54:39.599788: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-03-13 09:54:44.632792: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-03-13 09:54:54.158738: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /device:GPU:0 with 22109 MB memory:  -> device: 0, name: NVIDIA A30, pci bus id: 0000:4b:00.0, compute capability: 8.0
2024-03-13 09:54:54.167637: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 22109 MB memory:  -> device: 0, name: NVIDIA A30, pci bus id: 0000:4b:00.0, compute capability: 8.0
TensorFlow Version: 2.15.1
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
/device:GPU:0
tf.Tensor(-786.1937, shape=(), dtype=float32)
- - - - - - - - - - - - - - - - - - - - -
Deactivate Conda:
Done

Issues

As mentioned, TensorFlow is under constant development and errors/bugs will creep in.

For example, when installing version 2.16.1, GPUs were not being detecting: TF 2.16.1 Fails to work with GPUs #63362 - this is why in the example above we explicitly installed version 2.15.1.