Using the DGX Nodes

Overview

The page details using the DGX nodes, specifically running the Python TensorFlow package from within a Conda environment, with examples on the type of submission templates and output.

DGX nodes: The DGX nodes were purchased as part of the EvolvingAI project group investment (and thus these users would get priority usage), but available to other users via our condo model. To request, you have to explicitly define --partition=dgx within your sbatch related submission script (as detailed in step 5 below).

There are two specific nodes:

  • mdgx01 using eight V100-SXM2-16GB gpus.

  • tdgx01 using eight V100-SXM2-32GB gpus.

 

The DGX images based on Ubuntu 18.04.5 LTS (rather then RedHat 7.9 used across the other teton nodes) were last updated in January of 2020 and provide the following capabilities:

Software Releases in DGX OS Desktop Release 4.0.7: https://docs.nvidia.com/dgx/dgx-os-desktop-release-notes/index.html#release-4-0-7

  • Linux kernel: 4.15.0-47-generic - matches up with uname -r

  • NVIDIA Graphics Drivers for Linux 410.129

  • NVIDIA CUDA Toolkit: 10.0.130

  • NVIDIA CUDA Deep Neural Network (cuDNN) Library: 7.5.0

  • NVIDIA Collective Communication Library (NCCL): 2.4.2

  • OpenGL: 4.6

 

Currently, there is a very limited set of available modules due to the very specific nature of the investment’s research:

1 2 3 4 5 6 7 8 salexan5@tdgx01:/$ module spider ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ The following is a list of the modules currently available: ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ cuda: cuda/10.1.243, cuda/11.0.3 cudnn: cudnn/8.0.5.39 miniconda3: miniconda3/4.9.2, miniconda3/4.10.3 singularity: singularity/3.8.1

Approaches to Using:

Due to the limited number of modules currently installed, and the type of research and thus development approaches used, researchers have used Anaconda to provide the necessary implementation functionality.

In most cases the approach followed appears to have been by installing Anaconda locally (since not available as a module) and then accessing the Anaconda binaries directly.

Example 1: Using Miniconda

This example demonstrates using miniconda to create a conda environment that is then run on the DGX nodes.

Miniconda: a free minimal installer for conda. It is a small, bootstrap version of Anaconda that includes only conda, Python, the packages they depend on, and a small number of other useful packages, including pip, zlib and a few others.

 

Step 1: Download Linux version of Miniconda

1 []$ wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

 

Step 2: Install Miniconda

This will by default install into your /home/<username>/miniconda3/ folder. You do have limited /home/ space, so as an alternative you can install into your /project/ space.

The installation will follow something like below:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 []$ bash Miniconda3-latest-Linux-x86_64.sh Welcome to Miniconda3 py38_4.9.2 In order to continue the installation process, please review the license agreement. Please, press ENTER to continue >>> =================================== End User License Agreement - Anaconda Individual Edition =================================== Copyright 2015-2020, Anaconda, Inc. ... Last updated September 28, 2020 Do you accept the license terms? [yes|no] [no] >>> yes Miniconda3 will now be installed into this location: /home/salexan5/miniconda3 - Press ENTER to confirm the location - Press CTRL-C to abort the installation - Or specify a different location below [/home/salexan5/miniconda3] >>> PREFIX=/home/salexan5/miniconda3 Unpacking payload ... Collecting package metadata (current_repodata.json): done Solving environment: done ## Package Plan ## environment location: /home/salexan5/miniconda3 added / updated specs: - _libgcc_mutex==0.1=main - brotlipy==0.7.0=py38h27cfd23_1003 ... - zlib==1.2.11=h7b6447c_3 The following NEW packages will be INSTALLED: _libgcc_mutex pkgs/main/linux-64::_libgcc_mutex-0.1-main brotlipy pkgs/main/linux-64::brotlipy-0.7.0-py38h27cfd23_1003 .. zlib pkgs/main/linux-64::zlib-1.2.11-h7b6447c_3 Preparing transaction: done Executing transaction: done installation finished. Do you wish the installer to initialize Miniconda3 by running conda init? [yes|no] [no] >>> yes no change /home/salexan5/miniconda3/condabin/conda ... no change /home/salexan5/miniconda3/etc/profile.d/conda.csh modified /home/salexan5/.bashrc ==> For changes to take effect, close and re-open your current shell. <== If you'd prefer that conda's base environment not be activated on startup, set the auto_activate_base parameter to false: conda config --set auto_activate_base false Thank you for installing Miniconda3!

 

For the changes to take effect, you need to reset your session:

1 2 3 4 []$ exec bash ... Resetting modules to system default (base) []$

Notice that your command line prompt has changed to indicate that conda is now active.

This is activated by the installation process modifying your .bashrc file, that’s in your /home/ folder. If you view this file, you’ll see something like the following:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 # >>> conda initialize >>> # !! Contents within this block are managed by 'conda init' !! __conda_setup="$('/home/salexan5/miniconda3/bin/conda' 'shell.bash' 'hook' 2> /dev/null)" if [ $? -eq 0 ]; then eval "$__conda_setup" else if [ -f "/home/salexan5/miniconda3/etc/profile.d/conda.sh" ]; then . "/home/salexan5/miniconda3/etc/profile.d/conda.sh" else export PATH="/home/salexan5/miniconda3/bin:$PATH" fi fi unset __conda_setup # <<< conda initialize <<<

If you do not want conda to be automatically initialized when you start a session, you can remove/comment out this script from the .bashrc. Depending on how comfortable you are pointing to executables and your leave of linux experience, you might want to leave it while you create your conda environments.

Step 3: Build a Conda Environment

There are a number of ways to build a conda environment with respect to where it is located. The following example builds it in the specific folder that the command is run from.

When building a conda environment you’ll need to understand the configuration you’re build for. In this example I’m looking to build an environment to run TenserFlow, that runs on the DGX nodes. Since I’m aware that these nodes currently only support up to CUDA toolkit 10.0.130 I’m not going to try to build anything with capabilities that exceed this.

Looking at the Tested build configurations for TensorFlow I can see which version of TensorFlow can be used with respect to CUDA capabilities, as well as the version of Python it requires.

This example will build an environment based on cuda 9:

and looking at the configuration I need to use Python 3.6

 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 (base) []$ cd /project/arcc/salexan5/conda/dgx (base) []$ conda create --prefix=tf_dgx python=3.6 Collecting package metadata (current_repodata.json): done Solving environment: done ==> WARNING: A newer version of conda exists. <== current version: 4.9.2 latest version: 4.10.0 Please update conda by running $ conda update -n base -c defaults conda ## Package Plan ## environment location: /pfs/tsfs1/project/arcc/salexan5/conda/dgx/tf_dgx added / updated specs: - python=3.6 The following NEW packages will be INSTALLED: _libgcc_mutex conda-forge/linux-64::_libgcc_mutex-0.1-conda_forge ... zlib conda-forge/linux-64::zlib-1.2.11-h516909a_1010 Proceed ([y]/n)? y Preparing transaction: done Verifying transaction: done Executing transaction: done # # To activate this environment, use # $ conda activate /pfs/tsfs1/project/arcc/salexan5/conda/dgx/tf_dgx # # To deactivate an active environment, use # $ conda deactivate

 

If we look in the folder that we created this environment within, we’ll see it contains a folder that uses the same name that we used to name are conda environment.

1 2 (base) []$ ls tf_dgx

 

At this stage, when we activate this conda environment, we essentially have a blank environment that only has Python 3.6 installed. Notice that on activating the environment, the command line prompt will change to indicate this.

The next step is to install the appropriate versioned packages we want to use i.e. tensorflow.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 (base) []$ conda activate /pfs/tsfs1/project/arcc/salexan5/conda/dgx/tf_dgx (/pfs/tsfs1/project/arcc/salexan5/conda/dgx/tf_dgx) [salexan5@tlog2 dgx]$ conda install tensorflow-gpu==1.12 Collecting package metadata (current_repodata.json): done Solving environment: failed with initial frozen solve. Retrying with flexible solve. Collecting package metadata (repodata.json): done Solving environment: done ==> WARNING: A newer version of conda exists. <== current version: 4.9.2 latest version: 4.10.0 Please update conda by running $ conda update -n base -c defaults conda ## Package Plan ## environment location: /pfs/tsfs1/project/arcc/salexan5/conda/dgx/tf_dgx added / updated specs: - tensorflow-gpu==1.12 The following packages will be downloaded: package | build ---------------------------|----------------- _tflow_select-2.1.0 | gpu 2 KB ... zipp-3.4.1 | pyhd8ed1ab_0 11 KB conda-forge ------------------------------------------------------------ Total: 641.9 MB The following NEW packages will be INSTALLED: _tflow_select pkgs/main/linux-64::_tflow_select-2.1.0-gpu ... zipp conda-forge/noarch::zipp-3.4.1-pyhd8ed1ab_0 Proceed ([y]/n)? y Downloading and Extracting Packages gast-0.4.0 | 12 KB | ############################################################################################################################################ | 100% tensorflow-gpu-1.12. | 3 KB | ############################################################################################################################################ | 100% ... protobuf-3.15.7 | 333 KB | ############################################################################################################################################ | 100% Preparing transaction: done Verifying transaction: done Executing transaction: \ By downloading and using the CUDA Toolkit conda packages, you accept the terms and conditions of the CUDA End User License Agreement (EULA): https://docs.nvidia.com/cuda/eula/index.html By downloading and using the cuDNN conda packages, you accept the terms and conditions of the NVIDIA cuDNN EULA - https://docs.nvidia.com/deeplearning/cudnn/sla/index.html done (/pfs/tsfs1/project/arcc/salexan5/conda/dgx/tf_dgx) [salexan5@tlog2 dgx]$

 

This is over to you.

Here is a very simple script called tf.py used to interrogate the GPUs available:

1 2 3 4 5 6 7 8 9 10 11 12 13 import tensorflow as tf from tensorflow.python.client import device_lib def get_available_gpus(): local_device_protos = device_lib.list_local_devices() return [x.name for x in local_device_protos if x.device_type == 'GPU'] def main(): print(tf.test.is_gpu_available()) print(get_available_gpus()) if __name__ == '__main__': main()

 

Step 5: Submit Job to run on DGX node and Test Code

To test you code you can either create an interactive session to access the DGX nodes, or submit a job. This step demonstrates an example submission script called run.sh.

Notice the use of the srun --export=ALL command, this is required to:

  • Actually release the allocated GPUs to the code being run.

  • Export all the environment variables from the submission environment so that they propagated to the launched application.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 #!/bin/bash #SBATCH --job-name=tfc9 #SBATCH --time=00:05:00 #SBATCH --account=<your-project> #SBATCH --partition=dgx #SBATCH --nodelist=mdgx01 #SBATCH --gres=gpu:4 #SBATCH --output=tfc9_%A.out #SBATCH --mail-type=ALL #SBATCH --mail-user=<your-email> echo "SLURM_JOB_PARTITION" $SLURM_JOB_PARTITION echo "SLURM_JOB_NODELIST:" $SLURM_JOB_NODELIST echo "- - - - - - - - -" echo "CUDA_VISIBLE_DEVICES:" $CUDA_VISIBLE_DEVICES echo "- - - - - - - - -" srun nvidia-smi -L echo "- - - - - - - - -" srun nvidia-smi echo "- - - - - - - - -" . /pfs/tsfs1/project/arcc/salexan5/conda/miniconda3/bin/activate conda activate /pfs/tsfs1/project/arcc/salexan5/conda/dgx/tf_dgx python --version srun --export=ALL python tf.py conda deactivate echo "Finished Successfully:"

In the template above we can see we’re explicitly requesting to use the mdgx01 node, and requesting four GPUs.

Details on NVidia’s nvidia-smi can be found here: NVIDIA System Management Interface.

Submit

Since we are submitting a job to the DGX nodes, we have to use the modified sbatch_dgx command, which is a tailored version of the typical sbatch.

1 2 (base) []$ sbatch_dgx run.sh Submitted batch job 12714733

 

Output

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 (base) []$ cat tfc9_12714733.out SLURM_JOB_PARTITION dgx SLURM_JOB_NODELIST: mdgx01 - - - - - - - - - CUDA_VISIBLE_DEVICES: 0,1,2,3 - - - - - - - - - GPU 0: Tesla V100-SXM2-16GB (UUID: GPU-64dc6369-4c36-824d-182c-8e8f9c33f587) GPU 1: Tesla V100-SXM2-16GB (UUID: GPU-d4adb339-0dba-47db-e766-96b9cbc302b4) GPU 2: Tesla V100-SXM2-16GB (UUID: GPU-f7637d82-f0c0-15e6-da23-21216b9b8f33) GPU 3: Tesla V100-SXM2-16GB (UUID: GPU-5afe7813-6b87-b667-e0a3-8e04662357e8) - - - - - - - - - Tue Apr 6 17:01:08 2021 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 410.129 Driver Version: 410.129 CUDA Version: 10.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla V100-SXM2... On | 00000000:06:00.0 Off | 0 | | N/A 30C P0 42W / 300W | 0MiB / 16130MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla V100-SXM2... On | 00000000:07:00.0 Off | 0 | | N/A 33C P0 43W / 300W | 0MiB / 16130MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 2 Tesla V100-SXM2... On | 00000000:0A:00.0 Off | 0 | | N/A 33C P0 42W / 300W | 0MiB / 16130MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 3 Tesla V100-SXM2... On | 00000000:0B:00.0 Off | 0 | | N/A 31C P0 43W / 300W | 0MiB / 16130MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ - - - - - - - - - Python 3.6.13 2021-04-06 17:01:14.852650: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA 2021-04-06 17:01:15.411565: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53 pciBusID: 0000:06:00.0 totalMemory: 15.75GiB freeMemory: 15.34GiB 2021-04-06 17:01:15.771357: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 1 with properties: name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53 pciBusID: 0000:07:00.0 totalMemory: 15.75GiB freeMemory: 15.34GiB 2021-04-06 17:01:16.158389: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 2 with properties: name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53 pciBusID: 0000:0a:00.0 totalMemory: 15.75GiB freeMemory: 15.34GiB 2021-04-06 17:01:16.535766: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 3 with properties: name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53 pciBusID: 0000:0b:00.0 totalMemory: 15.75GiB freeMemory: 15.34GiB 2021-04-06 17:01:16.535884: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1, 2, 3 2021-04-06 17:01:18.455694: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2021-04-06 17:01:18.455741: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 1 2 3 2021-04-06 17:01:18.455750: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N Y Y Y 2021-04-06 17:01:18.455756: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: Y N Y Y 2021-04-06 17:01:18.455761: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 2: Y Y N Y 2021-04-06 17:01:18.455767: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 3: Y Y Y N 2021-04-06 17:01:18.456708: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:0 with 14846 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:06:00.0, compute capability: 7.0) 2021-04-06 17:01:18.457189: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:1 with 14846 MB memory) -> physical GPU (device: 1, name: Tesla V100-SXM2-16GB, pci bus id: 0000:07:00.0, compute capability: 7.0) 2021-04-06 17:01:18.457439: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:2 with 14846 MB memory) -> physical GPU (device: 2, name: Tesla V100-SXM2-16GB, pci bus id: 0000:0a:00.0, compute capability: 7.0) 2021-04-06 17:01:18.457725: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:3 with 14846 MB memory) -> physical GPU (device: 3, name: Tesla V100-SXM2-16GB, pci bus id: 0000:0b:00.0, compute capability: 7.0) 2021-04-06 17:01:18.460753: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1, 2, 3 2021-04-06 17:01:18.460801: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2021-04-06 17:01:18.460817: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 1 2 3 2021-04-06 17:01:18.460825: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N Y Y Y 2021-04-06 17:01:18.460831: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: Y N Y Y 2021-04-06 17:01:18.460837: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 2: Y Y N Y 2021-04-06 17:01:18.460843: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 3: Y Y Y N ... True ['/device:GPU:0', '/device:GPU:1', '/device:GPU:2', '/device:GPU:3'] Finished Successfully:

 

Example 2: Conda Environment using TensorFlow based on CUDA Toolkit 10


Now that we have Miniconda setup and useable, we can create alternative conda environments. This example demonstrates creating an environment that uses the CUDA Toolkit 10.

Step 1: Build a Conda Environment

The only significant difference compared to example 1 is the actual conda environment that is created. Looking at the build configuration

that uses the cuda toolkit 10 capabilities, we can identify the tensorflow version as well as the Python version.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 (base) []$ conda create --prefix=tf10b_dgx python=3.7 (base) []$ conda activate /pfs/tsfs1/project/arcc/salexan5/conda/dgx/tf10b_dgx (/pfs/tsfs1/project/arcc/salexan5/conda/dgx/tf10b_dgx) []$ conda install tensorflow-gpu==1.15 Collecting package metadata (current_repodata.json): done Solving environment: failed with initial frozen solve. Retrying with flexible solve. Collecting package metadata (repodata.json): done Solving environment: done ... ## Package Plan ## environment location: /pfs/tsfs1/project/arcc/salexan5/conda/dgx/tf10b_dgx added / updated specs: - tensorflow-gpu==1.15 The following packages will be downloaded: package | build ---------------------------|----------------- cudatoolkit-10.0.130 | hf841e97_8 336.5 MB conda-forge cudnn-7.6.5.32 | ha8d7eb6_1 226.2 MB conda-forge cupti-10.0.130 | 0 1.5 MB tensorboard-1.15.0 | py37_0 3.8 MB conda-forge tensorflow-1.15.0 |gpu_py37h0f0df58_0 4 KB tensorflow-base-1.15.0 |gpu_py37h9dcbed7_0 156.5 MB tensorflow-estimator-1.15.1| pyh2649769_0 271 KB tensorflow-gpu-1.15.0 | h0d30ee6_0 3 KB ------------------------------------------------------------ Total: 724.8 MB The following NEW packages will be INSTALLED: ... Downloading and Extracting Packages ... done

 

Using a modified test script tf2.py

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 import tensorflow as tf from tensorflow.python.client import device_lib def get_available_gpus(): local_device_protos = device_lib.list_local_devices() return [x.name for x in local_device_protos if x.device_type == 'GPU'] def main(): print("- - - - - - - - - - - - - - - - - - - -") print("TF Version:" + str(tf.__version__)) print("Is Built With Cuda: " + str(tf.test.is_built_with_cuda())) print("Is GPU Available: " + str(tf.test.is_gpu_available())) print(get_available_gpus()) if __name__ == '__main__': main()

 

Step 3: Submit Job to run on DGX node and Test Code

In the following template we’re only defining the dgx partition (i.e. not defining the node explicitly), and so could be allocated to either of the two DGX nodes. We’ve also requested eight GPUs.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 #!/bin/bash #SBATCH --job-name=tfc10 #SBATCH --time=00:05:00 #SBATCH --account=<your-project> #SBATCH --partition=dgx #SBATCH --gres=gpu:8 #SBATCH --output=slurms/tfc10_%A.out #SBATCH --mail-type=ALL #SBATCH --mail-user=<your-email> echo "SLURM_JOB_PARTITION" $SLURM_JOB_PARTITION echo "SLURM_JOB_NODELIST:" $SLURM_JOB_NODELIST echo "- - - - - - - - -" echo "CUDA_VISIBLE_DEVICES:" $CUDA_VISIBLE_DEVICES echo "- - - - - - - - -" srun nvidia-smi -L echo "- - - - - - - - -" . /pfs/tsfs1/project/arcc/salexan5/conda/miniconda3/bin/activate conda activate /pfs/tsfs1/project/arcc/salexan5/conda/dgx/tf10b_dgx python --version srun --export=ALL python tf2.py conda deactivate echo "Finished Successfully:"

 

Submit and Output:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 (base) []$ sbatch_dgx tfc10.sh Submitted batch job 12778974 (base) []$cat tfc10_12778974.out SLURM_JOB_PARTITION dgx SLURM_JOB_NODELIST: mdgx01 - - - - - - - - - CUDA_VISIBLE_DEVICES: 0,1,2,3,4,5,6,7 - - - - - - - - - GPU 0: Tesla V100-SXM2-16GB (UUID: GPU-64dc6369-4c36-824d-182c-8e8f9c33f587) ... GPU 7: Tesla V100-SXM2-16GB (UUID: GPU-2fb2d543-b85b-9088-e779-21d8976fc19d) - - - - - - - - - Python 3.7.10 2021-04-14 12:02:56.409021: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA 2021-04-14 12:02:56.435606: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2195050000 Hz 2021-04-14 12:02:56.435796: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x557a490682b0 initialized for platform Host (this does not guarantee that XLA will be used). Devices: 2021-04-14 12:02:56.435817: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version 2021-04-14 12:02:56.438689: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1 2021-04-14 12:02:59.990361: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53 pciBusID: 0000:06:00.0 2021-04-14 12:02:59.991727: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 1 with properties: name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53 pciBusID: 0000:07:00.0 ... 2021-04-14 12:02:59.999219: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 7 with properties: name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53 pciBusID: 0000:8a:00.0 2021-04-14 12:03:00.002633: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0 2021-04-14 12:03:00.004907: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0 2021-04-14 12:03:00.019135: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0 2021-04-14 12:03:00.030408: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0 2021-04-14 12:03:00.053932: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0 2021-04-14 12:03:00.068450: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0 2021-04-14 12:03:00.114602: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2021-04-14 12:03:00.130937: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0, 1, 2, 3, 4, 5, 6, 7 2021-04-14 12:03:00.130977: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0 2021-04-14 12:03:00.140814: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix: 2021-04-14 12:03:00.140833: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0 1 2 3 4 5 6 7 2021-04-14 12:03:00.140849: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N Y Y Y Y N N N 2021-04-14 12:03:00.140857: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 1: Y N Y Y N Y N N 2021-04-14 12:03:00.140864: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 2: Y Y N Y N N Y N 2021-04-14 12:03:00.140870: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 3: Y Y Y N N N N Y 2021-04-14 12:03:00.140877: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 4: Y N N N N Y Y Y 2021-04-14 12:03:00.140884: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 5: N Y N N Y N Y Y 2021-04-14 12:03:00.140891: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 6: N N Y N Y Y N Y 2021-04-14 12:03:00.140897: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 7: N N N Y Y Y Y N 2021-04-14 12:03:00.152965: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/device:GPU:0 with 14926 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:06:00.0, compute capability: 7.0) ... 2021-04-14 12:03:00.169486: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/device:GPU:7 with 14926 MB memory) -> physical GPU (device: 7, name: Tesla V100-SXM2-16GB, pci bus id: 0000:8a:00.0, compute capability: 7.0) 2021-04-14 12:03:00.172310: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x557a4fb87080 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices: 2021-04-14 12:03:00.172327: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Tesla V100-SXM2-16GB, Compute Capability 7.0 ... 2021-04-14 12:03:00.172372: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (7): Tesla V100-SXM2-16GB, Compute Capability 7.0 2021-04-14 12:03:00.174701: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53 pciBusID: 0000:06:00.0 ... 2021-04-14 12:03:00.181945: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 7 with properties: name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53 pciBusID: 0000:8a:00.0 ... - - - - - - - - - - - - - - - - - - - - TF Version:1.15.0 Is Built With Cuda: True Is GPU Available: True ['/device:GPU:0', '/device:GPU:1', '/device:GPU:2', '/device:GPU:3', '/device:GPU:4', '/device:GPU:5', '/device:GPU:6', '/device:GPU:7'] Finished Successfully:

 

What we've demonstrated with this second example is that we can still use the same Miniconda set up, but create an independent/different conda environment that can be submitted and run in the same way.