As research becomes more compute intensive, ARCC has made high performance compute a core service. This core service is currently being performed by The Teton Compute Environment, allowing researchers to perform computation-intensive analysis on large datasets. See Citing Teton. Newcomers to research computing should also consider reading the Research Computing Quick Reference Using Teton, researchers have control over their data, projects, and collaborators. Built-in tools help users get up and running in a short amount of time, and the ability to request custom tools allows users to fine-tune their research procedures.
See Citing Teton.
Info |
---|
This page contains commonly used words and phrases that are used in research computing, if you are unsure of any of the terms, please visit the Glossary page to learn more. |
...
Contents
Table of Contents |
---|
...
Tip |
---|
...
Overview
The Teton Compute Environment, or Teton for short, is a high performance computing (HPC) cluster with over 500 compute nodes and a high performance data storage system. Teton was preceded by UWyo’s first community HPC cluster, Mount Moran, which went into service in 2012. The second generation of HPC at UWyo, Teton, first went into service in 2018 and is available to all research interests at UWyo. With over 1.2 PB of storage, Teton can accommodate some of the largest datasets. Isolated filesystems ensure that researchers have control of where their data are, and who can access it. Teton can be securely accessed anywhere, anytime with 98% expected uptime and SSH connectivity using UWyo two-factor authentication. No matter where you go, your research can go with you
Teton is an Intel x86_64 cluster connected via a Mellanox FDR/EDR InfiniBand and has a 1.3 PB IBM Spectrum Scale global parallel filesystem available to all nodes. The system requires UWYO two-factor authentication (2FA) for login via SSH. The default shell is BASH with Lmod modules system is leveraged for dynamic user environments to help switch software stacks rapidly and easily. The Slurm workload manager is employed to schedule jobs, provide submission limits, and implement fair share as well as provide the Quality of Service (QoS) levels for research groups who have invested in the cluster.
Teton has a Digital Object Identifier (DOI) (https://doi.org/10.15786/M2FY47) and we request that all use of Teton appropriately acknowledges the system. Please see Citing Teton for more information.
Condo model - describe more
Available Nodes
See Partitions for information regarding Slurm Partitions on Teton.
Available Nodes
...
title | Click Here to View Available Nodes |
---|
...
Type
...
Series
...
Arch
...
Count
...
Sockets
...
Cores
...
Threads / Core
...
Clock (GHz)
...
RAM (GB)
...
GPU Type
...
GPU Count
...
Local Disk Type
...
Local Disk Capacity (GB)
...
IB Network
...
Operating System
...
Status
...
Teton Regular
...
Intel Broadwell
...
x86_64
...
180
...
2
...
32
...
1
...
2.1
...
128
...
N/A
...
N/A
...
SSD
...
240
...
EDR
...
RHEL 7.6
...
Mo longer available for purchase
...
Intel Cascade Lake
...
x86_64
...
15
...
2
...
40
...
1
...
2.1
...
128
...
N/A
...
N/A
...
SSD
...
240
...
EDR
...
RHEL 7.6
...
Current Model
...
Teton BigMem GPU
...
Intel Broadwell
...
x86_64
...
8
...
2
...
32
...
1
...
2.1
...
512
...
NVIDIA P100 16G
...
2
...
SSD
...
240
...
EDR
...
RHEL 7.6
...
Mo longer available for purchase
...
Teton HugeMem
...
Intel Broadwell
...
x86_64
...
10
...
2
...
32
...
1
...
2.1
...
1024
...
N/A
...
N/A
...
SSD
...
240
...
EDR
...
RHEL 7.6
...
Mo longer available for purchase
...
Teton KNL
...
Intel Knights Landing
...
x86_64
...
12
...
1
...
72
...
4
...
1.5
...
384 + 16
...
N/A
...
N/A
...
SSD
...
240
...
EDR
...
RHEL 7.6
...
Mo longer available for purchase
...
Teton DGX
...
Intel Broadwell
...
x86_64
...
1
...
2
...
40
...
2
...
2.2
...
512
...
NVIDIA V100 32G
...
8
...
SSD
...
7 TB
...
EDR
...
Ubuntu 18.04.2 LTS
...
Available as special order
...
Moran Regular
...
Intel Sandbridge/Ivybridge
...
x86_64
...
283
...
2
...
16
...
1
...
2.6
...
64 or 128
...
k20 on some
...
2
...
HD
...
1T
...
FDR
...
RHEL 7.6
...
Mo longer available for purchase
...
Moran BigMem
...
Intel Sandbridge/Ivybridge
...
x86_64
...
2
...
2
...
16
...
1
...
2.6
...
512
...
K80
...
8
...
HD
...
1T
...
FDR
...
RHEL 7.6
...
Mo longer available for purchase
...
Moran Debug
...
Intel Sandbridge/Ivybridge
...
x86_64
...
2
...
2
...
16
...
1
...
2.6
...
64
...
k20m
...
2
...
HD
...
1T
...
FDR
...
RHEL 7.6
...
Mo longer available for purchase
...
Moran HugeMem
...
Intel Sandbridge/Ivybridge
...
x86_64
...
2
...
2
...
16
...
1
...
2.6
...
1024
...
k20
...
2
...
HD
...
1T
...
FDR
...
RHEL 7.6
...
Mo longer available for purchase
...
Moran DGX
...
Intel Broadwell
...
x86_64
...
1
...
2
...
40
...
2
...
2.2
...
512
...
NVIDIA V100 16G
...
8
...
SSD
...
7 TB
...
EDR
...
Ubuntu 18.04.2 LTS
...
Available as special order
...
TOTAL Nodes
...
516
Global Filesystems
The Teton global parallel filesystem configured with a 160 TB SSD tier for active data and 1.2 PB HDD capacity tier for less-used data. The system policy engine moves data automatically between pools (disks and tiers). The system will automatically migrate data to HDD when the SSD tier reaches 70% used capacity. Teton has several spaces that are available for users to access described in the table below.
home/home/username ($HOME)
Space for configuration files and software installations. This file space is intended to be small and always resides on SSDs. The /home file space is snapshotted to recover from accidental deletions.
project/project/project_name/[username]
Space to collaborate among project members. Data here is persistent and is exempt from purge policy. No snapshots.
gscratch - /gscratch/username ($SCRATCH)
Space to perform computing for individual users. Data here is subject to a purge policy defined below. Warning emails will be sent when possible deletions may start to occur. No snapshots.
Global Filesystems
Filesystem | Quota (GB) | Snapshots | Backups | Purge Policy | Additional Info |
---|---|---|---|---|---|
home | 25 | Yes | No | No | Always on SDD |
project | 1024 | No | No | No | Aging Data will move to HDD |
gscratch | 5120 | No | No | Yes | Aging Data will move to HDD |
Purge Policy
File spaces within the Teton cluster filesystem may be subject to a purge policy. The policy has not yet been defined. However, ARCC reserves the right to purge data in this area after 30 to 90 days of no access or from creation time. Before performing an actual purge event, the owner of the file(s) will be notified by email several times for files that are subject to being purged.
Storage Increases on Teton
Project PIs can purchase additional scratch and/or project space at a cost of $100 / TB / year.
Additionally, PIs can request allocation increases at no cost for scratch and/or project space by submitting proposals that must be renewed when substantial cluster or storage changes occur:
the scientific gain and insights that will be or have been obtained by using the system,
how data is organized and accessed in efforts to maximize performance and usage.
Projects are limited to 1 no-cost increase.
To request more information, please contact ARCC.
Special Filesystems
Certain filesystems exist on different nodes of the cluster where specialized requirements exist. The table below summarizes these specialized filesystems.
Specialty Filesystems
Filesystem | Mount Location | Notes |
---|---|---|
petaLibrary | /petalibrary/homes | Only on login nodes |
petaLibrary | /petalibrary/Commons | Only on login nodes |
node local scratch | /lscratch | Only on compute nodes, Moran is 1 TB HDD, Teton is 240 GB SSD |
memory filesystem | /dev/shm | RAM-based tmpfs available as part of RAM for very rapid I/O operations; small capacity |
The node-local scratch or lscratch filesystem is purged at the end of each job.
The memory filesystems can really enhance the performance of small I/O operations. If you have a localized single node I/O jobs that have very intensive random access patterns, this filesystem may improve the performance of your compute job.
The petaLibrary filesystems are only available from the login nodes, not on the compute nodes. Storage space on the Teton global filesystems does not imply storage space on the ARCC petaLibrary or vice versa. For more information about the petaLibrary please see the following link petaLibrary.
The Bighorn filesystems will be provided for a limited amount of time in order for researchers to move data to either the petaLibrary, Teton storage or to some other storage media. The actual date that these mounts will be removed is still TBD.
Project and Account Requests
For research projects, UWYO UWyo faculty members (Principal Investigators) can request a Project be created on Teton. PIs can then add access to the project for UWYO UWyo students, faculty and external collaborators. User Accounts on Teton require a valid UWYO UWyo e-mail address and a UWYOUWyo-Affiliated PI sponsor. UWYO UWyo faculty members can sponsor their own accounts, while students, post-doctoral researchers, or research associates must use their PI as their sponsor. Non-UWYO UWyo external collaborators must be sponsored by a current UWYO UWyo faculty member.
Follow this link Account Policy for additional information and policy statements on account usage. Use the link under "Account Requests" to request that either a project or user(s) be created or added. From this same page, you can request that users be added to an existing project.
...
To request access for instructional use, send an email to arcc-info@uwyo.edu with the course number, section, and student list. If the PI prefers generic accounts can be created instead of providing a student list. Instructional accounts are usually valid for a single semester and access to the project is terminated at the beginning of the next semester.
System Access
SSH Access
Teton has login nodes for users to access the cluster. Login nodes are available publicly using the hostname teton.arcc.uwyo.edu or teton.uwyo.edu. SSH can be done natively on MacOS or Linux based operating systems using the terminal and the ssh command. Although X11 forwarding is supported, and if you need graphical support, we recommend using FastX if at all possible. Additionally, you may want to configure your OpenSSH client to support connection multiplexing if you require multiple terminal sessions. For those instances where you have unreliable network connectivity, you may want to use either tmux or screen once you login to keep sessions alive during disconnects. This will allow you to later reconnect to these sessions.
Code Block | ||
---|---|---|
| ||
ssh USERNAME@teton.arcc.uwyo.edu
ssh -l USERNAME teton.arcc.uwyo.edu
ssh -Y -l USERNAME teton.arcc.uwyo.edu # For secure forwarding of X11 displays
ssh -X -l USERNAME teton.arcc.uwyo.edu # For forwarding of X11 displays |
OpenSSH Configuration File (BSD,Linux,macOS)
By default, the OpenSSH user configuration file is $HOME/.ssh/config which can be edited to enhance workflow. Since Teton uses round-robin DNS to provide access to two login nodes and requires two-factor authentication, it can be advantageous to add SSH multiplexing to your local environment to make sure subsequent connections are made to the same login node. This also provides a way to shorten up the hostname and access methods for SCP/SFTP/Rsync capabilities. An example entry looks like where USERNAME would be replaced by your actual UWYO username:
Code Block |
---|
Host teton
Hostname teton.arcc.uwyo.edu
User USERNAME
controlmaster auto
controlpath ~/.ss/ssh-%r@%h:%p |
Note |
---|
WARNING: While ARCC allows SSH multiplexing, other research computing sites may not. Do not assume this will always work on systems not administered by ARCC. |
Access from Microsoft Windows
ARCC currently recommends that users install MobaXterm to access the Teton cluster. It provides appropriate access to the system with SSH and SFTP capability, allowing X11 if required. The home version of MobaXterm should be sufficient. There is also PuTTY if a more minimal application is desired.
Additional options include, a Cygwin installation with SSH installed or the Windows Subsystem for Linux with an OpenSSH client installed on very recent versions of windows, enabling the OpenSSH client. Finally, a great alternative is to use our FastX capability.
FastX Access
If you’re currently on the UW campus, you can also leverage FastX to provide you with a more robust remote graphics capability via an installable client for Windows, Mac, or Linux or through a web browser. Navigate to https://fastx.arcc.uwyo.edu and log in with your 2FA credentials. There are also native clients for FastX for Windows, macOS, and Linux which can be downloaded here. For more information, see the documentation on using FastX.
Available Shells
Teton has several shells available for use. The default is bash. To change your default shell, please submit the request through standard ARCC request methods.hard
...
Shell
...
Path
...
Version
...
Notes
...
...
/bin/bash
...
4.2.46
...
Recommended
...
...
/bin/zsh
...
5.0.2
...
...
/bin/csh
...
6.18.01
...
Implemented by TCSH
...
...
/bin/tcsh
...
6.18.01
GPUs and Accelerators
The ARCC Teton cluster has a number of compute nodes that contain GPUs. This section describes the hardware, as well as access and usage of the GPU nodes.
Teton GPU Hardware
The following tables list each node that has GPUs and the type of GPU installed.
Table #1
...
title | Click Here to View Table #1 |
---|
...
Node
...
Partition
...
GPU Type
...
Number of Devices
...
GPU Memory Size (GB)
...
Compute Capability
...
GRES Flag
...
Teton Partition
...
Notes
...
m025
...
moran
...
K20m
...
2
...
4
...
3.5
...
gpu:k20m:{1-2}
...
Yes
...
m026
...
moran
...
K20m
...
2
...
4
...
3.5
...
gpu:k20m:{1-2}
...
Yes
...
m027
...
moran
...
K20m
...
2
...
4
...
3.5
...
gpu:k20m:{1-2}
...
Yes
...
m028
...
moran
...
K20m
...
2
...
4
...
3.5
...
gpu:k20m:{1-2}
...
Yes
...
m029
...
moran
...
K20m
...
2
...
4
...
3.5
...
gpu:k20m:{1-2}
...
Yes
...
m030
...
moran
...
K20m
...
2
...
4
...
3.5
...
gpu:k20m:{1-2}
...
Yes
...
m031
...
moran
...
K20m
...
2
...
4
...
3.5
...
gpu:k20m:{1-2}
...
Yes
...
m032
...
moran
...
K20m
...
2
...
4
...
3.5
...
gpu:k20m:{1-2}
...
Yes
...
m075
...
moran
...
K20m
...
2
...
4
...
3.5
...
gpu:k20m:{1-2}
...
Yes
...
m076
...
moran
...
K20m
...
2
...
4
...
3.5
...
gpu:k20m:{1-2}
...
Yes
...
m077
...
moran
...
K20m
...
2
...
4
...
3.5
...
gpu:k20m:{1-2}
...
Yes
...
m078
...
moran
...
K20m
...
2
...
4
...
3.5
...
gpu:k20m:{1-2}
...
Yes
...
m079
...
moran
...
K20m
...
2
...
4
...
3.5
...
gpu:k20m:{1-2}
...
Yes
...
m080
...
moran
...
K20m
...
2
...
4
...
3.5
...
gpu:k20m:{1-2}
...
Yes
...
m081
...
moran
...
K20m
...
2
...
4
...
3.5
...
gpu:k20m:{1-2}
...
Yes
...
m082
...
moran
...
K20m
...
2
...
4
...
3.5
...
gpu:k20m:{1-2}
...
Yes
...
m083
...
moran
...
K20m
...
2
...
4
...
3.5
...
gpu:k20m:{1-2}
...
Yes
...
Disabled due to memory ECC errors
...
m084
...
moran
...
K20m
...
2
...
4
...
3.5
...
gpu:k20m:{1-2}
...
Yes
...
Disabled due to memory ECC errors
...
m085
...
moran
...
K20m
...
2
...
4
...
3.5
...
gpu:k20m:{1-2}
...
Yes
...
Disabled due to memory ECC errors
...
m086
...
moran
...
K20m
...
2
...
4
...
3.5
...
gpu:k20m:{1-2}
...
Yes
...
m219
...
moran
...
K20Xm
...
2
...
5
...
3.5
...
gpu:k20xm:{1-2}
...
Yes
...
m220
...
moran
...
K20Xm
...
2
...
5
...
3.5
...
gpu:k20xm:{1-2}
...
Yes
...
m227
...
moran
...
K20Xm
...
2
...
5
...
3.5
...
gpu:k20xm:{1-2}
...
Yes
...
m228
...
moran
...
K20Xm
...
2
...
5
...
3.5
...
gpu:k20xm:{1-2}
...
Yes
...
m235
...
moran
...
K20Xm
...
2
...
5
...
3.5
...
gpu:k20xm:{1-2}
...
Yes
...
m236
...
moran
...
K20Xm
...
2
...
5
...
3.5
...
gpu:k20xm:{1-2}
...
Yes
...
m243
...
moran
...
K20Xm
...
2
...
5
...
3.5
...
gpu:k20xm:{1-2}
...
Yes
...
m244
...
moran
...
20Xm
...
2
...
5
...
3.5
...
gpu:k20xm:{1-2}
...
Yes
...
m251
...
moran
...
K20Xm
...
2
...
5
...
3.5
...
gpu:k20xm:{1-2}
...
Yes
...
m252
...
moran
...
K20Xm
...
2
...
5
...
3.5
...
gpu:k20xm:{1-2}
...
Yes
...
m259
...
moran
...
K20Xm
...
2
...
5
...
3.5
...
gpu:k20xm:{1-2}
...
Yes
...
m260
...
moran
...
K20Xm
...
2
...
5
...
3.5
...
gpu:k20xm:{1-2}
...
Yes
...
m267
...
moran
...
K20Xm
...
2
...
5
...
3.5
...
gpu:k20xm:{1-2}
...
Yes
...
m268
...
moran
...
K20Xm
...
2
...
5
...
3.5
...
gpu:k20xm:{1-2}
...
Yes
...
mdbg01
...
moran
...
GTX Titan X
...
1
...
12
...
5.2
...
gpu:TitanX:{1-1}
...
Yes
...
moran
...
GTX Titan
...
2
...
6
...
6.0
...
gpu:Titan:{1-2}
...
Yes
...
mdbg02
...
moran
...
K40c
...
2
...
11
...
3.5
...
gpu:k40c:{1-2}
...
Yes
...
moran
...
GTX Titan X
...
2
...
12
...
5.2
...
gpu:TitanX:{1-2}
...
Yes
...
mbm01
...
moran-gpu
...
K80
...
8
...
11
...
3.7
...
gpu:k80:{1-8}
...
No
...
mbm02
...
moran-gpu
...
K80
...
8
...
11
...
3.7
...
gpu:k80:{1-8}
...
No
...
tbm03
...
teton-gpu
...
Tesla P100
...
2
...
16
...
6.0
...
gpu:P100:{1-2}
...
No
...
tbm04
...
teton-gpu
...
Tesla P100
...
2
...
16
...
6.0
...
gpu:P100:{1-2}
...
No
...
tbm05
...
teton-gpu
...
Tesla P100
...
2
...
16
...
6.0
...
gpu:P100:{1-2}
...
No
...
tbm06
...
teton-gpu
...
Tesla P100
...
2
...
16
...
6.0
...
gpu:P100:{1-2}
...
No
...
tbm07
...
teton-gpu
...
Tesla P100
...
2
...
16
...
6.0
...
gpu:P100:{1-2}
...
No
...
tbm08
...
teton-gpu
...
Tesla P100
...
2
...
16
...
6.0
...
gpu:P100:{1-2}
...
No
...
tbm09
...
teton-gpu
...
Tesla P100
...
2
...
16
...
6.0
...
gpu:P100:{1-2}
...
No
...
tbm10
...
teton-gpu
...
Tesla P100
...
2
...
16
...
6.0
...
gpu:P100:{1-2}
...
No
The following two GPU nodes are reserved for AI use.
Table #2
...
Node
...
Partition
...
GPU Type
...
Number of Devices
...
GPU Memory Size (GB)
...
Compute Capability
...
GRES Flag
...
Teton Partition
...
Notes
...
mdgx01
...
dgx
...
Tesla V100
...
8
...
16
...
7.0
...
gpu:V100-16g:{1-8}
...
No
...
tdgx01
...
dgx
...
Tesla V100
...
8
...
32
...
7.0
...
gpu:V100-32g:{1-8}
...
No
...
For additional information about CUDA programming visit Nivida's CUDA C Programming Guide.
Access and Running Jobs
There are three different types of GPU nodes in the Teton cluster and they are requested in somewhat different ways.
Public Nodes: Public nodes are available to the general user and are in the "teton" partition. These nodes are identified in table #1 last column with a "yes". Use the following partition request to access these nodes.
Code Block |
---|
--partition=teton |
Reserved Nodes: Reserved nodes are available to the general user and must be specifically requested via a partition request, i.e. "teton-gpu". These nodes are identified in table #1 last column with a "no". Use the following partition request to access these nodes.
Code Block |
---|
--partition=teton-gpu |
Speciality Nodes: These are specialty nodes that are available to special users and are requested via a partition request, i.e. "dgx", see table #2 above. Use the following partition request to access these nodes.
Code Block |
---|
--partition=dgx |
If one wants to access the GPU devices on a node one MUST explicitly specify the generic consumable resources flag ("gres" flag). The "gres" flag has the following syntax:
Code Block |
---|
--gres=<resource_type>:<resource_name>:<resource_count> |
where:
resource_type is always equal to gpu string for the GPU devices.
resource_name is a string that describes the type of the requested gpu(s) e.g. k80, titanx, k20m, ....
resource_count is the number of gpu devices that are requested of the type resource_name. Its value is an integer in the closed interval: {1, max. number of devices on a node}
The "gres" flag attached to each type of node can be found in the second-to-last column of Table 1. For example, the flag -gres=gpu:titanx:1 must be used to request one (1) GTX Titan X device that can only be satisfied by the nodes with the GTX Titan X in them.
If you run a job that requires GPUs and you fail to specify the "gres" flag, your job will be assigned any node in the requested partition. This means your job will possibly not have access to GPUs as part of your job.
One way to verify that your job has access to GPUs within a node you can execute the following command:
Code Block |
---|
echo $CUDA_VISIBLE_DEVICES |
An empty output string implies NO access to the node's GPU devices.
Some programs are serial, or able to run only on a single GPU; other jobs perform better on a single or small number of GPUs and therefore cannot efficiently make use of all of the GPUs on a single node. In order to better utilize our GPU nodes, node sharing has been enabled.. This allows multiple jobs to run on the same node, each job being assigned specific resources (number of cores, amount of memory, number of accelerators). The node resources are managed by the SLURM scheduler up to the maximum available on each node. It should be noted that while efforts are made to isolate jobs running on the same node, there are still many shared components in the system. Therefore a job's performance can be affected by other job(s) running on the node at the same time.
Node sharing can be accessed by requesting less than the full number of GPUs, CPUs or memory. Note that node sharing can also be done on the basis of the number of CPU's and/or memory, or all three. By default, each job gets 3.5 GB of memory per core requested (the lowest common denominator among our cluster nodes), therefore to request a different amount than the default amount of memory, you must use the "-mem" flag. To request exclusive use of the node, use "-mem=0".
Example #1
An example script that would request two Teton nodes with 2xK20m GPU's, including all cores and all memory, running one GPU per MPI task, would look like this:
Code Block |
---|
#SBATCH --nodes=2
#SBATCH --mem=0
#SBATCH --partition=teton
#SBATCH --account=<account>
#SBATCH --gres=gpu:k20m:2
#SBATCH --time=1:00:00
... Other job prep
srun myprogram.exe |
...
To request all 8 K80 GPUs on a Teton node, again using one GPU per MPI task, we would do:
Code Block |
---|
#SBATCH --nodes=1
#SBATCH --mem=0
#SBATCH --partition=teton
#SBATCH --account=<account>
#SBATCH --gres=gpu:k80:8
#SBATCH --time=1:00:00
... Other job prep
srun myprogram.exe |
...
Another example, using the job script below will get four GPUs, four CPU cores, and 8GB of memory. The remaining GPUs, CPUs, and memory will then be accessible to other jobs.
Code Block |
---|
#SBATCH --ntasks=4
#SBATCH --nodes=1
#SBATCH --mem=8
#SBATCH --partition=teton
#SBATCH --account=<account>
#SBATCH --gres=gpu:k80:4
#SBATCH --time=00:30:00
... Other job prep
srun myprogram.exe |
...
To run a parallel interactive job with MPI, do not use the usual "srun" command, as this does not work properly with the "gres" request. Instead, use the "salloc" command, e.g.
Code Block |
---|
salloc -n 1 -N 1 -t 1:00:00 -A <account> -p teton-gpu --gres=gpu:p100:1 |
This will allocate the resources to the job but keeps the prompt on the login node. You can then use "srun" or "mpirun" commands to launch the calculation on the allocated compute node resources.
For serial jobs, utilizing one or more GPUs, "srun" works properly, e.g.
Code Block |
---|
srun -n 1 -N 1 -t 1:00:00 -A <account> -p teton-gpu --gres=gpu:p100:1 --pty /bin/bash -l |
GPU Programming Environment
On Teton Nvidia CUDA, PGI CUDA Fortran and the OpenACC compilers are installed. The default CUDA is 9.2.88, which at the time of writing is the most recent. You can access by simply loading the CUDA module, "module load cuda". PGI compilers come with their own CUDA which is quite recent, and can be set access by loading the PGI module, using "module load pgi".
Any login node should work to compile your CUDA code as the CUDA tools are available from the login nodes. PGI compilers come with their own CUDA so compiling anywhere from where you can load the PGI module should work.
To compile CUDA code using the CUDA compiler "nvcc" so that it runs on all types of GPUs that ARCC has, use the following compiler flags:
Code Block | ||
---|---|---|
| ||
-gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 |
For more info on the CUDA compilation and linking flags, please have a look at this guide.
The PGI compilers specify the GPU architecture with the -tp=tesla flag. If no further option is specified, the flag will generate code for all available computing capabilities (at the time of writing cc35,cc37, cc50, cc60, and cc70). To be specific for each GPU:
GPU Type | Compiler Flag |
---|---|
K20m |
|
K20Xm |
|
Titan |
|
Titan X |
|
K40c |
|
K80 |
|
P100 |
|
V100 |
|
To invoke OpenACC, use the "-acc" flag. More information on OpenACC can be obtained at http://www.openacc.org.
A good tutorial on GPU programming is available at the CUDA Education and Training site from Nvidia.
When running the GPU code, it is worth checking the resources that the program is using, to ensure that the GPU is well utilized. For that, one can run the nvidia-smi command, and watch for the memory and CPU utilization. nvidia-smi is also useful to query and set various features of the GPU, see "nvidia-smi --help" for all the options that the command accepts.
For example, "nvidia-smi -L" lists the GPU card properties. On Teton node m025 you should see:
Code Block |
---|
userX@m025:~# nvidia-smi -L
GPU 0: Tesla K20m (UUID: GPU-2e23ddef-1d96-7894-102a-0458da3faaa4)
GPU 1: Tesla K20m (UUID: GPU-458a86ec-09cd-64d1-475a-d36dc0a73b4f) |
Debugging
Nvidia's CUDA distribution includes a terminal debugger named cuda-gdb. Its operation is similar to the GNU gdbdebugger. For details, see the cuda-gdb documentation.
For out of bounds and misaligned memory access errors, there is the cuda-memcheck tool. For details, see the cuda-memcheck documentation.
The Allinea DDT debugger that we currently license also supports CUDA and OpenACC debugging. Due to its user-friendly graphical interface, we recommend them for GPU debugging. For information on how to use DDT or TotalView, see our debugging page.
Profiling
Profiling can be very useful in finding GPU code performance problems, for example, inefficient GPU utilization, use of shared memory, etc. Nvidia CUDA provides both command line (nprof) and visual profiler (nvvp). More information is in the CUDA profilers' documentation.
...