Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

This is a custom Confluence template that is intended to be re-used in the creation of workshops presented by ARCC on the Wiki. All of the content in these sections is intended to be replaced by the author of the workshop. The first step in this style guide is to ensure that the the page is in wide mode to maximize the real estate for content when possible. The Title of the Page should be the same as the Title of the workshop and this section should include a quick intro to the topic, why it’s important for ARCC users, and what users should expect to get out of this workshop. Next should be a Table of Contents macro in vertical format. The Table is intended to be used as an agenda section for presenter mode as well as navigation for non-presenting viewing so that users can find the documentation and navigate to what they need to brush up on. Finally, at the end of each section, there should be a divider to indicate the separation of “slides”Goal: Introduce some further features, such as job efficiency and cluster utilization.

Table of Contents
minLevel1
maxLevel1
outlinefalse
stylenone
typelist
printabletrue

Headers and Sections

...

Code Examples

Two Column Tables are nice ways to separate content/ Background info along with a code example on the same “Slide”. Please notice the table width. This should stop scroll bars from appearing

...

Bullets are nice to include for distinct points

...

yep

...

they

...

sure

...

Job Efficiency

Info

You can view the cpu and memory efficiency of a job using the seff command and providing a <job-id>.

Code Block
[]$ seff 13515489
Job ID: 13515489
Cluster: <cluster-name>
User/Group: <username>/<username>
State: COMPLETED (exit code 0)
Cores: 1
CPU Utilized: 00:00:05
CPU Efficiency: 27.78% of 00:00:18 core-walltime
Job Wall-clock time: 00:00:18
Memory Utilized: 0.00 MB (estimated maximum)
Memory Efficiency: 0.00% of 8.00 GB (8.00 GB/node)
Info

Note:

  • Only accurate if the job is successful.

  • If the job fails with say an OOM: Out-Of-Memory the details will be inaccurate.

  • This is emailed out if you have Slurm email notifications turned on.

...

What’s the Current Cluster Utilization?

Info

There are a number of ways to see the current status of the cluster:

  • arccjobs: Prints a table showing active projects and jobs.

  • pestat: Prints a node list with allocated jobs - can query individual nodes.

  • sinfo: View the status of the Slurm partitions or nodes. Status of nodes that are drained can be seen using the -R flag.

  • OnDemand’s MedicineBow System Status page.

Expand
titlearccjobs example
Code Block
[]$ arccjobs
===============================================================================
Account                         Running                      Pending
  User                   jobs    cpus         cpuh    jobs    cpus         cpuh
===============================================================================
eap-amadson               500     500        30.42       3       3         2.00
  amadson                 500     500        30.42       3       3         2.00

eap-larsko                  1      32      2262.31       0       0         0.00
  fghorban                  1      32      2262.31       0       0         0.00

pcg-llps                    2      64      1794.41       0       0         0.00
  hbalantr                  1      32       587.68       0       0         0.00
  vvarenth                  1      32      1206.73       0       0         0.00

===============================================================================
TOTALS:                   503     596      4087.14       3       3         2.00
===============================================================================
Nodes                       9/51      (17.65%)
Cores                     596/4632    (12.87%)
Memory (GB)              2626/46952   ( 5.59%)
CPU Load             

...

 

...

 

...

803.43 

...

 

...

 

...

 

...

 

...

 

...

 

...

 

...

 

...

Straight Code - No context

Code Block
Limit to 16 lines in the example.
(17.35%)
===============================================================================
Expand
titlepestat example
Code Block
[]$ pestat
Hostname          Partition     Node Num_CPU  CPUload  Memsize  Freemem  Joblist
                               State Use/Tot  (15min)     (MB)     (MB)  JobID(JobArrayID) User ...
mba30-001            mb-a30    idle    0  96    0.00    765525   749441
mba30-002            mb-a30    idle    0  96    0.00    765525   761311
mba30-003            mb-a30    idle    0  96    0.00    765525   761189
...
mbl40s-004          mb-l40s    idle    0  96    0.00    765525   761030
mbl40s-005          mb-l40s    idle    0  96    0.00    765525   760728
mbl40s-007          mb-l40s    idle    0  96    0.00    765525   761452
wi001          inv-wildiris    idle    0  48    0.00    506997   505745
wi002          inv-wildiris    idle    0  48    0.00    506997   505726
wi003          inv-wildiris    idle    0  48    0.00    506997   505746
wi004          inv-wildiris    idle    0  
This
48  
is
 
the end

Same Thing With Images

...

Two Column Tables are nice ways to separate content/ Background info along with an image example on the same “Slide”. Please notice the table width. This should stop scroll bars from appearing

  • Bullets are nice to include for distinct points

  • yep

  • they

  • sure

  • are

    This is 14 lines

image-20240514-000033.pngImage Removed

Alternatively No Table

image-20240514-000127.pngImage Removed

Finally The End

...

Link to Previous sub-module or Home Module

...

 0.00    506997   505729
wi005          inv-wildiris    idle    0  56    0.00   1031000  1020610
Expand
titlesinfo examples:
Code Block
# View overall cluster:
[]$ sinfo -eO "CPUs:8,Memory:9,Gres:14,NodeAIOT:16,NodeList:50"
CPUS    MEMORY   GRES          NODES(A/I/O/T)  NODELIST
96      1023575  (null)        6/19/0/25       mbcpu-[001-025]
96      765525   gpu:a30:8     0/8/0/8         mba30-[001-008]
96      765525   gpu:l40s:8    1/4/0/5         mbl40s-[001-005]
96      765525   gpu:l40s:4    0/1/0/1         mbl40s-007
64      1023575  gpu:a6000:4   0/1/0/1         mba6000-001
48      506997   (null)        0/4/0/4         wi[001-004]
56      1031000  gpu:a30:2     0/1/0/1         wi005
96      1281554  gpu:h100:8    1/3/2/6         mbh100-[001-006]

# View a particular (investment) partition:
[]$ sinfo -p inv-wildiris
PARTITION    AVAIL  TIMELIMIT  NODES  STATE NODELIST
inv-wildiris    up   infinite      5   idle wi[001-005]

# View compute nodes currently drained:
[]$ sinfo -R
REASON               USER      TIMESTAMP           NODELIST
HW Status: Unknown - slurm     2024-07-19T12:02:04 mbh100-001
Not responding       slurm     2024-07-30T13:49:06 mbh100-006

...