MPI

Disclaimer: This is NOT a course on learning MPI. This is a very basic introduction on what MPI is and how to use across our clusters.

What is MPI?

The Message Passing Interface: MPI is standardized and portable message-passing standard designed to function on parallel computing architectures. Versions of the standard and their specifications can be found at the API Forum.

This message-passing library interface specification comes in the form of an API with a number of implementations that follow this specification.

Available Implementations

Currently on Beartooth we provide:

  • The OpenMPI and Intel’s oneAPI library implementation of the MPI standard.

  • The installed version of NVidia’s HPC SDK also comes with an MPI implementation, but this has not been tested.

Other implementations, not installed, include MPICH.

Basic Concept

MPI it is not a language, but comes in the form of an API with programming language bindings for C and Fortran. Bindings for other languages might be available through specific implementations of the MPI specification, for example OpenMPI provides bindings for Java - but this is not standard.

Wrapper packages/libraries do exist for over languages such as MPI for Python and Rmpi, with these running on top of installed system level MPI libraries.

Many domain specific applications (GROMACS, LAMMPS, OpenFOAM, Quantum-Espresso, VASP…) come with MPI implemented under the hood allowing their specific computations to be run across multiple compute nodes – which extracts the necessity for end users to learn/implement MPI themselves.

When to Use?

In a nutshell, MPI provides you with the ability to break your program up into dependent/independent tasks that can be run on a single and/or multiple compute nodes (connected over a network) allowing parallel / distributed implementations. It provides the mechanisms for thinking how to design dividing up your program, how to initialize tasks across multiple compute nodes, how to pass data/values to and fro across these compute nodes across the network, as well as performing I/O in parallel.

Using MPI you can extend your applications (where you have access to the source code) to implement parallel functionality.

Example 01:

For example, consider having 10 billion numbers within an array that you want to sum together. You can write a program in C, that uses an MPI library implementation, that could break this array into 10 separate one billion arrays, pass each of these arrays to its own compute node where the one billion numbers are summed, then retrieve these ten sums and add them together to calculate the final result.

So, rather than summing 10 billion numbers on a single machine, we can concurrently sum ten, one billion numbers each on there own machine, and then bring the results together - which in theory will be a lot quicker.

Example 02:

In the example above, each of the compute nodes performed the same task - summing numbers. This doesn’t have to be case. If you overall task can be divided up into a number of unique independent sub tasks that don't have to run in serial (one after the other) then using MPI you could distribute these sub tasks to their own unique compute node (where they run independently and concurrently) and then wait for them all to finish.

Example 03:

Work flows can use MPI at various stages to break up and parallel the computation. For example, you could have a workflow where the first stage is to take a surface and initialize by breaking this surface up into sub sections. The second stage could be to perform some form of computation on each sub section, which is performed in parallel with each sub section being processed on a separate compute node.

Typical Workflow

A typical simplistic workflow, based on using the C programming language, would look something like the following:

  1. Decide on how to use MPI to parallelize your computation.

  2. Load the MPI library implementation you’re using: See the Available Implementations above and the Module System: LMOD.

  3. Code up your implementation including the mpi.h header file.

  4. Compile your source code linking in the implementation specific MPI library.

  5. Decide on the of number nodes.

  6. Run an interactive job using salloc or submit to the cluster job queue using sbatch: See Slurm Workload Manager.

  7. Since this is an MPI application you’ll need to run your application using srun:

If you are re-running your executable (i.e. not having to re-compile) then you only need to follow steps: 2,6 and 7.

Example Code: MPI for Parallel Summation

This example:

  1. Fills an array with the numbers from 1 to 100.

  2. This array is copied over the network to all the compute nodes requested within the job.

  3. Each compute node will sum a sub-section of the array and return its result back to the master node.

  4. The master node will then sum these sub section totals and provide the final total.

01: Design and Implement Source Code

# include <mpi.h> # include <stdio.h> # include <stdlib.h> int main(int argc, char *argv[]) { #define ARRAY_SIZE 100 int array[ARRAY_SIZE]; int i, ierr; int master_task = 0; int task_id; int num_of_procs; int sum_task, sum_all; // Initialize MPI. ierr = MPI_Init(&argc, &argv); if (ierr != 0) { printf("ERROR: MPI_Init returns nonzero:%d\n", ierr); exit (1); } // Get the number of processes. MPI_Comm_size(MPI_COMM_WORLD, &num_of_procs); // Determine the rank of this task/process. MPI_Comm_rank(MPI_COMM_WORLD, &task_id); if (task_id == master_task) { printf("Master: Number of tasks (processes) available is %d.\n", num_of_procs); } // The master process initializes the array. if (task_id == master_task) { for (i = 0; i < ARRAY_SIZE; i++) { array[i] = i + 1; } } // The master process broadcasts the computed initial values // to all the other processes. MPI_Bcast(array, ARRAY_SIZE, MPI_INT, master_task, MPI_COMM_WORLD); // Each process adds up its sub section of the array. int batch_size = ARRAY_SIZE / num_of_procs; int start_idx = task_id * batch_size; int end_idx = (task_id + 1) * batch_size; printf("ID:%d : batch_size=%d : start_idx=%d : end_idx=%d\n", task_id, batch_size, start_idx, end_idx - 1); sum_task = 0; for ( i = start_idx; i < end_idx; i++ ) { sum_task = sum_task + array[i]; } printf("ID:%d : sum:%d\n", task_id, sum_task); // Each worker process sends its sum back to the master process. // The master process sums all the results together. MPI_Status status; if (task_id != master_task) { MPI_Send(&sum_task, 1, MPI_INT, master_task, 1, MPI_COMM_WORLD); } else { sum_all = sum_task; for (i = 1; i < num_of_procs; i++) { MPI_Recv(&sum_task, 1, MPI_INT, MPI_ANY_SOURCE, 1, MPI_COMM_WORLD, &status); sum_all = sum_all + sum_task; } } if (task_id == master_task) { printf("Master: ID:%d : Total sum:%d\n", task_id, sum_all); } // Terminate MPI. MPI_Finalize(); if (task_id == master_task) { printf("Master: ID:%d : Terminating Program.\n", task_id); } return 0; }

This is a modified version of an example that can be found here: sum_mpi.c

02: Compile Source Code

Implementing this on the Beartooth cluster, the following GNU compile and OpenMPI library was used:

[]$ module load gcc/12.2.0 openmpi/4.1.4 # mpicc is a wrapper around the gcc compiler that automatically links the openmpi libraries for you. []$ mpicc -O3 sum_ex.c -o sum_ex

03: Allocate an Interactive Session

Due to the nature of this simple example, we need to request X nodes, where X is exactly divisible into the size of the array - which is set to 100 elements.

# Replace <your-account> with the name of the project you are attached to on the cluster. []$ salloc --account=<your-account> --time=5:00 --nodes=5 salloc: Granted job allocation 12906560 salloc: Nodes mtest2,ttest[01-04] are ready for job # The actual nodes allocated might be different - the above simply shows the type of output to expect.

04: Run the Executable

03/04: Alternative: Submission Script

If you want to submit your job to the Slurm queue, your submission script would take the form:

Submit you job using: