Skip to end of metadata
Go to start of metadata

MPI jobs

For MPI parallel jobs you must specify how many cpus you need:

Running MPI jobs with OpenMPI and MVAPICH2

Mercer has two implementations of MPI: OpenMPI and MVAPICH2. Overall the performance and behavior of each is similar, but with either there are special things to consider:

  • With OpenMPI, processes may change which CPU (within the node) they use during execution. This can reduce performance, so OpenMPI jobs should be started with the --bind-to-core flag:

    mpirun --bind-to-core -np $PBS_NP ./my_program

  • With either MPI implementation, the default behavior is to launch each process on the next available CPU. For hybrid MPI/OpenMP programs, this causes incorrect behavior, as each MPI process needs multiple CPUs for its OpenMP threads. 

    For example, if you need two MPI processes with 2 OpenMP threads each, you would request -l nodes=2:ppn=2. Your job might then be allocated 2 CPUs on node 14-0 and 2 CPUs on node 14-1. Your nodefile ($PBS_NODEFILE) will then look like:

    MPI uses this to place its processes, so it will, by default, place the first process on compute-14-0.local and the second process on compute-14-0.local. But for a hybrid program, you have multiple threads (in this case, 2) for each MPI process. Threads cannot cross host boundaries, so you will have 4 threads sharing the 2 CPUs of compute-14-0.local, while compute-14-1.local sits idle.

     To get the correct behavior with OpenMPI, add --bynode to the mpirun command:

    mpirun --bind-to-core --bynode -np $PBS_NP ./my_program

    With MVAPICH2, you must use a modified nodefile to reserve CPUs for OpenMP threads. The following idiom is effective:

  • No labels