Process Placement Hands-On Session (with Open MPI)

Getting started

  1. Create a working directory and cd to it.
  2. Choose two node among fourmi047, fourmi048, fourmi050, fourmi051 (They will be called fourmi0XX and fourmi0YY in this hands-on session)
  3. You need a special version of MPI: copy /tmp/jeannot/mpi_profile.module in your working directory
  4. Load the modules:
    > module load hardware/hwloc/1.10 compiler/gcc ./mpi_profile.module
  5. Write script.pbs (change XX, YY, <my working directory> to the correct value)
    ##PBS options
    #PBS -q formation
    #PBS -l nodes=fourmi0XX:ppn=8+fourmi0YY:ppn=8
    #PBS -o stdout.out
    #PBS -e stderr.out
    # Working directory
    cd <my working directory>
    # modules to load
    module load hardware/hwloc/1.10 compiler/gcc ./mpi_profile.module
    # Command to run
    lstopo -
  6. Test your script (to block until it is executed you can use the until bash command).
    > rm stdout.out;qsub script.pbs;until [ -e ./stdout.out ] ; do sleep 0.5; done; cat stdout.out
  7. Q0: write down the topology tree with core numbering (physical and logical) of the selected nodes.

Binding and mapping MPI programs

In the following a line starting with > must be executed on the frontal while a line starting with a $ must be added to the end of your PBS script.

  1. Look at /tmp/jeannot/check_bindings.c
  2. Q1: What does /tmp/jeannot/check_bindings.c do?
  3. Compile it:
    > mpicc /tmp/jeannot/check_bindings.c -o check_bindings -lhwloc
  4. Execute it. Add this line to the end of your PBS script. (The above line starts with a $)
    $ mpiexec -bind-to core -np 16 ./check_bindings
  5. Save your nodes in nodefile (keep it for all the session)
    $ cat $PBS_NODEFILE > nodefile
  6. Create rankfile from nodefile
    rank 0=fourmi0XX slot=0
    rank 1=fourmi0XX slot=1
    rank 2=fourmi0XX slot=2
    rank 3=fourmi0XX slot=3
    rank 4=fourmi0XX slot=4
    rank 5=fourmi0XX slot=5
    rank 6=fourmi0XX slot=6
    rank 7=fourmi0XX slot=7
    rank 8=fourmi0YY slot=0
    rank 9=fourmi0YY slot=1
    rank 10=fourmi0YY slot=2
    rank 11=fourmi0YY slot=3
    rank 12=fourmi0YY slot=4
    rank 13=fourmi0YY slot=5
    rank 14=fourmi0YY slot=6
    rank 15=fourmi0YY slot=7
  7. Test the rankfile
    $ mpiexec -bind-to core -np 16 -rf rankfile ./check_bindings
  8. Q2 What is the difference between the two execution? (Hints: think of logical vs. physical core numbering)
  9. Shuffle rankfile
    rank 0=fourmi0YY slot=0
    rank 1=fourmi0XX slot=7
    rank 2=fourmi0XX slot=6
    rank 3=fourmi0XX slot=5
    rank 4=fourmi0XX slot=4
    rank 5=fourmi0XX slot=3
    rank 6=fourmi0XX slot=2
    rank 7=fourmi0XX slot=1
    rank 8=fourmi0XX slot=0
    rank 9=fourmi0YY slot=1
    rank 10=fourmi0YY slot=3
    rank 11=fourmi0YY slot=5
    rank 12=fourmi0YY slot=7
    rank 13=fourmi0YY slot=2
    rank 14=fourmi0YY slot=4
    rank 15=fourmi0YY slot=6
  10. Test it:
    $ mpiexec -bind-to core -np 16 -rf rankfile ./check_bindings
  11. Q3: Is the output what you expected?

Manage permutations

  1. You can create a script to generate a rankfile by permuting nodes of the nodefile. Let σ be a permutation (e.g. 0,7,4,…) it means that process i is executed on core σ(i) (e.g. process 0 on core 0, process 1 on core 7, process 2 on core 4, etc.). To apply a permutation on your nodefile you can use the /tmp/jeannot/bin/ script:
    > /tmp/jeannot/bin/ nodefile 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
    > /tmp/jeannot/bin/ nodefile Identity
    > /tmp/jeannot/bin/ nodefile 0,2,4,6,8,10,12,14,1,3,5,7,9,11,13,15
  2. Look at the MPI program /tmp/jeannot/shuffle.c.
  3. Q4: what does this program do?
  4. Compile it:
    > mpicc /tmp/jeannot/shuffle.c -o shuffle
  5. Run it:
    $ mpiexec -bind-to core -np 16 ./shuffle
  6. Q5: based on your understanding of the code what could be a good permutation of the process to speed up the execution?
  7. Build the corresponding rank file and test your solution. You should be able to divide by 10 the execution time.
    > /tmp/jeannot/bin/ nodefile My_great_permutation > rankfile
    $ mpiexec -bind-to core -np 16 -rf rankfile ./shuffle

Play with TreeMatch

  1. TreeMatch is a tool to compute permutations of processes based on their affinity. The affinity is a measure of how close should be put two processes. A possible metric can be a communication matrix. Look at the communication matrix of shuffle.c:
    > cat /tmp/jeannot/shuffle_size_internal.mat
  2. A tleaf is a (scotch) file format describing a tree topology the syntax is:
    tleaf <nb of levels> <arity level1> <cost level1> ... <arity level n> <cost level n>
  3. Look at the tleaf describing the topology:
    > cat /tmp/jeannot/plafrim16.tgt
  4. Use TreeMatch to compute the permutation of the shuffle.c communication matrix on the plafrim topology:
    > /tmp/jeannot/bin/mapping -t /tmp/jeannot/plafrim16.tgt -c /tmp/jeannot/shuffle_size_internal.mat
  5. Use verbose level to display timing information
    > /tmp/jeannot/bin/mapping -t /tmp/jeannot/plafrim16.tgt -c /tmp/jeannot/shuffle_size_internal.mat -v 4
  6. Display internal algorithms secrets:
    > /tmp/jeannot/bin/mapping -t /tmp/jeannot/plafrim16.tgt -c /tmp/jeannot/shuffle_size_internal.mat -v 5
  7. You can also use hwloc to generate the topology in xml hwloc format.
    > lstopo --of xml -i "node:2  socket:2 pu:4" > plafrim16.xml
  8. TreeMatch can work on this format as well:
    > /tmp/jeannot/bin/mapping -x plafrim16.xml -c /tmp/jeannot/shuffle_size_internal.mat

Extract communication pattern

  1. Open MPI offers a fine runtime tuning of the execution (see here for the details). It is based on Modular Component Architecture (MCA) and provides several frameworks such as pml : Point-to-point management layer (fragmenting, reassembly, top-layer protocols, etc.).
  2. We have developed an experimental PML component to extract the communication pattern of an MPI application (To be released soon in official version;-)). Read /tmp/jeannot/README.monitoring for more details.
  3. Test it:
    $ mpiexec -bind-to core -np 16  --mca pml_monitoring_enable 2 ./shuffle
  4. The application profile is output in the stdeer of the job (e.g. stderr.out)
    > cat stderr.out
  5. Q6: Explain this output in the light of /tmp/jeannot/README.monitoring.
  6. You can convert this profile to a matrix thanks to the /tmp/jeannot/bin/ script. It requires first, that the input file has a .prof extension:
    > cp stderr.out
  7. you can create the different communication matrices:
    > /tmp/jeannot/bin/ 
  8. Q7: what are the different generated matrices?
  9. Check that the generated one is the same has the provided one:
    > diff shuffle_size_internal.mat /tmp/jeannot/shuffle_size_internal.mat

Putting everything together

We will use ZeusMP2 a CFD application. we will run a 3D case and try to optimise its process placement.

If you do not have the time to compile ZEUSMP you can use my version /home/jeannot/ZeusMP2/zeusmp2/exe90/zeusmp.x and copy the configuration file /tmp/jeannot/zmp_inp in your working directory.

Then go to step 7.

  1. Untar the source
    tar xvfz /tmp/jeannot/ZeusMP2.tgz
  2. compile the dependencies
    1. JPEG7
      > cd ZeusMP2/zeusmp2_dep/jpeg-7
      > ./configure --prefix=<ZeusMP install dir>/zeusmp2_dep/jpeg-7_install/
      > make -j 8
      > make install
    2. HDF4
      > module load compiler/gcc/4.8.3
      > cd ../HDF4.2r4
      > export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:<ZeusMP install dir>/zeusmp2_dep/jpeg-7_install/lib
      > ./configure --prefix=<ZeusMP install dir>/zeusmp2_dep/HDF4.2r4_install/ --with-jpeg=<ZeusMP install dir>/zeusmp2_dep/jpeg-7_install
      > make -j 8
      > make install
  3. ZeusMP
    > cd ../../zeusmp2/src90
    1. Edit Makefile and change the 3 lines after
      ZMP_LIB   = ${HDF} ${MGMPI_LIB} \
    2. Compile
      > make -j 8
  4. Go to exe90
    cd ../exe90
  5. You should have an executable (zeusmp.x) and a configuration file (zmp_inp)
  6. Copy zmp_inp in your working directory. It is the configuration file that describe the 3D CFD simulation.
  7. Test ZeusMP from your working directory. It takes around 1'15” (the duration is given on the Wall Clock line) :
    $ mpiexec -bind-to core -np 16   <path to exe90>/zeusmp.x
  8. Q8: based on the tools used previously, try to optimise the process placement for this tool.
process_placement_hands-on_session.txt · Last modified: 2015/06/04 15:43 by ejeannot
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki