perfcatcher Overview
--------------------

This profiling library contains wrappers for a number of MPI functions.
This can be used for finding communication bottlenecks in your code.

To use it:

	1) Copy this directory and all contents to a work directory owned
	   by you.

	2) Use the build_dso script to build the dso library libproftest.so
           (Or type "make".)   Verify everything's OK by "cd test; make".
           You should get a MPI_PROFILING_STATS file with three test runs.
	    Note: to run using mpich, add "CC=mpicc" to each make. 

	3) Set _RLD64_LIST to the absolute pathname of the dso, followed 
	   by ":DEFAULT".  For example
		setenv _RLD64_LIST /usr/people/joe/libproftest.so:DEFAULT

        4) Either 
           a)  Set the environment variable MPI_PROFILE_AT_INIT to "1"  -or-
           b)  Modify your source code to insert a call to
               MPI_SGI_profiling_enable() at the point in your code where
               you want profiling to begin (e.g, after your program has
               completed any initialization which you're uninterested in.)

        5) Optional:  either
           a)  Set the environment variable MPI_SCOPY_CHECK_AT_INIT to "1" -or-
           b)  Modify your source code to insert a call to 
               MPI_SGI_scopy_check_enable() at the point in your code where
               you want checking for usage of single-copy to begin.

	6) Run the mpi program as usual.  The mpi calls will go to the 
	   profiling library wrappers which collect statistics and then
	   append statistics to a file called MPI_PROFILING_STATS when the 
           program finishes.  You might consider using the -p option on
           mpirun to tag output lines with the rank number of the process
	   printing it.

Beware that MPI on Origin will sometimes print stdout lines out of order.
Also beware that this profiling library writes many lines to stdout.


Example output
--------------

Summary counts and timings  Wed Oct 23 13:57:45 2002

Total MPI processes                                  2

Total MPI job time, avg per rank                    0.396172 sec
Profiled job time, avg per rank                     0.396172 sec
Percent job time profiled, avg per rank             100%

Time in all profiled MPI routines, avg per rank     0.135594 sec
Percent time in profiled MPI routines, avg per rank 34.2259%

Rank:Percent in profiled MPI routines
	0:67.8%	1:0.613297%	
Best:   Rank 1      0.613297%
Worst:  Rank 0      67.8%
Load Imbalance:  51.058835%

Wtime resolution is                   2.1e-08 sec

activity on process rank 0
 comm_rank calls 2	time 8.50495e-06
    ibsend calls 1	time 0.00151979
   barrier calls 1	time 0.267231
    gather calls 1	time 0.0001176

activity on process rank 1
 comm_rank calls 2	time 8.44197e-06
      recv calls 1	time 0.0021718   datacnt 40000  waits 0  wait time 0
            Average data size 40000   (min 40000, max 40000)   size:count(peer)
            40000:   1x(0)

            unique peers:     0

   barrier calls 1	time 0.000248073
    gather calls 1	time 7.6335e-05


------------------------------------------------

recv profile

             cnt/sec for all remote ranks
local   ANY_SOURCE        0            1    
 rank
    1     /           1/0.002     /        


------------------------------------------------

recv wait for data profile

             cnt/sec for all remote ranks
local        0            1    
 rank


------------------------------------------------

send profile

             cnt/sec for all destination ranks
  src        0            1    
 rank


------------------------------------------------

ssend profile

             cnt/sec for all destination ranks
  src        0            1    
 rank


------------------------------------------------

ibsend profile

             cnt/sec for all destination ranks
  src        0            1    
 rank
    0     /           1/1.5e-03 
MPI_SGI_loadbarrier() was never called.


Revision history of perf.c
---------------------------

		o version 6 - from Cheng Liao.  Includes more Fortran wrappers

		o version 7 - more fortran wrappers.  Add bcast and gather

                o version 8 from Frank Kampe added MPI_Isend,
                  MPI_Irecv, and MPI_Waitall.

                o version 9 fixes the MPI_Scatter wrapper

                o version 9b fixes the mpi_gatherv_ wrapper

                o version 9c fixes a bug in the mpi_gatherv_ wrapper

                o version 10 added size-distribution collection,
                  MPI_SGI_profiling_enable()/MPI_PROFILE_AT_INIT,
		  communication-time % measurement, type-size determination.

		o version 11 added peer-to-peer communication size measurement
                  and calculation of load imbalance.

		o version 12 added MPI_SGI_loadbarrier() code.  Uses -lshm.
                  Not available on Linux until shm ported.    Undefine
		  LOADBARRIER to build on Linux.

                o version 13: (reiner) added MPI_Sendrecv code and added trace
                  capability to find code regions where non-single copy
                  point-to-point communications are performed. Feature might
                  be helpful in case of advanced F90 codes in order to detec
                  hidden stack allocated temporary buffers.

                o version 14: (pags) added code to allow this library to be
		  used with mpich.

