CFHHM BENCHMARK "bm_cfhh1"
                          Updated: 99/11/12
                              R. Krivec


PURPOSE

(1) Find the most suitable program topology on different machines, and possibly
    a topology efficient on several machines.
(2) Find optimal f77 options on separate machines.
(3) Find the CPU time model of the CFHHM code.
(4) Define several CFHHM benchmarks.


DESCRIPTION

CFHHM (Correlation Function Hyperspherical Harmonic Method) originated and
developed by V.B. Mandelzweig (Hebrew University, Jerusalem), M.I. Haftel
(Naval Research Laboratory, Washington) and R. Krivec (J. Stefan Institute,
Ljubljana) is a method to calculate very precise values of observables in
the quantum nonrelativistic three body problem, especially in atomic physics.

Selected results:
    - The best value of annihilation rate of positronium negative ion [1]
    - Sticking probability and fusion rate for the muon-deuteron-triton
      mesomolecule in the muon-catalyzed fusion process; the results pointed
      out several inconsistencies in the literature [2]
    - Resonances in the doubly excited He atom [3], showing that in CFHHM,
      in contrast to the variational calculations, no stabilization is 
      needed [3]
    - The best calculation of hyperfine splitting in the muonic Helium atom,
      and discovery of wrong convergence of variational calculations [4,5,6]

The main programs in CFHHM are M2MK and EWF2P. M2MK calculates matrix elements
and EWF2P solves a system of coupled ordinary differential equations in one
variable (hyperradius), which represent the three-body Schroedinger equation.
At the origin, the exact analytic form including logarithmic terms is
represented as matrix recursion relations. On the rest of the interval, 
the solution is expressed as matrix Taylor series on small subintervals,
calculated via matrix recursion relations, and joined at the boundaries [7].

M2MK
    Calculates matrix elements (runs once, has negligible CPU time in 
    the complete CFHHM calculation).
    m2mk is a mixed integer/REAL*8 program, written for clarity, and not
    optimized. MFLOPS is meaningless for it. Its speed is presented as an
    example of how catastrophically low efficiency such "nice" programs
    can have, especially on superscalar machines.

    Dimension declarations:
        NDMU = 33, NDP0 = 104, NDP = 102,
        NDGO = 32, NNQSI = 8.  
        Km = 2*MUU (input), where Km is the maximum global angular momentum.

EWF2P
    Solves the coupled system of differential equations in one variable [7]
    by prescribing the energy E, propagating the solution from two sides
    towards a matching hyperradius (ZMATCH), calculating a 
    "joining determinant" D(E), and iterating E until D(E) = 0. Typically
    7 iterations with fixed values of E are necessary to find the zero to
    more than 10 significant digits, which assures the wave function accuracy
    of at least 6-7 significant digits.

    The CPU times quoted are for one iteration (complete integration with fixed
    energy parameter).

    ewf2p is a "vector" floating-point (REAL*8) program.

    Dimension declarations:
        NDMU variable, ISYM = 0, NDP = 102, NDP0 = 104, NDCN = 22 (size is 
	790 MB at NDMU = 33).
        NSU is the number of coupled equations (matrix size) in the recursive
        calculation of a matrix Taylor expansion of the solution on 
	sub-intervals.
        NDMU is the main dimension and is varied to see its effect at small 
	NSU.
        The relations are NDMU >= Km / 2, NSU = const * Km**2.

    CPU time model:
        in the first approximation we count only the main part of RRP* 
        (recursion relations, see below):
    
            T = C(IPWUU) * NSU**3 * Nint * (JMAX**2)/2 + T0
	
        where C is a constant, NSU is the number of coupled equations, 
        Nint is the number of hyperradial intervals, JMAX is the number 
        of powers in the matrix Taylor expansion of the solution, and T0 is
        the time spent on the first interval (small for He atom). C depends
        on IPWUU because the power expansion of the effective potential 
	introduces additional summations whose CPU time is proportional 
	to NSU**2 * IPWUU * JMAX, and changes the geometry of the most 
	important loops.
    
        Nint can be approximated as follows. Interval boundaries are 
	determined by Z(n) = min((1 + TOL)**n * ZMATCH, DZU), whence we get 
	that the lengths Z(n+1) - Z(n) increase up to Z(n1) = DZU / TOL, and 
        n1 = (1 / TOL) * ln [DZU / (TOL * ZMATCH)]. Then if Z(n1) < ZTOP,
        Nint = n1 + (ZTOP - Z(n1)) / DZU. 
        The actual Nint values were calculated by a trivial program (aux700.f),
        because the CFHHM program does not enumerate them.

    Recursion formulas evaluation:
        ewf2p subroutine ddca calls several variants of rrp* subroutines which
        evaluate the recursion formulas with different loop topologies, and 
	with and without calls to DAXPY and DGEMM. Some rrp* routines are also
        completely reordered for parallelization. The list of the most 
	important versions is as follows (index M1 =< JMAX runs over Taylor 
	terms in the solution, index IPX =< IPWUU runs over powers in the 
	effective potential):
    
        ---------------------------------------------------------------------
        Name    Originating    IRRPV    Short description
                rrpsm version  (ddca)      
        ---------------------------------------------------------------------
        rrp11   1.1               1     Single M1 loop outside, IPX inside 
        rrp30   3.0              30     as rrp11, eliminated power evaluation
        rrp30a  3.0V            301     as rrp30, but calls DGEMM
        rrp30b  3.0V1           302     as rrp30a, but calls DAXPY
        rrp31c  3.1.1V1         314     M1 loop distributed and moved inside
        rrp31e                  316     as rrp31c, but DO-loops only
        ---------------------------------------------------------------------

References
    
[1]  R. Krivec and V.B. Mandelzweig, Phys. Rev. A47, 911 (1993).
[2]  R. Krivec and V.B. Mandelzweig, Phys. Rev. A52, 221 (1995).
[3]  S. Berkovic, R. Krivec, V.B. Mandelzweig and L. Stotland, 
     Phys. Rev. A55, 988 (1997).
[4]  R. Krivec and V.B. Mandelzweig, Phys. Rev. A56, 3614 (1997).
[5]  R. Krivec and V.B. Mandelzweig, Phys. Rev. A57, 4976 (1998).
[6]  R. Krivec and V.B. Mandelzweig, Proc. XVI-th European Few-Body
     Conference, Autrans, June 1-6, 1998.
[7]  R. Krivec, M.I. Haftel and V.B. Mandelzweig,
     Journal of Computational Physics 123, 149 (1996).


FILES

Directory: saturn:~krivec/BENCH/bm_cfhh1


MACHINE LABELS

PCL       SGI Power Challenge L, R8000, 90 MHz (saturn)
O2000     SGI Origin 2000, R10000, 250 MHz
Sun       Sun (Enterprise 4500?) 336 MHz (8 MB L2?) (Sun Geneva, run by R. K.)
C3        Convex C3860 (120 MFLOPS) (Marvin)


MEASUREMENTS

CPU times measured by:
    SGI: DTIME, pixie, prof
    Sun: DTIME
and double-checked using "time" etc.
These programs may differ in what they return in the case of multiprocessing.
Measurements on the SGI are done on a 90 MHz R8000.  Note that prof (SGI)
may predict too small CPU times based on the number of floating point
operations required. The efficiencies reported below are based on execution
times on processors where only the tested job was running, i.e., the real 
elapsed times.


ORGANIZATION

In Part I precisely defined runs are tested. In Part II some additional 
examples are given, from other runs, which may illustrate the results
in Part I, and may later serve as additional benchmarks.


                                PART I
				 
				 
1.  He atom ground state (the simplest CFHHM example)
    -------------------------------------------------
    

FILES

ewf2p1aX40010*cfeci555b*:
    NM, JMAX              20  16
    EPSP, TOL, DZU            0.1000D-13    0.100000        5.00
    ZM, ZMATCH, ZTOP          0.5000        5.00      120.00
    Nint (aux700) = 64 +/- 1

SUMMARY

Linear correlation function implying an effective potential with 2 terms:
due to the this, most of the CPU time is spent for linear algebra in ewf2p 
(no sums over powers in the potential).

RESULTS


Table 1.I.  M2MK. The speed of this program is not very important, but it seems
to be more sensitive to clock rate than EWF2P. Operation counts ar taken from
the PCL pixie output.
-------------------------------------------------------------------------------
Machine         Km     N    CPU time   FP op. cnt      I op. cnt.      R0 *2
                                       (pixie, M) *1   (pixie, M)
-------------------------------------------------------------------------------
PCL *3          56   225     56            705            2447         0.16
                64   289    113           1266            4509         0.14
-------------------------------------------------------------------------------
O2000           64   289     54                                        0.21
-------------------------------------------------------------------------------
Sun *4          64   289     45                                        0.19
-------------------------------------------------------------------------------
*1  Operation count is taken from "prof -pixie" output. There are more
    integer operations than FP operations, MFLOPS are not sensible.
*2  R0 is an approximate efficiency measure: number of ALL operations
    divided by theoretical MFLOPS: small values indicate floating
    efficiency is even smaller.
*3  Compilation: f77 -lcomplib.sgimath-r8000-mips4-64-O3-LNO:ou=4.
*4  f77 -xtarget=native -fast.


Table 1.II.  EWF2P.
R is the efficiency (MFLOPS achieved divided by theoretical MFLOPS). 
CPU time model (T0 approximately 0): T = C(0) * NSU**3 * 64 * 128.
Some old measurements are reported for comparison (optimization history).
-------------------------------------------------------------------------------
Machine    Km    NSU      CPU time   FP op. count   MFLOPS *1     R     C(0) X
                                          (pixie, M) *1                 10**9
-------------------------------------------------------------------------------
PCL *2     24     49      8.3           1903          230        0.64    8.6
                          8.3 *6
                          8.3 *7
                          8.3 *8
           48    169    285            74305          261        0.72    7.3
           56    225    640           174450          273        0.76    6.9
           60    256    933           236326          253        0.70    6.8
           64    289   1320            (operation count integer overflow)

PCL 75 MHz 24            32  *9
*3                       12  *10
                         10  *14
-------------------------------------------------------------------------------
O2000      24     49      5.8 *13                     328        0.66
-------------------------------------------------------------------------------
Sun        24     49      6.5                         292        0.44
*4         48    169    192                           387        0.58
           56    225    460                           379        0.56
-------------------------------------------------------------------------------
C3 *5      24     49     26  *11                       73        0.61 *12
-------------------------------------------------------------------------------
*1  Operation count is taken from "prof -pixie" output; MFLOPS are calculated
    from this count and the CPU time, although prof returns MFLOPS also; but
    the MFLOPS returned by prof turned out to be misleading in the matrix 
    multiplication test.
*2  f77 -lcomplib.sgimath -r8000 -mips4 -64 -O3 -LNO:ou=4.
    Only slightly faster that without -LNO:ou=4.
*3  f77 -lcomplib.sgimath -O3 (old compiler may have had correct defaults).
*4  f77 -xO3 -xchip=ultra2 -xarch=v8plusa -xlic_lib=sunperf (correct library).
    Measured at Sun Geneva by R. K.
*5  fc -lveclib.
*6  RRP31E, DO loops instead of DAXPY.
*7  RRP31E, compiled with ... -pfa -WK,-p=1.
*8  RRP31C, compiled with ... -pfa -WK,-p=1.
*9  RRPSM 1.0, cf. cfhhoptpcl.txt.
*10  RRPSM 3.0VS - 3.1.1V1, cf. cfhhoptpcl.txt, opt_rrpsm.txt.
*11  RRPSM 1.0 - 3.1.1V1 (same), cf. cfhhoptpcl.txt.
*12  Efficiency estimated using 1903 as operation count.
*13  f77 -lcomplib.sgimath -r10000 -mips4 -64 -O3 -LNO:ou=4.
*14  /usr/bin/f77 -r8000 -mips4 -64 ... -lcomplib.sgimath, RRP31C, 75 MHz
     (the new compiler is slightly faster, or some other effect, compared
     with *10. 10 sec scaled from 75 to 90 MHz yield exactly 8.3 sec (see
     the first entry).)


2.  mu-d-t mesomolecule ground state
    --------------------------------


FILES

cde8gizzcc2eci1bbb*_<n>* (<n> = IRRPV, rrp* versions selector)
    IPWUU, NSU, INUSEL       100     325   0
    NM, JMAX              20  12
    EPSP, TOL, DZU            0.1000D-13    0.100000        1.00
    ZM, ZMATCH, ZTOP          1.2500       15.00      120.00
    Nint (aux700) = 133 +/- 2, T0 is large (of the order of 
    15% of total time is used by VICDF)

SUMMARY

Nonlinear correlation function is used. It generates a power series for the 
effective potential; 100 terms are used (IPWUU = 100).

Initially, a rather small efficiency of RRP31C is observed (DAXPY using most 
of the CPU time). Removing DAXPY from RRP31E (replacing it with DO loops) 
produced a speedup of about 25% at Km = 16, but no significant speedup at 
Km = 24.

Using options similar to those most efficient for matrix multiplication,
the overall program efficiency R, using RRP31E (DO-loops), could be increased 
to 0.30 at Km = 24. This is OK because the matrix multiplication efficiency 
could not exceed 0.53 using hand coded loops (0.83 using DGEMM).

CPU time model is probably roughly correct: C(100) should be larger than C(0)
by a small factor (about 2, according to some old estimates) because the time 
spent summing the effective potential goes like NSU**2, not NSU**3; but due 
to the smaller efficiency in Table 2.I compared to Table 1.II, the constant 
C(100) is larger than C(0) by a factor of 3 - 4.

As expected EWF2P exhibits slightly larger efficiencies on R10000 than on 
R8000 because of narrower pipeline and less linear-algebraic character.

RESULTS

Remark: O2000 m2mk times (NGO* = 4) are: Km = 32: 140 sec, Km = 56: 1101 sec.

Table 2.I.  EWF2P for different Km, NDMU, NSU, IRRPV, and different 
optimizations (as noted in footnotes). If not noted otherwise, the command line
(SGI) "f77 -lcomplib.sgimath -r8000 -mips4 -64 -O3 -LNO:ou=4" and RRP31C are 
used. Dimensions: NDMU >= Km / 2 + 1.
R is the efficiency (MFLOPS achieved divided by theoretical MFLOPS). 
CPU time model: T = C(100) * NSU**3 * 133 * 72 + T0; neglecting T0 although
not very small (of the order of 10% of total CPU time.) [NOTE: prof claims 
the scheduling characteristics to be about 2 times better than actually 
observed, i.e., than times measured by dtime (see footnote 8).]
-------------------------------------------------------------------------------
Machine   Km NDMU NSU  RRP*   CPU    FP op. count   MFLOPS *1     R    C(100) *
                       vers.  time   (pixie, M) *1                      10**9
-------------------------------------------------------------------------------
PCL *2    16  33   45   31c    46         3505        76       0.21      53
                               46 *4
                               46 *5
                              301 *3
                        31e    38 *6
                               38 *7      3509        92       0.26      44
          ---------------------------------------------------------------------
          24  33   91   1     391 *7     23209        59       0.16      
                        30    390 *7     23183        83
		        30a   401 *7     23344        58
		        30b   421 *7     23321        55
                        31c   273 *7     23308        85       0.24      
                        31e   258 *7 *8  23332        90       0.25      36
              13   91   1     307 *7
	                30    332 *7
			30a   334 *7
			30b   360 *7
			31c   249 *7
			31e   224 *7                 104       0.29
                              214 *10                109       0.30      30
                              247 *11
                              246 *12
                              216 *13     
                              226 *14
                              219 *15
                              226 *16
          ---------------------------------------------------------------------
          32  33  153   31c  1201 *9
                        31e  1136 *7 *8 100041        88       0.24      
              17             1027 *13                 97       0.27      
          ---------------------------------------------------------------------
          48  33  325   31c 11013       224778        20       0.06
                        31e 10782 *7    225046        21
              25            11223 *13                 20       0.06      34
          ---------------------------------------------------------------------
          56  29  435   31c 23696 *13                 43       0.12      
                        31e 25390 *13  1028349        41       0.11
-------------------------------------------------------------------------------
O2000     24            31c   152 *17                153       0.31 *18
          32            31c   644 *17                155       0.31 *18
          48            31c  5578 *17                 40       0.08 *18
-------------------------------------------------------------------------------
*1  See Footnote 1 of Table 1.II.
*2  f77 -lcomplib.sgimath -r8000 -mips4 -64 -O3 -LNO:ou=4. RRP31C.
*3  f77 -lblas -lcomplib.sgimath -r8000 -mips4 -64 -O3 -LNO:ou=4 is MUCH 
    slower (RRP31C).
*4  RRP31C compiled with ...ou=4 -pfa -WK,-p=1.
*5  RRP31C; all compiled with ...ou=4 -pfa -WK,-p=1.
*6  RRP31E compiled with ...ou=4 -pfa -WK,-p=1. 
*7  All compiled with 
    "f77 -lcomplib.sgimath -r8000 -mips4 -64 -O3 -LNO:ou=4 -pfa -WK,-p=1".
    At Km = 16, RRP31E takes 62% of cycles; VICDF takes 11% of cycles.
    Example of using DGEMM NOT being an advantage. PP and QQ update DO-loops
    in RRP31E use up 26-28% of total cycles each.
    At Km = 24, RRP31E takes 51% of cycles; vicdf takes 16% of cycles;
    PP and QQ update DO-loops in RRP31E use up 23-25% of total cycles each
    (RRP31E share decreasing).
    At Km = 32: RRP31E 40% cycles, COMPLIB_DGEMM_HOISTC 39%, VICDF 19%.
*8  At Km = 24, prof claims 133 MFLOPS on 75 MHz, or R = 0.44.
    At Km = 32, prof claims 613 sec execution time and 163 MFLOPS (Km = 32) 
    for 75 MHz, i.e., R about 0.54 for both 75 MHz and 90 MHz.
    At Km = 48, prof claims R to be about 0.3.
*9  "f77 -lcomplib.sgimath -r8000 -mips4 -64 -O3 -LNO:ou=4 -pfa -WK,-p=1",
    using RRP31C (calling DAXPY).
*10  RRP31E compiled with "f77 -r8000 -mips4 -64 -O3 -LNO:ou=4:cs1=16k:cs2=4m"
     "-TENV:X=4 -pfa -WK,-so=3,-ro=3,-o=5 -WK,-p=1", the program compiled with
     "f77 -r8000 -mips4 -64 -O3 -LNO:ou=4 -pfa -WK,-p=1", the program linked
     with "f77 -lcomplib.sgimath -r8000 -mips4 -64 -O3 -pfa -WK,-p=1".
*11  As *10, but rrp31e compiled with the additional -WK,ur=4 (inner loop
     unroll).
*12  As *11, but -o=4 instead of -o=5.
*13  The whole program compiled (as in *10) with "f77 -r8000 -mips4 -64 -O3"
     "-LNO:ou=4:cs1=16k:cs2=4m -TENV:X=4 -pfa -WK,-so=3,-ro=3,-o=5 -WK,-p=1".
*14  As *13, but "-Ofast=ip21" instead of "-r8000 -mips4 -64 -O3".
*15  As *13, but "-ou=8" instead of "-ou=4".
*16  The whole program compiled as in *13, but without ":ou=4" (no outer loop
     unrolling).
*17  f77 -lcomplib.sgimath -r10000 -mips4 -64 -O3 -LNO:ou=4.
*18  Estimate obtained by using the PCL (R8000) operation counts.
     A proper estimate would be obtained by running pixie.


                                PART II
				
				
1. Positronium negative ion (Ps-)
-- ------------------------------


Two runs using RRP31C, compiled differently, with parameters IPWUU = 60, 
JMAX = 20, DZU = 5, TOL = .05, ZTOP = 120. 99/02/23. As of today, this 
comparison is not made in Table 2.I. Speedup is small.


1.1. f77 -lcomplib.sgimath -r8000 -mips4 -64 -O3.
-------------------------------------------------------------------------------
ewf2plfg43cppaccec5555b23a_r8000_cs_314.2ff
 IRRPV (RRP--- VERSION)                   314
 LTDD/R,LTVICD/RRPS,LPRBIN/Z0M,LTCPU/RU     F   F   F   F   F   F   F   F
 -RRLSDE- 1.1, 96/08/02. 1.0, 93/05/18. 1.4, 91/03/13 (VICDF)        
 -RRP31C- 2.0, 96/08/05. -1.0; -RRPSM 3.1.1V1. -3.1V1S -3.0V1S -3.0VS
 DET:     -1.4265964634850457 10**120  ENERGY      0.2610000000000000
 DET:      2.5931146819654085 10**120  ENERGY      0.2640000000000000
 DET:      7.7558711507030882 10**118  ENERGY      0.2620627383190518
 DET:     -9.7808259030330582 10**114  ENERGY      0.2620061965277588
 DET:     -5.2273609064104392 10**109  ENERGY      0.2620062036452644
 DET:      3.7297727749911305 10**106  ENERGY      0.2620062036453025
 DET:     -2.6151134961832510 10**109  ENERGY      0.2620062036452835
 DET:      3.7297727749911305 10**106  ENERGY      0.2620062036453025
 DET:      3.7297727749911305 10**106  ENERGY      0.2620062036453025
    1  0.3729772774991136E-13  H   0.0000E+00      0.2620062036453025
    5 + 2 DET. EVALUATIONS OF      710.52 SEC.
   USER     6418.05     DIF     6418.05     REF        0.00     SYS       34.24


1.2. f77 -lcomplib.sgimath -r8000 -mips4 -64 -O3 \
-LNO:ou=4 -TENV:X=4 -pfa -WK,-p=1,-so=3,-ro=3,-o=5.
-------------------------------------------------------------------------------
ewf2plfg43cppaccec5555b23a_cs8000ou4pfa_314.2ff
 IRRPV (RRP--- VERSION)                   314
 LTDD/R,LTVICD/RRPS,LPRBIN/Z0M,LTCPU/RU     F   F   F   F   F   F   F   F
 -RRLSDE- 1.1, 96/08/02. 1.0, 93/05/18. 1.4, 91/03/13 (VICDF)        
 -RRP31C- 2.0, 96/08/05. -1.0; -RRPSM 3.1.1V1. -3.1V1S -3.0V1S -3.0VS
 DET:     -1.4265964634850428 10**120  ENERGY      0.2610000000000000
 DET:      2.5931146819654041 10**120  ENERGY      0.2640000000000000
 DET:      7.7558711507028519 10**118  ENERGY      0.2620627383190518
 DET:     -9.7808259032345006 10**114  ENERGY      0.2620061965277588
 DET:     -5.2273694740069976 10**109  ENERGY      0.2620062036452644
 DET:      3.7378218156157379 10**106  ENERGY      0.2620062036453025
 DET:     -2.6163059571265164 10**109  ENERGY      0.2620062036452835
 DET:      3.7378218156157379 10**106  ENERGY      0.2620062036453025
 DET:      3.7378218156157379 10**106  ENERGY      0.2620062036453025
    1  0.3737821815615743E-13  H   0.0000E+00      0.2620062036453025
    5 + 2 DET. EVALUATIONS OF      643.29 SEC.
   USER     5828.61     DIF     5828.61     REF        0.00     SYS       28.63