CFHHM BENCHMARK "bm_cfhh1" Updated: 99/11/12 R. Krivec PURPOSE (1) Find the most suitable program topology on different machines, and possibly a topology efficient on several machines. (2) Find optimal f77 options on separate machines. (3) Find the CPU time model of the CFHHM code. (4) Define several CFHHM benchmarks. DESCRIPTION CFHHM (Correlation Function Hyperspherical Harmonic Method) originated and developed by V.B. Mandelzweig (Hebrew University, Jerusalem), M.I. Haftel (Naval Research Laboratory, Washington) and R. Krivec (J. Stefan Institute, Ljubljana) is a method to calculate very precise values of observables in the quantum nonrelativistic three body problem, especially in atomic physics. Selected results: - The best value of annihilation rate of positronium negative ion [1] - Sticking probability and fusion rate for the muon-deuteron-triton mesomolecule in the muon-catalyzed fusion process; the results pointed out several inconsistencies in the literature [2] - Resonances in the doubly excited He atom [3], showing that in CFHHM, in contrast to the variational calculations, no stabilization is needed [3] - The best calculation of hyperfine splitting in the muonic Helium atom, and discovery of wrong convergence of variational calculations [4,5,6] The main programs in CFHHM are M2MK and EWF2P. M2MK calculates matrix elements and EWF2P solves a system of coupled ordinary differential equations in one variable (hyperradius), which represent the three-body Schroedinger equation. At the origin, the exact analytic form including logarithmic terms is represented as matrix recursion relations. On the rest of the interval, the solution is expressed as matrix Taylor series on small subintervals, calculated via matrix recursion relations, and joined at the boundaries [7]. M2MK Calculates matrix elements (runs once, has negligible CPU time in the complete CFHHM calculation). m2mk is a mixed integer/REAL*8 program, written for clarity, and not optimized. MFLOPS is meaningless for it. Its speed is presented as an example of how catastrophically low efficiency such "nice" programs can have, especially on superscalar machines. Dimension declarations: NDMU = 33, NDP0 = 104, NDP = 102, NDGO = 32, NNQSI = 8. Km = 2*MUU (input), where Km is the maximum global angular momentum. EWF2P Solves the coupled system of differential equations in one variable [7] by prescribing the energy E, propagating the solution from two sides towards a matching hyperradius (ZMATCH), calculating a "joining determinant" D(E), and iterating E until D(E) = 0. Typically 7 iterations with fixed values of E are necessary to find the zero to more than 10 significant digits, which assures the wave function accuracy of at least 6-7 significant digits. The CPU times quoted are for one iteration (complete integration with fixed energy parameter). ewf2p is a "vector" floating-point (REAL*8) program. Dimension declarations: NDMU variable, ISYM = 0, NDP = 102, NDP0 = 104, NDCN = 22 (size is 790 MB at NDMU = 33). NSU is the number of coupled equations (matrix size) in the recursive calculation of a matrix Taylor expansion of the solution on sub-intervals. NDMU is the main dimension and is varied to see its effect at small NSU. The relations are NDMU >= Km / 2, NSU = const * Km**2. CPU time model: in the first approximation we count only the main part of RRP* (recursion relations, see below): T = C(IPWUU) * NSU**3 * Nint * (JMAX**2)/2 + T0 where C is a constant, NSU is the number of coupled equations, Nint is the number of hyperradial intervals, JMAX is the number of powers in the matrix Taylor expansion of the solution, and T0 is the time spent on the first interval (small for He atom). C depends on IPWUU because the power expansion of the effective potential introduces additional summations whose CPU time is proportional to NSU**2 * IPWUU * JMAX, and changes the geometry of the most important loops. Nint can be approximated as follows. Interval boundaries are determined by Z(n) = min((1 + TOL)**n * ZMATCH, DZU), whence we get that the lengths Z(n+1) - Z(n) increase up to Z(n1) = DZU / TOL, and n1 = (1 / TOL) * ln [DZU / (TOL * ZMATCH)]. Then if Z(n1) < ZTOP, Nint = n1 + (ZTOP - Z(n1)) / DZU. The actual Nint values were calculated by a trivial program (aux700.f), because the CFHHM program does not enumerate them. Recursion formulas evaluation: ewf2p subroutine ddca calls several variants of rrp* subroutines which evaluate the recursion formulas with different loop topologies, and with and without calls to DAXPY and DGEMM. Some rrp* routines are also completely reordered for parallelization. The list of the most important versions is as follows (index M1 =< JMAX runs over Taylor terms in the solution, index IPX =< IPWUU runs over powers in the effective potential): --------------------------------------------------------------------- Name Originating IRRPV Short description rrpsm version (ddca) --------------------------------------------------------------------- rrp11 1.1 1 Single M1 loop outside, IPX inside rrp30 3.0 30 as rrp11, eliminated power evaluation rrp30a 3.0V 301 as rrp30, but calls DGEMM rrp30b 3.0V1 302 as rrp30a, but calls DAXPY rrp31c 3.1.1V1 314 M1 loop distributed and moved inside rrp31e 316 as rrp31c, but DO-loops only --------------------------------------------------------------------- References [1] R. Krivec and V.B. Mandelzweig, Phys. Rev. A47, 911 (1993). [2] R. Krivec and V.B. Mandelzweig, Phys. Rev. A52, 221 (1995). [3] S. Berkovic, R. Krivec, V.B. Mandelzweig and L. Stotland, Phys. Rev. A55, 988 (1997). [4] R. Krivec and V.B. Mandelzweig, Phys. Rev. A56, 3614 (1997). [5] R. Krivec and V.B. Mandelzweig, Phys. Rev. A57, 4976 (1998). [6] R. Krivec and V.B. Mandelzweig, Proc. XVI-th European Few-Body Conference, Autrans, June 1-6, 1998. [7] R. Krivec, M.I. Haftel and V.B. Mandelzweig, Journal of Computational Physics 123, 149 (1996). FILES Directory: saturn:~krivec/BENCH/bm_cfhh1 MACHINE LABELS PCL SGI Power Challenge L, R8000, 90 MHz (saturn) O2000 SGI Origin 2000, R10000, 250 MHz Sun Sun (Enterprise 4500?) 336 MHz (8 MB L2?) (Sun Geneva, run by R. K.) C3 Convex C3860 (120 MFLOPS) (Marvin) MEASUREMENTS CPU times measured by: SGI: DTIME, pixie, prof Sun: DTIME and double-checked using "time" etc. These programs may differ in what they return in the case of multiprocessing. Measurements on the SGI are done on a 90 MHz R8000. Note that prof (SGI) may predict too small CPU times based on the number of floating point operations required. The efficiencies reported below are based on execution times on processors where only the tested job was running, i.e., the real elapsed times. ORGANIZATION In Part I precisely defined runs are tested. In Part II some additional examples are given, from other runs, which may illustrate the results in Part I, and may later serve as additional benchmarks. PART I 1. He atom ground state (the simplest CFHHM example) ------------------------------------------------- FILES ewf2p1aX40010*cfeci555b*: NM, JMAX 20 16 EPSP, TOL, DZU 0.1000D-13 0.100000 5.00 ZM, ZMATCH, ZTOP 0.5000 5.00 120.00 Nint (aux700) = 64 +/- 1 SUMMARY Linear correlation function implying an effective potential with 2 terms: due to the this, most of the CPU time is spent for linear algebra in ewf2p (no sums over powers in the potential). RESULTS Table 1.I. M2MK. The speed of this program is not very important, but it seems to be more sensitive to clock rate than EWF2P. Operation counts ar taken from the PCL pixie output. ------------------------------------------------------------------------------- Machine Km N CPU time FP op. cnt I op. cnt. R0 *2 (pixie, M) *1 (pixie, M) ------------------------------------------------------------------------------- PCL *3 56 225 56 705 2447 0.16 64 289 113 1266 4509 0.14 ------------------------------------------------------------------------------- O2000 64 289 54 0.21 ------------------------------------------------------------------------------- Sun *4 64 289 45 0.19 ------------------------------------------------------------------------------- *1 Operation count is taken from "prof -pixie" output. There are more integer operations than FP operations, MFLOPS are not sensible. *2 R0 is an approximate efficiency measure: number of ALL operations divided by theoretical MFLOPS: small values indicate floating efficiency is even smaller. *3 Compilation: f77 -lcomplib.sgimath-r8000-mips4-64-O3-LNO:ou=4. *4 f77 -xtarget=native -fast. Table 1.II. EWF2P. R is the efficiency (MFLOPS achieved divided by theoretical MFLOPS). CPU time model (T0 approximately 0): T = C(0) * NSU**3 * 64 * 128. Some old measurements are reported for comparison (optimization history). ------------------------------------------------------------------------------- Machine Km NSU CPU time FP op. count MFLOPS *1 R C(0) X (pixie, M) *1 10**9 ------------------------------------------------------------------------------- PCL *2 24 49 8.3 1903 230 0.64 8.6 8.3 *6 8.3 *7 8.3 *8 48 169 285 74305 261 0.72 7.3 56 225 640 174450 273 0.76 6.9 60 256 933 236326 253 0.70 6.8 64 289 1320 (operation count integer overflow) PCL 75 MHz 24 32 *9 *3 12 *10 10 *14 ------------------------------------------------------------------------------- O2000 24 49 5.8 *13 328 0.66 ------------------------------------------------------------------------------- Sun 24 49 6.5 292 0.44 *4 48 169 192 387 0.58 56 225 460 379 0.56 ------------------------------------------------------------------------------- C3 *5 24 49 26 *11 73 0.61 *12 ------------------------------------------------------------------------------- *1 Operation count is taken from "prof -pixie" output; MFLOPS are calculated from this count and the CPU time, although prof returns MFLOPS also; but the MFLOPS returned by prof turned out to be misleading in the matrix multiplication test. *2 f77 -lcomplib.sgimath -r8000 -mips4 -64 -O3 -LNO:ou=4. Only slightly faster that without -LNO:ou=4. *3 f77 -lcomplib.sgimath -O3 (old compiler may have had correct defaults). *4 f77 -xO3 -xchip=ultra2 -xarch=v8plusa -xlic_lib=sunperf (correct library). Measured at Sun Geneva by R. K. *5 fc -lveclib. *6 RRP31E, DO loops instead of DAXPY. *7 RRP31E, compiled with ... -pfa -WK,-p=1. *8 RRP31C, compiled with ... -pfa -WK,-p=1. *9 RRPSM 1.0, cf. cfhhoptpcl.txt. *10 RRPSM 3.0VS - 3.1.1V1, cf. cfhhoptpcl.txt, opt_rrpsm.txt. *11 RRPSM 1.0 - 3.1.1V1 (same), cf. cfhhoptpcl.txt. *12 Efficiency estimated using 1903 as operation count. *13 f77 -lcomplib.sgimath -r10000 -mips4 -64 -O3 -LNO:ou=4. *14 /usr/bin/f77 -r8000 -mips4 -64 ... -lcomplib.sgimath, RRP31C, 75 MHz (the new compiler is slightly faster, or some other effect, compared with *10. 10 sec scaled from 75 to 90 MHz yield exactly 8.3 sec (see the first entry).) 2. mu-d-t mesomolecule ground state -------------------------------- FILES cde8gizzcc2eci1bbb*_* ( = IRRPV, rrp* versions selector) IPWUU, NSU, INUSEL 100 325 0 NM, JMAX 20 12 EPSP, TOL, DZU 0.1000D-13 0.100000 1.00 ZM, ZMATCH, ZTOP 1.2500 15.00 120.00 Nint (aux700) = 133 +/- 2, T0 is large (of the order of 15% of total time is used by VICDF) SUMMARY Nonlinear correlation function is used. It generates a power series for the effective potential; 100 terms are used (IPWUU = 100). Initially, a rather small efficiency of RRP31C is observed (DAXPY using most of the CPU time). Removing DAXPY from RRP31E (replacing it with DO loops) produced a speedup of about 25% at Km = 16, but no significant speedup at Km = 24. Using options similar to those most efficient for matrix multiplication, the overall program efficiency R, using RRP31E (DO-loops), could be increased to 0.30 at Km = 24. This is OK because the matrix multiplication efficiency could not exceed 0.53 using hand coded loops (0.83 using DGEMM). CPU time model is probably roughly correct: C(100) should be larger than C(0) by a small factor (about 2, according to some old estimates) because the time spent summing the effective potential goes like NSU**2, not NSU**3; but due to the smaller efficiency in Table 2.I compared to Table 1.II, the constant C(100) is larger than C(0) by a factor of 3 - 4. As expected EWF2P exhibits slightly larger efficiencies on R10000 than on R8000 because of narrower pipeline and less linear-algebraic character. RESULTS Remark: O2000 m2mk times (NGO* = 4) are: Km = 32: 140 sec, Km = 56: 1101 sec. Table 2.I. EWF2P for different Km, NDMU, NSU, IRRPV, and different optimizations (as noted in footnotes). If not noted otherwise, the command line (SGI) "f77 -lcomplib.sgimath -r8000 -mips4 -64 -O3 -LNO:ou=4" and RRP31C are used. Dimensions: NDMU >= Km / 2 + 1. R is the efficiency (MFLOPS achieved divided by theoretical MFLOPS). CPU time model: T = C(100) * NSU**3 * 133 * 72 + T0; neglecting T0 although not very small (of the order of 10% of total CPU time.) [NOTE: prof claims the scheduling characteristics to be about 2 times better than actually observed, i.e., than times measured by dtime (see footnote 8).] ------------------------------------------------------------------------------- Machine Km NDMU NSU RRP* CPU FP op. count MFLOPS *1 R C(100) * vers. time (pixie, M) *1 10**9 ------------------------------------------------------------------------------- PCL *2 16 33 45 31c 46 3505 76 0.21 53 46 *4 46 *5 301 *3 31e 38 *6 38 *7 3509 92 0.26 44 --------------------------------------------------------------------- 24 33 91 1 391 *7 23209 59 0.16 30 390 *7 23183 83 30a 401 *7 23344 58 30b 421 *7 23321 55 31c 273 *7 23308 85 0.24 31e 258 *7 *8 23332 90 0.25 36 13 91 1 307 *7 30 332 *7 30a 334 *7 30b 360 *7 31c 249 *7 31e 224 *7 104 0.29 214 *10 109 0.30 30 247 *11 246 *12 216 *13 226 *14 219 *15 226 *16 --------------------------------------------------------------------- 32 33 153 31c 1201 *9 31e 1136 *7 *8 100041 88 0.24 17 1027 *13 97 0.27 --------------------------------------------------------------------- 48 33 325 31c 11013 224778 20 0.06 31e 10782 *7 225046 21 25 11223 *13 20 0.06 34 --------------------------------------------------------------------- 56 29 435 31c 23696 *13 43 0.12 31e 25390 *13 1028349 41 0.11 ------------------------------------------------------------------------------- O2000 24 31c 152 *17 153 0.31 *18 32 31c 644 *17 155 0.31 *18 48 31c 5578 *17 40 0.08 *18 ------------------------------------------------------------------------------- *1 See Footnote 1 of Table 1.II. *2 f77 -lcomplib.sgimath -r8000 -mips4 -64 -O3 -LNO:ou=4. RRP31C. *3 f77 -lblas -lcomplib.sgimath -r8000 -mips4 -64 -O3 -LNO:ou=4 is MUCH slower (RRP31C). *4 RRP31C compiled with ...ou=4 -pfa -WK,-p=1. *5 RRP31C; all compiled with ...ou=4 -pfa -WK,-p=1. *6 RRP31E compiled with ...ou=4 -pfa -WK,-p=1. *7 All compiled with "f77 -lcomplib.sgimath -r8000 -mips4 -64 -O3 -LNO:ou=4 -pfa -WK,-p=1". At Km = 16, RRP31E takes 62% of cycles; VICDF takes 11% of cycles. Example of using DGEMM NOT being an advantage. PP and QQ update DO-loops in RRP31E use up 26-28% of total cycles each. At Km = 24, RRP31E takes 51% of cycles; vicdf takes 16% of cycles; PP and QQ update DO-loops in RRP31E use up 23-25% of total cycles each (RRP31E share decreasing). At Km = 32: RRP31E 40% cycles, COMPLIB_DGEMM_HOISTC 39%, VICDF 19%. *8 At Km = 24, prof claims 133 MFLOPS on 75 MHz, or R = 0.44. At Km = 32, prof claims 613 sec execution time and 163 MFLOPS (Km = 32) for 75 MHz, i.e., R about 0.54 for both 75 MHz and 90 MHz. At Km = 48, prof claims R to be about 0.3. *9 "f77 -lcomplib.sgimath -r8000 -mips4 -64 -O3 -LNO:ou=4 -pfa -WK,-p=1", using RRP31C (calling DAXPY). *10 RRP31E compiled with "f77 -r8000 -mips4 -64 -O3 -LNO:ou=4:cs1=16k:cs2=4m" "-TENV:X=4 -pfa -WK,-so=3,-ro=3,-o=5 -WK,-p=1", the program compiled with "f77 -r8000 -mips4 -64 -O3 -LNO:ou=4 -pfa -WK,-p=1", the program linked with "f77 -lcomplib.sgimath -r8000 -mips4 -64 -O3 -pfa -WK,-p=1". *11 As *10, but rrp31e compiled with the additional -WK,ur=4 (inner loop unroll). *12 As *11, but -o=4 instead of -o=5. *13 The whole program compiled (as in *10) with "f77 -r8000 -mips4 -64 -O3" "-LNO:ou=4:cs1=16k:cs2=4m -TENV:X=4 -pfa -WK,-so=3,-ro=3,-o=5 -WK,-p=1". *14 As *13, but "-Ofast=ip21" instead of "-r8000 -mips4 -64 -O3". *15 As *13, but "-ou=8" instead of "-ou=4". *16 The whole program compiled as in *13, but without ":ou=4" (no outer loop unrolling). *17 f77 -lcomplib.sgimath -r10000 -mips4 -64 -O3 -LNO:ou=4. *18 Estimate obtained by using the PCL (R8000) operation counts. A proper estimate would be obtained by running pixie. PART II 1. Positronium negative ion (Ps-) -- ------------------------------ Two runs using RRP31C, compiled differently, with parameters IPWUU = 60, JMAX = 20, DZU = 5, TOL = .05, ZTOP = 120. 99/02/23. As of today, this comparison is not made in Table 2.I. Speedup is small. 1.1. f77 -lcomplib.sgimath -r8000 -mips4 -64 -O3. ------------------------------------------------------------------------------- ewf2plfg43cppaccec5555b23a_r8000_cs_314.2ff IRRPV (RRP--- VERSION) 314 LTDD/R,LTVICD/RRPS,LPRBIN/Z0M,LTCPU/RU F F F F F F F F -RRLSDE- 1.1, 96/08/02. 1.0, 93/05/18. 1.4, 91/03/13 (VICDF) -RRP31C- 2.0, 96/08/05. -1.0; -RRPSM 3.1.1V1. -3.1V1S -3.0V1S -3.0VS DET: -1.4265964634850457 10**120 ENERGY 0.2610000000000000 DET: 2.5931146819654085 10**120 ENERGY 0.2640000000000000 DET: 7.7558711507030882 10**118 ENERGY 0.2620627383190518 DET: -9.7808259030330582 10**114 ENERGY 0.2620061965277588 DET: -5.2273609064104392 10**109 ENERGY 0.2620062036452644 DET: 3.7297727749911305 10**106 ENERGY 0.2620062036453025 DET: -2.6151134961832510 10**109 ENERGY 0.2620062036452835 DET: 3.7297727749911305 10**106 ENERGY 0.2620062036453025 DET: 3.7297727749911305 10**106 ENERGY 0.2620062036453025 1 0.3729772774991136E-13 H 0.0000E+00 0.2620062036453025 5 + 2 DET. EVALUATIONS OF 710.52 SEC. USER 6418.05 DIF 6418.05 REF 0.00 SYS 34.24 1.2. f77 -lcomplib.sgimath -r8000 -mips4 -64 -O3 \ -LNO:ou=4 -TENV:X=4 -pfa -WK,-p=1,-so=3,-ro=3,-o=5. ------------------------------------------------------------------------------- ewf2plfg43cppaccec5555b23a_cs8000ou4pfa_314.2ff IRRPV (RRP--- VERSION) 314 LTDD/R,LTVICD/RRPS,LPRBIN/Z0M,LTCPU/RU F F F F F F F F -RRLSDE- 1.1, 96/08/02. 1.0, 93/05/18. 1.4, 91/03/13 (VICDF) -RRP31C- 2.0, 96/08/05. -1.0; -RRPSM 3.1.1V1. -3.1V1S -3.0V1S -3.0VS DET: -1.4265964634850428 10**120 ENERGY 0.2610000000000000 DET: 2.5931146819654041 10**120 ENERGY 0.2640000000000000 DET: 7.7558711507028519 10**118 ENERGY 0.2620627383190518 DET: -9.7808259032345006 10**114 ENERGY 0.2620061965277588 DET: -5.2273694740069976 10**109 ENERGY 0.2620062036452644 DET: 3.7378218156157379 10**106 ENERGY 0.2620062036453025 DET: -2.6163059571265164 10**109 ENERGY 0.2620062036452835 DET: 3.7378218156157379 10**106 ENERGY 0.2620062036453025 DET: 3.7378218156157379 10**106 ENERGY 0.2620062036453025 1 0.3737821815615743E-13 H 0.0000E+00 0.2620062036453025 5 + 2 DET. EVALUATIONS OF 643.29 SEC. USER 5828.61 DIF 5828.61 REF 0.00 SYS 28.63