BENCHMARK: bm_a =============== TYPE Small-memory benchmark. Measures the efficiencies of two types of programs on different processor types. PROBLEM CFHHM - He atom ground state. PROGRAMS M2MK: not well vectorizable EWF2N: well vectorizable AUTHOR R. Krivec, Department of Theoretical Physics, J. Stefan Institute, Ljubljana Table I. Program lengths ================================================================= Program Parameters Size (MB, fc/Convex) NDMU NDP NDCN NDGO NNQSI ISYM text data bss ----------------------------------------------------------------- m2mk 21 2 16 8 .5 .1 5 ewf2n 21 2 22 1 .6 .1 13 ================================================================= Table II. CPU times and relative efficiencies R1 and R2 normalized to 1 for C3860: R1 = (120 x 69) / (MFLOPS x CPUtime), R2 = (120 x 42) / (MFLOPS x CPUtime), where CPUtime are the values in the m2mk/total and ewf2n/per iteration columns, respectively. On SPP-1000, with -O3, WALL time is reported instead of CPU time. The reference run involves a call to DGEMM in the critical loop. ==================================================================== Machine MFLOPS Opt. CPU time (theor.) level m2mk ewf2n total R1 total per iter. R2 -------------------------------------------------------------------- C3860 (1 proc.) 120 -O2 69 1.0 291 48 0.9 252 42 1.0 *8 SPP-1000 (1 proc.) 200 -O2 31 1.3 554 92 0.27 -nore -O3 40 544 91 0.28 *3 HP 715/75 31 +O3 54 4.9 1943 324 0.50 *1 +O2 53 5.0 1932 322 0.50 *2 -O 53 5.0 1934 322 0.50 *2 HP 710 12 -O 83 8.3 3561 593 0.71 SGI Power Ch. L 300 -O3 13 2.1 369 62 0.27 *4 -O2 22 1.3 2373 396 *5 -O3 13 2.1 348 58 0.29 *6 -O3 13 2.1 132 22 0.76 *7 ==================================================================== *1 Two jobs were running for part of the time, sum of CPU usage almost 100%; while running alone, CPU usage was 97%. Resident size: 4.4 MB. etime returns CPU time because the true time logged by *00*.csh is larger by 30 - 50%. The fact that this machine serves as cluster server may explain its low R2 as compared with HP 710 (a cluster client); it could mean more cache misses are responsible. *2 Run with empty machine. *3 16-processor subcomplex, empty. Preliminary. *4 f77 -O3 (1 proc.); no additional OPT parameters. C3860 gets a lot of speedup from vector load/store; this has to be OPT-imized on the SGI. Preliminary. *5 -O2 is obviously not something to be used for production. *6 f90 -O3 (1 proc.); otherwise as *4. Preliminary. *7 f90 -O3 -lblas. Using rrpsm 3.0VS, which uses DGEMM (blas) for the most critical loops, it is possible to achieve a large R2. *8 DGEMM (-lveclib); rrpsm_3.0V.