BENCHMARK: bm_a
===============

TYPE
    Small-memory benchmark.
    Measures the efficiencies of two types of programs on different
    processor types.

PROBLEM
    CFHHM - He atom ground state.

PROGRAMS
    M2MK: not well vectorizable
    EWF2N: well vectorizable
    
AUTHOR
    R. Krivec, Department of Theoretical Physics, 
    J. Stefan Institute, Ljubljana


Table I. Program lengths
=================================================================
Program   Parameters                         Size (MB, fc/Convex)
          NDMU  NDP  NDCN  NDGO NNQSI ISYM      text data   bss
-----------------------------------------------------------------
m2mk        21    2         16    8               .5   .1     5
ewf2n       21    2   22                1         .6   .1    13
=================================================================


Table II. CPU times and relative efficiencies R1 and R2 
normalized to 1 for C3860: R1 = (120 x 69) / (MFLOPS x CPUtime), 
R2 = (120 x 42) / (MFLOPS x CPUtime), where CPUtime are the values
in the m2mk/total and ewf2n/per iteration columns, respectively.
On SPP-1000, with -O3, WALL time is reported instead of CPU time.
The reference run involves a call to DGEMM in the critical loop.
====================================================================
Machine            MFLOPS   Opt.           CPU time        
                   (theor.) level   m2mk            ewf2n  
                                  total R1   total  per iter. R2
--------------------------------------------------------------------
C3860 (1 proc.)    120      -O2    69  1.0     291     48    0.9
                                               252     42    1.0  *8

SPP-1000 (1 proc.) 200      -O2    31  1.3     554     92    0.27
                      -nore -O3    40          544     91    0.28 *3

HP 715/75           31      +O3    54  4.9    1943    324    0.50 *1
                            +O2    53  5.0    1932    322    0.50 *2
                            -O     53  5.0    1934    322    0.50 *2

HP 710              12      -O     83  8.3    3561    593    0.71

SGI Power Ch. L    300      -O3    13  2.1     369     62    0.27 *4
                            -O2    22  1.3    2373    396         *5
                            -O3    13  2.1     348     58    0.29 *6    
	            	    -O3    13  2.1     132     22    0.76 *7
====================================================================
*1  Two jobs were running for part of the time, sum of CPU
    usage almost 100%; while running alone, CPU usage was
    97%. Resident size: 4.4 MB. etime returns CPU time because
    the true time logged by *00*.csh is larger by 30 - 50%.
    The fact that this machine serves as cluster server may explain 
    its low R2 as compared with HP 710 (a cluster client); it could
    mean more cache misses are responsible.
*2  Run with empty machine.
*3  16-processor subcomplex, empty. Preliminary. 
*4  f77 -O3 (1 proc.); no additional OPT parameters. C3860 gets a 
    lot of speedup from vector load/store; this has to be OPT-imized
    on the SGI. Preliminary. 
*5  -O2 is obviously not something to be used for production.
*6  f90 -O3 (1 proc.); otherwise as *4. Preliminary. 
*7  f90 -O3 -lblas. Using rrpsm 3.0VS, which uses DGEMM (blas) for the
    most critical loops, it is possible to achieve a large R2.
*8  DGEMM (-lveclib); rrpsm_3.0V.