SUN F77 COMPILER BASIC TEST
                              Updated 99/06/18


OBJECTIVE

To measure a simple vector/parallel REAL*8 program speed using basic 
compiler optimization options.


DESCRIPTION

Machine: dune.ijs.si (Sun Enterprise 4500, 8 x 336 MHz, 8 GB RAM).
         SunOS dune 5.7 Generic sun4u sparc SUNW,Ultra-Enterprise
Program: Matrix multiplication using nested DO loops.
Matrix dimensions: ND (reservation).
Timing: dtime (tempd.f)
Program details:
    P25:    I-loop inside (vector style), no directives, N is read:

      PROGRAM P25
      ...
      IMPLICIT    REAL*8  (A-H,O-Z)
      PARAMETER   (ND = 800, NIN = 5, NOUT = 6)
      DIMENSION   A(ND,ND), B(ND,ND), C(ND,ND)
      ...
      DO  26  J = 1,N
          DO  24  K = 1,N
              DO  22  I = 1,N
                  C(I,J) = C(I,J) + A(I,K) * B(K,J)
   22         CONTINUE
   24     CONTINUE
   26 CONTINUE
      ...


    P25V:    calls DGEMM


RESULTS

There are two tests. Section 1 was kindly supplied by Ruud van der Pas
from the Sun European HPC Team on 99/06/08. Section 2 was measured on a
local machine; these results may not be conclusive as follows: the 
hand-coded results are probably as good as possible, but the DGEMM 
version may have linked the incorrect library because of the absence 
of the xarch option.


1. E6500 400MHz UltraSPARC-II, L2 cache: 8 MByte. compiler version 5.0.
-- --------------------------------------------------------------------


Versions of the program (R. van der Pas):

P25          Compiled Fortran
P25-perflib  Performance library version
P25-ruud     Uses my version of DGEMM - only suitable for C=A*B, but
             otherwise useable for arbitrary matrices, also non-square
             All Fortran 77 coded i.e. no assembly.


Table 1.I.  ND = N = 800, 1600. Operation count is 2*N**3 = 1*10**9, 
therefore the lower bound on CPU time on a single 800 MFLOPS processor 
is 1.3 and 10.2 seconds, respectively. (R. van der Pas.)
Compile/link options: "-fast -fsimple=2 -xtarget=ultra2 -xarch=v8plusa".
------------------------------------------------------------------------
Version          ND      N     Time     Mflop/s   E4500@336MHz  Speed-up
------------------------------------------------------------------------
P25             800    800      6.7       153 (19%)   79 (12%)    1.94
P25-perflib     800    800      1.7       602 (75%)  260 (38%)    2.32
                                          690 (86%)                    *1)
P25-ruud        800    800      1.9       539 (67%)    n.a.

P25            1600   1600     203.3       40 ( 5%)   41 (6%)     0.98
P25-perflib    1600   1600      13.7      598 (75%)  228 (34%)    2.62
                                          710 (89%)                    *1)
P25-ruud       1600   1600      19.8      414 (52%)    n.a.
------------------------------------------------------------------------
*1)  Tuned for 8 MB cache, supplied by R. van der Pas 99/06/16.

Comments by R. van der Pas:

1. The performance library version runs much faster 
2. The compiled version for the 800x800 problem is about two times faster
3. The compiled version for the 1600x1600 runs at the same speed ....
4. The all Fortran rewritten version does okay
5. We reach about 75% of peak (800 Mflop/s) using the performance library

AD 1 and 5. I think the huge increase in performance (400/336=1.19 ...)
is because of the -xarch=v8plusa link option to get the tuned version 
linked in and of course because of the larger cache.
We don't get the desired 80% of peak, but with 75% it is quite close.

AD 4. I would expect the new release of the compiler to get close to
these numbers as well.

AD 3. This may come as a surprise, but actually it isn't. Currently the
compiler does not interchange the loops and so the innerloops performs
two loads and a store. All floating point can be hidden under these
memory operations. However, it means we're looking at the speed of the
memory system and not the speed of the CPU. As the memory system on 
both systems runs at the same speed, performance is equal. 
On the 800x800 system, we need ~15MB of data and the 8MB cache helps.
The 1600x1600 system requires about 60MB of data and we lose the
advantage of the larger cache.

As the numbers for the performance library version, and the rewritten
version demonstrate, one can do better than this!


2. E4500, 336 MHz, compiler version 5.0, SunOS 5.7.
-- ------------------------------------------------


f77 -V: f77: WorkShop Compilers 5.0 98/12/15 FORTRAN 77 5.0.


Table 2.I.  ND = 800, N = 800. Operation count is 2*N**3 = 1*10**9, 
therefore the lower bound on CPU time on a single 672 MFLOPS processor 
(at 336 MHz) is 1.5 seconds. (R. Krivec).
--------------------------------------------------------------------------
Program   Compiler call                               Threads  CPU time
--------------------------------------------------------------------------
p25       f77 -fast                                             12.8
              -fast -O5                                         13.0
              -fast -O5 -xarch=v9                               13.5
              -fast -O5 -xarch=v9a                              13.5

          f77 -fast -O5 
	       -xtarget=ultra2 
	       -xcache=16/32/1:4096/64/1 -xarch=v9              13.4  *1
          f77 -fast -O5 -unroll=4
	       -xtarget=ultra2 
	       -xcache=16/32/1:4096/64/1 -xarch=v9              13.5  *1

p25v      f77 -fast -xlic_lib=sunperf                            4.1  *2
--------------------------------------------------------------------------
*1  "-xtarget=ultra2 -xcache=16/32/1:4096/64/1" advised by "fpversion".
*2  The DGEMM version may have linked the incorrect library because of the 
    absence of the xarch option on the command line.