SGI F77 COMPILER BASIC TEST
                             Updated 99/02/25


OBJECTIVE

To measure a simple vector/parallel REAL*8 program speed using basic 
compiler optimization options. Also compare with the DGEMM routine.


DESCRIPTION

Machine: saturn.ijs.si (Power Challenge R8000, IP21: 2 x 75 MHz, 4 x 90 MHz;
    2 GB RAM).
Program: Matrix multiplication using nested DO loops.
Matrix dimensions: ND = 800 (reservation).
Timing: dtime (tempd.f)
Program details:
    P25:    I-loop inside (vector style), no directives:

      PROGRAM P25
      ...
      IMPLICIT    REAL*8  (A-H,O-Z)
      PARAMETER   (ND = 800, NIN = 5, NOUT = 6)
      DIMENSION   A(ND,ND), B(ND,ND), C(ND,ND)
      ...
      DO  26  J = 1,N
          DO  24  K = 1,N
              DO  22  I = 1,N
                  C(I,J) = C(I,J) + A(I,K) * B(K,J)
   22         CONTINUE
   24     CONTINUE
   26 CONTINUE
      ...

    P25PJK: As P25 (J-loop outermost), C*$*CONCURRENTIZE parallelization 
            directive on the J-loop, I-loop (vector loop) innermost. 
	    (Prepared for -pfa; f77 -mp does not recognize the
	    C*$*CONCURRENTIZE directive, and produces a single-threaded 
	    executable.)
    P25UJK: As P25PJK, but C*$*UNROLL(4) directive inserted before the J-loop.
    P25MJK: as P25PJK, except C$DOACROSS parallelization directive on J-loop
            (prepared for -mp.)

    P25V    calls DGEMM
    P25VT   calls DGEMM with matrix B transposed (no other changes)


RESULTS

Pfa produces slower single-threaded code than without pfa, both for p25
and p25pjk. The single-threaded code has the same speed in both cases, 
if no unrolling is used, which indicates both programs are rearranged the 
same way (see Appendices): the J loop gets parallelized in both cases.

Table I is uninteresting. Table II shows that the slowdown
of the single-threaded execution when -pfa is used is caused by the
multiprocessing overhead; -pfa -WK,-p=1 eliminates this overhead.

"-LNO:ou=<n>" (outer loop unrolling) brings the program to within a factor of 
2 (6 sec) of the theoretical speed (3.4 sec). "-ur=4" (inner loop unrolling) 
is harmful. -pfa -WK,-p=1 without -LNO:ou=<n> is 50 % slower (9 sec).

DGEMM: transposing matrix B does not affect CPU time up to N = 1600.


NOTE

For N = 800 best hand coded loops run 6.4 sec while DGEMM needs 4 sec,
the theoretical time being 3.4 sec. prof -pixie reports the SAME execution
times in both cases (about 3.5 sec). See Appendix D for a comparison of
prof output files in both cases.

Table I.  Basic optimization options (-On, -pfa, -mp). ND = 800, N = 800.
Operation count is 2*N**3 = 1*10**9, therefore the lower bound on CPU time 
on a single 300 MFLOPS processor (at 75 MHz) is 3.4 seconds. Mixed 75/90 
MHz runs were done if there was more than one thread, therefore parallel 
run timing is uncertain by 20 percent. 
--------------------------------------------------------------------------
Program   Compiler call                               Threads  CPU time
--------------------------------------------------------------------------
p25       f77 -r8000  -mips4 -64 -O3                             9 *1
			     
                                     -pfa                1      15 *2
                                                         2       8
                                                         4       4
                                                         6       3

                             -n32 -O3                           10

              -Ofast=ip21                                       10

                          -pfa                           1      15 *3
                                                         2       8
                                                         6       3

              -r10000 -mips4 -64 -O3                            15
--------------------------------------------------------------------------
p25pjk    f77 -r8000  -mips4 -64 -O3 -pfa                1      15 *2
                                                         2       9
                                                         4       4
                                                         6       3

              -Ofast=ip21 -pfa                           1      14 *4
                                                         2       8
                                                         6       3

              -r8000  -mips4 -64 -O3 -mp                any      9 *5
--------------------------------------------------------------------------
p25ujk    f77 -r8000  -mips4 -64 -O3 -pfa                1      24 *6
                                                         2      13
                                                         4       7
                                                         6       5

              -Ofast=ip21 -pfa                           1      24
                                                         6       5
--------------------------------------------------------------------------
p25mjk    f77 -r8000  -mips4 -64 -O3 -mp                 1      15 *7
                                                         2       8
                                                         4       4
                                                         6       3
--------------------------------------------------------------------------
*1  Note that CPU time with ND = 1600, N = 800 is 17 seconds (an increase in
    ND slows down the program).
*2  f77 -pfa slows down p25 and p25pjk. This indicates that -pfa generates
    the same optimizations with and without directives. (The loop order has 
    NOT been changed.)
*3  See Appendix A for the transformed code.
*4  Transformed code is equivalent to that of p25 (Appendix A).
*5  f77 -mp does not recognize the C*$*CONCURRENTIZE directive - it always
    produces a single-threaded executable, but the code is as fast as 
    f77 -pfa on a single thread.
*6  Unroll factor 4 on the inner loop - very slow code. See Appendix B.
*7  The C$DOACROSS LOCAL(K) directive is recognized by -mp. For the processed
    source see Appendix C.


Table II.  Additional optimization options. ND = 800, N = 800. -WK is ignored
if -pfa is not specified; if -pfa is specified, -WK options are cumulative.
The -LNO options are executed by the back end (be) and are not visble in the
pfa phase output (e.g., parallelization is done before the outer loop
unrolling). IPC: number of cycles per instructions (total program, average).
--------------------------------------------------------------------------
Program   Compiler call                    IPC        Threads  CPU time
--------------------------------------------------------------------------
p25       f77 -O3                                               15
          f77 -r8000 -mips4 -64 -O3                              9.4
          f77 -r8000 -mips4 -64 -O3
              -LNO:ou=2                                         11
              -LNO:ou=4                                          6.7   *1
              -LNO:ou=6                                         17
          ................................................................
          f77 -r8000 -mips4 -64 -O3
              -pfa -WK,-p=1                                      9
              -TENV:X=4                                          9
              -TENV:X=4 -LNO:ou=4                                6.4   *2
          ................................................................
          f77 -r8000 -mips4 -64 -O3
              -TENV:X=4 -LNO:ou=4 
              -WK,-ur=12,-ur2=200
              -WK,-so=3,-ro=3,-o=5                               
              -pfa -WK,-p=1                                     23

              -TENV:X=4 -LNO:ou=4 
              -WK,-ur=6,-ur2=200
              -WK,-so=3,-ro=3,-o=5                               
              -pfa -WK,-p=1                                     23

              -TENV:X=4 -LNO:ou=4 
              -WK,-ur=2,-ur2=200
              -WK,-so=3,-ro=3,-o=5                               
              -pfa -WK,-p=1                                     16

              -TENV:X=4 -LNO:ou=4 
              -WK,-so=3,-ro=3,-o=5                               
              -pfa -WK,-p=1                                      6.4   *3
          ................................................................
          f77 -r8000 -mips4 -64 -O3
              -TENV:X=4 -LNO:ou=4 
              -WK,-so=3,-ro=3,-o=5                               
              -pfa                                              16
                                                         2       9
                                                         4       4
                                                         6       3
              -TENV:X=4
              -WK,-so=3,-ro=3,-o=5                               
              -pfa                                              15
                                                         2       8
                                                         4       4
                                                         6       3

p25v      f77 -lcomplib.sgimath                                  4.1
p25vt     f77 -lcomplib.sgimath                                  4.1
--------------------------------------------------------------------------
*1  Optimal unroll is 4.
*2  -TENV:X=4 brings a very small improvement.
*3  Inner loop unrolling should not be used.


Table III.  As Table II, but ND = 1600, N = 1600. Theoretical lower limit 
on the CPU time is 27 seconds (75 MHz R8000). Data cache is 16KB, secondary
instruction/data cache is 4MB.
--------------------------------------------------------------------------
Program   Compiler call                               Threads  CPU time
--------------------------------------------------------------------------
p25       f77 -r8000 -mips4 -64 -O3                            212

          f77 -r8000 -mips4 -64 -O3 
              -LNO:ou=4                                        131

              -LNO:ou=6                                        105     *1
              -LNO:ou=6 -OPT:IEEE_arithmetic=3                 105
              -LNO:ou=6 -TARG:fp_precise=OFF                   105
              -LNO:ou=6:opt=1                                  118
              -LNO:ou=6:opt=1:cs1=16K:cs2=4M                   117

              -LNO:ou=8                                        147
              -LNO:ou=8 -OPT:IEEE_arithmetic=3                 148

          f77 -Ofast=ip21
              -LNO:ou=6                                        111
              -LNO:ou=6:opt=1:cs1=16K:cs2=4M                   116
          ................................................................
          f77 -r8000 -mips4 -64 -O3 -pfa
              -LNO:ou=4:cs1=16k:cs2=4m -TENV:X=4
                        -WK,-so=3,-ro=3,-o=5
                        -WK,-p=1,-chs=4096                     112

              -LNO:ou=6
                        -WK,-so=3,-ro=3,-o=5
                        -WK,-p=1,-chs=4096                     106
              -LNO:ou=6:cs1=16k:cs2=4m
                        -WK,-so=3,-ro=3,-o=5
                        -WK,-p=1,-chs=4096                     102
              -LNO:ou=6:cs1=16k:cs2=4m -TENV:X=4
                        -WK,-so=3,-ro=3,-o=5
                        -WK,-p=1,-chs=4096                     102

              -LNO:ou=8:cs1=16k:cs2=4m -TENV:X=4
                        -WK,-so=3,-ro=3,-o=5
                        -WK,-p=1,-chs=4096                      71
              -LNO:ou=8:cs1=16k:cs2=4m -TENV:X=4
                        -WK,-so=3,-ro=3,-o=5
                        -WK,-p=1,-chs=16                        71
              -LNO:ou=8:cs1=16k:cs2=4m -TENV:X=4
                        -WK,-so=3,-ro=3,-o=5
                        -WK,-p=1                                71     *2
              -LNO:ou=8:cs1=16k:cs2=4m -TENV:X=4
                        -WK,-so=3,-ro=3,-o=5
                        -WK,-p=1 -OPT:IEEE_arithmetic=3         72
              -LNO:ou=8:cs1=16k:cs2=4m -TENV:X=4
                        -WK,-so=3,-ro=3,-o=5
                        -WK,-p=1
                        -WK,-ur=2                              258     *3

              -LNO:ou=10:cs1=16k:cs2=4m -TENV:X=4
                        -WK,-so=3,-ro=3,-o=5
                        -WK,-p=1                                82

              -LNO:ou=12:cs1=16k:cs2=4m -TENV:X=4
                        -WK,-so=3,-ro=3,-o=5
                        -WK,-p=1,-chs=4096                      91

          f77 -Ofast=ip21           -pfa
              -LNO:ou=8:cs1=16k:cs2=4m -TENV:X=4
                        -WK,-so=3,-ro=3,-o=5 
                        -WK,-p=1                                71
          ................................................................
          f77 -r8000 -mips4 -64 -O3 -pfa -mp
              -LNO:cs1=16k:cs2=4m -TENV:X=4
                        -WK,-so=3,-ro=3,-o=5              1    
                                                          2    xxx
                                                          4    
              -LNO:ou=6:cs1=16k:cs2=4m -TENV:X=4
                        -WK,-so=3,-ro=3,-o=5              1    
                                                          2    
              -LNO:ou=8:cs1=16k:cs2=4m -TENV:X=4
                        -WK,-so=3,-ro=3,-o=5              1    
                                                          2   

              -LNO:ou=10:cs1=16k:cs2=4m -TENV:X=4
                        -WK,-so=3,-ro=3,-o=5              1    
                                                          2    
--------------------------------------------------------------------------
p25v      f77 -r8000 -mips4 -64 -O3 -lcomplib.sgimath           34

p25v      f77 -lcomplib.sgimath                                 34
p25vt     f77 -lcomplib.sgimath                                 34 *4
--------------------------------------------------------------------------
*1  -ou=6 is optimal without -pfa.
*2  -ou=8 is optimal wit -pfa.
*3  -ur > 1 is harmful.
*4  Transposing one of the matrices apparently does not affect the CPU 
    time.


Table IV-a.  Integer matrix multiplication, ND = 400, N = 400. 
--------------------------------------------------------------------------
Program   Compiler call                               Threads  CPU time
--------------------------------------------------------------------------
p2i       f77 -r8000 -mips4 -64 -O3                            4.3
          f77 -r8000 -mips4 -64 -O3 -LNO:ou=4                  4.3
          f77 -r8000 -mips4 -64 -O3 -LNO:ou=4 -pfa -WK,-p=1    4.3
          f77 -Ofast=ip21                                      4.4
--------------------------------------------------------------------------


Table IV-b.  Integer matrix multiplication, ND = 800, N = 800. 
--------------------------------------------------------------------------
Program   Compiler call                               Threads  CPU time
--------------------------------------------------------------------------
p2i       f77 -r8000 -mips4 -64 -O3                           34
          f77 -r8000 -mips4 -64 -O3 -LNO:ou=4                 34
          f77 -r8000 -mips4 -64 -O3 -LNO:ou=4 -pfa -WK,-p=1   34
          f77 -r8000 -mips4 -64 -O3 -LNO:ou=4 -TENV:X=4       34
          f77 -r8000 -mips4 -64 -O3 -LNO:ou=4 -TENV:X=4
	      -pfa -WK,-p=1                                   34
          f77 -r8000 -mips4 -64 -O3 -LNO:ou=4 -TENV:X=4
	      -pfa -WK,-p=1,-so=3,-ro=3,-o=5                  34
          f77 -Ofast=ip21                                     36
--------------------------------------------------------------------------


APPENDIX A


C     KAP/SGI               6.0.2 Jan2695       o5r2so3  25-Jan-1999 21:01:36
# 1 "p25.f"
# 1 "p25.f"
      program P25
C
C     EACH STEP TIMED SEPARATELY.
C     ONLY ONE LOOP CONFIGURATION TESTED (BECAUSE RRPSK 1.0 PROVED
C     STABILITY ON LOOP REARRANGEMENTS).
C     I-LOOP INSIDE (FOR COMPILERS WHICH DO NOT EXCHANGE LOOPS).
C     V1, 92/11/26.
C
       implicit real*8 (A-H,O-Z)
       parameter (ND = 800, NIN = 5, NOUT = 6)
       dimension A(800,800), B(800,800), C(800,800)
# 12 "p25.f"
       integer II1, II2
C
       write (6, 200)
# 14 "p25.f"
       read (5, 100) N
       write (6, 201) 800, N
       call TEMPD (T0,T1,TD,6,.true.)
C$PAR parallel if (N .gt. 12) shared (N,A,B) local (II1,II2)
# 17 "p25.f"
CSGI$ startloop 9
# 17 "p25.f"
C$PAR pdo
# 17 "p25.f"
       do 3 II1=1,N
CSGI$ startloop 10
# 18 "p25.f"
        do 2 II2=1,N
         A(II2,II1) = II1 * II2
         B(II2,II1) = II1 + II2
    2    continue
# 21 "p25.f"
CSGI$ endloop 10
    3   continue
# 22 "p25.f"
C$PAR end pdo nowait
# 22 "p25.f"
CSGI$ endloop 9
# 22 "p25.f"
C$PAR end parallel
       call TEMPD (T1,T2,TD,6,.true.)
C$PAR parallel if (N .gt. 22) shared (N,C) local (II1,II2)
# 24 "p25.f"
CSGI$ startloop 2
# 24 "p25.f"
C$PAR pdo
# 24 "p25.f"
       do 5 II1=1,N
CSGI$ startloop 3
# 25 "p25.f"
        do 4 II2=1,N
         C(II2,II1) = 0.d0
    4    continue
# 27 "p25.f"
CSGI$ endloop 3
    5   continue
# 28 "p25.f"
C$PAR end pdo nowait
# 28 "p25.f"
CSGI$ endloop 2
# 28 "p25.f"
C$PAR end parallel
       call TEMPD (T2,T3,TD,6,.true.)
C$PAR parallel if (N * N * N .gt. 142) shared (N,C,A,B) local (II1,K,II2
C$PAR&  )
# 30 "p25.f"
CSGI$ startloop 4
# 30 "p25.f"
C$PAR pdo
# 30 "p25.f"
       do 8 II1=1,N
CSGI$ startloop 5
# 31 "p25.f"
        do 7 K=1,N
CSGI$ startloop 11
# 32 "p25.f"
         do 6 II2=1,N
          C(II2,II1) = C(II2,II1) + A(II2,K) * B(K,II1)
    6     continue
# 34 "p25.f"
CSGI$ endloop 11
    7    continue
# 35 "p25.f"
CSGI$ endloop 5
    8   continue
# 36 "p25.f"
C$PAR end pdo nowait
# 36 "p25.f"
CSGI$ endloop 4
# 36 "p25.f"
C$PAR end parallel
# 30 "p25.f"
       J = max0 (N, 0) + 1
# 37 "p25.f"
       call TEMPD (T3,T4,TD,6,.true.)
       write (6, 202) N, 800, ((C(I,J), J=1,4), I=1,4)
C
  100  format(i4)
  200  format(1H ,8HP25 V1  )
  201  format(1H ,8HND, N   ,2i8)
  202  format(1H ,2i8,/,(1H ,4e16.8))
      end


APPENDIX B

C     KAP/SGI               6.0.2 Jan2695       o5r2so3  26-Jan-1999 10:57:07
# 1 "p25ujk.f"
# 1 "p25ujk.f"
      program P25UJK
C
C     MATRIX MULTIPLICATION TEST.
C     LOOP ORDER AS IN P25, BUT WITH PFA DIRECTIVES.
C     V1.0, 99/01/27.
C
       implicit real*8 (A-H,O-Z)
       parameter (ND = 800, NIN = 5, NOUT = 6)
       dimension A(800,800), B(800,800), C(800,800)
# 10 "p25ujk.f"
       integer II1, II2
C
       write (6, 200)
# 12 "p25ujk.f"
       read (5, 100) N
       write (6, 201) 800, N
       call TEMPD (T0,T1,TD,6,.true.)
C*$*  noconcurrentize
CSGI$ startloop 2
# 16 "p25ujk.f"
       do 12 II1=1,N
CSGI$ startloop 3
# 17 "p25ujk.f"
        do 10 II2=1,N
         A(II2,II1) = II1 * II2
         B(II2,II1) = II1 + II2
   10    continue
# 20 "p25ujk.f"
CSGI$ endloop 3
   12   continue
# 21 "p25ujk.f"
CSGI$ endloop 2
       call TEMPD (T1,T2,TD,6,.true.)
C*$*  noconcurrentize
CSGI$ startloop 4
# 24 "p25ujk.f"
       do 20 II1=1,N
CSGI$ startloop 5
# 25 "p25ujk.f"
        do 18 II2=1,N
         C(II2,II1) = 0.d0
   18    continue
# 27 "p25ujk.f"
CSGI$ endloop 5
   20   continue
# 28 "p25ujk.f"
CSGI$ endloop 4
       call TEMPD (T2,T3,TD,6,.true.)
C*$*  concurrentize
C$PAR parallel if (N * N * N .gt. 142) shared (N,C,A,B) local (II1,K,II2
C$PAR&  )
# 31 "p25ujk.f"
CSGI$ startloop 9
# 31 "p25ujk.f"
C$PAR pdo
# 31 "p25ujk.f"
       do 5 II1=1,N
CSGI$ startloop 10
# 32 "p25ujk.f"
        do 4 K=1,N
# 34 "p25ujk.f"
CSGI$ startloop 11
# 34 "p25ujk.f"
         do 2 II2=1,N-3,4
          C(II2,II1) = C(II2,II1) + A(II2,K) * B(K,II1)
# 35 "p25ujk.f"
          C(II2+1,II1) = C(II2+1,II1) + A(II2+1,K) * B(K,II1)
# 35 "p25ujk.f"
          C(II2+2,II1) = C(II2+2,II1) + A(II2+2,K) * B(K,II1)
# 35 "p25ujk.f"
          C(II2+3,II1) = C(II2+3,II1) + A(II2+3,K) * B(K,II1)
    2     continue
# 36 "p25ujk.f"
CSGI$ endloop 11
# 34 "p25ujk.f"
CSGI$ startloop 6
# 34 "p25ujk.f"
         do 3 II2=II2,N,1
          C(II2,II1) = C(II2,II1) + A(II2,K) * B(K,II1)
    3     continue
# 36 "p25ujk.f"
CSGI$ endloop 6
    4    continue
# 37 "p25ujk.f"
CSGI$ endloop 10
    5   continue
# 38 "p25ujk.f"
C$PAR end pdo nowait
# 38 "p25ujk.f"
CSGI$ endloop 9
# 38 "p25ujk.f"
C$PAR end parallel
# 31 "p25ujk.f"
       J = max0 (N, 0) + 1
# 39 "p25ujk.f"
       call TEMPD (T3,T4,TD,6,.true.)
       write (6, 202) N, 800, ((C(I,J), J=1,4), I=1,4)
C
  100  format(i4)
  200  format(1H ,12HP25UJK V1   )
  201  format(1H ,8HND, N   ,2i8)
  202  format(1H ,2i8,/,(1H ,4e16.8))
      end

APPENDIX C

p25mjk processed code.

C ***********************************************************
C Fortran file translated from WHIRL Tue Jan 26 11:17:34 1999
C ***********************************************************


	PROGRAM MAIN
	IMPLICIT NONE
C
C	**** Variables and functions ****
C
	REAL*8 A(800_8, 800_8)
	REAL*8 B(800_8, 800_8)
	REAL*8 C(800_8, 800_8)
	INTEGER*4 N
	REAL*8 T0
	REAL*8 T1
	REAL*8 TD
	INTEGER*4 J
	INTEGER*4 I
	REAL*8 T2
	REAL*8 T3
	INTEGER*4 k
	REAL*8 T4
	EXTERNAL tempd
C
C	**** Temporary variables ****
C
	INTEGER*4 J1
	INTEGER*4 I0
	INTEGER*4 J0
	INTEGER*4 I1
	INTEGER*4 wd_J
	INTEGER*4 I2
	INTEGER*4 k2
	INTEGER*4 k1
	INTEGER*4 k0
	INTEGER*4 I3
	REAL*8 mi0
	REAL*8 mi1
	REAL*8 mi2
	REAL*8 mi3
	REAL*8 mi4
	REAL*8 mi5
	REAL*8 mi6
	REAL*8 mi7
	REAL*8 mi8
	REAL*8 mi9
	INTEGER*4 wd_I
	REAL*8 mi10
	REAL*8 mi11
	INTEGER*4 wd_$wd_I
	REAL*8 mi12
C
C	**** statements ****
C
	WRITE(6, '(1H ,12HP25MJK V1   )')
	READ(5, '(I4)') N
	WRITE(6, '(1H ,8HND, N   ,2I8)') 800, N
	CALL tempd(T0, T1, TD, 6, .TRUE.)
	DO J1 = 1, N, 1
	  DO I0 = 1, N, 1
	    A(I0, J1) = DBLE((J1 * I0))
	    B(I0, J1) = DBLE((J1 + I0))
	  END DO
	  I = MAX((N + 1), 1)
	END DO
	CALL tempd(T1, T2, TD, 6, .TRUE.)
	DO J0 = 1, (N + -2), 3
	  DO I1 = 1, N, 1
	    C(I1, J0) = 0.0D00
	    C(I1, J0 + 1) = 0.0D00
	    C(I1, J0 + 2) = 0.0D00
	  END DO
	  I = MAX((N + 1), 1)
	  I = MAX((N + 1), 1)
	  I = MAX((N + 1), 1)
	END DO
	DO wd_J = J0, N, 1
	  DO I2 = 1, N, 1
	    C(I2, wd_J) = 0.0D00
	  END DO
	  I = MAX((N + 1), 1)
	END DO
	CALL tempd(T2, T3, TD, 6, .TRUE.)
C$DOACROSS lastlocal(J), local(k, k2, k1, k0, I, I3), local(mi0, mi1,
C$&  mi2, mi3, mi4, mi5, mi6, mi7, mi8, mi9, wd_I, mi10, mi11, wd_$wd_I
C$& , mi12)
	DO J = 1, N, 1
	  DO I3 = 1, (N + -9), 10
	    mi0 = C(I3, J)
	    mi1 = C(I3 + 1, J)
	    mi2 = C(I3 + 9, J)
	    mi3 = C(I3 + 2, J)
	    mi4 = C(I3 + 8, J)
	    mi5 = C(I3 + 3, J)
	    mi6 = C(I3 + 7, J)
	    mi7 = C(I3 + 4, J)
	    mi8 = C(I3 + 6, J)
	    mi9 = C(I3 + 5, J)
	    DO k = 1, N, 1
	      mi0 = (mi0 +(A(I3, k) * B(k, J)))
	      mi1 = (mi1 +(A(I3 + 1, k) * B(k, J)))
	      mi3 = (mi3 +(A(I3 + 2, k) * B(k, J)))
	      mi5 = (mi5 +(A(I3 + 3, k) * B(k, J)))
	      mi7 = (mi7 +(A(I3 + 4, k) * B(k, J)))
	      mi9 = (mi9 +(A(I3 + 5, k) * B(k, J)))
	      mi8 = (mi8 +(A(I3 + 6, k) * B(k, J)))
	      mi6 = (mi6 +(A(I3 + 7, k) * B(k, J)))
	      mi4 = (mi4 +(A(I3 + 8, k) * B(k, J)))
	      mi2 = (mi2 +(A(I3 + 9, k) * B(k, J)))
	    END DO
	    C(I3 + 5, J) = mi9
	    C(I3 + 6, J) = mi8
	    C(I3 + 4, J) = mi7
	    C(I3 + 7, J) = mi6
	    C(I3 + 3, J) = mi5
	    C(I3 + 8, J) = mi4
	    C(I3 + 2, J) = mi3
	    C(I3 + 9, J) = mi2
	    C(I3 + 1, J) = mi1
	    C(I3, J) = mi0
	  END DO
	  DO wd_I = I3, (N + -1), 2
	    mi10 = C(wd_I, J)
	    mi11 = C(wd_I + 1, J)
	    DO k0 = 1, N, 1
	      mi10 = (mi10 +(A(wd_I, k0) * B(k0, J)))
	      mi11 = (mi11 +(A(wd_I + 1, k0) * B(k0, J)))
	    END DO
	    C(wd_I + 1, J) = mi11
	    C(wd_I, J) = mi10
	  END DO
	  DO wd_$wd_I = wd_I, N, 1
	    mi12 = C(wd_$wd_I, J)
	    DO k1 = 1, N, 1
	      mi12 = (mi12 +(A(wd_$wd_I, k1) * B(k1, J)))
	    END DO
	    C(wd_$wd_I, J) = mi12
	  END DO
	  DO k2 = 1, N, 1
	    I = MAX((N + 1), 1)
	  END DO
	END DO
	CALL tempd(T3, T4, TD, 6, .TRUE.)
	WRITE(6, '(1H ,2I8,/,(1H ,4E16.8))') N, 800, ((C(I, J), J = 1, 4, 1), I = 1, 4, 1)
	STOP
	END ! MAIN


APPENDIX D

diff of prof files with and without DGEMM call:

2,3c2,3
<   Prof run at: Wed Feb  3 15:32:12 1999
<  Command line: prof -r8000 -pixie -gprof -op -archinfo -h -q 4 
       p25.e_f77_r8000_mips4_64_O3 -pixie
---
>   Prof run at: Wed Feb  3 15:30:28 1999
>  Command line: prof -r8000 -pixie -gprof -op -archinfo -h -q 4 
       p25v.e_f77_r8000_mips4_64_O3_lcomplib.sgimath -pixie
6,9c6,9
<           260190702: Total number of cycles
<            3.46921s: Total execution time
<          1035469993: Total number of instructions executed
<               0.251: Ratio of cycles / instruction
---
>           285309264: Total number of cycles
>            3.80412s: Total execution time
>           998949432: Total number of instructions executed
>               0.286: Ratio of cycles / instruction
17,18c17,18
<          1025668746: Floating point operations (295.649 Mflops @ 75 MHz)
<           106563413: Integer operations (30.7169 M intops @ 75 MHz)
---
>          1026031421: Floating point operations (269.716 Mflops @ 75 MHz)
>            28495088: Integer operations (7.49058 M intops @ 75 MHz)
21,29c21,43
<     207956083: Number of scheduling stalls
<       1445736: Number of stalls waiting for operands
<         10479: Number of nop's in branch delay slot
<     359070816: Number of load instructions executed
<    2872534400: Number of bytes loaded
<       2648313: Number of store instructions executed
<      20914843: Number of bytes stored
<      52031417: Number of conditional branch instructions executed
<      26046744: Number of conditional branches taken
---
>     262488610: Number of scheduling stalls
>     130403663: Number of stalls waiting for operands
>         13359: Number of nop's in branch delay slot
>     301026875: Number of load instructions executed
>    2408179114: Number of bytes loaded
>     131127408: Number of store instructions executed
>    1049002899: Number of bytes stored
>      22323303: Number of conditional branch instructions executed
>       5858711: Number of conditional branches taken
>        214812: Number of branch likely instructions executed
>         53503: Number of branch likely branches taken