SGI F77 COMPILER BASIC TEST Updated 99/02/25 OBJECTIVE To measure a simple vector/parallel REAL*8 program speed using basic compiler optimization options. Also compare with the DGEMM routine. DESCRIPTION Machine: saturn.ijs.si (Power Challenge R8000, IP21: 2 x 75 MHz, 4 x 90 MHz; 2 GB RAM). Program: Matrix multiplication using nested DO loops. Matrix dimensions: ND = 800 (reservation). Timing: dtime (tempd.f) Program details: P25: I-loop inside (vector style), no directives: PROGRAM P25 ... IMPLICIT REAL*8 (A-H,O-Z) PARAMETER (ND = 800, NIN = 5, NOUT = 6) DIMENSION A(ND,ND), B(ND,ND), C(ND,ND) ... DO 26 J = 1,N DO 24 K = 1,N DO 22 I = 1,N C(I,J) = C(I,J) + A(I,K) * B(K,J) 22 CONTINUE 24 CONTINUE 26 CONTINUE ... P25PJK: As P25 (J-loop outermost), C*$*CONCURRENTIZE parallelization directive on the J-loop, I-loop (vector loop) innermost. (Prepared for -pfa; f77 -mp does not recognize the C*$*CONCURRENTIZE directive, and produces a single-threaded executable.) P25UJK: As P25PJK, but C*$*UNROLL(4) directive inserted before the J-loop. P25MJK: as P25PJK, except C$DOACROSS parallelization directive on J-loop (prepared for -mp.) P25V calls DGEMM P25VT calls DGEMM with matrix B transposed (no other changes) RESULTS Pfa produces slower single-threaded code than without pfa, both for p25 and p25pjk. The single-threaded code has the same speed in both cases, if no unrolling is used, which indicates both programs are rearranged the same way (see Appendices): the J loop gets parallelized in both cases. Table I is uninteresting. Table II shows that the slowdown of the single-threaded execution when -pfa is used is caused by the multiprocessing overhead; -pfa -WK,-p=1 eliminates this overhead. "-LNO:ou=" (outer loop unrolling) brings the program to within a factor of 2 (6 sec) of the theoretical speed (3.4 sec). "-ur=4" (inner loop unrolling) is harmful. -pfa -WK,-p=1 without -LNO:ou= is 50 % slower (9 sec). DGEMM: transposing matrix B does not affect CPU time up to N = 1600. NOTE For N = 800 best hand coded loops run 6.4 sec while DGEMM needs 4 sec, the theoretical time being 3.4 sec. prof -pixie reports the SAME execution times in both cases (about 3.5 sec). See Appendix D for a comparison of prof output files in both cases. Table I. Basic optimization options (-On, -pfa, -mp). ND = 800, N = 800. Operation count is 2*N**3 = 1*10**9, therefore the lower bound on CPU time on a single 300 MFLOPS processor (at 75 MHz) is 3.4 seconds. Mixed 75/90 MHz runs were done if there was more than one thread, therefore parallel run timing is uncertain by 20 percent. -------------------------------------------------------------------------- Program Compiler call Threads CPU time -------------------------------------------------------------------------- p25 f77 -r8000 -mips4 -64 -O3 9 *1 -pfa 1 15 *2 2 8 4 4 6 3 -n32 -O3 10 -Ofast=ip21 10 -pfa 1 15 *3 2 8 6 3 -r10000 -mips4 -64 -O3 15 -------------------------------------------------------------------------- p25pjk f77 -r8000 -mips4 -64 -O3 -pfa 1 15 *2 2 9 4 4 6 3 -Ofast=ip21 -pfa 1 14 *4 2 8 6 3 -r8000 -mips4 -64 -O3 -mp any 9 *5 -------------------------------------------------------------------------- p25ujk f77 -r8000 -mips4 -64 -O3 -pfa 1 24 *6 2 13 4 7 6 5 -Ofast=ip21 -pfa 1 24 6 5 -------------------------------------------------------------------------- p25mjk f77 -r8000 -mips4 -64 -O3 -mp 1 15 *7 2 8 4 4 6 3 -------------------------------------------------------------------------- *1 Note that CPU time with ND = 1600, N = 800 is 17 seconds (an increase in ND slows down the program). *2 f77 -pfa slows down p25 and p25pjk. This indicates that -pfa generates the same optimizations with and without directives. (The loop order has NOT been changed.) *3 See Appendix A for the transformed code. *4 Transformed code is equivalent to that of p25 (Appendix A). *5 f77 -mp does not recognize the C*$*CONCURRENTIZE directive - it always produces a single-threaded executable, but the code is as fast as f77 -pfa on a single thread. *6 Unroll factor 4 on the inner loop - very slow code. See Appendix B. *7 The C$DOACROSS LOCAL(K) directive is recognized by -mp. For the processed source see Appendix C. Table II. Additional optimization options. ND = 800, N = 800. -WK is ignored if -pfa is not specified; if -pfa is specified, -WK options are cumulative. The -LNO options are executed by the back end (be) and are not visble in the pfa phase output (e.g., parallelization is done before the outer loop unrolling). IPC: number of cycles per instructions (total program, average). -------------------------------------------------------------------------- Program Compiler call IPC Threads CPU time -------------------------------------------------------------------------- p25 f77 -O3 15 f77 -r8000 -mips4 -64 -O3 9.4 f77 -r8000 -mips4 -64 -O3 -LNO:ou=2 11 -LNO:ou=4 6.7 *1 -LNO:ou=6 17 ................................................................ f77 -r8000 -mips4 -64 -O3 -pfa -WK,-p=1 9 -TENV:X=4 9 -TENV:X=4 -LNO:ou=4 6.4 *2 ................................................................ f77 -r8000 -mips4 -64 -O3 -TENV:X=4 -LNO:ou=4 -WK,-ur=12,-ur2=200 -WK,-so=3,-ro=3,-o=5 -pfa -WK,-p=1 23 -TENV:X=4 -LNO:ou=4 -WK,-ur=6,-ur2=200 -WK,-so=3,-ro=3,-o=5 -pfa -WK,-p=1 23 -TENV:X=4 -LNO:ou=4 -WK,-ur=2,-ur2=200 -WK,-so=3,-ro=3,-o=5 -pfa -WK,-p=1 16 -TENV:X=4 -LNO:ou=4 -WK,-so=3,-ro=3,-o=5 -pfa -WK,-p=1 6.4 *3 ................................................................ f77 -r8000 -mips4 -64 -O3 -TENV:X=4 -LNO:ou=4 -WK,-so=3,-ro=3,-o=5 -pfa 16 2 9 4 4 6 3 -TENV:X=4 -WK,-so=3,-ro=3,-o=5 -pfa 15 2 8 4 4 6 3 p25v f77 -lcomplib.sgimath 4.1 p25vt f77 -lcomplib.sgimath 4.1 -------------------------------------------------------------------------- *1 Optimal unroll is 4. *2 -TENV:X=4 brings a very small improvement. *3 Inner loop unrolling should not be used. Table III. As Table II, but ND = 1600, N = 1600. Theoretical lower limit on the CPU time is 27 seconds (75 MHz R8000). Data cache is 16KB, secondary instruction/data cache is 4MB. -------------------------------------------------------------------------- Program Compiler call Threads CPU time -------------------------------------------------------------------------- p25 f77 -r8000 -mips4 -64 -O3 212 f77 -r8000 -mips4 -64 -O3 -LNO:ou=4 131 -LNO:ou=6 105 *1 -LNO:ou=6 -OPT:IEEE_arithmetic=3 105 -LNO:ou=6 -TARG:fp_precise=OFF 105 -LNO:ou=6:opt=1 118 -LNO:ou=6:opt=1:cs1=16K:cs2=4M 117 -LNO:ou=8 147 -LNO:ou=8 -OPT:IEEE_arithmetic=3 148 f77 -Ofast=ip21 -LNO:ou=6 111 -LNO:ou=6:opt=1:cs1=16K:cs2=4M 116 ................................................................ f77 -r8000 -mips4 -64 -O3 -pfa -LNO:ou=4:cs1=16k:cs2=4m -TENV:X=4 -WK,-so=3,-ro=3,-o=5 -WK,-p=1,-chs=4096 112 -LNO:ou=6 -WK,-so=3,-ro=3,-o=5 -WK,-p=1,-chs=4096 106 -LNO:ou=6:cs1=16k:cs2=4m -WK,-so=3,-ro=3,-o=5 -WK,-p=1,-chs=4096 102 -LNO:ou=6:cs1=16k:cs2=4m -TENV:X=4 -WK,-so=3,-ro=3,-o=5 -WK,-p=1,-chs=4096 102 -LNO:ou=8:cs1=16k:cs2=4m -TENV:X=4 -WK,-so=3,-ro=3,-o=5 -WK,-p=1,-chs=4096 71 -LNO:ou=8:cs1=16k:cs2=4m -TENV:X=4 -WK,-so=3,-ro=3,-o=5 -WK,-p=1,-chs=16 71 -LNO:ou=8:cs1=16k:cs2=4m -TENV:X=4 -WK,-so=3,-ro=3,-o=5 -WK,-p=1 71 *2 -LNO:ou=8:cs1=16k:cs2=4m -TENV:X=4 -WK,-so=3,-ro=3,-o=5 -WK,-p=1 -OPT:IEEE_arithmetic=3 72 -LNO:ou=8:cs1=16k:cs2=4m -TENV:X=4 -WK,-so=3,-ro=3,-o=5 -WK,-p=1 -WK,-ur=2 258 *3 -LNO:ou=10:cs1=16k:cs2=4m -TENV:X=4 -WK,-so=3,-ro=3,-o=5 -WK,-p=1 82 -LNO:ou=12:cs1=16k:cs2=4m -TENV:X=4 -WK,-so=3,-ro=3,-o=5 -WK,-p=1,-chs=4096 91 f77 -Ofast=ip21 -pfa -LNO:ou=8:cs1=16k:cs2=4m -TENV:X=4 -WK,-so=3,-ro=3,-o=5 -WK,-p=1 71 ................................................................ f77 -r8000 -mips4 -64 -O3 -pfa -mp -LNO:cs1=16k:cs2=4m -TENV:X=4 -WK,-so=3,-ro=3,-o=5 1 2 xxx 4 -LNO:ou=6:cs1=16k:cs2=4m -TENV:X=4 -WK,-so=3,-ro=3,-o=5 1 2 -LNO:ou=8:cs1=16k:cs2=4m -TENV:X=4 -WK,-so=3,-ro=3,-o=5 1 2 -LNO:ou=10:cs1=16k:cs2=4m -TENV:X=4 -WK,-so=3,-ro=3,-o=5 1 2 -------------------------------------------------------------------------- p25v f77 -r8000 -mips4 -64 -O3 -lcomplib.sgimath 34 p25v f77 -lcomplib.sgimath 34 p25vt f77 -lcomplib.sgimath 34 *4 -------------------------------------------------------------------------- *1 -ou=6 is optimal without -pfa. *2 -ou=8 is optimal wit -pfa. *3 -ur > 1 is harmful. *4 Transposing one of the matrices apparently does not affect the CPU time. Table IV-a. Integer matrix multiplication, ND = 400, N = 400. -------------------------------------------------------------------------- Program Compiler call Threads CPU time -------------------------------------------------------------------------- p2i f77 -r8000 -mips4 -64 -O3 4.3 f77 -r8000 -mips4 -64 -O3 -LNO:ou=4 4.3 f77 -r8000 -mips4 -64 -O3 -LNO:ou=4 -pfa -WK,-p=1 4.3 f77 -Ofast=ip21 4.4 -------------------------------------------------------------------------- Table IV-b. Integer matrix multiplication, ND = 800, N = 800. -------------------------------------------------------------------------- Program Compiler call Threads CPU time -------------------------------------------------------------------------- p2i f77 -r8000 -mips4 -64 -O3 34 f77 -r8000 -mips4 -64 -O3 -LNO:ou=4 34 f77 -r8000 -mips4 -64 -O3 -LNO:ou=4 -pfa -WK,-p=1 34 f77 -r8000 -mips4 -64 -O3 -LNO:ou=4 -TENV:X=4 34 f77 -r8000 -mips4 -64 -O3 -LNO:ou=4 -TENV:X=4 -pfa -WK,-p=1 34 f77 -r8000 -mips4 -64 -O3 -LNO:ou=4 -TENV:X=4 -pfa -WK,-p=1,-so=3,-ro=3,-o=5 34 f77 -Ofast=ip21 36 -------------------------------------------------------------------------- APPENDIX A C KAP/SGI 6.0.2 Jan2695 o5r2so3 25-Jan-1999 21:01:36 # 1 "p25.f" # 1 "p25.f" program P25 C C EACH STEP TIMED SEPARATELY. C ONLY ONE LOOP CONFIGURATION TESTED (BECAUSE RRPSK 1.0 PROVED C STABILITY ON LOOP REARRANGEMENTS). C I-LOOP INSIDE (FOR COMPILERS WHICH DO NOT EXCHANGE LOOPS). C V1, 92/11/26. C implicit real*8 (A-H,O-Z) parameter (ND = 800, NIN = 5, NOUT = 6) dimension A(800,800), B(800,800), C(800,800) # 12 "p25.f" integer II1, II2 C write (6, 200) # 14 "p25.f" read (5, 100) N write (6, 201) 800, N call TEMPD (T0,T1,TD,6,.true.) C$PAR parallel if (N .gt. 12) shared (N,A,B) local (II1,II2) # 17 "p25.f" CSGI$ startloop 9 # 17 "p25.f" C$PAR pdo # 17 "p25.f" do 3 II1=1,N CSGI$ startloop 10 # 18 "p25.f" do 2 II2=1,N A(II2,II1) = II1 * II2 B(II2,II1) = II1 + II2 2 continue # 21 "p25.f" CSGI$ endloop 10 3 continue # 22 "p25.f" C$PAR end pdo nowait # 22 "p25.f" CSGI$ endloop 9 # 22 "p25.f" C$PAR end parallel call TEMPD (T1,T2,TD,6,.true.) C$PAR parallel if (N .gt. 22) shared (N,C) local (II1,II2) # 24 "p25.f" CSGI$ startloop 2 # 24 "p25.f" C$PAR pdo # 24 "p25.f" do 5 II1=1,N CSGI$ startloop 3 # 25 "p25.f" do 4 II2=1,N C(II2,II1) = 0.d0 4 continue # 27 "p25.f" CSGI$ endloop 3 5 continue # 28 "p25.f" C$PAR end pdo nowait # 28 "p25.f" CSGI$ endloop 2 # 28 "p25.f" C$PAR end parallel call TEMPD (T2,T3,TD,6,.true.) C$PAR parallel if (N * N * N .gt. 142) shared (N,C,A,B) local (II1,K,II2 C$PAR& ) # 30 "p25.f" CSGI$ startloop 4 # 30 "p25.f" C$PAR pdo # 30 "p25.f" do 8 II1=1,N CSGI$ startloop 5 # 31 "p25.f" do 7 K=1,N CSGI$ startloop 11 # 32 "p25.f" do 6 II2=1,N C(II2,II1) = C(II2,II1) + A(II2,K) * B(K,II1) 6 continue # 34 "p25.f" CSGI$ endloop 11 7 continue # 35 "p25.f" CSGI$ endloop 5 8 continue # 36 "p25.f" C$PAR end pdo nowait # 36 "p25.f" CSGI$ endloop 4 # 36 "p25.f" C$PAR end parallel # 30 "p25.f" J = max0 (N, 0) + 1 # 37 "p25.f" call TEMPD (T3,T4,TD,6,.true.) write (6, 202) N, 800, ((C(I,J), J=1,4), I=1,4) C 100 format(i4) 200 format(1H ,8HP25 V1 ) 201 format(1H ,8HND, N ,2i8) 202 format(1H ,2i8,/,(1H ,4e16.8)) end APPENDIX B C KAP/SGI 6.0.2 Jan2695 o5r2so3 26-Jan-1999 10:57:07 # 1 "p25ujk.f" # 1 "p25ujk.f" program P25UJK C C MATRIX MULTIPLICATION TEST. C LOOP ORDER AS IN P25, BUT WITH PFA DIRECTIVES. C V1.0, 99/01/27. C implicit real*8 (A-H,O-Z) parameter (ND = 800, NIN = 5, NOUT = 6) dimension A(800,800), B(800,800), C(800,800) # 10 "p25ujk.f" integer II1, II2 C write (6, 200) # 12 "p25ujk.f" read (5, 100) N write (6, 201) 800, N call TEMPD (T0,T1,TD,6,.true.) C*$* noconcurrentize CSGI$ startloop 2 # 16 "p25ujk.f" do 12 II1=1,N CSGI$ startloop 3 # 17 "p25ujk.f" do 10 II2=1,N A(II2,II1) = II1 * II2 B(II2,II1) = II1 + II2 10 continue # 20 "p25ujk.f" CSGI$ endloop 3 12 continue # 21 "p25ujk.f" CSGI$ endloop 2 call TEMPD (T1,T2,TD,6,.true.) C*$* noconcurrentize CSGI$ startloop 4 # 24 "p25ujk.f" do 20 II1=1,N CSGI$ startloop 5 # 25 "p25ujk.f" do 18 II2=1,N C(II2,II1) = 0.d0 18 continue # 27 "p25ujk.f" CSGI$ endloop 5 20 continue # 28 "p25ujk.f" CSGI$ endloop 4 call TEMPD (T2,T3,TD,6,.true.) C*$* concurrentize C$PAR parallel if (N * N * N .gt. 142) shared (N,C,A,B) local (II1,K,II2 C$PAR& ) # 31 "p25ujk.f" CSGI$ startloop 9 # 31 "p25ujk.f" C$PAR pdo # 31 "p25ujk.f" do 5 II1=1,N CSGI$ startloop 10 # 32 "p25ujk.f" do 4 K=1,N # 34 "p25ujk.f" CSGI$ startloop 11 # 34 "p25ujk.f" do 2 II2=1,N-3,4 C(II2,II1) = C(II2,II1) + A(II2,K) * B(K,II1) # 35 "p25ujk.f" C(II2+1,II1) = C(II2+1,II1) + A(II2+1,K) * B(K,II1) # 35 "p25ujk.f" C(II2+2,II1) = C(II2+2,II1) + A(II2+2,K) * B(K,II1) # 35 "p25ujk.f" C(II2+3,II1) = C(II2+3,II1) + A(II2+3,K) * B(K,II1) 2 continue # 36 "p25ujk.f" CSGI$ endloop 11 # 34 "p25ujk.f" CSGI$ startloop 6 # 34 "p25ujk.f" do 3 II2=II2,N,1 C(II2,II1) = C(II2,II1) + A(II2,K) * B(K,II1) 3 continue # 36 "p25ujk.f" CSGI$ endloop 6 4 continue # 37 "p25ujk.f" CSGI$ endloop 10 5 continue # 38 "p25ujk.f" C$PAR end pdo nowait # 38 "p25ujk.f" CSGI$ endloop 9 # 38 "p25ujk.f" C$PAR end parallel # 31 "p25ujk.f" J = max0 (N, 0) + 1 # 39 "p25ujk.f" call TEMPD (T3,T4,TD,6,.true.) write (6, 202) N, 800, ((C(I,J), J=1,4), I=1,4) C 100 format(i4) 200 format(1H ,12HP25UJK V1 ) 201 format(1H ,8HND, N ,2i8) 202 format(1H ,2i8,/,(1H ,4e16.8)) end APPENDIX C p25mjk processed code. C *********************************************************** C Fortran file translated from WHIRL Tue Jan 26 11:17:34 1999 C *********************************************************** PROGRAM MAIN IMPLICIT NONE C C **** Variables and functions **** C REAL*8 A(800_8, 800_8) REAL*8 B(800_8, 800_8) REAL*8 C(800_8, 800_8) INTEGER*4 N REAL*8 T0 REAL*8 T1 REAL*8 TD INTEGER*4 J INTEGER*4 I REAL*8 T2 REAL*8 T3 INTEGER*4 k REAL*8 T4 EXTERNAL tempd C C **** Temporary variables **** C INTEGER*4 J1 INTEGER*4 I0 INTEGER*4 J0 INTEGER*4 I1 INTEGER*4 wd_J INTEGER*4 I2 INTEGER*4 k2 INTEGER*4 k1 INTEGER*4 k0 INTEGER*4 I3 REAL*8 mi0 REAL*8 mi1 REAL*8 mi2 REAL*8 mi3 REAL*8 mi4 REAL*8 mi5 REAL*8 mi6 REAL*8 mi7 REAL*8 mi8 REAL*8 mi9 INTEGER*4 wd_I REAL*8 mi10 REAL*8 mi11 INTEGER*4 wd_$wd_I REAL*8 mi12 C C **** statements **** C WRITE(6, '(1H ,12HP25MJK V1 )') READ(5, '(I4)') N WRITE(6, '(1H ,8HND, N ,2I8)') 800, N CALL tempd(T0, T1, TD, 6, .TRUE.) DO J1 = 1, N, 1 DO I0 = 1, N, 1 A(I0, J1) = DBLE((J1 * I0)) B(I0, J1) = DBLE((J1 + I0)) END DO I = MAX((N + 1), 1) END DO CALL tempd(T1, T2, TD, 6, .TRUE.) DO J0 = 1, (N + -2), 3 DO I1 = 1, N, 1 C(I1, J0) = 0.0D00 C(I1, J0 + 1) = 0.0D00 C(I1, J0 + 2) = 0.0D00 END DO I = MAX((N + 1), 1) I = MAX((N + 1), 1) I = MAX((N + 1), 1) END DO DO wd_J = J0, N, 1 DO I2 = 1, N, 1 C(I2, wd_J) = 0.0D00 END DO I = MAX((N + 1), 1) END DO CALL tempd(T2, T3, TD, 6, .TRUE.) C$DOACROSS lastlocal(J), local(k, k2, k1, k0, I, I3), local(mi0, mi1, C$& mi2, mi3, mi4, mi5, mi6, mi7, mi8, mi9, wd_I, mi10, mi11, wd_$wd_I C$& , mi12) DO J = 1, N, 1 DO I3 = 1, (N + -9), 10 mi0 = C(I3, J) mi1 = C(I3 + 1, J) mi2 = C(I3 + 9, J) mi3 = C(I3 + 2, J) mi4 = C(I3 + 8, J) mi5 = C(I3 + 3, J) mi6 = C(I3 + 7, J) mi7 = C(I3 + 4, J) mi8 = C(I3 + 6, J) mi9 = C(I3 + 5, J) DO k = 1, N, 1 mi0 = (mi0 +(A(I3, k) * B(k, J))) mi1 = (mi1 +(A(I3 + 1, k) * B(k, J))) mi3 = (mi3 +(A(I3 + 2, k) * B(k, J))) mi5 = (mi5 +(A(I3 + 3, k) * B(k, J))) mi7 = (mi7 +(A(I3 + 4, k) * B(k, J))) mi9 = (mi9 +(A(I3 + 5, k) * B(k, J))) mi8 = (mi8 +(A(I3 + 6, k) * B(k, J))) mi6 = (mi6 +(A(I3 + 7, k) * B(k, J))) mi4 = (mi4 +(A(I3 + 8, k) * B(k, J))) mi2 = (mi2 +(A(I3 + 9, k) * B(k, J))) END DO C(I3 + 5, J) = mi9 C(I3 + 6, J) = mi8 C(I3 + 4, J) = mi7 C(I3 + 7, J) = mi6 C(I3 + 3, J) = mi5 C(I3 + 8, J) = mi4 C(I3 + 2, J) = mi3 C(I3 + 9, J) = mi2 C(I3 + 1, J) = mi1 C(I3, J) = mi0 END DO DO wd_I = I3, (N + -1), 2 mi10 = C(wd_I, J) mi11 = C(wd_I + 1, J) DO k0 = 1, N, 1 mi10 = (mi10 +(A(wd_I, k0) * B(k0, J))) mi11 = (mi11 +(A(wd_I + 1, k0) * B(k0, J))) END DO C(wd_I + 1, J) = mi11 C(wd_I, J) = mi10 END DO DO wd_$wd_I = wd_I, N, 1 mi12 = C(wd_$wd_I, J) DO k1 = 1, N, 1 mi12 = (mi12 +(A(wd_$wd_I, k1) * B(k1, J))) END DO C(wd_$wd_I, J) = mi12 END DO DO k2 = 1, N, 1 I = MAX((N + 1), 1) END DO END DO CALL tempd(T3, T4, TD, 6, .TRUE.) WRITE(6, '(1H ,2I8,/,(1H ,4E16.8))') N, 800, ((C(I, J), J = 1, 4, 1), I = 1, 4, 1) STOP END ! MAIN APPENDIX D diff of prof files with and without DGEMM call: 2,3c2,3 < Prof run at: Wed Feb 3 15:32:12 1999 < Command line: prof -r8000 -pixie -gprof -op -archinfo -h -q 4 p25.e_f77_r8000_mips4_64_O3 -pixie --- > Prof run at: Wed Feb 3 15:30:28 1999 > Command line: prof -r8000 -pixie -gprof -op -archinfo -h -q 4 p25v.e_f77_r8000_mips4_64_O3_lcomplib.sgimath -pixie 6,9c6,9 < 260190702: Total number of cycles < 3.46921s: Total execution time < 1035469993: Total number of instructions executed < 0.251: Ratio of cycles / instruction --- > 285309264: Total number of cycles > 3.80412s: Total execution time > 998949432: Total number of instructions executed > 0.286: Ratio of cycles / instruction 17,18c17,18 < 1025668746: Floating point operations (295.649 Mflops @ 75 MHz) < 106563413: Integer operations (30.7169 M intops @ 75 MHz) --- > 1026031421: Floating point operations (269.716 Mflops @ 75 MHz) > 28495088: Integer operations (7.49058 M intops @ 75 MHz) 21,29c21,43 < 207956083: Number of scheduling stalls < 1445736: Number of stalls waiting for operands < 10479: Number of nop's in branch delay slot < 359070816: Number of load instructions executed < 2872534400: Number of bytes loaded < 2648313: Number of store instructions executed < 20914843: Number of bytes stored < 52031417: Number of conditional branch instructions executed < 26046744: Number of conditional branches taken --- > 262488610: Number of scheduling stalls > 130403663: Number of stalls waiting for operands > 13359: Number of nop's in branch delay slot > 301026875: Number of load instructions executed > 2408179114: Number of bytes loaded > 131127408: Number of store instructions executed > 1049002899: Number of bytes stored > 22323303: Number of conditional branch instructions executed > 5858711: Number of conditional branches taken > 214812: Number of branch likely instructions executed > 53503: Number of branch likely branches taken