This is an optimization report (work in progress). OBJECTIVE To optimize the program EWF2N (see the "bm_a" benchmark) to run fast on SGI PC L. Starting point: RRPSM 1.0. COMPILER OPTIONS Convex: fc -O2. SGI: f77 -O3. No additional optimization parameters used; this may explain the speedup when using precompiled libraries on SGI (more tests pending). SUMMARY The optimization is done for two cases (physical applications) in which the program behaves differently. Profiling is not done; however, the routine RRPSM uses up at least 2/3 of the total time. Case A: IPXU = 0. In this case the index IPX assumes only two values and the IPX loops do not affect the performance. Most of the time is spent doing pure linear algebra. Case B: IPXU = 100. In this case the positions of the IPX loops severely affect performance. In the case A (Table 1-I), the speedup on SGI is 2 compared to C-3860 (theoretical MFLOPS ratio: 2.5). Because I only used "f77 -O3", and no other optimization options in addition to the defaults, the speedup on SGI comes from the use of -lblas (or -lcomplib.sgimath). In the case B (Tables 2-I, 3-I), the use of BLAS and LINPACK makes little difference on both SGI and Convex. However, the best routine on C-3860 (3.0VS) has quite different loop topology from the best one on SGI (3.1.1V1). (In 3.1.1V1, the IPX loop has been moved to the outer level. Then the WD(.,.,IPX), WS(.,.,IPX) memory references are no longer being repeated for each iteration of the M1 loop.) The speedup in this case is 1.15. Also, the variations in speed among different versions are very large on both SGI and Convex. This last example clearly shows the pitfalls of speed measurements: version 3.1.1V1S itself is 2.7 times faster on SGI than on Convex, but the overall speed gain across all versions is only 1.15.) NOTE. A "V" in the RRPSM version number means a call to DGEMM (Veclib). "V1" means calls to DGEMM and DAXPY. In addition to these calls, there are calls to BLAS or LINPACK from other routines in the program; these are either supplied as source files, or called from precompiled libraries; this is explained in the Table captions. Table 1-I. CPU times for ewf2n (total running time). rrpsm uses most of the CPU time, but the exact percentage was not measured (in progress). C-3860 (marvin): Only the RRPSM calls of DGEMM, etc. are resolved by -lveclib, others by Linpack source files. SGI PC L (saturn): Resolving blas, veclib calls: x: -lblas used for DGEMM in rrpsm, other Linpack routines as sources. cs: all Linpack routines from -lcomplib.sgimath. b: Linpack routines from -lblas -complib.sgimath, in this order. The critical part is the calculation of CNT(,,) (region of rrpsm with the DGEMM call, see Appendix). MM = 49, IPX = 1,2. -------------------------------------------------------------------------- Version CPU time (SGI) CPU time (Convex) x cs b -------------------------------------------------------------------------- 1.0 27.9 1.1 32.2 1.1S 32.0 3.0S 31.9 3.0VS 12.4 12.5 25.0 3.0V1S 12.5 25.5 3.1V1S 12.5 12.6 26.4 3.1.1V1S 12.5 12.6 26.1 4.0S 101.6 4.1S 98.3 4.1PV1S 28.1 4.1.1V1S 28.7 29.0 4.2S 95.5 -------------------------------------------------------------------------- Table 2-I. The critical parts are the calculation of CNT(,,) and PP(,,), QQ(,,) (see Appendix). Notation as in Table 1-I. MM = 66, IPX = 1,... 102. This is the case with the unvectorizable summation. -------------------------------------------------------------------------- Version CPU time (SGI) CPU time (Convex) x cs b -------------------------------------------------------------------------- 1.0 470 293 1.1 471 296 1.1S 3.0S 467 291 3.0VS 358 356 278 3.0V1S 373 694 3.1V1S 333 421 643 3.1.1V1S 241 362 643 3.1.2V1S 248 4.1.1V1S 313 -------------------------------------------------------------------------- Remark. 3.1V1S and 3.1.1V1S differ only in the exchange of M1 and IPX loops for the PP, QQ calculation. This speeds up the program by a factor of 1.4 on the SGI, which means rrpsm is sped up even more. Table 3-I. As Table 2-I, but MM = 231. Notation as in Table 1-I. -------------------------------------------------------------------------- Version CPU time (saturn) CPU time (marvin) x cs b -------------------------------------------------------------------------- 1.0 7134 3.0VS 7365 7369 7550 3.0V1S 3.1V1S 7286 8174 3.1.1V1S 6888 8062 -------------------------------------------------------------------------- APPENDIX rrpsm version 3.0V (comments deleted) ------------------ SUBROUTINE RRPSM(NCDS, NCDP, NCDCN, NDBIN1, NDBIN2, 1 WS, WD, MM, MU, IPXU, IPOFF, Z0M, JMAX, AA, 2 RHOUW, PP, QQ, RR, BIN, 3 LTEST, NOUT, CNT) IMPLICIT REAL*8 (A - H, O - Z) LOGICAL LTEST DIMENSION WS(NCDS,NCDS,NCDP), WD(NCDS,NCDS,NCDP), 1 MU(NCDS), 2 PP(NCDS,NCDS,NCDCN), QQ(NCDS,NCDS,NCDCN), 3 RR(NCDS,NCDS), BIN(NDBIN1,NDBIN2), 4 CNT(NCDS,NCDS,NCDCN) CHARACTER*6 NAME DATA NAME / 'RRPSM ' / CHARACTER*60 VERS DATA VERS */ ' 3.0VS, 95/12/29. 3.0S, 95/12/28. 3.0, 95/10/08 -M_1.1 ' / PARAMETER (ZERO = 0.D0, ONE = 1.D0, TWO = 2.D0) LOGICAL LCALL0 DATA LCALL0 / .TRUE. / IF (LCALL0) WRITE (NOUT,4000) NAME, VERS IF (MM .GT. NCDS) CALL ERR2(NAME, 1, 1, NOUT) IF (IPXU .GT. NCDP) CALL ERR2(NAME, 2, 1, NOUT) IF (JMAX+1 .GT. NCDCN) CALL ERR2(NAME, 3, 1, NOUT) IF (ABS(AA) .LT. CDXTOL()) CALL ERR2(NAME, 4, 1, NOUT) IF (LCALL0 .AND. IPOFF .GT. 2) CALL ERR2(NAME, 11, 0, NOUT) IF (LCALL0 .AND. JMAX .LT. 2) CALL ERR2(NAME, 21, 0, NOUT) IF (LCALL0) LCALL0 = .FALSE. IF (JMAX .LT. 2) RETURN IF (RHOUW .GE. ZERO) ZUW = TWO * AA * RHOUW IF (RHOUW .LT. ZERO) ZUW = -RHOUW JMAX1 = JMAX + 1 RHO0 = Z0M / (TWO * AA) DO 68 M1 = 1,JMAX JR = M1 - 2 Z0MIR2 = ONE / (Z0M ** (JR + 2)) ZZZA = BIN(-1+3,JR+1+1) * Z0MIR2 DO 22 I = 1,MM DO 20 J = 1,MM PP(J,I,M1) = ZERO 20 CONTINUE 22 CONTINUE DO 24 I = 1,MM PP(I,I,M1) = ZZZA 24 CONTINUE IF (Z0M .LE. ZUW) THEN IPXL = JR + 1 + IPOFF IF (IPXL .LE. IPXU) THEN ZZZB0 = Z0MIR2 * RHO0 ** (IPXL - IPOFF + 1) DO 34 IPX = IPXL,IPXU IP = IPX - IPOFF ZZZB = - BIN(IP+3,JR+1+1) * ZZZB0 DO 32 I = 1,MM DO 30 J = 1,MM PP(J,I,M1) = PP(J,I,M1) 1 + ZZZB * WD(J,I,IPX) 30 CONTINUE 32 CONTINUE ZZZB0 = ZZZB0 * RHO0 34 CONTINUE ENDIF ENDIF DO 42 I = 1,MM DO 40 J = 1,MM QQ(J,I,M1) = ZERO 40 CONTINUE 42 CONTINUE IF (JR .LT. 0) GO TO 66 IF (JR .EQ. 0) THEN ZZZC = -0.25D0 DO 50 I = 1,MM QQ(I,I,M1) = ZZZC 50 CONTINUE ENDIF IF (JR .GE. 0) THEN ZZZD = -BIN(-2+3,JR+1) * Z0MIR2 DO 52 I = 1,MM QQ(I,I,M1) = QQ(I,I,M1) + ZZZD * MU(I) ** 2 52 CONTINUE ENDIF IF (Z0M .GT. ZUW) GO TO 66 IPXL = -1 + IPOFF IF (IPXL .GT. IPXU) CALL ERR2(NAME, 31, 1, NOUT) ZZZE0 = Z0MIR2 * RHO0 ** (IPXL - IPOFF + 2) DO 65 IPX = IPXL,IPXU IP = IPX - IPOFF IF (IP .GE. 0 .AND. JR .GT. IP) GO TO 64 ZZZE = - BIN(IP+3,JR+1) * ZZZE0 DO 62 I = 1,MM DO 60 J = 1,MM QQ(J,I,M1) = QQ(J,I,M1) & + ZZZE * WS(J,I,IPX) 60 CONTINUE 62 CONTINUE 64 CONTINUE ZZZE0 = ZZZE0 * RHO0 65 CONTINUE 66 CONTINUE 68 CONTINUE DO 124 N1 = 3,JMAX1 DO 72 J = 1,MM DO 70 I = 1,MM CNT(I,J,N1) = ZERO 70 CONTINUE 72 CONTINUE JP = N1 - 1 JP1 = JP - 1 DO 90 M1 = 1,JP1 JR = M1 - 2 ZZZ72 = JP - JR - 2 DO 82 K = 1,MM DO 80 J = 1,MM RR(J,K) = ZZZ72 * PP(J,K,M1) + QQ(J,K,M1) 80 CONTINUE 82 CONTINUE CALL DGEMM('N', 'T', MM, MM, MM, & ONE, CNT(1,1,N1-JR-2), NCDS, & RR, NCDS, & ONE, CNT(1,1,N1), NCDS) 90 CONTINUE M1 = JP DO 108 J = 1,MM DO 106 I = 1,MM CNT(I,J,N1) = CNT(I,J,N1) + QQ(J,I,M1) 106 CONTINUE 108 CONTINUE ZZZ71 = - ONE / DBLE(JP * (JP - 1)) DO 122 J = 1,MM DO 120 I = 1,MM CNT(I,J,N1) = ZZZ71 * CNT(I,J,N1) 120 CONTINUE 122 CONTINUE 124 CONTINUE IF (.NOT.LTEST) RETURN WRITE (NOUT,4100) NAME, Z0M DO 132 K = 1,JMAX1 WRITE (NOUT,4110) K DO 130 I = 1,MM WRITE (NOUT,4120) (CNT(I,J,K), J = 1,MM) 130 CONTINUE 132 CONTINUE RETURN 4000 FORMAT ( A 2H -, A6, 1H-, A60) 4100 FORMAT (//, 1H , 80(1H-), /, A 1H , 1H-, 2A4, 1H-, 4HZ0M , F12.6) 4110 FORMAT (/, A 1H , 4H K , I4) 4120 FORMAT ( A (1H , 8X, 4F16.8)) END rrpsm version 3.1.1V1 (comments deleted) --------------------- SUBROUTINE RRPSM(NCDS, NCDP, NCDCN, NDBIN1, NDBIN2, 1 WS, WD, MM, MU, IPXU, IPOFF, Z0M, JMAX, AA, 2 RHOUW, PP, QQ, RR, BIN, 3 LTEST, NOUT, CNT) IMPLICIT REAL*8 (A - H, O - Z) LOGICAL LTEST DIMENSION WS(NCDS,NCDS,NCDP), WD(NCDS,NCDS,NCDP), 1 MU(NCDS), 2 PP(NCDS,NCDS,NCDCN), QQ(NCDS,NCDS,NCDCN), 3 RR(NCDS,NCDS), BIN(NDBIN1,NDBIN2), 4 CNT(NCDS,NCDS,NCDCN) CHARACTER*6 NAME DATA NAME / 'RRPSM ' / CHARACTER*60 VERS DATA VERS */ ' 3.1.1V1, 96/01/12. -3.1V1S. -3.0V1S. -3.0VS -3.0S -M 1.1 ' / PARAMETER (ZERO = 0.D0, ONE = 1.D0, TWO = 2.D0) LOGICAL LCALL0 DATA LCALL0 / .TRUE. / IF (LCALL0) WRITE (NOUT,4000) NAME, VERS IF (MM .GT. NCDS) CALL ERR2(NAME, 1, 1, NOUT) IF (IPXU .GT. NCDP) CALL ERR2(NAME, 2, 1, NOUT) IF (JMAX+1 .GT. NCDCN) CALL ERR2(NAME, 3, 1, NOUT) IF (ABS(AA) .LT. CDXTOL()) CALL ERR2(NAME, 4, 1, NOUT) IF (LCALL0 .AND. IPOFF .GT. 2) CALL ERR2(NAME, 11, 0, NOUT) IF (JMAX .LT. 2) THEN IF (LCALL0) THEN CALL ERR2(NAME, 21, 0, NOUT) LCALL0 = .FALSE. RETURN ENDIF ENDIF IF (RHOUW .GE. ZERO) ZUW = TWO * AA * RHOUW IF (RHOUW .LT. ZERO) ZUW = -RHOUW JMAX1 = JMAX + 1 RHO0 = Z0M / (TWO * AA) Z0MI = ONE / Z0M DO 14 M1 = 1,JMAX DO 12 I = 1,MM DO 10 J = 1,MM PP(J,I,M1) = ZERO 10 CONTINUE 12 CONTINUE 14 CONTINUE DO 18 M1 = 1,JMAX JR = M1 - 2 Z0MIR2 = ONE / (Z0M ** (JR + 2)) ZZZA = BIN(-1+3,JR+1+1) * Z0MIR2 DO 16 I = 1,MM PP(I,I,M1) = ZZZA 16 CONTINUE 18 CONTINUE IF (Z0M .LE. ZUW) THEN IPXL = - 1 + 1 + IPOFF IF (IPXL .LE. IPXU) THEN ZZZB0 = RHO0 ** (IPXL - IPOFF + 1) DO 26 IPX = IPXL,IPXU IP = IPX - IPOFF ZZZB1 = ONE / (Z0M ** (- 1 + 2)) * ZZZB0 M1U = MIN(JMAX, IP - 1 + 2) DO 24 M1 = 1,M1U JR = M1 - 2 ZZZB = - BIN(IP+3,JR+1+1) * ZZZB1 DO 22 I = 1,MM CALL DAXPY(MM, ZZZB, WD(1,I,IPX), 1, & PP(1,I,M1), 1) 22 CONTINUE ZZZB1 = ZZZB1 * Z0MI 24 CONTINUE ZZZB0 = ZZZB0 * RHO0 26 CONTINUE ENDIF ENDIF DO 34 M1 = 1,JMAX DO 32 I = 1,MM DO 30 J = 1,MM QQ(J,I,M1) = ZERO 30 CONTINUE 32 CONTINUE 34 CONTINUE ZZZC = -0.25D0 DO 36 I = 1,MM QQ(I,I,2) = ZZZC 36 CONTINUE DO 42 M1 = 2,JMAX JR = M1 - 2 Z0MIR2 = ONE / (Z0M ** (JR + 2)) ZZZD = -BIN(-2+3,JR+1) * Z0MIR2 DO 40 I = 1,MM QQ(I,I,M1) = QQ(I,I,M1) + ZZZD * MU(I) ** 2 40 CONTINUE 42 CONTINUE IF (Z0M .LE. ZUW) THEN DO 46 M1 = 2,JMAX JR = M1 - 2 Z0MIR2 = ONE / (Z0M ** (JR + 2)) IPXL = - 1 + IPOFF IF (IPXL .GT. IPXU) CALL ERR2(NAME, 31, 1, NOUT) ZZZE0 = Z0MIR2 * RHO0 ** (IPXL - IPOFF + 2) IPX = IPXL IP = IPX - IPOFF ZZZE = - BIN(IP+3,JR+1) * ZZZE0 DO 44 I = 1,MM CALL DAXPY(MM, ZZZE, WS(1,I,IPX), 1, & QQ(1,I,M1), 1) 44 CONTINUE 46 CONTINUE IPXL = 0 + IPOFF IF (IPXL .GT. IPXU) CALL ERR2(NAME, 32, 1, NOUT) ZZZE0 = RHO0 ** (IPXL - IPOFF + 2) DO 54 IPX = IPXL,IPXU IP = IPX - IPOFF ZZZE1 = ONE / (Z0M ** (0 + 2)) * ZZZE0 M1U = MIN(JMAX, IP + 2) DO 52 M1 = 2,M1U JR = M1 - 2 ZZZE = - BIN(IP+3,JR+1) * ZZZE1 DO 50 I = 1,MM CALL DAXPY(MM, ZZZE, WS(1,I,IPX), 1, & QQ(1,I,M1), 1) 50 CONTINUE ZZZE1 = ZZZE1 * Z0MI 52 CONTINUE ZZZE0 = ZZZE0 * RHO0 54 CONTINUE ENDIF DO 100 N1 = 3,JMAX1 DO 62 J = 1,MM DO 60 I = 1,MM CNT(I,J,N1) = ZERO 60 CONTINUE 62 CONTINUE JP = N1 - 1 JP1 = JP - 1 DO 74 M1 = 1,JP1 JR = M1 - 2 ZZZ72 = JP - JR - 2 DO 72 K = 1,MM DO 70 J = 1,MM RR(J,K) = ZZZ72 * PP(J,K,M1) + QQ(J,K,M1) 70 CONTINUE 72 CONTINUE CALL DGEMM('N', 'T', MM, MM, MM, & ONE, CNT(1,1,N1-JR-2), NCDS, & RR, NCDS, & ONE, CNT(1,1,N1), NCDS) 74 CONTINUE M1 = JP DO 82 J = 1,MM DO 80 I = 1,MM CNT(I,J,N1) = CNT(I,J,N1) + QQ(J,I,M1) 80 CONTINUE 82 CONTINUE ZZZ71 = - ONE / DBLE(JP * (JP - 1)) DO 92 J = 1,MM DO 90 I = 1,MM CNT(I,J,N1) = ZZZ71 * CNT(I,J,N1) 90 CONTINUE 92 CONTINUE 100 CONTINUE IF (.NOT. LTEST) THEN IF (LCALL0) LCALL0 = .FALSE. RETURN ENDIF WRITE (NOUT,4100) NAME, Z0M DO 112 K = 1,JMAX1 WRITE (NOUT,4110) K DO 110 I = 1,MM WRITE (NOUT,4120) (CNT(I,J,K), J = 1,MM) 110 CONTINUE 112 CONTINUE IF (LCALL0) LCALL0 = .FALSE. RETURN 4000 FORMAT ( A 2H -, A6, 1H-, A60) 4100 FORMAT (//, 1H , 80(1H-), /, A 1H , 1H-, 2A4, 1H-, 4HZ0M , F12.6) 4110 FORMAT (/, A 1H , 4H K , I4) 4120 FORMAT ( A (1H , 8X, 4F16.8)) END