BENCHMARK: bm_mm (matrix multiplication) Single-threaded p25, p25v: updated 99/01/22. Summary: see bench_bm_mm_all.txt Purpose This benchmark tests basic operations with different program sizes. Number of MFLOPS used to calculate speed: 2 N**3. Precision: 8-byte (REAL*8). Programs p25: multiplication loops: innnermost loop is the product-column loop. p25v: multiplication loops replaced by a call to DGEMM (C: -lveclib, SGI PCL: -lcomplib.sgimath (much faster than -lblas)). p25_matmul: multiplication loops replaced by a call to matmul. Summary The speed if p25 increased with the new compilers. The speed when using "complib.sgimath" on the SGI Power Challenge (R8000) stayed the same since 1995. Note that the best 1600x1600 result on the SGI Power Challenge is 34 seconds, which indicates 80% of theoretical performance. The best result known to me on a 400 MHz Pentium is 37 seconds, which means 221 Mflops and 55% of Pentium 400 theoretical performance (g77 + ppro blas), i.e., a 4-year-old SGI machine still beats a Pentium. The outer loop unrolling speeds up the program by a factor of 3 on Power Challenge. This comes within a factor 2 of the precompiled library. The C-series vector processor is almost as efficient with an explicit multiplication loop as with the Veclib (Blas) routine DGEMM. The SGI (a pipelined processor) performance is drastically improved by the DGEMM routine (shipped with the machine). Results TABLE I. CPU times and MFLOPS performance for SGI Power Challenge L and Convex C3860. Notation: ND: array dimension; N: actual array size; Length: 3 ND**2 * 8 / 1MB (given in Mbytes); MF: MFLOPS performance (2N**3 divided by CPU time); R: MF divided by theoretical peak performance. =============================================================================== ND N SGI (R8000, 75 MHz, 300 MFLOPS, C (120 MFLOPS) 4 MB secondary cache, 2 GB RAM; (Length) (f77 -r8000 -mips4 -64 -O3 -pfa (fc -O2 -LNO:ou=:cs1=16k:cs2=4m -TENV:X=4 -WK,-so=3,-ro=3,-o=5 -WK,-p=1,-chs=4096 [ -lcomplib.sgimath]) *1 [ -lveclib]) --------------------------------- ------------------------------- p25 p25v p25 p25v --------------- ---------------- -------------- --------------- CPU MF R CPU MF R CPU MF R CPU MF R ------------------------------------------------------------------------------- 800 100 0.01 0.01 0.02 0.02 (15) 200 0.06 0.16 0.16 400 0.5 0.51 250 0.83 1.4 1.2 104 0.87 800 6.4 160 0.53 4.1 250 0.83 10.7 96 0.8 9.8 104 0.87 ------------------------------------------------------------------------------- 1600 800 4.2 244 0.81 10.1 101 0.84 (61) 1600 71 115 0.38 33.8 241 0.80 81.4 101 0.84 ------------------------------------------------------------------------------- 4000 1600 38.0 215 0.72 * 82.9 99 0.82 (384) 4000 607 210 0.70 * 1276 100 0.84 =============================================================================== * f90 (1996 version). f90 may be slightly faster than f77 (SGI). *1 Optimal values of n: N = 800: n = 4, N = 1600: n = 8. TABLE I-1. As Table I, but for the 90 MHz R8000. =============================================================================== ND N SGI (R8000, 90 MHz, 360 MFLOPS) (Length (f77 -O3) (MB)) --------------------------------- p25 p25v --------------- ---------------- CPU MF R CPU MF R ------------------------------------------------------------------------------- 800 800 3.4 301 0.84 1600 1600 28.8 284 0.79 =============================================================================== TABLE I-2. As Table I, but integer arithmetic. MI is an artificial measure (million integer operations per second). This Table is only for illustration. =============================================================================== ND N SGI (R8000, 75 MHz, 300 MFLOPS) (Length /usr/bin/f77 -r8000 -mips4 -64 -O3 (MB)) --------------------------------- p2i --------------- CPU MI ------------------------------------------------------------------------------- 800 200 0.5 32 400 4.3 30 800 34 30 1600 800 36 28 1600 329 25 =============================================================================== TABLE II. As in Table I, but using C = matmul(A, B) (p25_matmul.f). =============================================================================== ND N SGI (R8000, 75 MHz, 300 MFLOPS) (Length (f90 -O3) (MB)) --------------------------------- p25_matmul --------------- CPU MF R ------------------------------------------------------------------------------- 800 100 9 0 (15) 200 9 2 400 9 14 800 9 113 0.38 1600 800 150 7 0.02 (61) 1600 150 55 0.18 =============================================================================== APPENDIX: Program sources ------------------------------------------------------------------------------- PROGRAM P25 C C EACH STEP TIMED SEPARATELY. C V1, 92/11/26. C IMPLICIT REAL*8 (A-H,O-Z) PARAMETER (ND = 800, NIN = 5, NOUT = 6) DIMENSION A(ND,ND), B(ND,ND), C(ND,ND) C WRITE (NOUT,200) READ (NIN,100) N WRITE (NOUT,201) ND, N CALL TEMPD(T0, T1, TD, NOUT, .TRUE.) DO 12 J = 1,N DO 10 I = 1,N A(I,J) = (J * I) B(I,J) = (J + I) 10 CONTINUE 12 CONTINUE CALL TEMPD(T1, T2, TD, NOUT, .TRUE.) DO 24 J = 1,N DO 18 I = 1,N C(I,J) = 0.D0 18 CONTINUE 24 CONTINUE CALL TEMPD(T2, T3, TD, NOUT, .TRUE.) DO 26 J = 1,N DO 22 K = 1,N DO 20 I = 1,N C(I,J) = C(I,J) + A(I,K) * B(K,J) 20 CONTINUE 22 CONTINUE 26 CONTINUE CALL TEMPD(T3, T4, TD, NOUT, .TRUE.) WRITE (NOUT,202) N, ND, ((C(I,J), J = 1,4), I = 1,4) C 100 FORMAT (I4) 200 FORMAT (1H , 8HP25 V1 ) 201 FORMAT (1H , 8HND, N , 2I8) 202 FORMAT (1H , 2I8, /, (1H , 4E16.8)) END ------------------------------------------------------------------------------- ------------------------------------------------------------------------------- PROGRAM P25V ... 24 CONTINUE CALL TEMPD(T2, T3, TD, NOUT, .TRUE.) CALL DGEMM('N', 'N', N, N, N, 1.0D0, & A, ND, B, ND, 0.D0, C, ND) CALL TEMPD(T3, T4, TD, NOUT, .TRUE.) ... END ------------------------------------------------------------------------------- ------------------------------------------------------------------------------- SUBROUTINE TEMPD(TOLD, TNEW, TDIF, NOUT, LPR) INTEGER NOUT REAL*8 TOLD, TNEW, TDIF LOGICAL LPR C C SGI PCL TIMING ROUTINE, USING DTIME. C DTIME REPORTS ELAPSED EXECUTION TIME (USER, SYSTEM) SINCE THE C LAST CALL TO ITSELF; THEREFORE THIS PROGRAM WORKS OK ONLY IF THERE C ARE NO OTHER CALLS TO DTIME EXCEPT FROM THIS PROGRAM. C (DTIME MAY NOT MEASURE THE CPU TIME ITSELF.) C C INPUT: TOLD, NOUT, LPR (ALL UNCHANGED). C OUTPUT: TNEW, TDIF. C LOGICAL INIT, TEST REAL*4 DTIME, TARRAY DIMENSION TARRAY(2) REAL*8 TSUM, SDIF DATA INIT / .TRUE. /, TEST / .FALSE. / C IF (INIT) THEN WRITE (NOUT,2000) TSUM = DTIME(TARRAY) TDIF = TARRAY(1) TNEW = TOLD + TDIF SDIF = TARRAY(2) INIT = .FALSE. IF (TEST) WRITE (NOUT,2200) TSUM, TNEW, TDIF, TOLD, SDIF RETURN ENDIF TSUM = DTIME(TARRAY) TDIF = TARRAY(1) TNEW = TOLD + TDIF SDIF = TARRAY(2) IF (TEST) WRITE (NOUT,2200) TSUM, TNEW, TDIF, TOLD, SDIF IF (LPR) WRITE (NOUT,2100) TNEW, TDIF, TOLD, SDIF RETURN C 2000 FORMAT (' -TEMPD- 3.0, 95/12/20. SGI PC TIMING ROUTINE.') 2100 FORMAT (' TIME', F12.2, A ' DIF', F12.2, B ' REF', F12.2, C ' DIF SYS', F12.2) 2200 FORMAT (' SUM USER DIF OLD SYS', 5F8.2) END -------------------------------------------------------------------------------