SUN F77 COMPILER BASIC TEST Updated 99/06/18 OBJECTIVE To measure a simple vector/parallel REAL*8 program speed using basic compiler optimization options. DESCRIPTION Machine: dune.ijs.si (Sun Enterprise 4500, 8 x 336 MHz, 8 GB RAM). SunOS dune 5.7 Generic sun4u sparc SUNW,Ultra-Enterprise Program: Matrix multiplication using nested DO loops. Matrix dimensions: ND (reservation). Timing: dtime (tempd.f) Program details: P25: I-loop inside (vector style), no directives, N is read: PROGRAM P25 ... IMPLICIT REAL*8 (A-H,O-Z) PARAMETER (ND = 800, NIN = 5, NOUT = 6) DIMENSION A(ND,ND), B(ND,ND), C(ND,ND) ... DO 26 J = 1,N DO 24 K = 1,N DO 22 I = 1,N C(I,J) = C(I,J) + A(I,K) * B(K,J) 22 CONTINUE 24 CONTINUE 26 CONTINUE ... P25V: calls DGEMM RESULTS There are two tests. Section 1 was kindly supplied by Ruud van der Pas from the Sun European HPC Team on 99/06/08. Section 2 was measured on a local machine; these results may not be conclusive as follows: the hand-coded results are probably as good as possible, but the DGEMM version may have linked the incorrect library because of the absence of the xarch option. 1. E6500 400MHz UltraSPARC-II, L2 cache: 8 MByte. compiler version 5.0. -- -------------------------------------------------------------------- Versions of the program (R. van der Pas): P25 Compiled Fortran P25-perflib Performance library version P25-ruud Uses my version of DGEMM - only suitable for C=A*B, but otherwise useable for arbitrary matrices, also non-square All Fortran 77 coded i.e. no assembly. Table 1.I. ND = N = 800, 1600. Operation count is 2*N**3 = 1*10**9, therefore the lower bound on CPU time on a single 800 MFLOPS processor is 1.3 and 10.2 seconds, respectively. (R. van der Pas.) Compile/link options: "-fast -fsimple=2 -xtarget=ultra2 -xarch=v8plusa". ------------------------------------------------------------------------ Version ND N Time Mflop/s E4500@336MHz Speed-up ------------------------------------------------------------------------ P25 800 800 6.7 153 (19%) 79 (12%) 1.94 P25-perflib 800 800 1.7 602 (75%) 260 (38%) 2.32 690 (86%) *1) P25-ruud 800 800 1.9 539 (67%) n.a. P25 1600 1600 203.3 40 ( 5%) 41 (6%) 0.98 P25-perflib 1600 1600 13.7 598 (75%) 228 (34%) 2.62 710 (89%) *1) P25-ruud 1600 1600 19.8 414 (52%) n.a. ------------------------------------------------------------------------ *1) Tuned for 8 MB cache, supplied by R. van der Pas 99/06/16. Comments by R. van der Pas: 1. The performance library version runs much faster 2. The compiled version for the 800x800 problem is about two times faster 3. The compiled version for the 1600x1600 runs at the same speed .... 4. The all Fortran rewritten version does okay 5. We reach about 75% of peak (800 Mflop/s) using the performance library AD 1 and 5. I think the huge increase in performance (400/336=1.19 ...) is because of the -xarch=v8plusa link option to get the tuned version linked in and of course because of the larger cache. We don't get the desired 80% of peak, but with 75% it is quite close. AD 4. I would expect the new release of the compiler to get close to these numbers as well. AD 3. This may come as a surprise, but actually it isn't. Currently the compiler does not interchange the loops and so the innerloops performs two loads and a store. All floating point can be hidden under these memory operations. However, it means we're looking at the speed of the memory system and not the speed of the CPU. As the memory system on both systems runs at the same speed, performance is equal. On the 800x800 system, we need ~15MB of data and the 8MB cache helps. The 1600x1600 system requires about 60MB of data and we lose the advantage of the larger cache. As the numbers for the performance library version, and the rewritten version demonstrate, one can do better than this! 2. E4500, 336 MHz, compiler version 5.0, SunOS 5.7. -- ------------------------------------------------ f77 -V: f77: WorkShop Compilers 5.0 98/12/15 FORTRAN 77 5.0. Table 2.I. ND = 800, N = 800. Operation count is 2*N**3 = 1*10**9, therefore the lower bound on CPU time on a single 672 MFLOPS processor (at 336 MHz) is 1.5 seconds. (R. Krivec). -------------------------------------------------------------------------- Program Compiler call Threads CPU time -------------------------------------------------------------------------- p25 f77 -fast 12.8 -fast -O5 13.0 -fast -O5 -xarch=v9 13.5 -fast -O5 -xarch=v9a 13.5 f77 -fast -O5 -xtarget=ultra2 -xcache=16/32/1:4096/64/1 -xarch=v9 13.4 *1 f77 -fast -O5 -unroll=4 -xtarget=ultra2 -xcache=16/32/1:4096/64/1 -xarch=v9 13.5 *1 p25v f77 -fast -xlic_lib=sunperf 4.1 *2 -------------------------------------------------------------------------- *1 "-xtarget=ultra2 -xcache=16/32/1:4096/64/1" advised by "fpversion". *2 The DGEMM version may have linked the incorrect library because of the absence of the xarch option on the command line.