Trying to understand memory throughput numbers from architecture R. Krivec 01/05/08 Abstract. An attempt to explain the STREAM numbers from the network part of machines (buses and crossbars) is found to fail; instead, the memory throughput benchmark is found to be proportional to processor MHz (!). It could be a coincidence because the role of cache coherency is unclear. The architectural calculations attempt to reproduce Veberic numbers (DV). These are correlated with 1/2 of STREAM COPY values. (Veberic numbers are approximate values from graphs (http://www-f1.ijs.si/~krivec/bench/dv_mem_a.ps) where memory chunks are larger than L2 cache.) STREAM COPY results are for 1 processor, or, if this datum is not available, approximately extrapolated to 1 processor. ------------------------------------------------------------------------------ MACHINE DV STREAM STREAM-1 P./NODE Attempted explanation (no contention) ------------------------------------------------------------------------------ (1 p.) No. p. Value No. p. Value ------------------------------------------------------------------------------ GS160/731: aggregate BW = 6.4 GBps 16 9796 (bidirectional?, 4 processors); 4 2455 1 processor 6.4 GBps/4/2 = 800 MBps 400 1 970 unidirectional; cache coherency --> 400 MBps ES40/833: aggregate BW = 5.2 GBps 4 2504 (bidirectional?, 4 processors); 2 1762 1 processor 5.2 GBps/4/2 = 650 MBps 600 1 1342 unidirectional; (NO cache coherency overhead?) O200/225: 2 processors share a 0.720 GBps 4 632 (peak) bus; crossbar; 2 330 1 processor 0.72 GBps/1/2 = 360 MBps 150 1 303 (total bus, unidirectional); cache coherency --> 180 MBps (?) O2000/250: 2 processors share a 0.780 GBps 16 2910 16 5560 (peak) bus; crossbar; 8 1430 8 2570 1 processor 0.78 GBps/1/2 = 390 MBps 4 747 4 1280 (total bus, unidirectional); 2 361 2 664 cache coherency --> 195 MBps (?) 1 332 O3800/400: 2 processors share a 1.6 GBps bus; 16 5534 crossbar; 1 processor 1.6 GBps/1/2 = 800 MBps 8 2855 (total bus, unidirectional); 4 1400 cache coherency --> 400 MBps (?) 2 *1 700 1 *1 600? SGI PowerChallenge/10k 4 537 (bus) 2 351 1 172 Sun_UE_6001 4 921 (bus) 3 694 2 460 1 281 HP_V2600_Enterprise 8 1539 (crossbar?) 1 390 HP_N4000 (bus; 440 MHz?) 8 1759 1 760 ------------------------------------------------------------------------------ ---------------------------- *1 Extrapolated Remarks: 1. About STREAM (http://www.cs.virginia.edu/stream/) ES40: fast on 1 proc., scales bad, reaches 1/2 aggregate QBB BW at 4 proc. GS160: slower than ES40 on 1 proc., scales better, reaches 1/2.5 aggregate QBB BW at 4 proc. (still less than ES40). STREAM for Origins is approximately proportional to processor MHz. This explains why white papers from 1997 list the same network speeds. 2. About attempted explanations: Explanations should explain the DV (unidirectional) values. They seem to work ES40 and GS160 and also somewhat for Origin 3800, because probably the aggregate BW can be utilized at 400 MHz but not at lower MHz. (Older Origins are slower because of smaller MHz.) The only sensible comparison is between GS160 and Origin 3800. Here we have proportionality with MHz again! CONCLUSION In the above class of machines, STREAM is just roughly proportional to MHz! In addition, ALL machines suffer from contention when going from 1 to 2 processors, so 4 processors only get approximately double the speed of 1 processor. However, the bus machines (Power Challenge and UE) do scale linearly with number of processors. (There are no data for newer Sun machines.) The values are lower than expected, though. This benchmark is linearly dependent on the MHz or Linpack benchmark on the interseting set of machines and therefore unusable.