Note how a speedup of 3 for 1600x1600 is achieved on the SGI Power Challenge solely by using some basic compiler options:
"f77 -r8000 -mips4 -64 -O3" ............................ 212 sec "f77 -r8000 -mips4 -64 -O3 -LNO:ou=6" .................. 105 sec "f77 -r8000 -mips4 -64 -O3 -LNO:ou=8 -pfa -WK,-p=1" .... 71 sec.
On the other hand, the built-in library (f77 ... -lcomplib.sgimath) provides an additional speedup of 2 (routine DGEMM): 34 sec.