A few years ago I run across a colleague who was trying to measure matrix multiplication speed on a RISC workstation and compare it to the mighty Convex computer. I said, "Why don't you exchange loops so the column loop becomes innermost?" The answer was, "But this is a SCALAR machine, so the innermost loop should be the SCALAR PRODUCT loop!"
If you didn't get the point, or if you think compilers are smart anyway, you probably shouldn't bother reading on. If you are unsure, try this.