Preface
There are usually some recommendation videos from MIT OCW on youtube for me. Shameless to say, although I am interested in the courses, I never finished any whole semester. But, Nah. I just want to note down those I clicked and gain some knowledge after all.
Reference
https://www.youtube.com/watch?v=o7h_sYMk_oc&list=PLUl4u3cNGP63VIBQVWguXxZZi0566y7Wf&index=1
Software Properties
What software properties are more important than performace?
There are many things are more important than performance, but it doesn’t mean that performance is not important.
Performance is like some kind of currency(water resource).
Computer Programming in the Early Days
Software performace engineering was common, because machine resoures were limited.
Many programs strained the machine’s resources.
- Programs had to be planned around the machine
- Many program would not “fit” without intense performance engineering.
Clockrate, memory was low back in 70s
Core Speed = (Bus/Base Speed) * (Multiplier)
Lessons learned from the 70’s and 80’s
Premature optimization is the root of all evil. - Donald Knuth
More computing sins are committed in the name of efficiency(without necessarily achieving it) than for any other single reason - including blind stupidity -William Wulf
The First Rule of Program Optimization: Don’t do it. The Second Rule of Program Optimization - For experts only: Don’t do it yet. - Michael Jaskson
Making code readable and Fast
Vendor Solution: Multicore
Clock rate seem stop growing after around 2004.
- To scale performance, processor manufacturers put many processing cores on the microprocessor chip
- Each generation of Moore’s Law potentially doubles the number of cores.
Performance Is No Longer Free
Moore’s Law continues to increase computer performance.
But now that performance looks like big multicore processors with complex cache hierarchies, wide vector units, GPU’s, FGPA’s, etc
Generally, software must be adapted
to utilize this hardware efficiently.
Same code writen in different language, Python, Java and C, for example
Interpreters are versatile, but slow
- The interpreter reads, interprets, and preforms each program statement and updates the machine state.
- Interpreters can easily support high-level programming features - such as dynamic code alteration - at the cost of performance
JIT compilation
- JIT compilers can recover some of the performance lost by interpretation.
- When code is first executed, it is interpreted.
- The runtime system keeps track of how often the various pieces of code are executed.
- Whenever some piece of code executes sufficiently frequently, it gets compiled to machine code in real time.
- Future executions of that code use the more-efficient compiled version.
Performance of Differnent Order
Loop Order
1 | for(int i = 0; i < n; i++){ |
what if we change the order of the loops, let’s say,
1 | for(int k = 0; k < n; k++){ |
Does this cause any performance difference?
cache location
Hardware Caches
Each processor reads and writes main memory in contiguous blocks, called cache lines
.
- Previously accessed cache lines are stored in a small memory, called a cache, that sits near the processor
- Cache hits: accesses to data in cache -> are fast
- Cache misses: accesses to data not in cache -> are slow
We can measure the effect of different access patterns using the Cachegrind
cache simulator
1 | valgrind --tool=cachegrind ./mm |
Compiler Optimization
Clang provides a collection of optimization switches. You can specify a switch to the compiler to ask it to optimize.
Parallel Loops
The cilk_for
loop allows all iterations of the loop to execute in parallel.
Experimenting with Parallel Loops
Parallel i
Loop
1 | clik_for(int i = 0; i < n; ++i){ |
Running time: 3.18s
Parallel j
Loop
1 | for(int i = 0; i < n; ++i){ |
Running time: 531.71s
Parallel i
and j
loops
1 | cilk_for(int i = 0; i < n; ++i){ |
Running time: 10.64s
For some reason, we cannot parallel k to get the correct answer.
Rule of Thumb: Parallelize outer loops rather than inner loops.
Using parallel loops gets us almost 18x speedup on 18 cores! (Disclaimer: Not all code is so easy to parallelize effictively.)
Vector Hardware
Modern microprocessors incorporate vector hardware to process data in single-instruction stream, multiple data stream(SIMD) fashion
Compiler Vectorization
Clang/LLVM uses vector instructions automatically when compiling at optimization level -02
or higher.
Many machines don’t support the newest set of vector instructions, however, so the compiler uses vector instructions conservatively by default.
Vectorization Flags
Programmers can direct the compiler to use modern vector instructions using compiler flags such as the following:
- mavx: Use Intel AVX vector instructions
- mavx2: Use Intel AVX2 vector instructions
Version Implementation
- Python
- Java
- C
- +interchange loops
- +optimization flags
- Parallel loops
- +tilling
- Parallel divide-and-conquer
- +compiler vectorization
- +AVX intrinsics
Plus More Optimizations
We can apply several more insights and performance-engineering trick to make this code run faster, including:
- Preprocessing
- Matrix transposition
- Data alignment
- Memory-managemnet optimizations
- A clever algorithm for the base case that uses AVX intrinsic instructions explicitly