JimYuan's Blog

Sharing the things I learned

0%

Performance Engineering_IntroductionMatrixMultiplication

Preface

There are usually some recommendation videos from MIT OCW on youtube for me. Shameless to say, although I am interested in the courses, I never finished any whole semester. But, Nah. I just want to note down those I clicked and gain some knowledge after all.

Reference

https://www.youtube.com/watch?v=o7h_sYMk_oc&list=PLUl4u3cNGP63VIBQVWguXxZZi0566y7Wf&index=1

Software Properties

What software properties are more important than performace?

There are many things are more important than performance, but it doesn’t mean that performance is not important.

Performance is like some kind of currency(water resource).

Computer Programming in the Early Days

Software performace engineering was common, because machine resoures were limited.

Many programs strained the machine’s resources.

  • Programs had to be planned around the machine
  • Many program would not “fit” without intense performance engineering.

Clockrate, memory was low back in 70s

Clockrate

Core Speed = (Bus/Base Speed) * (Multiplier)

Lessons learned from the 70’s and 80’s

Premature optimization is the root of all evil. - Donald Knuth

More computing sins are committed in the name of efficiency(without necessarily achieving it) than for any other single reason - including blind stupidity -William Wulf

The First Rule of Program Optimization: Don’t do it. The Second Rule of Program Optimization - For experts only: Don’t do it yet. - Michael Jaskson

Making code readable and Fast

Vendor Solution: Multicore

Clock rate seem stop growing after around 2004.

  • To scale performance, processor manufacturers put many processing cores on the microprocessor chip
  • Each generation of Moore’s Law potentially doubles the number of cores.

Performance Is No Longer Free

Moore’s Law continues to increase computer performance.
But now that performance looks like big multicore processors with complex cache hierarchies, wide vector units, GPU’s, FGPA’s, etc

Generally, software must be adapted to utilize this hardware efficiently.

Same code writen in different language, Python, Java and C, for example

Interpreters are versatile, but slow

  • The interpreter reads, interprets, and preforms each program statement and updates the machine state.
  • Interpreters can easily support high-level programming features - such as dynamic code alteration - at the cost of performance

JIT compilation

  • JIT compilers can recover some of the performance lost by interpretation.
  • When code is first executed, it is interpreted.
  • The runtime system keeps track of how often the various pieces of code are executed.
  • Whenever some piece of code executes sufficiently frequently, it gets compiled to machine code in real time.
  • Future executions of that code use the more-efficient compiled version.

Performance of Differnent Order

Loop Order

1
2
3
4
5
6
7
for(int i = 0; i < n; i++){
for(int j = 0; j < n; j++){
for(int k = 0; k < n; k++){
c[i][j] += A[i][j] * B[k][j];
}
}
}

what if we change the order of the loops, let’s say,

1
2
3
4
5
6
7
        for(int k = 0; k < n; k++){
for(int j = 0; j < n; j++){
for(int i = 0; i < n; i++){
c[i][j] += A[i][j] * B[k][j];
}
}
}

Does this cause any performance difference?

cache location

Hardware Caches

Each processor reads and writes main memory in contiguous blocks, called cache lines.

  • Previously accessed cache lines are stored in a small memory, called a cache, that sits near the processor
  • Cache hits: accesses to data in cache -> are fast
  • Cache misses: accesses to data not in cache -> are slow

We can measure the effect of different access patterns using the Cachegrind cache simulator

1
valgrind --tool=cachegrind ./mm

Compiler Optimization

Clang provides a collection of optimization switches. You can specify a switch to the compiler to ask it to optimize.

Parallel Loops

The cilk_for loop allows all iterations of the loop to execute in parallel.

Experimenting with Parallel Loops

Parallel i Loop

1
2
3
4
5
6
7
clik_for(int i = 0; i < n; ++i){
for(int k =0; k < n ;++k){
for(int j =0; j < n ;++j){
c[i][j] += A[i][j] * B[k][j];
}
}
}

Running time: 3.18s

Parallel j Loop

1
2
3
4
5
6
7
for(int i = 0; i < n; ++i){
for(int k =0; k < n ;++k){
cilk_for(int j =0; j < n ;++j){
c[i][j] += A[i][j] * B[k][j];
}
}
}

Running time: 531.71s

Parallel i and j loops

1
2
3
4
5
6
7
cilk_for(int i = 0; i < n; ++i){
for(int k =0; k < n ;++k){
cilk_for(int j =0; j < n ;++j){
c[i][j] += A[i][j] * B[k][j];
}
}
}

Running time: 10.64s

For some reason, we cannot parallel k to get the correct answer.

Rule of Thumb: Parallelize outer loops rather than inner loops.

Using parallel loops gets us almost 18x speedup on 18 cores! (Disclaimer: Not all code is so easy to parallelize effictively.)

Vector Hardware

Modern microprocessors incorporate vector hardware to process data in single-instruction stream, multiple data stream(SIMD) fashion

Compiler Vectorization

Clang/LLVM uses vector instructions automatically when compiling at optimization level -02 or higher.

Many machines don’t support the newest set of vector instructions, however, so the compiler uses vector instructions conservatively by default.

Vectorization Flags

Programmers can direct the compiler to use modern vector instructions using compiler flags such as the following:

  • mavx: Use Intel AVX vector instructions
  • mavx2: Use Intel AVX2 vector instructions

Version Implementation

  1. Python
  2. Java
  3. C
  4. +interchange loops
  5. +optimization flags
  6. Parallel loops
  7. +tilling
  8. Parallel divide-and-conquer
  9. +compiler vectorization
  10. +AVX intrinsics

Plus More Optimizations

We can apply several more insights and performance-engineering trick to make this code run faster, including:

  • Preprocessing
  • Matrix transposition
  • Data alignment
  • Memory-managemnet optimizations
  • A clever algorithm for the base case that uses AVX intrinsic instructions explicitly