JimYuan's Blog

Sharing the things I learned

0%

Preface

There are usually some recommendation videos from MIT OCW on youtube for me. Shameless to say, although I am interested in the courses, I never finished any whole semester. But, Nah. I just want to note down those I clicked and gain some knowledge after all.

Reference

https://www.youtube.com/watch?v=o7h_sYMk_oc&list=PLUl4u3cNGP63VIBQVWguXxZZi0566y7Wf&index=1

Software Properties

What software properties are more important than performace?

There are many things are more important than performance, but it doesn’t mean that performance is not important.

Performance is like some kind of currency(water resource).

Computer Programming in the Early Days

Software performace engineering was common, because machine resoures were limited.

Many programs strained the machine’s resources.

  • Programs had to be planned around the machine
  • Many program would not “fit” without intense performance engineering.

Clockrate, memory was low back in 70s

Clockrate

Core Speed = (Bus/Base Speed) * (Multiplier)

Lessons learned from the 70’s and 80’s

Premature optimization is the root of all evil. - Donald Knuth

More computing sins are committed in the name of efficiency(without necessarily achieving it) than for any other single reason - including blind stupidity -William Wulf

The First Rule of Program Optimization: Don’t do it. The Second Rule of Program Optimization - For experts only: Don’t do it yet. - Michael Jaskson

Making code readable and Fast

Vendor Solution: Multicore

Clock rate seem stop growing after around 2004.

  • To scale performance, processor manufacturers put many processing cores on the microprocessor chip
  • Each generation of Moore’s Law potentially doubles the number of cores.

Performance Is No Longer Free

Moore’s Law continues to increase computer performance.
But now that performance looks like big multicore processors with complex cache hierarchies, wide vector units, GPU’s, FGPA’s, etc

Generally, software must be adapted to utilize this hardware efficiently.

Same code writen in different language, Python, Java and C, for example

Interpreters are versatile, but slow

  • The interpreter reads, interprets, and preforms each program statement and updates the machine state.
  • Interpreters can easily support high-level programming features - such as dynamic code alteration - at the cost of performance

JIT compilation

  • JIT compilers can recover some of the performance lost by interpretation.
  • When code is first executed, it is interpreted.
  • The runtime system keeps track of how often the various pieces of code are executed.
  • Whenever some piece of code executes sufficiently frequently, it gets compiled to machine code in real time.
  • Future executions of that code use the more-efficient compiled version.

Performance of Differnent Order

Loop Order

1
2
3
4
5
6
7
for(int i = 0; i < n; i++){
for(int j = 0; j < n; j++){
for(int k = 0; k < n; k++){
c[i][j] += A[i][j] * B[k][j];
}
}
}

what if we change the order of the loops, let’s say,

1
2
3
4
5
6
7
        for(int k = 0; k < n; k++){
for(int j = 0; j < n; j++){
for(int i = 0; i < n; i++){
c[i][j] += A[i][j] * B[k][j];
}
}
}

Does this cause any performance difference?

cache location

Hardware Caches

Each processor reads and writes main memory in contiguous blocks, called cache lines.

  • Previously accessed cache lines are stored in a small memory, called a cache, that sits near the processor
  • Cache hits: accesses to data in cache -> are fast
  • Cache misses: accesses to data not in cache -> are slow

We can measure the effect of different access patterns using the Cachegrind cache simulator

1
valgrind --tool=cachegrind ./mm

Compiler Optimization

Clang provides a collection of optimization switches. You can specify a switch to the compiler to ask it to optimize.

Parallel Loops

The cilk_for loop allows all iterations of the loop to execute in parallel.

Experimenting with Parallel Loops

Parallel i Loop

1
2
3
4
5
6
7
clik_for(int i = 0; i < n; ++i){
for(int k =0; k < n ;++k){
for(int j =0; j < n ;++j){
c[i][j] += A[i][j] * B[k][j];
}
}
}

Running time: 3.18s

Parallel j Loop

1
2
3
4
5
6
7
for(int i = 0; i < n; ++i){
for(int k =0; k < n ;++k){
cilk_for(int j =0; j < n ;++j){
c[i][j] += A[i][j] * B[k][j];
}
}
}

Running time: 531.71s

Parallel i and j loops

1
2
3
4
5
6
7
cilk_for(int i = 0; i < n; ++i){
for(int k =0; k < n ;++k){
cilk_for(int j =0; j < n ;++j){
c[i][j] += A[i][j] * B[k][j];
}
}
}

Running time: 10.64s

For some reason, we cannot parallel k to get the correct answer.

Rule of Thumb: Parallelize outer loops rather than inner loops.

Using parallel loops gets us almost 18x speedup on 18 cores! (Disclaimer: Not all code is so easy to parallelize effictively.)

Vector Hardware

Modern microprocessors incorporate vector hardware to process data in single-instruction stream, multiple data stream(SIMD) fashion

Compiler Vectorization

Clang/LLVM uses vector instructions automatically when compiling at optimization level -02 or higher.

Many machines don’t support the newest set of vector instructions, however, so the compiler uses vector instructions conservatively by default.

Vectorization Flags

Programmers can direct the compiler to use modern vector instructions using compiler flags such as the following:

  • mavx: Use Intel AVX vector instructions
  • mavx2: Use Intel AVX2 vector instructions

Version Implementation

  1. Python
  2. Java
  3. C
  4. +interchange loops
  5. +optimization flags
  6. Parallel loops
  7. +tilling
  8. Parallel divide-and-conquer
  9. +compiler vectorization
  10. +AVX intrinsics

Plus More Optimizations

We can apply several more insights and performance-engineering trick to make this code run faster, including:

  • Preprocessing
  • Matrix transposition
  • Data alignment
  • Memory-managemnet optimizations
  • A clever algorithm for the base case that uses AVX intrinsic instructions explicitly

Preface

Just as another regular work day, I memo down some new things along the task I am facing. Recently I have been crazy about the Voronoi diagram. Although as bad as I wanted to implement the sweepline algorithm, I found the best practice might be patient and learn thing little by little. Also, I am kinds into “Study with me” series on youtube. I feel clam with someone’s company alongside.

Reference

https://discourse.mcneel.com/t/c-select-objects-by-layer/84420/6
Appreciate the help from username Mahdiyar.

Two things are different

  • select object in Rhino DocView
  • select object in GH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
var ghobjs = new List<object>();
var selectedLayer = RhinoDoc.ActiveDoc.Layers.FindName(layerName);
var layers = new List<Rhino.DocObjects.Layer>(){selectedLayer};
if(subLayer)
{
foreach(var layer in RhinoDoc.ActiveDoc.Layers)
if(layer.IsChildOf(selectedLayer))
layers.Add(layer);
}
B = layers;
foreach(var layer in layers)
{
var rhobjs = doc.Objects.FindByLayer(layer);
foreach (var rhObj in rhobjs)
ghobjs.Add(rhObj.Geometry);
}
A = ghobjs;

SelectInSameLayer

Lecture Link

https://www.youtube.com/watch?v=Pn19deUBsM8&t=13s

The Post-Office Problem

https://link.springer.com/chapter/10.1007%2F3-540-10291-4_1

http://blog.ivank.net/voronoi-diagram-in-javascript.html

http://paperjs.org/examples/voronoi/

https://shaneosullivan.wordpress.com/2011/03/22/sweep-line-voronoi-algorithm-in-javascript/

https://jacquesheunis.com/post/fortunes-algorithm/

http://blog.ivank.net/fortunes-algorithm-and-implementation.html

https://github.com/pvigier/FortuneAlgorithm

http://www.cs.uu.nl/geobook/

https://pvigier.github.io/2018/11/18/fortune-algorithm-details.html

TakeAwayFromPvigier’s blog

https://pvigier.github.io/2018/11/18/fortune-algorithm-details.html

pseudo-code from Pvigier’s Blog

1
2
3
4
5
6
7
8
9
10
11
add a site event in the event queue for each site
while the event queue is not empty
pop the top event
if the event is a site event
insert a new arc in the beachline
check for new circle events
else
create a vertex in the diagram
remove the shrunk arc from the beachline
delete invalidated events
check for new circle events
  • site event
  • circle event

Diagram’s Data Structure

Double connected edge list(DCEL).

1
2
3
4
5
6
7
8
9
10
class VoronoiDiagram{
public:
//...
private:
std::vector<Site> mSite;
std::vector<Face> mFace;
std::list<Vertex> mVertices;
std::list<HalfEdge> mHalfEdges;
}

The Site class represents an input point. Each site has an index, which is useful to query them, coordinates and a pointer to their cell(face):

1
2
3
4
5
struct Site{
std::size_t index;
Vector2 point;
Face* face;
};

The vertices of the cells are represented by the Vertex class, they just have a cooridinates field:

1
2
3
struct Vertex{
Vector2 point;
};

Here is the implementation of half-edges:

1
2
3
4
5
6
7
8
struct HalfEdge{
Vertex* origin;
Vertex* destination;
HalfEdge* twin;
Face* incidentFace;
HalfEdge* prev;
HalfEdge* next;
};

An edge in a Voronoi diagram is shared by two adjacent cells. In the DCEL data structure, we splite these edfes in two half-edges, one for each cell, and they are linked by the twin pointer.

Moreover, a hald-edge has an origin vertex and a desrination vertex.
The incidenetFace field points to the face to which the half-edge belongs to.

Finally, in DCEL, cells are implemented as a circular doubly linked list of half-edges where adjacent half-edges are linked together. Thus the prev and next fields points to the previous and next half-edges in the cell.

Finally, the Face class represents a cell and it just contains a pointer to its site and another to one of its half-edges. It does not matter which half-edge it is because a cell is a closed polygon. Thus we can access to all of its hald-edges by traversing the circular linked list.

1
2
3
4
struct Face{
Site* site;
HalfEdge* outerComponent;
};

Event Queue

The standard way to implement the event queue is to use a priority queue.

During the processing of site and circle events we may need to remove circle events from the queue because they are not valid anymore.

But most of the standard implementations of priority queues do not allow to remove an element which is not the top one. In particular that is the case for std::priority_queue.

There are two ways to tackle this problem.

  1. The first and simplest one is to add a valid flag to true initially. Then instead of removing the circle event from the queue, we just set its flag to false. Finially, when we process the events in the main loop, if the valid flag of an event equals to false, we simply discard it and process the next one.
  2. The second method which I adopted is not to use std::priority_queue. Instead, I implemented my own priority queue which supports removal of any element it contains. The implementation of such a queue is pretty straightforward. I choose this method because I find it makes the code for the algorithm clearer.

Beachline

If not implemented correctly there is no guarantee that the algorithm will run in O(log(n)). The key to reach this time complecity is to use a self-balancing tree.

In my implementation the beachline is still a tree, but all nodes represent an arc.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
struct Arc{
enum class Color{RED, BLACK};

// Hierarchy
Arc* parent;
Arc* left;
Arc* right;

// Diagram
VoronoiDiagram::Site* site;
VoronoiDiagram::HalfEdge* leftHalfEdge;
VoronoiDiagram::HalfEdge* rightHalfEdge;
Event* event;
// Optimizations
Arc* prev;
Arc* next;
// Only for balancing
Color color;
};

In Fortune’s algorithm we have no key, we only know that we want to insert an arc before or after another in the beachline. To do that, I create methods insertBefore and insertAfter:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
void Beachline::insertBefore(Arc* x, Arc* y){
// Find the right place
if(isNil(x->left)){
x->left = y;
y->parent = x;
}
else{
x->prev->right = y;
y->parent = x->prev;
}
// Set the pointers
y->prev = x->prev;
if(!isNil(y->prev))
y->prev->next = y;

y->next = x;
x->prev = y;

// Balance the tree
insertFixup(y);
}

Priority Queues

https://www.youtube.com/watch?v=wptevk0bshY

What is a Priority Queue

A priority queue is an Abstract Data Type(ADT) that operates similar to a normal queue excpet that each element has a certain priority. The priority of the elements in priority queue determin the order in which elements are removed from the PQ.

NOTE: Priority queues only supports comparable data.

What is a Heap?

A heap is a tree based DS that satifies the heap invariant(also called heap property):
If A is a parent node of B then A is ordered with repect to B for all nodes A, B in the heap.

  • Max Heap
  • Min Heap

Heaps must be trees.

When and where is a PQ used?

  • Used in certain implementaitons of Dijkstra’s shortest Path algorithm.
  • Anytime you need the dynamically fetch the next best or next worst element
  • Used in Huffman coding (which is often used for lossless data compression).
  • Best First Search(BFS) algorithms such as A* use PQs to continuously grab the next most promising node
  • Used by minimum Spanning Tree(MST) algorithms

Complexity PQ with binary heap

  • Binary Heap construction: O(n)
  • Polling: O(log(n))
  • Peeking: O(1)
  • Adding: O(log(n))

Self balance Tree (AVL Tree)

https://www.youtube.com/watch?v=vRwi_UcZGjU

  • Balance
  • Rotation
  • Implementation

Rebalancing threshold

pseudo-code from Jacquesheunis’ blog

https://jacquesheunis.com/post/fortunes-algorithm/

1
2
3
4
5
6
7
Fill the event queue with site events for each input site.
While the event queue still has items in it:
If the next event on the queue is a site event:
Add the new site to the beachline
Otherwise it must be an edge-intersection event:
Remove the squeezed cell from the beachline
Cleanup any remaining intermediate state

Voronoi Diagram - Computataional Geometry

Fortune’s Sweep

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
VoronoiDiagram(P are in R2)
Q : new PriorityQueue(P) // site events sortes by y-coord.
T : new BalancedBinarySearchTree() // sweep status(B)
D : new DCEL() // to-be Vor(P)

while not Q.empty() do
p <- Q.ExtractMax()
if p site event then
HandleSiteEvent(p)
else
a <- arc on B that will disappear
HandleCircleEvent(a)

treat remaining int. nodes of T(unbnd. edges of Vor(P))
return D

Handling Events

  • HandleSiteEvent(point p)
    • Search in T for the arc a vertically above p. If a has pointer to circle event in Q, delete this event.
    • Split a into a0 and a2. Let a1 be the new arc of p
    • Add Vor-edges <q, p> and <p, q> to DECL
    • Check <., a0, a1> and <a1, a2, .> for circle events.
  • HandleCircleEvent(arc a)
    • T.delete(a); update breakpts
    • Delete all circle events involving a from Q
    • Add Vor-vtx a_left and a_right