Code Modernization – The Importance of Cache Awareness Share your comment!

cache

Thirty years ago in a simpler 8 bit world, I spent a few years developing games in 6502 and Z80 assembler. Life was simpler then with processors running at 1 MHz frequency and instructions being 2-3 words long. These Instructions took typically 2-3 clock cycles to execute but could be slower. So a maximum of 500,000 simple instructions per second but realistically more like between 250,000 and 330,000 per second.

Now though my Intel i7 5930K is 400-450x as fast but also processes 64 Bits at a time instead of 8. Plus with all the clever tricks that CPU manufacturers throw in such as pre-fetch, instruction and data caches etc., determining how many instructions per second is no easy task. Plus that’s not even counting that it has 6 hyper-threaded cores so flat out it’s a V12 compared to the single cylinder 6502.

This gist from developer jboner is five years old but gives a rough estimation of how long it takes to do things. L1 cache is 0.5 of a nanosecond. L2 is 14 nanoseconds and reading main memory is 100 ns. This is an incredibly small time interval; you get 10 million 100ns periods each second.

C++ Code

To time the examples which I ran from the command line on Windows, I used two files hr_time.h and hr_time.cpp which you can find and download from Github in the PokerEval2 project.

The first example shows a problem with data size and cache lines. The examples were run on my Pc which has an Intel i7-5930K CPU; part of the Haswell family. It has three caches: L1 has a 32KB instruction and 32 KB data cache for each of the six cores. L2 has a 256 KB cache for each core and there’s a 15MB L3 cache shared between all cores. L1 and L2 are eight-way associative cache which means that each block of main memory could be in one of eight cache blocks. The higher the associativity the greater the chance of cache hits.

Cache lines are 64 bytes long on the I7=5930K. This can have an impact on processing time. Here’s the example. Note because of the array size I built this as 64 bit. To run as 32 bit you need to make the MaxData const lower by at least a factor of 10.

This allocates an array of 100 million structs, each 16 bytes long. It zero fills them, then within the inner loop it copies one value b to a in a struct. This is done 10 times to get better average time. The sw.getElapsedTime() returns milliseconds so dividing by 10000 gives the average time in seconds over 10 loops. On my PC and using the Intel C++ compiler it outputs:

Now if we comment out c and d in the struct recompile and rerun, you’d expect the times to be the same.

That’s almost twice as fast. Yet the code was unchanged, only the struct layout. Why is this? It’s because of cache line size.

When it’s reading the pData[i].b it retrieves 64 bytes (the cache line) not just the 16 bytes of b. The struct occupies entirely one cache line. With c and d commented out the struct only occupies half the cache line. Add this line

After the std::cout << “Time … line and you’ll see Size of Data = 8 and Size of Data = 16 for the two cases.

So with the 8 size struct, fetching a cache line retrieves two structs at once- all structs are arranged one after another in ram. So the data is there without needing another cache hit and that’s why it’s nearly twice as fast.

Alignment

Understanding the size and alignment of variables is important. If we take the example above and change the struct to this:

Then when we run it we get a Size of 4 and Time = 0.000054 secs. But now if we move the line short b; to be between the two chars:

Then we get Size of 6 and Time = 0.000072 secs. Chars occupy one byte and do not require alignment. However shorts occupy 2 bytes and are usually aligned on a 2 byte boundary. So sixteen of the smaller structs fit into a cache line, while only 10 of the longer do, so it takes longer.

Conclusions

When there are many small structs in memory, the internal layout with regard to padding and alignment can have an impact on performance. This is perhaps one of the exceptions to the “don’t prematurely optimize” rule.

Posted on March 6, 2017 by David Bolton, Slashdot Media Contributing Editor