Bumped into this older code from Petterson's book. Wonder how the cache size and latency are derived from the code? There was a plot of loop stride size vs. read latency time. But isn't the cache size and instruction and compiler etc would be different in every architecture?
http://www.hpl.hp.com/research/cacti/aca_ch2_cs2.c
There is a similar post on memory benchmark but the code link isn't working anymore so can't compare. Maybe similar. Since can't attach the original plot in the book, so can reference to plot in this post:
Memory benchmark plot: understanding cache behaviour
What can be seen from both plots is that read time starts ramping up when array size reaches above 64KB, possibly because of increasing misses. Is this 64KB for page size or something else? The stride tells how data locality is doing to read time and the use of TLBs, according to the post. So 4K stride had worst performance. Can someone elaborate a bit more the impact from L1 TLB vs. L2 TLB? Not fully understand the explanation there.