1

I always have thought and known that multidimensional arrays to which indexing is done only once by multiplication is faster than arrays of arrays to which indexing is done by two pointer dereferencing, due to better locality and space saving.

I ran a small test a while ago, and the result was quite surprising. At least my callgrind profiler reported that the same function using array of arrays run slightly faster.

I wonder whether I should change the definition of my matrix class to use an array of arrays internally. This class is used virtually everywhere in my simulation engine (? not exactly sure how to call..), and I do want to find the best way to save a few seconds.

test_matrix has the cost of 350 200 020 and test_array_array has the cost of 325 200 016. The code was compiled with -O3 by clang++. All member functions are inlined according to the profiler.

#include <iostream>
#include <memory>

template<class T>
class BasicArray : public std::unique_ptr<T[]> {
public:
    BasicArray() = default;
    BasicArray(std::size_t);
};

template<class T>
BasicArray<T>::BasicArray(std::size_t size)
: std::unique_ptr<T[]>(new T[size]) {}

template<class T>
class Matrix : public BasicArray<T> {
public:
    Matrix() = default;
    Matrix(std::size_t, std::size_t);
    T &operator()(std::size_t, std::size_t) const;
    std::size_t get_index(std::size_t, std::size_t) const;
    std::size_t get_size(std::size_t) const;

private:
    std::size_t sizes[2];
};

template<class T>
Matrix<T>::Matrix(std::size_t i, std::size_t j)
: BasicArray<T>(i * j)
, sizes {i, j} {}

template<class T>
T &Matrix<T>::operator()(std::size_t i, std::size_t j) const {
    return (*this)[get_index(i, j)];
}

template<class T>
std::size_t Matrix<T>::get_index(std::size_t i, std::size_t j) const {
    return i * get_size(2) + j;
}

template<class T>
std::size_t Matrix<T>::get_size(std::size_t d) const {
    return sizes[d - 1];
}

template<class T>
class Array : public BasicArray<T> {
public:
    Array() = default;
    Array(std::size_t);
    std::size_t get_size() const;

private:
    std::size_t size;
};

template<class T>
Array<T>::Array(std::size_t size)
: BasicArray<T>(size)
, size(size) {}

template<class T>
std::size_t Array<T>::get_size() const {
    return size;
}

static void __attribute__((noinline)) test_matrix(const Matrix<int> &m) {
    for (std::size_t i = 0; i < m.get_size(1); ++i) {
        for (std::size_t j = 0; j < m.get_size(2); ++j) {
            static_cast<volatile void>(m(i, j) = i + j);
        }
    }
}

static void __attribute__((noinline))
test_array_array(const Array<Array<int>> &aa) {
    for (std::size_t i = 0; i < aa.get_size(); ++i) {
        for (std::size_t j = 0; j < aa[0].get_size(); ++j) {
            static_cast<volatile void>(aa[i][j] = i + j);
        }
    }
}

int main() {
    constexpr int N = 1000;
    Matrix<int> m(N, N);
    Array<Array<int>> aa(N);
    for (std::size_t i = 0; i < aa.get_size(); ++i) {
        aa[i] = Array<int>(N);
    }
    test_matrix(m);
    test_array_array(aa);
}
  • 2
    You're doing what many people do: guessing what you should worry about. Rather, let the program tell you what you should focus on - it is probably something else entirely. [*Here's an example.*](http://stackoverflow.com/a/927773/23771) – Mike Dunlavey Aug 07 '15 at 20:06
  • Have you considered converting to raw C? If you really want to save time, my experience is that removing class- and method-resolution can be a huge time saver. –  Aug 07 '15 at 20:06
  • Maybe you want to benchmark that with an access pattern other than increasing indices in storage order, which allows an optimizing compiler to remove both the multiplications and the majority of indirections. (And what @MikeDunlavey said.) – rici Aug 07 '15 at 20:08
  • @MikeDunlavey Maybe I should explain why I'm doing this. What I meant by a simulation engine is a Go (board game) engine which produces a move by doing millions of Monte-Carlo-like simulations. It is a very heavy computation and if can produce a quality move in 9 seconds where before it used to take 10 seconds, I see it as a good improvement. –  Aug 07 '15 at 20:10
  • @JohnPirie It was originally C, and I still like C much better. I just decided to move to C++ for better flexibility in changing the program structure when needed after some massive re-writings. –  Aug 07 '15 at 20:12
  • @xiver77: Regardless, no amount of guessing what you should optimize can stand up to good diagnosis. If you are right, it will tell you. If it is something else, it will tell you. If it is something else, and you fix it, then maybe array indexing is the next biggest problem, maybe later. To expect a mere 10% in a freshly-written program is selling yourself short, and there a huge difference between guessing what the problem is, and *knowing* what it really is. – Mike Dunlavey Aug 07 '15 at 20:23
  • @rici Thank you for your suggestion, I've just ran a test with `rand` for array indexing paired with `srand(0)` at function entry to get the same indexing sequence. The result is not much different. `test_array_array` still runs slightly faster. –  Aug 07 '15 at 20:27
  • Direct measurements with `clock_gettime` show that the array version is actually about two times slower than the matrix version. – n. 1.8e9-where's-my-share m. Aug 07 '15 at 21:11
  • If you swap the last two lines of main do you get different results? It may just be an issue with caching and memory fragmentation or something. Each test should be called multiple times within one program, alternating between each of the two types. – Aaron McDaid Aug 07 '15 at 21:25
  • @n.m. I did the same, but the results doesn't differ much from that of callgrind. –  Aug 07 '15 at 22:35
  • @AaronMcDaid Yes! I ran the same test 10 times in a for loop, and one extremely interesting fact is that in the first iteration the matrix version is always slow, but after that the matrix version does run faster as expected. I never thought the order can result in speed difference. Very interesting to know. Thanks! –  Aug 07 '15 at 22:37
  • Which timer did you use? – n. 1.8e9-where's-my-share m. Aug 08 '15 at 10:16
  • @n.m. `std::chrono::high_resolution_clock` and also `clock_gettime` with `CLOCK_MONOTONIC` out of curiosity. No difference though. –  Aug 08 '15 at 10:25
  • What about CLOCK_PROCESS_CPUTIME_ID? – n. 1.8e9-where's-my-share m. Aug 08 '15 at 10:28
  • @n.m. I'll have to fix my timer class again for that. Is that likely to make a big difference? I made `N` big enough to let the program run for several seconds. –  Aug 08 '15 at 10:31

0 Answers0