0

I'm trying to use 256bit SIMD registers as a byte-array without going to L1 cache, on my AMD FX (Piledriver) CPU.

Does C++ have AVX/AVX2 instructions to index any 256bit register as if it is a plain char-array?

For example, I have a simple loop like this:

// sz: up to 20Billion
// ptr: arbitrary pointer from a vector or string
unsigned char tmp[64];
for(size_t i=0;i<sz;i+=64)
{
    for(int j=0;j<64;j++)
        tmp[j]=ptr[i+j];

    // a serial algorithm with only cpu-intensive work on tmp elements that has dependency to each other
    // compute tmp[0] .. do work .. compute tmp[1] .. do work ..

    for(int j=0;j<64;j++)
        result += tmp[j];

}

All work between tmp0-tmp1-etc are dependent on the computation of last element so it is a serial work.

I cold not vectorize inner loop so I need to manually extract bytes. After looking into avx2intrin header, I don't see any function returning an int8_t element.

When I simply use size_t, the load/store performance is much better than unsigned char (nearly 2x). So I guess, load/store performance would get even better with SIMD load/store commands but how to extract(and insert) bytes one by one from/into 256bit register without going out of ALU (not even going to L1)? (with size_t, I used 1/8 sized size_t vector as result vector and input vector instead of unsigned char vector). Can std::transform on char array do necessary loop unrolling to optimize access latency or other improvements or is plain for loop without extra modifications is enough (for both portability and performance)?

What about latency of L1 access and AVX register-byte indexing latency? Does it worth it to optimize for this? Or is this for nothing?

huseyin tugrul buyukisik
  • 10,675
  • 3
  • 42
  • 85
  • Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackoverflow.com/rooms/242681/discussion-on-question-by-huseyin-tugrul-buyukisik-is-there-an-instruction-for-a). – Machavity Mar 07 '22 at 13:41

0 Answers0