-1

I recently decided to play around with Java's new incubated vector API, to see how fast it can get. I implemented two fairly simple methods, one for parsing an int and one for finding the index of a character in a string. In both cases, my vectorized methods were incredibly slow compared to their scalar equivalents.

Here's my code:

public class SIMDParse {

private static IntVector mul = IntVector.fromArray(
        IntVector.SPECIES_512,
        new int[] {0, 0, 0, 0, 0, 0, 1000000000, 100000000, 10000000, 1000000, 100000, 10000, 1000, 100, 10, 1},
        0
);
private static byte zeroChar = (byte) '0';
private static int width = IntVector.SPECIES_512.length();
private static byte[] filler;

static {
    filler = new byte[16];
    for (int i = 0; i < 16; i++) {
        filler[i] = zeroChar;
    }
}

public static int parseInt(String str) {
    boolean negative = str.charAt(0) == '-';
    byte[] bytes = str.getBytes(StandardCharsets.UTF_8);
    if (negative) {
        bytes[0] = zeroChar;
    }
    bytes = ensureSize(bytes, width);
    ByteVector vec = ByteVector.fromArray(ByteVector.SPECIES_128, bytes, 0);
    vec = vec.sub(zeroChar);
    IntVector ints = (IntVector) vec.castShape(IntVector.SPECIES_512, 0);
    ints = ints.mul(mul);
    return ints.reduceLanes(VectorOperators.ADD) * (negative ? -1 : 1);
}

public static byte[] ensureSize(byte[] arr, int per) {
    int mod = arr.length % per;
    if (mod == 0) {
        return arr;
    }
    int length = arr.length - (mod);
    length += per;
    byte[] newArr = new byte[length];
    System.arraycopy(arr, 0, newArr, per - mod, arr.length);
    System.arraycopy(filler, 0, newArr, 0, per - mod);
    return newArr;
}

public static byte[] ensureSize2(byte[] arr, int per) {
    int mod = arr.length % per;
    if (mod == 0) {
        return arr;
    }
    int length = arr.length - (mod);
    length += per;
    byte[] newArr = new byte[length];
    System.arraycopy(arr, 0, newArr, 0, arr.length);
    return newArr;
}

public static int indexOf(String s, char c) {
    byte[] b = s.getBytes(StandardCharsets.UTF_8);
    int width = ByteVector.SPECIES_MAX.length();
    byte bChar = (byte) c;
    b = ensureSize2(b, width);
    for (int i = 0; i < b.length; i += width) {
        ByteVector vec = ByteVector.fromArray(ByteVector.SPECIES_MAX, b, i);
        int pos = vec.compare(VectorOperators.EQ, bChar).firstTrue();
        if (pos != width) {
            return pos + i;
        }
    }
    return -1;
}

}

I fully expected my int parsing to be slower, since it won't ever be handling more than the vector size can hold (an int can never be more than 10 digits long).

By my bechmarks, parsing 123 as an int 10k times took 3081 microseconds for Integer.parseInt, and 80601 microseconds for my implementation. Searching for 'a' in a very long string ("____".repeat(4000) + "a" + "----".repeat(193)) took 7709 microseconds to String#indexOf's 7.

Why is it so unbelievably slow? I thought the entire point of SIMD is that it's faster than the scalar equivalents for tasks like these.

Redempt
  • 129
  • 1
  • 1
  • 8
  • What hardware (and JVM version) did you test on? Also, you should update this with info from your comments on my answer, that apparently the long-string test was only one repeat. – Peter Cordes Jun 21 '21 at 03:35

2 Answers2

1

You picked something SIMD is not great at (string->int), and something that JVMs are very good at optimizing out of loops. And you made an implementation with a bunch of extra copying work if the inputs aren't exact multiples of the vector width.


I'm assuming your times are totals (for 10k repeats each), not a per-call average.

7 us is impossibly fast for that.

"____".repeat(4000) is 16k bytes before the 'a', which I assume is what you're searching for. Even a well-tuned / unrolled memchr (aka indexOf) running at 2x 32-byte vectors per clock cycle, on a 4GHz CPU, would take 625 us for 10k reps. (16000B / (64B/c) * 10000 reps / 4000 MHz). And yes, I'd expect a JVM to either call the native memchr or use something equally efficient for a commonly-used core library function like String#indexOf. For example, glibc's avx2 memchr is pretty well-tuned with loop unrolling; if you're on Linux, your JVM might be calling it.

Built-in String indexOf is also something the JIT "knows about". It's apparently able to hoist it out of loops when it can see that you're using the same string repeatedly as input. (But then what's it doing for the rest of those 7 us? I guess doing a not-quite-so-great memchr and then doing an empty 10k iteration loop at 1/clock could take about 7 microseconds, especially if your CPU isn't as fast as 4GHz.)

See Idiomatic way of performance evaluation? - if doubling the repeat-count to 20k doesn't double the time, your benchmark is broken and not measuring what you think it does.

Your manual SIMD indexOf is very unlikely to get optimized out of a loop. It makes a copy of the whole array every time, if the size isn't an exact multiple of the vector width!! (In ensureSize2). The normal technique is to fall back to scalar for the last size % width elements, which is obviously much better for large arrays. Or even better, do an unaligned load that ends at the end of the array (if the total size is >= vector width) for something where overlap with previous work is not a problem.

A decent memchr on modern x86 (using an algorithm like your indexOf without unrolling) should go at about 1 vector (16/32/64 bytes) per maybe 1.5 clock cycles, with data hot in L1d cache, without loop unrolling or anything. (Checking both the vector compare and the pointer bound as possible loop exit conditions takes extra asm instructions vs. a simple strlen, but see this answer for some microbenchmarks of a simple hand-written strlen that assumes aligned buffers). Probably your indexOf loops bottlenecks on front-end throughput on a CPU like Skylake, with its pipeline width of 4 uops/clock.

So let's guess that your implementation takes 1.5 cycles per 16 byte vector, if perhaps you're on a CPU without AVX2? You didn't say.

16kB / 16B = 1000 vectors. At 1 vector per 1.5 clocks, that's 1500 cycles. On a 3GHz machine, 1500 cycles takes 500 ns = 0.5 us per call, or 5000 us per 10k reps. But since 16194 bytes isn't a multiple of 16, you're also copying the whole thing every call, so that costs some more time, and could plausibly account for your 7709 us total time.


What SIMD is good for

for tasks like these.

No, "horizontal" stuff like ints.reduceLanes is something SIMD is generally slow at. And even with something like How to implement atoi using SIMD? using x86 pmaddwd to multiply and add pairs horizontally, it's still a lot of work.

Note that to make the elements wide enough to multiply by place-values without overflow, you have to unpack, which costs some shuffling. ints.reduceLanes takes about log2(elements) shuffle/add steps, and if you're starting with 512-bit AVX-512 vectors of int, the first 2 of those shuffles are lane-crossing, 3 cycle latency (https://agner.org/optimize/). (Or if your machine doesn't even have AVX2, then a 512-bit integer vector is actually 4x 128-bit vectors. And you had to do separate work to unpack each part. But at least the reduction will be cheap, just vertical adds until you get down to a single 128-bit vector.)

Peter Cordes
  • 286,368
  • 41
  • 520
  • 731
  • I should have clarified - it was only 10k calls for the int parsing. For indexOf, each was timed for a single call. – Redempt Jun 21 '21 at 03:08
  • Just wrote a basic test with a method for scalar multiply and vector multiply on an array - it seems the runtime is nearly identical for both small and large arrays. both taking about 9ms on an array size 2560000. https://hastebin.com/pexakenihe.java – Redempt Jun 21 '21 at 03:21
  • @Redempt: Oh, then 7 us for *one* call is pathetic. Maybe HotSpot doesn't handle indexOf specially? But IDK why / how one call to your manual intrinsics version could take so long either, unless your benchmark methodology is totally flawed and you're missing a warm-up run. I'd suggest using a repeat loop. – Peter Cordes Jun 21 '21 at 03:30
  • @Redempt: `out[i] = arr[i] * scalar;` - That's *so* easy to auto-vectorize that I think modern HotSpot will actually do it for you while JITting. ([Do any JVM's JIT compilers generate code that uses vectorized floating point instructions?](https://stackoverflow.com/a/20267515) says it's been in since Java 7u40) At least that's good confirmation that the SIMD API isn't making things worse when used for normal vertical-SIMD things over arrays. (Or at least that both ways bottleneck on memory bandwidth.) – Peter Cordes Jun 21 '21 at 03:32
0

Hmm. I found this post because I've hit something strange with the Vector perfomance for something that ostensibly it should be ideal for - multiplying two double arrays.

  static private void doVector(int iteration, double[] input1, double[] input2, double[] output) {
    Instant start = Instant.now();
    for (int i = 0; i < SPECIES.loopBound(ARRAY_LENGTH); i += SPECIES.length()) {
      DoubleVector va = DoubleVector.fromArray(SPECIES, input1, i);
      DoubleVector vb = DoubleVector.fromArray(SPECIES, input2, i);
      va.mul(vb);
      System.arraycopy(va.mul(vb).toArray(), 0, output, i, SPECIES.length());
    }
    Instant finish = Instant.now();
    System.out.println("vector duration " + iteration + ": " + Duration.between(start, finish).getNano());
  }

The species length comes out at 4 on my machine (CPU is Intel i7-7700HQ at 2.8 GHz).

On my first attempt the execution was taking more than 15 milliseconds to execute (compared with 0 for the scalar equivalent), even with a tiny array length (8 elements). On a hunch I added the iteration to see whether something had to warm up - and indeed, the first iteration still ALWAYS takes ages (44 ms for 65536 elements). Whilst most of the other iterations are reporting zero time, a few are taking around 15ms but they are randomly distributed (i.e. not always the same iteration index on each run). I sort of expect that (because I'm measuring real-time measurement and other stuff will be going on).

However, overall for an array size of 65536 elements, and 32 iterations, the total duration for the vector approach is 2-3 times longer than that for the scalar one.

Tim V
  • 1,065
  • 8
  • 5