ForkJoinPool performance in Java 17

Question

Could anyone explain me the performance behaviour of ForkJoinPool in Java 17?

I wrote this very simple benchmark and tried to experiment with amount of threads and the divider. It just loops numOjOps/divider times, each time adding 1 to the result (divide & conquer technique).

The benchmark source code:

static long numOfOps = 90_000_000_000L;
static int numOfThreads = 10;
static int divider = 8;

public static void main(String[] args) {

    for (int i = 0; i < 10; i++) {
        long start = System.currentTimeMillis();

        ForkJoinPool forkJoinPool = new ForkJoinPool(numOfThreads);
        System.out.println(forkJoinPool.invoke(new MyFork(0, numOfOps)));

        long finish = System.currentTimeMillis();
        System.out.println("Execution time " + i + ": " + (finish - start));
    }
}

static class MyFork extends RecursiveTask<Long> {

    long from, to;

    public MyFork(long from, long to) {
        this.from = from;
        this.to = to;
    }

    @Override
    protected Long compute() {
        if (to - from <= numOfOps/divider) {
            long j = 0;
            for (long i = from; i < to; i++) {
                j++;
            }
            return j;
        } else {
            long middle = Math.round((double) (from + to)/2);

            MyFork firstVal = new MyFork(from, middle);
            MyFork secondVal = new MyFork(middle, to);

            firstVal.fork();
            secondVal.fork();

            return firstVal.join() + secondVal.join();
        }
    }
}

The results are a bit controversive to me.

After the second iteration of benchmark the performance increases dramatically. Where does this boost come from? Is it CPU cache or JIT compiler optimization?

I also found interesting, that the best and the most consistent results are achieved when the number of threads in the threadpool are at least 2 more than actual ones in the CPU.

I have a PC with Core i7 860 (8 threads) overclocked to 3.5 Ghz and laptop with 4200U 2.3 Ghz (4 threads).

Core i7 860 (Threads: 10; divider 8):

90000000000 Execution time 0: 12359

90000000000 Execution time 1: 18250

90000000000 Execution time 2: 813

90000000000 Execution time 3: 828

90000000000 Execution time 4: 812

90000000000 Execution time 5: 828

90000000000 Execution time 6: 813

90000000000 Execution time 7: 828

90000000000 Execution time 8: 828

90000000000 Execution time 9: 813

Core i7 860 (Threads: 1; divider 1):

90000000000 Execution time 0: 52031

90000000000 Execution time 1: 51959

90000000000 Execution time 2: 6140

90000000000 Execution time 3: 6157

90000000000 Execution time 4: 6157

90000000000 Execution time 5: 6109

90000000000 Execution time 6: 6141

90000000000 Execution time 7: 6125

90000000000 Execution time 8: 6125

90000000000 Execution time 9: 6140

Now lets see the results from laptop:

Core i5 4200U (Threads: 6, divider 4):

90000000000 Execution time 0: 18432

90000000000 Execution time 1: 28317

90000000000 Execution time 2: 1844

90000000000 Execution time 3: 1844

90000000000 Execution time 4: 1828

90000000000 Execution time 5: 1828

90000000000 Execution time 6: 1843

90000000000 Execution time 7: 1828

90000000000 Execution time 8: 1844

90000000000 Execution time 9: 1843

Considering the clock speed and core count difference, nothing surprising here. But now lets see the single threaded result:

Core i5 4200U (Threads: 1, divider 1):

90000000000 Execution time 0: 51153

90000000000 Execution time 1: 50170

90000000000 Execution time 2: 3031

90000000000 Execution time 3: 3000

90000000000 Execution time 4: 3031

90000000000 Execution time 5: 3031

90000000000 Execution time 6: 2984

90000000000 Execution time 7: 3000

90000000000 Execution time 8: 2999

90000000000 Execution time 9: 3016

Now laptop CPU is twice as fast! My assumption is that the JVM started using AVX2 instructions, which are missing in the old desktop CPU? Am I right?

I also found, that this optimizations are only present, when using ForkJoinPool. Without it the main thread will continue to execute the loop consistently slow (~50000 in this example).

Maybe someone knows what's happening here?

My first question here, sorry for all my mistakes.

Writing "very simple benchmarks" in Java _always_ produces confusing and misleading results, from all of the above factors. — Louis Wasserman, Oct 25 '21 at 15:57

ForkJoinPool performance in Java 17

0 Answers0