Could anyone explain me the performance behaviour of ForkJoinPool in Java 17?
I wrote this very simple benchmark and tried to experiment with amount of threads and the divider. It just loops numOjOps/divider times, each time adding 1 to the result (divide & conquer technique).
The benchmark source code:
static long numOfOps = 90_000_000_000L;
static int numOfThreads = 10;
static int divider = 8;
public static void main(String[] args) {
for (int i = 0; i < 10; i++) {
long start = System.currentTimeMillis();
ForkJoinPool forkJoinPool = new ForkJoinPool(numOfThreads);
System.out.println(forkJoinPool.invoke(new MyFork(0, numOfOps)));
long finish = System.currentTimeMillis();
System.out.println("Execution time " + i + ": " + (finish - start));
}
}
static class MyFork extends RecursiveTask<Long> {
long from, to;
public MyFork(long from, long to) {
this.from = from;
this.to = to;
}
@Override
protected Long compute() {
if (to - from <= numOfOps/divider) {
long j = 0;
for (long i = from; i < to; i++) {
j++;
}
return j;
} else {
long middle = Math.round((double) (from + to)/2);
MyFork firstVal = new MyFork(from, middle);
MyFork secondVal = new MyFork(middle, to);
firstVal.fork();
secondVal.fork();
return firstVal.join() + secondVal.join();
}
}
}
The results are a bit controversive to me.
After the second iteration of benchmark the performance increases dramatically. Where does this boost come from? Is it CPU cache or JIT compiler optimization?
I also found interesting, that the best and the most consistent results are achieved when the number of threads in the threadpool are at least 2 more than actual ones in the CPU.
I have a PC with Core i7 860 (8 threads) overclocked to 3.5 Ghz and laptop with 4200U 2.3 Ghz (4 threads).
Core i7 860 (Threads: 10; divider 8):
90000000000 Execution time 0: 12359
90000000000 Execution time 1: 18250
90000000000 Execution time 2: 813
90000000000 Execution time 3: 828
90000000000 Execution time 4: 812
90000000000 Execution time 5: 828
90000000000 Execution time 6: 813
90000000000 Execution time 7: 828
90000000000 Execution time 8: 828
90000000000 Execution time 9: 813
Core i7 860 (Threads: 1; divider 1):
90000000000 Execution time 0: 52031
90000000000 Execution time 1: 51959
90000000000 Execution time 2: 6140
90000000000 Execution time 3: 6157
90000000000 Execution time 4: 6157
90000000000 Execution time 5: 6109
90000000000 Execution time 6: 6141
90000000000 Execution time 7: 6125
90000000000 Execution time 8: 6125
90000000000 Execution time 9: 6140
Now lets see the results from laptop:
Core i5 4200U (Threads: 6, divider 4):
90000000000 Execution time 0: 18432
90000000000 Execution time 1: 28317
90000000000 Execution time 2: 1844
90000000000 Execution time 3: 1844
90000000000 Execution time 4: 1828
90000000000 Execution time 5: 1828
90000000000 Execution time 6: 1843
90000000000 Execution time 7: 1828
90000000000 Execution time 8: 1844
90000000000 Execution time 9: 1843
Considering the clock speed and core count difference, nothing surprising here. But now lets see the single threaded result:
Core i5 4200U (Threads: 1, divider 1):
90000000000 Execution time 0: 51153
90000000000 Execution time 1: 50170
90000000000 Execution time 2: 3031
90000000000 Execution time 3: 3000
90000000000 Execution time 4: 3031
90000000000 Execution time 5: 3031
90000000000 Execution time 6: 2984
90000000000 Execution time 7: 3000
90000000000 Execution time 8: 2999
90000000000 Execution time 9: 3016
Now laptop CPU is twice as fast! My assumption is that the JVM started using AVX2 instructions, which are missing in the old desktop CPU? Am I right?
I also found, that this optimizations are only present, when using ForkJoinPool. Without it the main thread will continue to execute the loop consistently slow (~50000 in this example).
Maybe someone knows what's happening here?
My first question here, sorry for all my mistakes.