0

I have a database record of around 1000000 paragraphs with around ~500 characters each. By reading all the records, I need to get the list of alphabet ordered by most to least used.

I mock the database reading by creating stream up to 1000000 then process the stream in parallel

final Map<Character, Long> charCountMap = new ConcurrentHashMap<>();
for (char c = 'a'; c <= 'z'; c++) {
    charCountMap.put(c, 0l);
}

System.out.println("Parallel Stream");
long start = System.currentTimeMillis();
Stream.iterate(0, i -> i).limit(1000000).parallel() //mock database stream
    .forEach(i-> RandomStringUtils.randomAlphanumeric(500)
    .toLowerCase().chars().mapToObj(c -> Character.valueOf((char) c)).filter(c -> c >= 97 && c <= 122)
    .forEach(c -> charCountMap.compute(c, (k, v) -> v + 1))); //update ConcurrentHashMap

long end = System.currentTimeMillis();
System.out.println("Parallel Stream time spent :" + (end - start));

System.out.println("Serial Stream"); start = System.currentTimeMillis();
Stream.iterate(0, i -> i).limit(1000000) //mock database stream
    .forEach(i-> RandomStringUtils.randomAlphanumeric(500)
    .toLowerCase().chars().mapToObj(c -> Character.valueOf((char) c)).filter(c -> c >= 97 && c <= 122)
    .forEach(c -> charCountMap.compute(c, (k, v) -> v + 1)));
end = System.currentTimeMillis();
System.out.println("Serial Stream time spent :" + (end - start));

I initially thought that parallel stream would be faster even with expected overhead for stream larger than 100,000. However, test shows that serial stream is ~5X faster than parallel even for 1,000,000 records.

I suspected it was because of updating the ConcurrentHashMap. But when I removed it and change with empty function, there is still significant performance gap.

Is there something wrong in my database mock up call or the way I use parallel stream?

NothingBox
  • 252
  • 3
  • 13
  • Going by memory here, so take it with a grain of salt, but when calculating elapsed time, you should use `System.nanoTime()`, not `System.currentTimeMillis()`. If you have to do benchmarking on you own, this [question](https://stackoverflow.com/questions/504103/how-do-i-write-a-correct-micro-benchmark-in-java) is a great source. – Chaosfire May 21 '22 at 11:45
  • You can't make any conclusions without warm-up runs and proper benchmarks. Just try swapping over the order of serial and parallel code sections, and you may come to the opposite conclusion because the first iteration test is slowest. Also the random generator may not be threadsafe (cannot tell as you don't show the code) which may mean that the parallel version can never be quickest. – DuncG May 21 '22 at 15:37
  • I have swapped the order and got the same result. The RandomStringUtils I am using is from Apache commons-lang library – NothingBox May 21 '22 at 15:56

0 Answers0