5

OK, so I'm trying to write a Haskell program which counts prime numbers extremely fast. Presumably I am not the first person to try to do this. (In particular, I'm damned sure I saw some prior art, but I can't find it now...)

Initially, I want to count the number of primes less than 10^11. Currently I've left my program running for about 15 minutes and it's not even half way there yet. Some rabid C++ programmer claims his program only takes 8 seconds minutes. So clearly I'm doing something horrifyingly wrong.

(In case it matters, my current implementation uses IOUArray Integer Bool and multiple threads to process independent subranges of the search space. Currently it takes several seconds to remove all the multiples of 2 from a 10MB array chunk...)

Note that 10^11 is too big for 32-bit arithmetic. Also, 10^11 bits = 12.5 GB, far too much data to fit into Haskell's 32-bit address space. So you can't have the entire bitmap in memory at once. Finally, note that the number of primes less than 10^11 is just a shade less than 2^32, so there's no way you can store the actual integers all at once either.


Edit: Apparently I misread the timing information. What the C++ guy actually claimed was:

  • Counting primes < 10^11 takes 8 minutes using just one core, and 56 seconds using 4 cores. (CPU type not specified.)

  • Counting primes < 10^10 takes 5 seconds. (Not sure how many cores that's using.)

Sorry about the mistake...

Edit: My source code can be found here: http://hpaste.org/72898

MathematicalOrchid
  • 60,742
  • 18
  • 121
  • 216
  • Has anybody else tried to do this? Is there some code I can look at? – MathematicalOrchid Aug 09 '12 at 14:55
  • How does your solution compare to http://stackoverflow.com/questions/2221378/the-genuine-sieve-of-eratosthenes-algorithm-used-to-generate-prime-numbers? – Volker Stolz Aug 09 '12 at 14:58
  • What algorithm are you using anyway? If it's just the sieve of Erastothenes, its clear that you won't beat a solution with better complexity for such large ranges... – leftaroundabout Aug 09 '12 at 15:01
  • 1
    http://hackage.haskell.org/package/primes – Louis Wasserman Aug 09 '12 at 15:15
  • @leftaroundabout Both my program and the one I'm competing against use a trivial sieve of Erastothenes implementation. – MathematicalOrchid Aug 09 '12 at 16:10
  • @LouisWasserman This package returns a _list_ of prime numbers, which won't fit in a 32-bit address space for this size of number. It also appears to be single-threaded, so it can't use multiple processor cores. – MathematicalOrchid Aug 09 '12 at 16:11
  • 2
    What do you mean by "Haskell's 32-bit address space"? That's news to me! – Thomas M. DuBuisson Aug 09 '12 at 16:16
  • @ThomasM.DuBuisson As far as I'm aware, the Haskell Platform is only available for Windows in 32-bit form. Not that it's impossible to make a 64-bit version, just that nobody has done. – MathematicalOrchid Aug 09 '12 at 16:41
  • That GHC isn't available for 64-bit Windows certainly has quite a bit to do with that fact that anyone who wants high performance will normally use a Unixoid operating system... _why don't you_? – leftaroundabout Aug 10 '12 at 00:50
  • How does memory size affect _performance_? I can see how it would affect how much data you can handle, but why would it affect how fast? Last I checked, running in 64-bit mode is only a few percent quicker. – MathematicalOrchid Aug 10 '12 at 08:24
  • 2
    @MathematicalOrchid As far as GHC is concerned, the big effect of 64-bit vs. 32-bit is not due to the larger address space or memory (though that occasionally is beneficial too), it's a) more registers in 64-bit architectures, that's very important since GHC produces register-heavy code and often makes a difference of 2× or more, and b) 64-bit `Int#`s, which means much fewer C-calls to GMP in integer computations, and of course `Int` is often sufficient where it wouldn't be on 32-bit systems. That also often makes a huge difference. – Daniel Fischer Aug 27 '12 at 19:50
  • As to the 'C' programmers claims of 8 minutes for 10^11, that isn't all that fast for multi-core, as the extremely optimized ['C' primesieve](https://code.google.com/p/primesieve/) takes 0.61 seconds to 10^10 and 7.98 seconds to 10^11 on a i7 (3.5 GHz, using eight threads including Hyper Threading set to a 16 Kilobyte buffer size). I don't know that one can quite reach this speed in Haskell, but one can likely get close using mutable arrays and the same wheel factorization, page segmentation, multi-processing, and composite number culling optimizations. I might take a crack at this sometime. – GordonBGood Jan 02 '14 at 02:40

3 Answers3

13

Using the package arithmoi by the excellent StackOverflow teacher Daniel Fischer:

import Math.NumberTheory.Primes.Counting

main = print $ primeCount (10^11)

-- $ time ./arith
-- 4118054813
-- 
-- real 0m0.209s
-- user 0m0.198s
-- sys  0m0.008s

Which is 40 times faster than whatever your 'rabid' C++ friend has written; maybe he can learn a thing or two looking at the Haskell source ... Here are the haddocks

applicative
  • 7,943
  • 33
  • 38
  • Comparing the runtime of this program with the C++ version is pointless if you don't know at least the hardware specifications of each system. – Frerich Raabe Aug 09 '12 at 16:51
  • 1
    My 'hardware' is a semi-antique macbook. If `MathematicalOrchid` had had the sense to post any code, as I did, I would of course have given a more scientific comparison. That the difference across machines well exceeded an order of magnitude -- and many, many orders of magnitude in the case of `MathematicalOrchid`s Haskell program -- suggested the `arithmoi` might be onto something. – applicative Aug 09 '12 at 17:10
  • 1
    @applicative, of course you actually read "the haddocks" link and realize that the function 'primeCount' that you use from the 'arithmoi' package is not a sieve for ranges larger than 30,000 and shouldn't be compared in execution times to a sieve such as here; it is a special primes counting function as developed by D. H. Lehmer and implemented in the 'arithmoi' package. – GordonBGood Jan 07 '14 at 02:24
  • If the "rabid C++" friend used Meissel-Lehmer method, it would still be faster than Haskell. Apples to oranges. The better comparison would be using the nthPrime function to count how many primes, instead of using primeCount. – starflyer Nov 13 '14 at 16:22
  • 1
    @starflyer, although you are correct that we should be comparing like with like, why do you think that same algorithm (in this case the "Meissel-Lehmer method") would be faster if written in C++ than in Haskell? In my experience the Haskell compiler is very efficient and when combined with the LLVM backend, which is also very efficient, Haskell can often generate machine code which is at least as fast if not faster than the equivalent code in C **as long as the correct Haskell features are selected to optimize for speed and not program conciseness**. – GordonBGood Nov 16 '15 at 01:57
  • As long as the correct C++ features are selected and coded, there's no way Haskell can be faster. So you see I can also apply the circular reasoning to C++. This is the difference though: The distance from standard Haskell to "correct Haskell features to optimize for speed" is pretty big. – starflyer Nov 16 '15 at 05:07
  • @starflyer, why can't Haskell be the same speed as C/C++? By "optimize for speed" in Haskell, generally means just not using lists inside tight inner loops and forcing evaluation of otherwise lazy values immediately to avoid loss of speed in evaluating "thunks" (deferred execution code); that is still standard Haskell. While it's true that using un-boxed types and in particular un-boxed arrays isn't standard Haskell 98 or 2010, standard Haskell really doesn't matter when GHC is the only actively developed compiler. So let's say optimized GHC can be as fast as optimized C/C++. – GordonBGood Nov 25 '15 at 13:44
  • 1
    @applicative, calling an external package seems to not teach anyone anything except that it exists. If one is going to call an external program, one may as well call state-of-the-art [primecount](https://github.com/kimwalisch/primecount) by the same author as "primesieve" and count the primes to 10^11 in 10's of milliseconds and the primes to the 64-bit nuber range (about 1.8 * 10^19) in a few 10's of minutes. If one insists on calling it from Haskell, one writes a wrapper and calls it through the Foreign Function Interface (FFI). But I say again, what has one learned? – GordonBGood Jun 04 '16 at 05:04
  • Further on the algorithms used as per the above comment: Kim Walisch's "primecount" uses a very advanced algorithm and is multi-threaded, whereas Danial Fischer's Lehmer algorithm as used in "arithmoi" is practically many orders of magnitude slower. KW's "primecount" can practically be used on a modern desktop computer up to about 10^23 in less than a day, whereas DF's "primeCount" function is limited to 10^18, and even that will take something approaching a half day. – GordonBGood Jun 04 '16 at 06:19
12

Some rabid C++ programmer claims his program only takes 8 seconds.

Is that wall-clock time or CPU time?

If wall-clock, and the task is split across 100 CPUs, say, it's not very impressive (it's decent), if split across 1000, it's pitiful.

If it's CPU time:

I'm pretty sure that time is not reached by actually sieving up to 1011.

With a few more than 4×109 primes until then, assuming a somewhat normal 2-3GHz CPU, you'd have 4-6 cycles per prime.

You cannot achieve that with a sieve of Eratosthenes, nor with a sieve of Atkin. Each prime has to be inspected and counted, each composite marked as such and inspected. That gives a theoretical lower bound of two cycles per number in the sieve, not counting e.g. array initialisation, loop bound checking, loop variable updates, redundant markings. You're not going to come near that theoretical bound.

A few data points:

Daniel Bernstein's primegen (Sieve of Atkin), with the sieving blocks adjusted to take full advantage of my 32KB L1-cache, takes 90 seconds to sieve the primes to 1011 and count them (234 seconds with the default sieve-block size of 8K words) on my Core i5 2410M (2.3GHz). (It's much optimised for the range up to 232, but above that, it becomes noticeably slower, for the limit 109 the times are 0.49 resp 0.64 seconds.)

My segmented Sieve of Eratosthenes, using some not exposed internals to avoid list creation, sieves and counts to 1011 in 340 seconds (sniff :-/, but hey, for 109 it took 2.92 seconds - it's getting closer, and somewhere between 1012 and 1013 it overtakes primegen :) Using the exposed interface creating a list of primes roughly doubles the time taken, as does compiling it with a 32-bit GHC.

So I'd wager that the reported time of 8 seconds - if CPU time - is, if correct, for an algorithm counting the number of primes without actually sieving the whole way. As indicated by applicative's answer, that can be done much faster.

dafis@schwartz:~/Haskell/Repos/arithmoi> time tests/primeCount 100000000000
4118054813

real    0m0.145s
user    0m0.139s
sys     0m0.006s

Note that 10^11 is too big for 32-bit arithmetic. Also, 10^11 bits = 12.5 GB, far too much data to fit into Haskell's 32-bit address space. So you can't have the entire bitmap in memory at once.

To sieve that range, you have to use a segmented sieve. Even if you're not restricted by a 32-bit address space, using such a large array will yield abysmal performance due to frequent cache misses. Your programme will spend most of its time to wait for data being transferred from main memory. Sieve in chunks that fit in your L2-cache (I haven't succeeded in trying to make it faster by making the sieve fit in L1, I guess the overhead of the GHC runtime is too large to make it work).

Also, eliminate the multiples of some small primes from the sieve, that reduces the needed work, and additionally improves performance by making the sieve smaller. Eliminating even numbers is trivial, multiples of 3 easy, multiples of 5 not very difficult.

Finally, note that the number of primes less than 10^11 is just a shade less than 2^32, so there's no way you can store the actual integers all at once either.

If you store the sieve as a list of bit-arrays, withe multiples of 2, 3 and 5 removed, you need about 3.3GB to store the chunks, so if you really can have up to 4GB, it would fit. But you should rather let the chunks you don't need anymore be garbage-collected immediately.

(In case it matters, my current implementation uses IOUArray Integer Bool and multiple threads to process independent subranges of the search space. Currently it takes several seconds to remove all the multiples of 2 from a 10MB array chunk...)

It does matter.

  • Use Int for the indices and unsafeRead/unsafeWrite to read and modify the array. Integer computations are much slower than Int computations, and the bounds-checking you get with readArray/writeArray really hurts.
  • 10MB chunks are far too large, you lose cache-locality. Use chunks of a few hundred KB at most (L2 cache minus some space for other things needed).
  • Still, it shouldn't take several seconds to remove multiples of 2 even with Integer indices, bounds-checking and 10MB chunks. Can we see your code?

Post-vacation update:

Eight minutes to sieve the primes up to 1011 is possible without deep wizardry. I don't see how going from one to four cores could yield an eightfold speedup, since there should be no cache-effects here, but whatever, it may be, without seeing the code, I can't investigate.

So let's take a look at your code.

First, an incorrectness:

vs <-
  mapM
    (\ start -> do
      let block = (start, start + block_size)
      v <- newEmptyMVar
      writeChan queue $ do
        say $ "New chunk " ++ show block
        chunk <- chunk_new block
        sieve_top base chunk
        c <- chunk_count chunk
        putMVar v c
      return v
    )
    (takeWhile (< target) $ iterate (+block_size) base_max)

The numbers base_max + k*block_size appear in two chunks each, if any of them is prime, that prime is counted twice, also you should cap the upper bound at target.

Now to the performance aspect:

One thing that jumps out is that it's real chatty, so chatty that it's measurable once you have adjusted the block_size to the cache (I took 256KB blocks for a 512KB L2 cache) - then the threads are slowed down by fighting for stdout for the if prime < 100 then say $ "Sieve " ++ show prime else return () message.

Let's look at your (silenced) sieving loop:

chunk_sieve :: Chunk -> Integer -> IO ()
chunk_sieve array prime = do
  (min, max) <- getBounds array
  let n0 = min `mod` prime
  let n1 = if n0 == 0 then min else min - n0 + prime
  mapM_
    (\ n -> writeArray array n (n == prime))
    (takeWhile (<= max) $ iterate (+prime) n1)

One thing that costs time is that each index is compared to the prime whose multiples are marked off. Each single comparison is cheap (though considerably more expensive than an Int comparison), but the huge number of comparisons, only one of which may yield True, adds up. Unconditionally writing False and if necessary writing True at the prime's index after the loop yields a considerable speedup.

For timing purposes I've reduced the target to 109 and ran it on two cores. The original code took 155s (elapsed, 292s user), with the reduced block_size 148s, silenced 143s. Omitting the comparison,

mapM_
  (\ n -> writeArray array n False)
  (takeWhile (<= max) $ iterate (+prime) n1)
when (min <= prime && prime <= max) $ writeArray array prime True

it runs in 131s.

Now it's time for some bigger speedups. Did I already mention that bounds-checking costs a lot of time? Since the loop condition guarantees that no out-of-bounds access is attempted (and the primes are small enough that no Int-overflow can happen), we should really use the unchecked access:

chunk_sieve :: Chunk -> Integer -> IO ()
chunk_sieve array prime = do
  (min, max) <- getBounds array
  let n0 = min `mod` prime
      n1 = if n0 == 0 then min else min - n0 + prime
      n2 = fromInteger (n1 - min)
      mx = fromInteger (max - min)
      pr = fromInteger prime
  mapM_
    (\ n -> unsafeWrite array n False)
    (takeWhile (<= mx) $ iterate (+pr) n2)
  when (min <= prime && prime <= max) $ writeArray array prime True

which reduces the running time to 96s. Much better, but still abysmal. The culprit is

takeWhile (<= mx) $ iterate (+pr) n2

GHC can't fuse that composition well, and you get a list of boxed Ints that is traversed. Replace that with an arithmetic sequence, [n2, n2+pr .. mx] and GHC happily creates a loop using unboxed Int#s, 37 seconds.

Much much better, but still bad. The biggest time-consumer now is

chunk_count :: Chunk -> IO Integer
chunk_count array = do
    (min, max) <- getBounds array
    work min max 0
  where
    work i max count = do
      b <- readArray array i
      let count' = count + if b then 1 else 0
      evaluate count'
      let i' = i+1
      if i' > max
        then return count'
        else work i' max count'

Again, the bounds-checking costs a lot of time. With

chunk_count :: Chunk -> IO Integer
chunk_count array = do
    (min, max) <- getBounds array
    work 0 (fromInteger (max-min)) 0
  where
    work i max count = do
      b <- unsafeRead array i
      let count' = count + if b then 1 else 0
      evaluate count'
      let i' = i+1
      if i' > max
        then return count'
        else work i' max count'

we're down to 15 seconds. Now, evaluate count' is a somewhat expensive way to make work strict in count. Using else work i' max $! count' in the last line instead of evaluate reduces the running time to 13 seconds. Defining work in a more suitable (for GHC, at least) way,

chunk_count :: Chunk -> IO Integer
chunk_count array = do
    (min, max) <- getBounds array
    let mx = fromInteger (max-min)
        work i !ct
            | mx < i    = return ct
            | otherwise = do
                b <- unsafeRead array i
                work (i+1) (if b then ct+1 else ct)
    work 0 0

brings the time down to 6.55 seconds. Now we're in a situation where say $ "New chunk " ++ show block makes a measurable difference, disabling that gets us down to 6.18 seconds.

However, counting set bits by reading a byte from the array, masking off the undesired bits and comparing to 0 for each individual bit is not the most efficient way. It's faster to read entire Words from the array (via castIOUArray) and use popCount, if "you know what you're doing...", that gets us down to 4.25 seconds; stopping the marking when the square of the prime becomes larger than the upper bound of the chunk

sieve_top :: Chunk -> Chunk -> IO ()
sieve_top base chunk = work 2
  where
    work prime = do
      chunk_sieve chunk prime
      mp <- chunk_next_prime base prime
      case mp of
        Nothing -> return ()
        Just p' -> do
            (_,mx) <- getBounds chunk
            when (p'*p' <= mx) $ work p'

to 3.9 seconds. Still not spectacular, but considering where we started, not bad. Just to illustrate the importance of cache locality once other bad behaviour has been reduced: the same code with the original 10MB block size takes 8.5 seconds.

Another small problem in your code is that all threads use the same mutable array of small primes for sieving. Since it is mutable, access to that must be synchronised, which adds a bit of overhead. With only two threads, the overhead isn't too big, using an immutable copy to do the sieving only reduces the time to 3.75 seconds here, but I expect that the effect would be larger for more threads. (I have only two physical cores - with hyperthreading - so using more than two threads doing the same kind of work introduces a slowdown that may invalidate conclusions drawn from that, but using four threads, I get 4.55 seconds with the immutable array versus 5.3 seconds with the mutable array. That seems to corroborate the growing synchronisation overhead.)

There's still a bit to be gained by eliminating more Integer calculations and writing code for GHC's optimiser (more worker/wrapper transformations, some static argument transformations), but not very much, maybe 10-15%.

The next big improvement is to be obtained by eliminating even numbers from the sieve. That reduces the work, allocation and running time by more than half. No prime sieve should ever consider even numbers, really, that's just a pointless waste of work.

Community
  • 1
  • 1
Daniel Fischer
  • 178,696
  • 16
  • 303
  • 427
  • Thanks for an extensive and thorough answer. It's kinda late at night in my timezone, but hopefully tomorrow I can try some of your suggestions. – MathematicalOrchid Aug 09 '12 at 20:59
  • 1
    @MathematicalOrchid Have you tried any of the suggestions and if so, what did you get from them? I got quite a bit from them ;) – Daniel Fischer Aug 27 '12 at 19:42
  • @DanielFischer, thanks for the detailed description of the programming approach used by the Eratosthenes Sieve implementation of your arithmoi package. In the 'TODO' list of that package you mention wanting to try the Sieve of Atkin algorithm. I would suggest that your time would be better spent further extending the wheel factorization and multi-processing used for the reasons laid out in the [my answer to a C# question](http://stackoverflow.com/a/20572948/549617); maximum wheel factorization makes more of a difference for the SoE than the small empirical run time advantage of the SoA. – GordonBGood Jan 22 '14 at 07:26
  • @GordonBGood Yes, I don't expect to be able to write an Atkin Sieve that beats a decent Eratosthenes sieve. Still, in principle I'm curious how it would perform, so it remains as a "when I have time and inclination, I'm going to try it out". But before that, when I can get myself to do it, there are improvements to the Eratosthenes sieve in the line. – Daniel Fischer Jan 22 '14 at 11:06
  • @DanielFischer, a fully optimized Sieve of Atkin (SoA) will likely beat your current implementation of the Sieve of Eratosthenes (SoE) as in arithmoi with its only 2,3,5 wheel factorization, but the SoE is not limited to only that - 2,3,5,7 wheel is quite easy to implement and further wheels possible. One easy optimization is to pre-initialize the mutable array with a large wheel pattern such as a 2,3,5,7,11,13,17 wheel as well as using the smaller wheel for further culling for a considerable reduction in total number of culling operations. SoA has the 2,3,5 wheel baked in to the quadratics. – GordonBGood Jan 23 '14 at 02:30
  • @GordonBGood But to write a "fully optimized" implementation, one needs to know more about the workings of the machine than I do ;) – Daniel Fischer Jan 23 '14 at 09:14
  • @DanielFischer, ah so. By "fully optimized" I was referring to similar optimizations as used by the Sieve of Eratosthenes [primesieve in C](https://code.google.com/p/primesieve/) (at least up to the bucket sieve) but applied to the Sieve of Atkins with segmented mutable arrays. I'm not sure that the extreme loop unrolling used there would work in Haskell, but I suppose it might. In fact, the SoA isn't really all that much different than the SoE with a different segment array culling engine. – GordonBGood Jan 23 '14 at 10:19
  • @DanielFischer, regarding making tight-loop algorithms such as this "as fast as highly optimized C/C++ code", given an efficient enough "back end" (LLVM is better than the NCG), using GHC Haskell one can implement all of the tricks as used in primesieve (C) including extreme wheel factorization, loop unrolling (TemplateHaskell helps), choice of buffer size, and even use array pointers in the inner loop (using primitive operations), but at a terrible cost in code complexity. Other than as an interesting exercise, why would one do this rather than writing it in C and calling it from Haskell? – GordonBGood Feb 10 '14 at 00:57
  • "Counting primes < 10^11 takes 8 minutes using just one core" is slow using a slightly different way of analysis: A tight culling loop inside a L! cache sieve buffer should take about 3.5 CPU machine cycles per culling loop for a modern Intel CPU,, should take about 0.6 to 1.0 nanoseconds per composite number representation cull for modern CPU speeds. For this range, there are about 11.3/27.6 billion culls (odds-only/all), so the the job should take well under 480 seconds even with all worst case. To be even worse than this, an inefficient counting method or inefficient memory use may be used. – GordonBGood Nov 16 '15 at 01:38
  • @DanielFischer, I think I have solved the "eight-fold speedup" problem: 4 cores with Hyper-Threading (HT) is eight cores to the OS, and the "rabid C programmer" is likely using too large a sieve buffer as in many times larger than cache sizes, just as the question does, meaning 10's of CPU cycles per composite cull operation, with most of the time spent waiting on memory access. Under these conditions, HT threads are quite effective as they just share the wait times per thread for the memory controller's main memory accesses. – GordonBGood Nov 16 '15 at 02:12
  • In fact, < 32 seconds CPU time to 10^11 is quite possible: 1) First, use maximum wheel factorization with 2/3/5/7 and pre-culling for 11/13/17/19 to reduce total composite culls to about 3e10, 2) use very tight inner culling loops with extreme loop unrolling as per "primesieve" to reduce the average clocks per cull to about 2.2 for this range, 3) minimize all other overheads such as counting the remaining primes as you've described above to negligible and 4) we get about 28.7 seconds CPU time even for your 2.3 GHz CPU, Divide that by the number of true cores (say four) and... < 8 seconds. – GordonBGood Nov 06 '16 at 03:07
1

This is a strange question/answer in that the accepted answer doesn't match the question: The question asks for help improving the speed of a sieve (correctly choosing a Page-Segmented Sieve of Eratosthenes implementation) but the accepted answer doesn't use a sieve but rather a numerical analysis technique and is just a library. Although that is fine for finding the total number of primes up to a large range very quickly (and there are faster and broader versions for doing that in other languages such as Kim Walisch's primecount in C++, and also for quickly calculating the sums of primes over a range, a sieve is useful for doing particular types of analysis such as finding prime gaps, existence of prime doubles, triples, etc. (generally K-Tuple primes), etc. In fact, general numerical analysis techniques such as the Meissel–Lehmer algorithm, upon which most of these are based, require a source of "seed primes" to start, which is best produced by an optimized Sieve of Eratosthenes.

In fact, Kim Walisch's primecount as per the above link has a GHC/Haskell API already built for it and can easily by called through the Foreign Function Interface (FFI) so therefore would be a better answer than the arithmoi library as it is faster. It is so fast that it is currently the record holder in calculating the number of primes up to 1e28! If one must make such a value available to a Haskell program and doesn't care if they understand how it gets it, it calculates the number of primes to 1e11 in tens of milliseconds.

In a similar fashion, if a sieve is what is really required, then Kim Walisch's primesieve also has a GHC Haskell FFI and could also be called directly.

While using libraries gets the job done, one doesn't learn anything about how to implement them by just using them; thus, the reason for Daniel Fischer's (DF's) very good tutorial answer and this follow-on series to what he started. DF's answer shows how to improve the question's code, but there is nowhere a summary that shows what the code should look like after working through all of his suggestions; This is especially important as the original question code in an OP's hpaste has disappeared (mercifully, as it is only a good example of how not to do it, but perhaps the code should be embedded in the question for reference), and we can only reconstruct what it did through DF's comments in his answer. This series of answers seeks to rectify that in case someone has a need for such a sieve in pure GHC Haskell, starting with a summary of the code to which DF's teaching leads one and following that up with further staged improvements.

TLDR; Jump to the end of the last of my answers for posted Haskell code that actually is almost as fast or as fast as Kim Walisch's primesieve, which up to now is probably the fastest in the world, at least up to smaller ranges of ten to a hundred billion or so and isn't beat by much above that by anything other than an extremely optimized version of YAFU which may be up to about 5% faster for large ranges. DF's final code before wheel factorization is stated to be about 40 times faster than the original question code, I extend that to where my code is 30 to 32 times faster yet again for a total of about 1200 to 1280 times faster than the original question code!.

The original question code

This will be the only time I reference it since it is no longer available (and in my opintion isn't worth modifying anyway): The only thing I liked about that code was the implementation of a thread pool, but even that was flawed in that it used mapM to feed the entire job queue to be processed by the threads to a channel, which is a non-deferred function that thus could potentially push a huge amount of work onto the job channel consuming a lot of memory rather than just pushing enough work to keep all the threads busy, and then feeding one more job for evey one returned from the results Channel. I will correct that in my code at the bottom of this answer and follow-on codes. In fact, only a results MVar pool is necessary, as the GHC runtime forks new threads faster than it takes to pipe new work to a waiting pool.

One problem with both the original code and DF's improved code is neither of them used the "-fllvm" compiler flag to use the LLVM back end code generator. For tight loops as we are trying to write here, LLVM can reduce the time per loop by as much as about a factor of two. It didn't matter for the original code, which had loops so non-tight that LLVM couldn't help, but DF's code does have tight loops and would have benefited by reducing the loop time to about 60 per cent.

The other problem with the use of MVar's (and thus Chan's) is that they aren't particularly fast, with an overhead of about three milliseconds per set activation. We had evidence of this problem in DF's answer in his final analysis where he said "using four threads, I get 4.55 seconds with the immutable array versus 5.3 seconds with the mutable array" as compared to the 3.75 seconds using two threads. Now his machine had only two cores and the extra two threads were Hyper Threaded, sharing most of the same resources as the other two, so one doesn't expect much in better performance in using them, but one doesn't expect worse performance as here. The reason is that there is so much overhead that adding the inefficient cores actually adds extra work and slows the final result. I also see this in my four "real" thread/core machine after increasing efficiency by using the LLVM back end. I only get a reduction of time to about 55% in using all four codes, which is in line with that when I use only two cores the total execution time actually increases. Since `MVar's are the only way we can implement "wait for result" using GHC, the solution to this is to make the work slices much larger (more coarse grained multi-threading) so this overhead becomes a negligible fraction, which is part of my algorithmic improvements in my seconds answer.

There is also no need to use channels as in infinite depth places to receive work and return results once the overloading problem is fixed, so I eliminated them in favor of just a "round robin" array which has the number of elements as the number of processes in the pool.

The test environments

I'm not sure if DF still has his Sandy Bridge laptop, and although I have had a Sandy Bridge CPU, it is currently down for maintenance. However, the online IDE website, Wandbox uses a Broadwell CPU of about exactly the same rating as DF's machine used in his answer at 2.3 GHz with a turbo boost to 2.9 GHz for single threaded use with two cores/four threads (Hyper Threaded). This has the same performance as DF's machine as proven by my taking his referenced internals of the "arithmoi" library and forcing it to run for a range of a billion. this Wandbox link shows it running at almost exactly the same 2.92 seconds as he mentioned in his answer. I did not bother to count the result (which is only about 0.01 milliseconds using pop count) as that would not change the comparison, but only forced it to run over the range with a 128 Kilobyte buffer size as is the default.

So, in Wandbox we have a readily referenced comparable machine; however, it has the limitation that it does not support the use of the LLVM back end, of which use is important for optimization of the tight culling loops we will be using. Accordingly, I will be doing comparisons with and without LLVM on my own machine, which is a Intel Skylake i5-2500 at 3.2 GHz with a boost to 3.6 GHz for single threaded use. There is a slight limitation in this in that the results will not scale directly by clock speed used because Skylake has a further improvement in the architecture for better branch prediction and elision of branches to as low as zero time when they are correctly predicted; as the loops which we are developing spend almost all of their time in tight loops, this can make a ten to fifteen percent reduction in execution time for the sieve implementations.

The principles behind a fast Sieve algorithm

These principles are just two, as follows:

  1. It is important to keep the total number of operations low for a given sieving range.
  2. Each operation must take a small number of CPU clock cycles on average.

The end execution efficiency is then the product of these two.

The performance target

DF seems to think that Atkin and Bernstein's "primegen" is a "gold standard" in sieve performance. It is not for the major reason that it does not and can not take a small number of CPU clock cycles per operation on average (principle 2) and the number of cycles it consumes increases with range faster than what I regard to be the "gold standard" - Kim Walisch's primesieve as referenced in the TLDR. While it is absolutely true that this implementation of the Sieve of Atkin (SoA) passes principle 1) above as it quickly converges to a constant number of operations of 0.2587 times the range for ranges as low as about 100 and at a range of about 1e11 this is less than the best practical implementations of a Maximally Wheel Factorized Sieve of Eratosthenes as per the combo sieve here as estimated by the formulas above that point on the web page (0.2984 for 1e11, higher with increasing range), it does not live up to expectations as to efficiency. The comparison made by the SoA document is flawed as to its comparison to the Sieve of Eratosthenese (SoE): in the same download as the code for "primegen" is the code for "eratspeed", which is a reference version of the SoE that sieves over about a billion. However, they crippled that reference version, as they limited it to the same 2/3/5 wheel factorization as is baked into the SoA (which can't be increased) instead of using the Maximally Wheel Factorized combo sieve (for which there is evidence in the file they knew of). This made the number of operations over this range a little over 400 million operations as compared to the SoA's about 258.7 million. Next, it appears that they further crippled the reference SoE in making the sieve buffer smaller than that used for the SoA so as to increase the time per operation of the SoE so about that of the SoA, in spite of those operations being simpler than those of the SoA. In this way they claimed that the SoA waas about 40% faster than the SoA.

Berstein has done some hand optimizations to the tight inner operation loop for both of these in a similar way, perhaps due to the C compilers of the day not being able to fully optimize these loops, and in order for those compilers not to undo these hand optimizations he states in the notes that the compiler should be run with only the first level of optimization. That is no longer true for the "gcc" version of today as the performance of both are increased with "-O3" high level optimizations. If both are set to 8192 32-bit words (32 Kilobytes) with the above and compiled with the optimizations as above, they both sieve to a billion in about 0.49/0.50 seconds on the DF-type machine, indicating that the number of CPU cycles per cull is about 36 per cent less for "eratspeed". If the Maximum Wheel Factorization combo principles were applied, then it should be about yet another 40% faster, and this is before doing optimizations on the tight inner culling loop; these optimizations to the culling loop are not possible with the SoA because it must use a variable-span-per-operation loop as compared to the fixed span per loop of SoE. This reference "eratspeed" SoE implementation will be discussed further down the answer(s), as this is the approach that leads to my improved algoithm answer.

As a final note about the SoA "primegen", Bernstein seems to believe that the sieve buffer needed to be limited to less than the CPU L1 cache size. While that may have been true for the CPU's on which he developed this work, it is no longer true for modern CPU's whose L2 cache performance is much faster than the SoA and close to the reference SoE tight inner loop time. Thus, if one makes the sieve buffer equal to the CPU L2 cache size (256 Kiloytes in the case of these CPU's), the time to sieve to a billion changes almost not at all at about 0.50 seconds for "primegen", but the time to sieve to 100 billion almost scales linearly to about 52 seconds (as it should). This works because the buffer has been increased so the SoA isn't plagued so quickly by operation span overflows, but it doesn't fix the problem, it just moves it further up the range and the SoA still won't be faster than a maximally optimized SoE at even the highest practical ranges.

For reference, a "primesieve" type of algorithm sieves to a billion in about 0.18 seconds for a range of a billion and about 25 seconds to 100 billion when single threaded with both times reduced by a factor of about two using two threads on the Wandbox/DF range of CPU.

State of DF's final results of his answer and proposed further work

DF stated that his final answer code would sieve to a billion in about 7.5/3.75 seconds when single threaded/run on two threads, repectively. This represents about 2.5514 billion operations and at a CPU clock of 2.9 GHz represents about 9 CPU clocks per cull (CpC). This is not good, as a basic culling loop should take about 3.5 CpC. This is the result of not using the LLVM as discussed above.

He suggested that the first further improvement would be Wheel factorization by odds-only for an improvement in speed of about 2.5 times and that further improvements using more extended wheel factorization are reasonably easily possible. This is true. However, his attempt to do extended wheel factorization in the primes function of the "arithmoi" library is very much a failure: 2.92 seconds to do 404.8 million culls at 2.9 GHz is about 21 CpC, just as the 340 seconds to do the 46.07 billion culls to sieve to 100 billion is also abysmal at about the same clocks per cull. This is so slow that there is no reason for this extended 2/3/5 wheel factorization as the result will be the same or slower than if one just used odds-only even at 9 CpC. The reason for this terrible efficiency is that he uses some complex, and thus slow, mathemeatics to do the reduction in culls for wheel factorization, but those computations take a lot of machine time. There are Look Up Table ways of doing this that are about twice as fast at about 12 CPU clock cycles per cull, but they are still too slow for use in such small ranges; their use should be limited to augmenting the efficient range for very large ranges where the percentage of the time they take is a small part of the overall time.

To show how bad these results are, here is a reference Wandbox Javascript odds-only version sieving to a billion in about 2.14 seconds or about 6.15 CpC; this runs on my Skylake machine at 1.54 seconds, with the reduced time above the ratio in clock rates due to the improvements in architecture as mentioned above for 5.4 CpC.

Further, in Haskell, here is a odds-only version from my submission to RosettaCode, the second faster version which runs on the reference Wandbox CPU in 2.26 seconds sieving to a billion compiled without LLVM for about 6.4 CpC. On my Skylake machine , this runs at 1.83 seconds and 1.023 seconds without/with LLVM respectively (6.4/3.6 CpC, respectively) and sieves to 100 billion in about 210/127 seconds no LLVM/LLVM, respectively (6.7/4.0 CpC, respectively). Note that these are faster than "arithmoi" library wheel factorized version. This will form the basis for further algorithmic improvements in my second answer.

So, the following code is an odds-only algorithm which performs as per DF mentions as his final answer:

-- Multi-threaded Page-Segmented Bit-Packed Odds-Only Sieve of Eratosthenes... 
-- "Running a modern CPU single threaded is like
--  running a race car on one cylinder" me ...

-- compile with "-threaded" to use maximum available cores and threads...
-- compile with "-fllvm" for highest speed by a factor of up to two times.

{-# LANGUAGE FlexibleContexts, ScopedTypeVariables #-} -- , BangPatterns, MagicHash, UnboxedTuples, Strict
{-# OPTIONS_GHC -O2 -fllvm #-} -- or -O3 -keep-s-files -fno-cse -rtsopts

import Data.Int ( Int32, Int64 )
import Data.Word ( Word32, Word64 )
import Data.Bits ( (.&.), (.|.), shiftL, shiftR, popCount )
import Data.Array.Base (
         UArray(..), listArray, assocs, unsafeAt, elems,
         STUArray(..), newArray,
         unsafeRead, unsafeWrite,
         unsafeThaw, unsafeFreezeSTUArray, castSTUArray )
import Data.Array.ST ( runSTUArray )
import Control.Monad.ST ( ST, runST )
import Data.Time.Clock.POSIX ( getPOSIXTime )

-- imports to do with multi-threading...
import Data.Array (Array)
import Control.Monad ( forever, when )
import GHC.Conc ( getNumProcessors )
import Control.Monad.Cont ( join )
import Control.Concurrent
    ( ThreadId,
      forkIO,
      getNumCapabilities,
      myThreadId,
      setNumCapabilities )
import Control.Concurrent.MVar ( MVar, newEmptyMVar, putMVar, takeMVar )
import System.IO.Unsafe ( unsafePerformIO )

type Prime = Word64
type PrimeNdx = Int64
type StartAddr = Int32
type StartAddrArr = UArray Int StartAddr
type BasePrimeRep = Word32
type BasePrimeRepArr = UArray Int BasePrimeRep
type SieveBuffer = UArray Int Bool -- no point to artificial index!
 
-- constants related to odds-only...
cWHLPRMS :: [Prime]
cWHLPRMS = [2] -- excludes even numbers other than 2
cFRSTSVPRM :: Prime
cFRSTSVPRM = 3 -- start at first prime past the wheel prime(s)
 
makeSieveBuffer :: Int -> SieveBuffer
{-# INLINE makeSieveBuffer #-}
makeSieveBuffer szbts = runSTUArray $ do
  newArray (0, szbts - 1) False

-- count the remaining un-marked composite bits using very fast popcount...
{-# INLINE countSieveBuffer #-}
countSieveBuffer :: Int -> SieveBuffer -> Int
countSieveBuffer lstndx sb = runST $ do
  cmpsts <- unsafeThaw sb -- :: ST s (STUArray s PrimeNdx Bool)
  wrdcmpsts <-
    (castSTUArray :: STUArray s Int Bool ->
                      ST s (STUArray s Int Word64)) cmpsts
  let lstwrd = lstndx `shiftR` 6
  let lstmsk = 0xFFFFFFFFFFFFFFFE `shiftL` (lstndx .&. 63)
  let loopwi wi cnt =
        if wi < lstwrd then do
          v <- unsafeRead wrdcmpsts wi
          case cnt - popCount v of
            ncnt -> ncnt `seq` loopwi (wi + 1) ncnt
        else do
          v <- unsafeRead wrdcmpsts lstwrd
          return $ fromIntegral (cnt - popCount (v .|. lstmsk))
  loopwi 0 (lstwrd * 64 + 64)

cWHLPTRNLEN64 :: Int
cWHLPTRNLEN64 = 2048

cWHLPTRN :: SieveBuffer -- twice as big to allow for overflow...
cWHLPTRN = makeSieveBuffer (131072 + 131072)

-- could be faster using primitive copyByteArray#...
-- in preparation for filling with pre-cull pattern...
fillSieveBuffer :: PrimeNdx -> SieveBuffer -> SieveBuffer
fillSieveBuffer lwi sb@(UArray _ _ rng _) = runSTUArray $ do
  ptrn <- unsafeThaw cWHLPTRN :: ST s (STUArray s Int Bool)
  ptrnu64 <- (castSTUArray :: STUArray s Int Bool ->
                                  ST s (STUArray s Int Word64)) ptrn
  cmpsts <- unsafeThaw sb :: ST s (STUArray s Int Bool)
  cmpstsu64 <- (castSTUArray :: STUArray s Int Bool ->
                                  ST s (STUArray s Int Word64)) cmpsts
  let lmt = rng `shiftR` 6
      lwi64 = lwi `shiftR` 6
      loop i | i >= lmt = return cmpsts
             | otherwise =
                 let mdlo = fromIntegral $ lwi64 `mod` fromIntegral cWHLPTRNLEN64
                     sloop j
                       | j >= cWHLPTRNLEN64 = loop  (i + cWHLPTRNLEN64)
                       | otherwise = do
                          v <- unsafeRead ptrnu64 (mdlo + j)
                          unsafeWrite cmpstsu64 (i + j) v; sloop (j + 1) in sloop 0
  loop 0

cullSieveBuffer :: PrimeNdx -> [BasePrimeRepArr] -> SieveBuffer -> SieveBuffer
cullSieveBuffer lwi bpras sb@(UArray _ _ rng _) = runSTUArray $ do
  cmpsts <- unsafeThaw sb :: ST s (STUArray s Int Bool)
  let limi = lwi + fromIntegral rng - 1
      loopbpras [] = return cmpsts -- stop warning incomplete pattern match!
      loopbpras (bpra@(UArray _ _ bprrng _) : bprastl) =
        let loopbpi bpi
              | bpi >= bprrng = loopbpras bprastl
              | otherwise =
                let bp = unsafeAt bpra bpi
                    bpndx = (fromIntegral bp - cFRSTSVPRM) `shiftR` 1
                    rsqri = fromIntegral ((bpndx + bpndx) * (bpndx + cFRSTSVPRM)
                                             + cFRSTSVPRM) - lwi in
                if rsqri >= fromIntegral rng then return cmpsts else
                let bpint = fromIntegral bp
                    bppn = fromIntegral bp
                    cullbits c | c >= rng = loopbpi (bpi + 1)
                               | otherwise = do unsafeWrite cmpsts c True
                                                cullbits (c + bpint)
                    s = if rsqri >= 0 then fromIntegral rsqri else
                        let r = fromIntegral (-rsqri `rem` bppn)
                        in if r == 0 then 0 else fromIntegral (bppn - r)
                in cullbits s in loopbpi 0
  loopbpras bpras

-- multithreading goes here...

{-# NOINLINE cNUMPROCS #-}
cNUMPROCS :: Int -- force to the maximum number of threads available
cNUMPROCS = -- 1
-- {-
  unsafePerformIO $ do -- no side effects because global!
  np <- getNumProcessors; setNumCapabilities np
  getNumCapabilities
--}

-- list of culled soeve buffers from index with give bit size...
makePrimePagesFrom :: forall r. PrimeNdx -> Int ->
                                (PrimeNdx -> SieveBuffer -> r) -> Bool -> [r]
makePrimePagesFrom stwi szbts cnvrtrf thrdd =
  -- great, we can make an extra thread pool whenever we might need more, and
  -- it should die and be collected whenever this goes out of scope!
  let bpras = makeBasePrimeRepArrs thrdd
      jbparms() =
        let loop lwi szb =
              (lwi, szb) : loop (lwi + fromIntegral szb) szb
        in loop stwi szbts in
  if thrdd then
    let

      {-# NOINLINE strttsk #-}
      strttsk lwi szbts bpras mvr = -- do some strict work but define it non-strictly,
        forkIO $ do -- else it will run in forground before threading!
          -- and return it using a MVar; force strict execution in thread...
          putMVar mvr $! cnvrtrf lwi $ cullSieveBuffer lwi bpras
                           $ fillSieveBuffer lwi $ makeSieveBuffer szbts

      -- start a result pool, initialized to start with the first tasks...
      {-# NOINLINE rsltpool #-}
      rsltpool :: Array Int (MVar r) = unsafePerformIO $! do
        mvlst <- mapM (const newEmptyMVar) [ 1 .. cNUMPROCS ] -- unique copies
        mapM_ (\ (mvr, (lwi, szb)) -> strttsk lwi szb bpras mvr)
                                          $ zip mvlst $ jbparms()
        return $! listArray (0, cNUMPROCS - 1) mvlst

      -- lazily loop over the entire job list...
      loop (fdhd : fdtl) =
        let {-# NOINLINE getnxt #-}
            getnxt ((lwi, szb), i) = unsafePerformIO $! do -- wait for and get result of next page
              let mvr = unsafeAt rsltpool i
              r <- takeMVar mvr -- recycle mvr for next
              strttsk lwi szb bpras mvr; return $! r
        in getnxt fdhd : loop fdtl

  -- lazily cycle over the rest of the jobs forever...
  in rsltpool `seq` loop $ zip (drop cNUMPROCS $ jbparms())
                               (cycle [ 0 .. cNUMPROCS - 1 ]) else 

-- back to non multi-threaded functions...

  let loop ((lwi, szb) : jbpmstl) =
        (cnvrtrf lwi . cullSieveBuffer lwi bpras . fillSieveBuffer lwi .
           makeSieveBuffer) szb : loop jbpmstl
  in loop $ jbparms()

makeBasePrimeRepArrs :: Bool -> [BasePrimeRepArr]
makeBasePrimeRepArrs thrdd = 
  let sb2bpra :: PrimeNdx -> SieveBuffer -> BasePrimeRepArr
      sb2bpra lwi sb@(UArray _ _ rng _) =
        let len = countSieveBuffer (rng - 1) sb
            bpbs = fromIntegral cFRSTSVPRM + fromIntegral (lwi + lwi) in
        listArray (0, len - 1) [ bpbs + fromIntegral (i + i) |
                                          (i, False) <- assocs sb ]                
      fkbpras = [ sb2bpra 0 $ makeSieveBuffer 512 ]
      bpra0 = sb2bpra 0 $ cullSieveBuffer 0 fkbpras $ makeSieveBuffer 131072
  in bpra0 : makePrimePagesFrom 131072 131072 sb2bpra thrdd

-- result functions are here...

-- prepends the wheel factorized initial primes to the sieved primes output...
-- some faster not useing higher-order-functions, but still slow so who cares?
primes :: Int -> Bool -> [Prime]
primes szbts thrdd = cWHLPRMS ++ concat prmslsts where
  -- convert a list of sieve buffers to a UArray of primes...
  sb2prmsa :: PrimeNdx -> SieveBuffer -> UArray Int Prime
  sb2prmsa lwi sb@(UArray _ _ rng _) = -- bsprm `seq` loop 0 where
    let bsprm = cFRSTSVPRM + fromIntegral (lwi + lwi)
        len = countSieveBuffer (rng - 1) sb in
    bsprm `seq` len `seq`
      listArray (0, len - 1)
                [ bsprm + fromIntegral (i + i) | (i, False) <- assocs sb ]
  prmslsts = map elems $ makePrimePagesFrom 0 szbts sb2prmsa thrdd

-- count the primes from the sieved page list to the limit...
countPrimesTo :: Prime -> Int -> Bool -> Int64
countPrimesTo limit szbts thrdd =
  let lmtndx = fromIntegral $ (limit - cFRSTSVPRM) `shiftR` 1 :: PrimeNdx
      sb2cnt lwi sb@(UArray _ _ rng _) =
        let nlwi = lwi + fromIntegral rng in
        if nlwi < lmtndx then (countSieveBuffer (rng - 1) sb, nlwi)
        else (countSieveBuffer (fromIntegral (lmtndx - lwi)) sb, nlwi)
      loop [] cnt = cnt
      loop ((cnt, nxtlwi) : cntstl) ocnt =
        if nxtlwi > lmtndx then ocnt + fromIntegral cnt
        else loop cntstl $ ocnt + fromIntegral cnt
  in if limit < cFRSTSVPRM then
       if limit < 2 then 0 else 1
     else loop (makePrimePagesFrom 0 szbts sb2cnt thrdd) 1
 
-- test it...
main :: IO ()
main = do
  let limit = 10^9 :: Prime
  -- page segmentation sized for most efficiency;
  -- fastest with CPU L1 cache size but more address calculation overhead;
  -- a little slower with CPU L2 cache size but just about enough to
  -- cancell out the gain from reduced page start address calculations...
  let cSIEVEPGSZ = (2^18) * 8 :: Int -- CPU L2 cache size in bits
  let threaded = True

  putStrLn $ "There are " ++ show cNUMPROCS ++ " threads available."
 
  strt <- getPOSIXTime
--  let answr = length $ takeWhile (<= limit) $ primes cSIEVEPGSZ threaded -- slow way
  let answr = countPrimesTo limit cSIEVEPGSZ threaded -- fast way
  stop <- answr `seq` getPOSIXTime -- force evaluation of answr b4 stop time!
  let elpsd = round $ 1e3 * (stop - strt) :: Int64
 
  putStr $ "Found " ++ show answr
  putStr $ " primes up to " ++ show limit
  putStrLn $ " in " ++ show elpsd ++ " milliseconds."

This has been refactored from my RosettaCode submission referenced above by making it possible to have different sieve buffer sizes for the main sieve loop and the secondary base prime feed loop, as well as adding multi-threading (improved above DF's as discussed above). It runs at about the same speed as DF mentions in CpC for his final answer on equivalent to his machine without LLVM (about 3/1.5 seconds, 9 CpC to one billion) and runs at one second to a billion/125 seconds to 100 billion (3.7/4.1 CpC, respectively) on my Skylake machine with LLVM single threaded, and about half of those times when multi-threaded due to the problem of not being "coarse-grained" enough as explained above.

This answer is only a factor of less than two faster than DF's code, mostly due to the recommended use of the LLVM back end.

GordonBGood
  • 3,287
  • 1
  • 32
  • 43