0

I am working on a fixed point reciprocal algorithm for the purposes of computing integer divides as fast as hardware multiplies. It turns out that for 1/x, one may repeatedly double (0.1)_x to compute the reciprocal of x in O(n) time to arbitrary precision by checking if doubling the fractional portion once more produces a value greater than or equal to one.

Without even needing to create a sophisticated arbitrary base encoding in base two, one can use base two entirely to compute the largest power of two greater than x as 2**(floor(log_2(x)) + 1) and compute the modulus of this value as 2**(floor(log_2(x)) + 1) mod x = 2**(floor(log_2(x)) + 1) - x, then determine the largest power of two coefficient, n, for which the result of this modulus, m, satisfies m * n less than or equal to x, then repeat until the fractional portion is zero or we've run out of space. However, I have not found a way to parallelize this and so it does not appear to be too useful, but it does have a worst case of O(n) time and a best case of O(1) time, however.

The issue, as a matter of parallelization, is increasing the number of independent and concurrent terms to operate on. As one only needs to be able to convert 2**n to base x or otherwise multiply (0.1)_x by 2**n for arbitrary n corresponding to n bits of precision, this method is much more promising implementation-wise because each term can be computed independently, however, even with addition-chain exponentiation and the ability to at least multiply by two, the cost is too great outside of base two unless one can create an even tighter multiplication implementation that manages to be as cheap as addition or subtraction, something that is by no means trivial given how well-optimized multiply instructions are in contemporary hardware. The best alternative I have thus far is exp(x) using Taylor series and argument reduction for 64-bit arguments to machine precision in around 18c[ycles] using SIMD multiplies on Ryzen Family 17h, but for my requirements, I need it to be no more than twice the cost of mul or 6c, and it would have to be base-agnostic in implementation which adds yet more overhead. Using exp(x) here reduces the number of operations to a constant times the number of bits of precision desired, but the overhead of this implementation in software is unacceptable--perhaps an FPGA or hardware implementation would make this more practical.

The circumstances obviate that we cannot use existing built-ins or instructions to my knowledge for performing divides or computing reciprocals in a timely manner for the purposes of implementing divides efficiently to the quality I desire; conversion from one base to another is trivialized by either a modulus or floored division algorithm. Computing 2**n mod x is much easier than y mod x given that the graph can be interpreted as several lines whose slopes vary by 1/floor(x) and are bounded by the lines of x = y and y = 0 for all integers x, however, I have not yet found a way to efficiently implement this, but of note here is that one can compute 2**n mod x recursively using the algorithm mentioned in the second paragraph.

AMDG
  • 1,012
  • 14
  • 29
  • 1. what do you mean by arbitrary base? as power of 2 has always `base = 2` ... 2. iteration/recursion is usually very hard or not able to parallelize so even if you optimize your approach its highly likely it will not be a good math for you. 3. IIRC for computing reciprocals there is [Montgomery reduction](https://en.wikipedia.org/wiki/Montgomery_modular_multiplication) but I never truly used it. You can use to compute `2^n mod p` where 2^n is bitshift and `mod p` is multiplication however not sure how much cost is for the reduction itself – Spektre Oct 08 '21 at 07:24
  • however IIRC its dependent on the `p` and not on the `2^n` so you might have precomputed values in a LUT unless `n` has too many bits for a LUT or the data can not be compressed to storable ammount. 4. Iterative process does not need multiplication at all you simply bitshift and if result is bigger than `p` substract `p` but again non parallelizable. See modular arithmetics in here [Modular arithmetics and NTT (finite field DFT) optimizations](https://stackoverflow.com/q/18577076/2521214) it uses this trick a lot. – Spektre Oct 08 '21 at 07:31
  • @Spektre Arbitrary base arithmetic is what I intend to implement (both a specialized implementation if necessary for division and one for general purpose), i.e. to perform arithmetic in the native base of one's choosing, and in the case of binary computers, the digits are encoded in a binary string. – AMDG Oct 08 '21 at 14:09
  • You seem to fail to understand where I'm coming from. Multiplying (0.1)_x in base x by two produces the fraction portion regardless. If you can't do modulus at all, then iteratively computing reciprocal bits for base two is the only option, or, to convert 2^n to base x and multiply (0.1)_x by 2^n in that base. Using addition-chain exponentiation, one can in fact multiply in the native base relatively efficiently if one makes the encoding conducive to efficient multiplication. This can be parallelized. I'm seeking a means to compute faster than addition-chain exponentiation here. – AMDG Oct 08 '21 at 14:14
  • The only thing that matters in base x is the least significant digit value which is in fact 2^n mod x if we multiply x by 2^n, hence why I'm looking for that as a lead. 1 mod x for rational x is actually much simpler than I originally realized: 2^n mod x = 2^n (1 mod 2^(-n) x). The trouble is being able to determine symbolically without evaluating the reciprocal of x which "reciprocal group" x belongs to after converting to a fraction. If we can do that one thing, then we have efficient modulus. – AMDG Oct 08 '21 at 14:40
  • Hm, it would appear you don't need to determine what "reciprocal group" that x belongs to for `1 mod x` at all (for all x). It looks like you can just scale `1 mod x` itself as a function using powers of two entirely to effectively create an identity of modulus, `2^(-n) - x` then multiply by some power of two to get an integer result. This stems from the fact that the first "reciprocal group" is represented by `2^n - x` for `2^(n - 1) lt x leq 2^n`. I could be wrong--I haven't thoroughly examined this possibility--but if true, then I will answer my question here. – AMDG Oct 09 '21 at 14:16

0 Answers0