I am working on a fixed point reciprocal algorithm for the purposes of computing integer divides as fast as hardware multiplies. It turns out that for 1/x, one may repeatedly double (0.1)_x to compute the reciprocal of x in O(n) time to arbitrary precision by checking if doubling the fractional portion once more produces a value greater than or equal to one.
Without even needing to create a sophisticated arbitrary base encoding in base two, one can use base two entirely to compute the largest power of two greater than x as 2**(floor(log_2(x)) + 1) and compute the modulus of this value as 2**(floor(log_2(x)) + 1) mod x = 2**(floor(log_2(x)) + 1) - x, then determine the largest power of two coefficient, n, for which the result of this modulus, m, satisfies m * n less than or equal to x, then repeat until the fractional portion is zero or we've run out of space. However, I have not found a way to parallelize this and so it does not appear to be too useful, but it does have a worst case of O(n) time and a best case of O(1) time, however.
The issue, as a matter of parallelization, is increasing the number of independent and concurrent terms to operate on. As one only needs to be able to convert 2**n to base x or otherwise multiply (0.1)_x by 2**n for arbitrary n corresponding to n bits of precision, this method is much more promising implementation-wise because each term can be computed independently, however, even with addition-chain exponentiation and the ability to at least multiply by two, the cost is too great outside of base two unless one can create an even tighter multiplication implementation that manages to be as cheap as addition or subtraction, something that is by no means trivial given how well-optimized multiply instructions are in contemporary hardware. The best alternative I have thus far is exp(x) using Taylor series and argument reduction for 64-bit arguments to machine precision in around 18c[ycles] using SIMD multiplies on Ryzen Family 17h, but for my requirements, I need it to be no more than twice the cost of mul or 6c, and it would have to be base-agnostic in implementation which adds yet more overhead. Using exp(x) here reduces the number of operations to a constant times the number of bits of precision desired, but the overhead of this implementation in software is unacceptable--perhaps an FPGA or hardware implementation would make this more practical.
The circumstances obviate that we cannot use existing built-ins or instructions to my knowledge for performing divides or computing reciprocals in a timely manner for the purposes of implementing divides efficiently to the quality I desire; conversion from one base to another is trivialized by either a modulus or floored division algorithm. Computing 2**n mod x is much easier than y mod x given that the graph can be interpreted as several lines whose slopes vary by 1/floor(x) and are bounded by the lines of x = y and y = 0 for all integers x, however, I have not yet found a way to efficiently implement this, but of note here is that one can compute 2**n mod x recursively using the algorithm mentioned in the second paragraph.