Reverse-engineering a quadratic fit?

Question

The square root function for an old (pre-IEEE floating point, no hidden bit) mainframe (the Soviet BESM-6 works as follows:

Making sure the argument $X$ is non-negative
Splitting $X$ into the exponent $E$ and the 40-bit mantissa $M$, the latter being in $[0.5, 1.0)$.
Computing $T := -0.204405847 M^2 + 0.890194099 M + 0.313567057$
Computing $Y := 2^{E/2} T$ (by setting the exponent to $[E/2]$ and optionally multiplying by $\sqrt 2$ if $E$ was odd).
Performing two iterations of $Y := (X/Y+Y)/2$

This is claimed to produce results which are within a ulp of the exact value, which looks reasonable, because $T$ is within less than $0.001$ from the exact value of $\sqrt M$ (that is, 10 bits of the mantissa are correct), and each iteration doubles the precision; but it prompts a question:

Is it possible to make an educated guess how they have come up with those particular coefficients for the quadratic fit?

Here is the assembly pseudocode, with machine specifics omitted, of the relevant part of the function, for the sake of historical accuracy:

   LOAD arg
   ADDEXP 0     ; forces normalization
   STORE X
   MODE boolean
   JZ (linkreg) ; arg is exactly zero, return zero
   MODE sign
   JNZ errsqrt  ; arg was negative 
   SHR 1        ; the LSB of the exponent moves to the sign
                ; and the exponent which was offset by 64, is halved
   STORE Z
   LOAD X
   SUBEXP X     ; zeroes out the exponent value; acc is now in [0.5, 1)
   STORE Y
   MUL S2
   ADD S1
   MUL Y
   ADD S0
   ADDEXP 32
   ADDEXP Z     ; now the exponent field of the accumulator is 64+[E/2]
   STORE Y
   LOAD Z
   MODE sign,noround ; I don't know why rounding is suppressed here
   JZ L1        ; the exponent of arg was even
   LOAD Y
   MUL R2
   STORE Y
L1 LOAD X
   DIV Y
   ADD Y
   ADDEXP -1
   STORE Y
   LOAD X
   DIV Y
   ADD Y
   ADDEXP -1
   J (linkreg)
* number format is [7 bit - exponent+64], 1 bit sign, 40 bit mantissa
S2 CONST oct'3722726016750030'
S1 CONST oct’4016174360527114’
S0 CONST oct’3752021367076726’
R2 CONST oct’4053240474631772’
X  BSS 1
Y  BSS 1
Z  BSS 1

For comparison: My own polynomial minimax approximation generator based on the Remez algorithm, using a relative error criterion, spits out this approximation: -0.204405832x^2 + 0.890194073x + 0.313567067. That's almost identical. — njuffa, Nov 16 '21 at 09:48
@njuffa This nails it down and solves OP's question completely it seems. If you write it as an answer you have my upvote! :) — Federico Poloni, Nov 16 '21 at 10:43
David G. Moursund, "Optimal Starting Values for Newton-Raphson Calculation of √x", Communications of the ACM, Vol. 10, No. 7, July 1967, pp. 430-432 gives some minimax approximations optimized for relative error. It lists, among other approximations, the following for sqrt(x) in [0.5, 1]: -0.204405874000x^2 + 0.890194258460x + 0.313567132598 — njuffa, Nov 16 '21 at 18:15
J. Eve, "Starting approximations for the iterative calculation of square roots." The Computer Journal, Vol. 6, No. 3, Nov. 1963, pp. 274-276 gives this quadratic minimax approximation to sqrt(x) on [0.5, 1] optimized for relative error: -0.204445x^2 + 0.890245*x + 0.313553 — njuffa, Nov 16 '21 at 18:41
@njuffa I've added the actual assembly code in a readable form. — Leo B., Nov 16 '21 at 21:28
Interesting that the floating point format apparently wastes one bit. The sign is both explicit in the sign bit and implicit in the msb of the mantissa. — WimC, Nov 17 '21 at 19:56
@WimC That is true only for the normalized representation. You see, there was no separation between the "integer" ALU and the FPU, as today. It was possible, for example, to use integers represented by non-normalized numbers (with the lsb having value 1 by virtue of the exponent being set to value +40), allowing faster arithmetic by turning normalization off when evaluating integer expressions, as well as avoiding the need for the explicit integer-to-float conversion when mixing integers and floats in an expression. — Leo B., Nov 17 '21 at 20:31

Federico Poloni · Accepted Answer · 2021-11-16T09:14:05.873

8

My first guess was that they computed a minimax best-approximation polynomial to $\sqrt{x}$ on [0,5,1] with something like the Remez algorithm. However, plotting the difference $p(x)-\sqrt{x}$ I do not see 4 points of equioscillation, so this is not the polynomial that minimizes $$\max_{x\in [0.5,1]} |p(x)-\sqrt{x}|.$$

Maybe that 0.001 that they tried to minimize is a relative error and not an absolute one -- I don't know the theory for that case. Or maybe someone didn't know the Remez algorithm and just computed a suboptimal solution optimizing directly with gradient descent.

(Step 4 is Newton's method for $Y^2-X=0$, but probably you already know that.)

edited Nov 16 '21 at 09:14

answered Nov 16 '21 at 08:29

Federico Poloni

11,344
1
31
59

4

From personal design experience: Yes, one would optimize for relative error for the starting approximation. And using that the equi-oscillation of the error function is there (Wolfram Alpha plot) – njuffa Nov 16 '21 at 09:22
@njuffa Thanks for stepping in with useful information! I wasn't sure if the equioscillation theorem still holds in the case of the relative error, but I guess that's the case. – Federico Poloni Nov 16 '21 at 09:58

Abdullah Ali Sivas · Answer 2 · 2021-11-16T18:07:41.987

If my computer history is not wrong, no one would use the initial guess obtained from the quadratic polynomial you provided and use the algorithm you suggested. This is due to two reasons: floating-point division was much more expensive than multiplication (up to 32x -or more if implemented in software- compared to 4x today) and obtaining the initial guess from the quadratic polynomial is more expensive than one step of the Newton iterations.

In fact, when I was teaching, I had my students analyze an algorithm which computes the reciprocal of a number without using division. Here is the excerpt:

For computers, it is easy to add, subtract and multiply. However, it is harder to divide, exponentiate, take logarithms, etc. With all of their ingenuity, many smart people offered solutions to make these operations more efficient. For example, SRT division algorithm uses lookup tables, and it is commonly used in microprocessor design (see Pentium FDIV bug for a historical issue regarding Intel's implementation of SRT algorithm). We can also use Newton's method to reduce division to addition, subtraction and multiplication.

a. Find a function $f(x)$ such that the root $x^*$ of $f$ is the reciprocal of a given non-zero number $d$. Newton's method for the function $f$ MUST only contain addition, subtraction and multiplication as operations.

b. Define the error $\varepsilon_i = 1 - dx_i$, show that $\varepsilon_i = \varepsilon_{i-1}^2$. Hence, the convergence of Newton's method for finding the reciprocal is quadratic.

c. Assume that $0.5\leq d\leq 1$. If we choose our initial guess $x_0 = p(d) = \frac{48}{17}-\frac{32}{17}d$, what is the maximum value of the error $\varepsilon(d) = 1 - dp(d)$ in absolute value? Show your work.

d. Assuming $0.5\leq d\leq 1$ and $x_0 = p(d) = \frac{48}{17}-\frac{32}{17}d$, what is the maximum number of steps $k$ required to ensure $\varepsilon_k<10^{-17}$?

e. Explain why we would want to avoid other stopping criteria and exclusively use the number of iterations as the stopping condition in this case.

And here is a solution:

a. $f(x) = 1/x - d$ works for this purpose, since $f(1/d) = 1/(1/d) - d = d - d = 0$ and $$x_{i+1} = x_i - f(x_i)/f^{\prime}(x_i) = x_i - \frac{1/x_i - d}{-1/x_i^2} = x_i + x_i(1-dx_i)= x_i(2-dx_i) $$

b. \begin{align*} \varepsilon_{i+1} &= 1 - dx_{i+1} = 1 - dx_i(2-dx_i) \\ &= d^2x_i^2 - 2dx_i + 1 = (1 - dx_i)^2 \\ &= \varepsilon_i^2. \end{align*}

c. Consider $\varepsilon(d) = 1 - dp(d)$. Since $\varepsilon(d)$ is continuous on the interval $[0.5,1]$, by E.V.T, there must be extrema of $\varepsilon(d)$ on this interval. We know that these extrema are achieved either at the end points or at the critical points. Note that, $\varepsilon(0.5)=1/17$ and $\varepsilon(1)=1/17$. $\varepsilon^{\prime}(d) = \frac{48}{17}-\frac{32}{17}d + d(-\frac{32}{17}) = \frac{48}{17}-\frac{64}{17}d$ is zero only if $d=3/4$, but $\varepsilon(3/4)=-1/17$. So we can conclude that the maximum value of the error in absolute value is $1/17$.

d. From part b, we know that $\varepsilon_i = \dots = \varepsilon_0^{2^i}$ and from part c, we know that $|\varepsilon_0|<1/17$ for any initial guess $x_0 = p(d)$. So $\varepsilon_{i+1}< (1/17)^{2^i}$ and we want $(1/17)^{2^i}<10^{-17}$. Applying $\log_{10}$ to both sides, we obtain $2^i\log_{10}(1/17)<-17$ or, equivalently, $2^i>17/\log_{10}(17)$. Further, applying $\log_2$ to both sides, we get $i>\log_2(17/\log_{10}(17))\approx 3.79$. Therefore, if we do $4$ iterations of Newton's method, our solution will be guaranteed to satisfy the error tolerance.

e. In some applications (for example this one), the cost of checking for convergence may be more expensive than the cost of an iteration, and it may be fine to do 1 or 2 extra iterations. Since, the Newton's method for this problem is quadratically convergent, hence, we reach the desired tolerance in just 4 iterations, it is -probably- not worth to check for other stopping criteria.

The analysis in part c in relevant to your question: you want to find an initial guess which is going to minimize the maximum error, similar to Federico Poloni's answer. However, the polynomial you find will be dependent on your error function.

Also, in your question, you provide the coefficients as floating-point numbers but afaik they would be computed exactly (by hand) and would be presented as rational numbers so that it would be robust independent of the floating-point system. That might be another reason why we don't see the points of equi-oscillation assuming Remez' algorithm is used to compute the coefficients.

(For the division algorithm, the quadratic polynomial which gives a better initial guess is $p(d):={140}/{33}+{-64}d/{11}+{256}d^2/{99}$ for example. The error in this initial guess should be -in absolute sense- less than or equal to ~0.01, and this is the best polynomial quadratic fit in the sense of minimizing the $L_{\infty}$ error over the interval $[0.5,1)$.

Again, if I remember correctly, this polynomial is obtained through using Remez' algorithm on $\max_{d\in [0.5,1]} |1-p(d)d|$)

For the mainframe in question (the Soviet BESM-6), division was about 3x as slow as multiplication. The docs give 18 ALU clock cycles for MUL, and 50 for DIV in average. Computing the initial guess to a 20-bit precision so that only one Newton iteration will suffice would be much more expensive. I've converted the coefficients from the binary representation in the code. It would be an interesting question whether the algorithm could be improved given our knowledge today, but I doubt it. — Leo B., Nov 16 '21 at 19:43
@njuffa I think division-free Newton-Rhapson is an old algorithm. I don't know if it is as old as BESM-6, but I know that it is at least from 1986 (due to Kahan) and Kahan himself may have been aware of the algorithm earlier than that. The techniques involved in the development of the algorithm are a hundred years old, so probably, there were smart programmers around whose code worked much faster than everyone else's but didn't publish their code (or maybe they did, but nobody cared). — Abdullah Ali Sivas, Nov 16 '21 at 20:34
@njuffa I didn't mean to imply that you made such a claim. I meant that probably there were people using the same (division-free) algorithm even back then. My apologies if my intentions came through wrong. — Abdullah Ali Sivas, Nov 16 '21 at 20:51
@njuffa The quadratically-convergent iteration for the reciprocal square root ($x_{n+1} = x_n/2 * (3 - S*x_n^2)$) uses 3 multiplications, one subtraction and one exponent correction, which, in case of a division operation 3x as expensive as a multiplication, is no faster than a Newton iteration but takes more instructions. Perhaps, even if they had considered using the reciprocal square root, they might have decided it's not worth the bother. — Leo B., Nov 16 '21 at 20:55
@AbdullahAliSivas No offense taken. The earliest proposal for division-free square root computation I could find in a quick search is: James R. Goodman, "Two square root algorithms utilizing multiplication as the iterative operator," M.S. Thesis, UT Austin, August 1969 — njuffa, Nov 16 '21 at 21:26
@njuffa One cubically convergent iteration would need more than 13 bits of precision in the initial guess. I've found a comment that the values are from Computer Approximations by J. F. Hart et al., published by John Wiley & Sons, NY,1968. The system software must have been ready by 1967 for the govt acceptance test, so this is not the initial version of the function, but then there is enough extra precision that even a simple 3-point fit at 0.5, 0.75 and 1 would do. However, if all current calculations don't yield the same numbers as given, how did Hart arrive to them? — Leo B., Nov 16 '21 at 22:14
@njuffa FWIW, the comment near the sqrt coefficients says "table 130". Nearby, for sin(x) - "table 3343", for atan(x) - "table 5096", Here, around the second of the lines numbered 010722 (there are multiple concatenated files, I presume). Initially I looked at the source of another OS which did not have the comments. — Leo B., Nov 16 '21 at 22:46
@LeoB. My apologies. Yes, table 130 in Hart et al. does list the exact coefficients given in the question. I overlooked this table in the index, because it is confusingly labelled as providing an approximation for $[\frac{1}{\sqrt{4}}, 1]$. The root symbol caused me to skip over it. — njuffa, Nov 16 '21 at 22:51
@LeoB. One thing Hart may have done is to run a test on the all the combinations of the numbers in a latticial neighbourhood of the theoretically optimal coefficients to find an approximating polynomial which gives the most accuracy after one or two steps of Newton's method under floating-point arithmetic. Surprisingly, such coefficients may be different from the theoretically optimal values. This was a trick I learned back in my undergrad, and I later learned that it is very common practice in applications. — Abdullah Ali Sivas, Nov 17 '21 at 03:08
@AbdullahAliSivas That's an interesting possibility, but in the $[0.5, 1] -> [\sqrt{0.5}, 1]$ range, the arithmetic is effectively fixed point, isn't it? Judging by values computed by njuffa, and listed in Moursund, quoted in a subsequent comment, which match to 6 decimals with Hart, the differences could be precision/conversion artifacts. — Leo B., Nov 17 '21 at 03:34
I think because of the multiplications between two numbers less than 1, you still have roundoff issues in the intermediate steps, for $0.890194099 M$ will be less than $0.5$ for $M<1$. — Abdullah Ali Sivas, Nov 17 '21 at 03:55
@LeoB. All the evidence available, including the methodology section of Hart et al., indicate that the coefficient in the book are purely a result of executing the Remez algorithm for generating minimax approximation. But this is a numerical process, and small differences in results will occur depending on details of the Remez algorithm implementation. The techniques Abdullah Ali Sivas alludes to (taking into account number representation by specific FP formats and incorporating the details of subsequent FP computation) did not (in my recollection) come into play until about 20-25 years ago. — njuffa, Nov 17 '21 at 06:07

WimC · Answer 3 · 2021-11-17T13:41:33.940

3

This polynomial $p(x)$ solves the minimax optimization for $$\frac{p(x)}{\sqrt{x}}-1$$ over the interval $[\tfrac12,1]$.

edited Nov 17 '21 at 13:41

answered Nov 17 '21 at 13:33

WimC

131
2

Reverse-engineering a quadratic fit?

3 Answers3