x87 trigonometric instructions

Question

The 8087 has instructions FPTAN and FPATAN, which are Partial Tangent and Partial Arctangent. The "partial" is presumably is to do with the range of the operands. For FPTAN the operand must be less than pi / 4 in absolute value.

Why was there this restriction, so why is it not pi / 2?

With the 80387, the FPTAN has to have an absolute value less than 2^63, but what is the reason for this number?

Finally, was there a reason that the 8087 did not include FCOS and FSIN?

Guess: The restriction is there because whatever approximation is used isn't good enough outside of pi/4, and they don't have enough gates to reduce the pi/2 range to pi/4 in hardware, so they leave that to software. So to answer the question, one must find out what approximation is used in the 8087... — dirkt, Mar 28 '20 at 15:30
Your guess is borne out by this paper -- see for example the FSIN error graph on page 5. — dave, Mar 28 '20 at 16:06
It is a range of pi/2 that just happens to go from -pi/4 to pi/4 which covers the full period of the function. — Brian, Mar 28 '20 at 16:37
The accuracy issue described in @another-dave's link applies to any trig functions and any precision, because you can't represent the value of pi exactly as a floating point number. Modern algorithms use clever ways to avoid this, by storing the "error in the approximate value of pi" as another floating point constant (and therefore making "pi" effectively double-precision), but the 8087 would have been too small for that sort of refinement. — alephzero, Mar 28 '20 at 18:06
… and the improved algorithm produces results which may surprise people, for example FSIN(MATH_PI) is not exactly zero because the floating point constant MATH_PI is not exactly equal to pi! — alephzero, Mar 28 '20 at 18:08
@Brian The tangent function has a period of π. The instruction only covers half of a period, which is exactly what the OP is wondering about. It's easy to extend it to a full period because tan(π/2 – x) = 1/tan(x), but that's not what was asked. — Sven Marnach, Mar 28 '20 at 21:01
@another-dave good reference. In the 1990s Intel replaced the 8087’s CORDIC-based approximations with polynomial-based approximations, which have greater overall accuracy and speed. So the 80387 FSIN would have had even larger errors! It also says “FPTAN(x) should not be relied on as an accurate approximation of tan() near multiples of π/2, odd or even”. — Single Malt, Mar 29 '20 at 10:48
@alephzero: What irks me is that there isn't a standard function to compute sin(2*pi*x) which would have a period of precisely 1. There are very few cases where computation of sin(x) with a "mathematically correct" period would be more accurate than one using a period of precisely 2 times Math.PI. — supercat, Mar 29 '20 at 23:35
@another-dave: related Bruce Dawson wrote a series of nice FP / x86 details articles, including https://randomascii.wordpress.com/2014/10/09/intel-underestimates-error-bounds-by-1-3-quintillion/. That might have been what prompted that Intel paper, along with Intel updating their x86 PDF manuals. — Peter Cordes, Mar 30 '20 at 08:52
@supercat. sin(2pix) is easy to achieve with CORDIC, less with numeric suites. — Grabul, Mar 31 '20 at 00:45
@TEMLIB: What is the difficulty of computing it with any suite? The argument reduction is way cheaper than with sin(x), and performing argument reduction, multiplying by 2pi, and then computing +/- sin(y) or +/- cos(y) for an angle in the range 0..pi/4 would in most cases yield better accuracy than trying to compute sin(x) directly. — supercat, Mar 31 '20 at 04:34
The paper @another-dave mentions is now a dead link. An archive copy can be found here: https://web.archive.org/web/20200329172138/https://software.intel.com/sites/default/files/managed/f8/9c/x87TrigonometricInstructionsVsMathFunctions.pdf — Adam Hyland, Nov 09 '23 at 23:42

Chromatix · Accepted Answer · 2020-03-28T21:22:42.803

17

The tan(x) function has a period of π radians, with asymptotes at ±π/2; it effectively calculates sin(x)/cos(x), and the latter goes to zero at those points. So a function which evaluates properly over that interval can be used for any angle, by first reducing the angle to the range supported.

However, accurately calculating the tangent function near the asymptotes is more difficult than near the origin. It can still be done by further reducing the input angle to the ±π/4 range and taking the reciprocal of the result. This works because sin(x) == cos(x - π/2) and both functions are periodic, which you can substitute into the identity above.

The 8087 FPTAN instruction was evidently designed to cover only the half-period immediately surrounding the origin, making the assumption that the programmer would reduce angles to that range before executing it. This is common with polynomial approximations, which are much easier to implement in an FPU than a fully accurate calculation.

The 80387 runs into a different limitation, as it clearly implements a reduction step to permit an extended range of inputs. But the double-extended format has a 63-bit significand, so reducing a value larger than 2^63 would leave no significant precision behind. Even as you approach that limit, the available precision reduces substantially in absolute terms.

While the 8087 did not include sine and cosine calculations, it is possible to derive them using the following identities:

1+Tan^2(a) = 1/Cos^2(a)
1+Cot^2(a) = 1/Sin^2(a)

Hence something like the following sequence would apply:

FPTAN
FMUL ST(0)  ; square the tangent
FLD1
FADDP       ; add 1
FSQRT       ; square root
FLD1
FDIVP       ; reciprocal

The above would produce the cosine, but not necessarily with the correct sign. This would have to be inferred from the quadrant of the original input (before reduction).

Obviously, obtaining the correct result from a single instruction is much faster and more convenient, which is why it was added to the 80387. In the earlier 8087, the space for microcode was decidedly limited, so functions that weren't strictly necessary were simply left out.

Full implementations of trigonometric functions usually start with a reduction step anyway, to improve the precision of the result by avoiding large values internal to the computation. These can still be implemented on the 8087 using the five elementary operations - addition, subtraction, multiplication, division, and square root.

edited Mar 28 '20 at 21:22

answered Mar 28 '20 at 18:22

Chromatix

16,791
1
49
69

The tangent function has a period of π, with poles at ±π/2. The OP is specifically asking why the tangent function on the 8087 only covers the interval [-π/4, π/4], rather than the full interval between the poles. The Taylor series around 0 converges for the whole interval. – Sven Marnach Mar 28 '20 at 20:57
@SvenMarnach Good point, and I've expanded the answer to explain why that's still usable. – Chromatix Mar 28 '20 at 21:36
Double-extended has 64-bit significand, not 63. Moreover, the most significant bit is even explicitly stored in the format, unlike single and double precision formats. – Ruslan Mar 29 '20 at 08:36
@Ruslan I excluded that most-significant bit because, for the range of numbers in question, it is fixed at 1 and thus not significant. – Chromatix Mar 29 '20 at 09:50
Erm... I don't quite get why being fixed at 1 makes it not significant. Its value changes with exponent, so it's not like a useless constant digit. Besides, for denormals (which are inside the range (-2⁶³,2⁶³)) it's zero. – Ruslan Mar 29 '20 at 10:17
@Ruslan Denormals occur for very small absolute values; the OP asked why the 80387 imposed a limit of 2^63, which is nowhere near the range represented as denormals. In that range the msb is completely redundant, as its value is implied by that of the exponent field. So there are only 63 bits of precision represented in the significand field, except for those tiny values that are denormalised. – Chromatix Mar 29 '20 at 13:49
The fact that we can exactly represent each of the unsigned 64-bit integers in the double-extended precision format makes all of the 64 bits of the significand meaningful: the whole 64-bit precision is there, even if the MSB of the normalized significand is fixed to 1. The redundancy of the MSB is a mere consequence of normalization, which is a feature of the format, not of floating-point arithmetic with a 64-bit significand. Yes, the bit could be thrown away as in double and single precision, but the precision of numbers is still 64 bits. – Ruslan Mar 29 '20 at 14:36
x87 operates on double extended-precision floating-point format and this has 64 bits in the significand, and the most significant bit is present. So the value is a one or zero, the radix point, and then 63 ones and zeroes. Thus multiplying this by 2^64 would create a zero in the LSb position, when it would be unknown what that bit value actually is (leaving aside that floating-points are usually an approximation of a real). So 2^63 is the maximum you can multiply the significand without introducing any new bits? – Single Malt Mar 29 '20 at 16:12
@SingleMalt More precisely, above 2^63 the Unit in Last Place would be larger than π/4, the limit of valid support of the underlying function used to approximate tan(x). – Chromatix Mar 29 '20 at 18:35
@Ruslan: Having the MSB of the significant stored explicitly could improve performance if an implementation were to use a "not-necessarily-normalized" storage format. When performing a computation like a=b-c+d; on a processor without an FPU, if all values are of similar magnitude, normalizing after b-c could be not only expensive but counter-productive, since the normalization would have to be undone to re-align the significant with that of d. – supercat Mar 30 '20 at 15:53
@supercat there was a use for the not-always-zero MSB before 387: namely, unnormal results could be generated when an operation on denormal didn't restore the significance. (See Table 1-9 and the paragraphs before it here.) But it's not relevant to performance, only to tracking of significance. – Ruslan Mar 30 '20 at 16:09
@Ruslan: If an implementation keeps track of temporary values that might not be normalized, then the a sequence like "a=b-c+d, with all values positive; can be simplified from "align b and c; add/subtract significant; normalize result of subtract; align that result and d; perform add/subtract significand; shfit result of addition if needed" to eliminate the first normalization step, likely also improving the performance of aligning that result and d. – supercat Mar 30 '20 at 16:54
@supercat I'm just pointing out that no real-life implementation of x87 does this. – Ruslan Mar 30 '20 at 16:55
@Ruslan: Hardware implementations, no, but I would expect that software emulations probably did. IEEE-754 was designed to allow efficient software emulation as well as hardware implementations, and the design of the extended type substantially facilitates the former. – supercat Mar 30 '20 at 16:56

njuffa · Answer 2 · 2020-03-29T21:53:45.940

The restrictions on the range of arguments the transcendental instructions are able to handle is a direct result of hardware resource limitations in these early floating-point units. The primary source for the implementation details of the transcendental instructions in the 8087 is:

Rafi Nave, "Implementation of transcendental functions on a numerics processor". Microprocessing and Microprogramming, Vol. 11, No. 3-4, March-April 1983, pp. 221-225

It states that the microcode size was limited to about 500 lines for all of the transcendentals combined. Because of limited hardware, the algorithms used are based on CORDIC for the initial steps, followed by rational approximation once the partial remainder has become sufficiently small.

To fit the ~30kbit microcode ROM into the chip at all with the transistor densities and die sizes available at the time (the 8087 contained more transistors than the 8086), Intel had to resort to a special four-state ROM, as described in

Rafi Nave and John Palmer, "A Numeric Data Processor". In 1980 IEEE International Solid-State Circuits Conference, Philadelphia, PA, USA, February 13-15, 1980, pp. 108-109

A second source, that pertains to the 80387 which relaxed several range restrictions on the transcendental instructions describes the same combination of CORDIC and rational approximation as was used in the 8087:

Alan K. Yuen, "Intel's Floating-Point Processors". Electro/88 Conference Record, Boston, MA, USA, May 10-12, 1988, pp. 48/5/1-7.

Palmer and Morse, who were architects of the 8086/8087, published a book on the math coprocessor:

John F. Palmer and Stephen P. Morse, "The 8087 Primer", John Wiley & Sons, 1984

In the chapter on the transcendental functions, they likewise cite the severe restrictions on microcode size as the reason for the range limitations of the built-in transcendental instructions of the 8087. The fundamental design idea was to support only the most time-consuming part of these computations in hardware and let programmers handle the argument reduction in software.

All trigonometric functions can be computed via FPTANand all inverse trigonometric functions can be computed via FPATAN, as shown in "The 8087 Primer", alleviating the need for direct hardware support given the severe hardware resource limitations in the 8087. For the 80387, more hardware resources were available, which allowed support for the FSINCOS instruction.

From the Software Developer's Manual: "An important use of the FPREM instruction is to reduce the arguments of periodic functions. When reduction is complete, the instruction stores the three least-significant bits of the quotient in the C3, C1, and C0 flags of the FPU status word. This information is important in argument reduction for the tangent function (using a modulus of π/4), because it locates the original angle in the correct one of eight sectors of the unit circle". Is this step done first, then FPTAN? — Single Malt, Mar 30 '20 at 20:04
@SingleMalt FPREM (better even FPREM1 introduced with the 80387) can be used for argument reduction. However, this means that the value of π/2 used in the process is only accurate to extended precision, which causes accuracy issues in the trig functions for arguments that are large in magnitude, or close to integer multiples of π/2. A numerically superior alternative would be to use the argument reduction algorithm by Payne and Hanek published in 1983. — njuffa, Mar 30 '20 at 20:51
Had not heard of the Payne and Hanek algorithm. Interesting, and great answer with the C code so can step through. The quantification above of ~30kbit microcode limit with 500 lines for all of the transcendentals similarly excellent. — Single Malt, Mar 31 '20 at 17:17
@Single Malt: The literature reference is: M. Payne and R. Hanek. “Radian reduction for trigonometric functions,” SIGNUM Newsletter, 18:19–24, 1983. Most people, including me, find that article very hard to understand. That's why I pointed you at my C implementation instead. The Nave/Palmer paper says "microcode (including constants) utilizes over 30,000 bits of ROM". Based on a high-resolution die photo, Ken Shirriff counted 26368 bits for the microcode alone. — njuffa, Mar 31 '20 at 19:38

score 1 · Answer 3 · answered Mar 29 '20 at 20:15

1

I remember that FPTAN didn't give the actual tg(x), but two results and you had to divide one by another in order to get tg(x) (and they were not sin(x) and cos(x) as one would hope).

Probably that's why it was "Partial".

answered Mar 29 '20 at 20:15

fraxinus

1,324
5
9

x87 trigonometric instructions

3 Answers3