Intel CPU bug in the 90s

Question

My teacher who teaches "Logic" at the university told us a story about Intel processors, which goes: "In the 90s, Intel had a bug in the calculation of mathematical functions like sine or cosine encoded in the processor. This bug created inconsistencies in some bank accounts, bringing Intel to hire logicians in order to demonstrate the correctness of the code."

I tried to search this story around the web but I did not find anything. Does anyone know anything about it or can anyone give me some sources?

It's true. I got a CPU change out of it. I'll see if I can find a reference. — dave, Oct 12 '20 at 14:30
There was a problem with the floating point multiplication. I did not hear about a sin/cos bug. — peterh, Oct 12 '20 at 15:30
Such bugs do happen (see infamous FDIV bug mentioned by others), but this particular story appears to be a bit distorted. It's hard to imagine what use could a bank have for trigonometric functions, and the values of trigonometric functions have been tabulated for decades if not centuries to high precision - any discrepancy could be easily verified without hiring someone to examine algorithms. The FDIV bug is not a close match because Intel was indeed wrong on that one (and the error was not algorithmic but a missing column in a lookup table). Looking forward to a closer match. — Euro Micelli, Oct 12 '20 at 16:55
There were also bugs in the 80386 with 32-bit integer multiplication in the A1 and A2 stepping, but that was in the mid 80's. Not sure how sin/cos would be used in banking... — mannaggia, Oct 12 '20 at 18:04
@mannaggia off the top of my head, since markets react to numbers they're in part big feedback systems; wherever there is feedback there is a chance for harmonic motion. So a classic case of if f''(x) = -f(x) then f(x) = cos(x)? I'm not a quant, I'm speculating wildly. — Tommy, Oct 12 '20 at 20:08
FYI fsin is a different bug that came much later, caused hard coded pi not large enough to handle very large inputs: http://www.cpushack.com/2014/10/15/has-the-fdiv-bug-met-its-match-enter-the-intel-fsin-bug/ — user3528438, Oct 12 '20 at 21:38
Also every CPU has a tremendous bugs historically and now, google for "xxxx processor errata", e.g. ARM A77 errata is 59 pages: https://developer.arm.com/documentation/101992/0009/ — user3528438, Oct 12 '20 at 21:43
Unfortunately, Intel has recently gotten rid of its validation team: https://danluu.com/cpu-bugs/#update — forest, Oct 13 '20 at 02:53
@user3528438 So true about the errata. These can be really annoying when you run into them. A digital signal processor that we used on a project back over a decade ago - a product which we still manufacture to this day - had a couple of rather unfortunate silicon anomalies. One of them sometimes resulted in corruption of the DMA control registers when using the USB controller's DMA mode 0. The only listed workaround was to use mode 1. The other resulted in occasional corruption of the DMA control registers when using mode 1... with a listed workaround of using mode 0... — reirab, Oct 13 '20 at 15:33
@forest: More recently, former Intel principal engineer, François Piednoël has said Skylake was more buggy than most previous designs. That matches up with the 2014 timeline that those anonymous sources report for Intel making changes to reduce QA / validation time: Skylake launched in mid 2015. It has bugs that required microcode updates to disable the loop buffer LSD, promote mfence to serializing OoO exec, disable the uop cache for JCC at a 32-byte boundary — Peter Cordes, Oct 14 '20 at 03:08
@forest: and also disabled the hardware lock elision part of TSX, and sometimes RTM. That might partly be due to a new class of vulnerabilities being discovered (Spectre and then MDS) that weren't even anticipated when Skylake was being validated. But regardless, TSX has been a real Charlie Brown football situation: present in Haswell, disabled due to bugs. Present in Broadwell, disabled due to a different bug I think? Present in Skylake, then disabled due to security(?) bugs. — Peter Cordes, Oct 14 '20 at 03:13
So basically Skylake now (with those microcode mitigations) has some performance glass-jaws that can limit throughput for small loops, which didn't exist in previous CPUs like Haswell. Ice Lake fixes some of that, (e.g. re-enable the loop buffer), but unfortunately many of the skylake-derived uarches like Coffee Lake don't. — Peter Cordes, Oct 14 '20 at 03:26
A joke I heard at the time was: "At Intel, quality is job 0.999999999." — Doug Warren, Oct 14 '20 at 13:22
More recently, AMD Ryzen 3000 had an issue where random number generation would always return -1. https://arstechnica.com/gadgets/2019/10/how-a-months-old-amd-microcode-bug-destroyed-my-weekend/ — Mooing Duck, Oct 14 '20 at 17:36
@DougWarren: What's the difference between Windows 3.1 and Windows 3.11? The calculator knows. Go to the Windows Calculator and type 3.11-3.1=. On Windows 3.10, the calculator will display "0.00", On Windows 3.11, the calculator will display "0.01". This issue is somewhat similar to a bug that was in in the printf for Turbo C 2.0 but fixed in 2.1, where printf("%1.1f", 999.96); would output 000.0 [it would determine that the value was at least 100 but less than 1,000 and thus needed three digits to the left of the decimal... — supercat, Oct 15 '20 at 17:25
...but rounding would bump the value up to 1000.0, which actually has four digits to the left of the decimal). — supercat, Oct 15 '20 at 17:26

score 84 · Accepted Answer · edited Feb 18 '23 at 21:47

84

I suspect your teacher was referring to the FDIV Pentium bug, which led to a large outcry in the media at the time and for which Intel issued a recall.

This bug caused floating-point division to return incorrect results in some cases. It didn’t affect only FDIV, some related instructions were affected: the other division and remainder instructions, and FPTAN and FPATAN. Other trigonometric instructions were treated with suspicion, but ultimately cleared, including FSIN and FCOS.

It does however seem unlikely that this would cause problems in banks: financial applications typically avoid floating point representations, so errors in a floating-point instruction would be unlikely to affect them.

See also the Wikipedia entry on this bug. Another famous Pentium bug was the F00F bug. It didn’t cause calculation errors but it could lead to lock-ups, and was worked around by specific handling in operating systems.

edited Feb 18 '23 at 21:47

Sep Roland

1,043
5
14

answered Oct 12 '20 at 14:35

Stephen Kitt

121,835
17
505
462

24

To this day, the Linux kernel has code to detect if your processor has this bug: https://github.com/torvalds/linux/blob/v5.9/arch/x86/kernel/fpu/bugs.c – IMSoP Oct 12 '20 at 18:40
20

FWIW, the FSIN and FCOS instructions are rather useless except perhaps for size optimization in code where you don't care about the correct result. They're slower than library implementations (and always have been) and also have serious accuracy problems. – R.. GitHub STOP HELPING ICE Oct 13 '20 at 00:34
I think there are still some embedded Intel processors still made today based on older Pentiums that are affected by F00F. – forest Oct 13 '20 at 02:59
3

It might be worth adding that the Wikipedia page does not list any bank-related problems. – Stig Hemmer Oct 13 '20 at 07:56
1

Banking would use scaled integers for most things, but for calculating interest payments (fractional powers of numbers) wouldn't you use floating point calculations? – Chris H Oct 13 '20 at 09:29
2

@ChrisH No. Just go read the fine print for your account. The exponentials for compound interest are in practice always discretized (interest is calculated every day and compounded that way, with the per-day interest rate often determined in surprising ways, such as assuming there's exactly 30 days in every month or so). – TooTea Oct 13 '20 at 10:48
There is a reference to the Pentium FDIV bug in the game Mass Effect. – Andrew Morton Oct 13 '20 at 12:52
3

It depends which poart of the bank. Retail banks e.g. current and savings account probably don't use floating point but their treasury departments trading in options, swaps FX bonds do use a lot of floating point in calculations of prices. Also the overal risk of the bank calculations use floating point. – mmmmmm Oct 13 '20 at 16:08
1

Basic credit/debit processing wouldn't need to be floating point, though in some implementations it might end up that way. But when you get into things like compound interest calculations such as calculating the principle vs interest component of loan payments, it starts to make a lot more sense. And weakly typed languages are either implicitly floating point, or go there the minute they need to. – Chris Stratton Oct 13 '20 at 16:46
4

Note that the floating point referred to here is binary (that is, base-2) floating point, as opposed base-10 floating point. When or if banks do any floating point calculations, it's much more likely to be done in base-10 floating point, not base-2. Note, too, that some computers had hardware instructions for base-10 calculations before this, specifically for financial calculations. – Clockwork-Muse Oct 13 '20 at 18:58
11

I was working on the (Silicon Graphics) IRIX kernel at the time, and had the privilege of writing code in the OS loader to detect the use of potentially buggy instructions and, if found, overwrite them with an invalid machine language instruction. This would invoke the fault handler at runtime, where another bit of code would determine if the arguments were likely to get invalid results and if so implement the division in a software "longhand." Pretty expensive fix! – Myk Willis Oct 13 '20 at 19:56
3

In addition to the fact that most banking applications would use integer representations, in most applications using division, a small amount of inaccuracy in the result is not troublesome. The concept of an "average", for example, is fictitious, and only a few decimal places of answer are usually relevant. Banks absolutely would want their Pentiums replaced to avoid any possible risk of trouble, but any direct impact of the FDIV bug would have been limited to astronomers and physicists. – Russell Borogove Oct 13 '20 at 23:51
12

@ChrisH: In financial calculations, "never use (binary) floating-point" is a pretty well-known rule. (At least as an example for general audiences of where not to use FP, but it's probably actually true). Extended precision / fixed-point integer to make sure the fractional part (cents) isn't rounded, or possibly decimal floating-point (which x86 doesn't support in hardware, but PowerPC does). Note that binary (normal) floating point can't exactly represent a value such as 0.01 because the fractional part's denominator isn't a power of 2. – Peter Cordes Oct 14 '20 at 02:26
1

@PeterCordes yes I'm well aware of the reason and the general rule, though I've only implemented it myself in minor tasks. The only bit I wasn't aware of was how certain calculations (all involving powers) would be implemented or avoided – Chris H Oct 14 '20 at 05:51
the only thing I can think of is the use of some FFT approach for prediction/filtering/statistics leading to wrong planning leading to inconsistencies and monetary loss... but its just my wild guess. – Spektre Oct 14 '20 at 11:29
12

A lot of times it's stated as "financial applications should never use floating point," but what is really meant is "accounting applications should never use floating point." Many tasks in the realm of finance don't involve counting pennies, and many tasks outside of finance have similar problems with accurate number representation. That is why the libraries are described as "arbitrary precision math" and not "finance math." – fluffysheap Oct 14 '20 at 16:25
1

I was involved in implementing a small software that had some accounting functionality. If I recall correctly, local law quite exactly specified how the calculations are to be performed: 4 decimal digits precision for intermediate results and round to 2 decimal digits (local equivalent of cents) when actually executing a transaction. The rounding mode was also explicitly given. I would expect other countries to have similar requirements. – Martin Modrák Oct 15 '20 at 10:48
2

@PeterCordes It's a well known rule and with all generalizations not true in all cases. IEEE-754 doubles have 18 significant digits. There are many, many situations in which it is perfectly accurate to do the computations with doubles when working with currencies. You simply do the computations in doubles and then round correctly to your 2 or 3 significant digits for display. You really only run into problems if you do many, many, maanny calculations on the same numbers. I know of high performance trading software that uses doubles just fine for speed reasons. – Voo Oct 15 '20 at 17:24
1

@fluffysheap: Thank you, I knew "financial" seemed too broad a term; "accounting" is the word I wanted but couldn't think of at the time. – Peter Cordes Oct 15 '20 at 22:13
1

@Voo: A fundamental problem with floating-point math, which is shared ironically enough by the .NET "Decimal" Type, is that computations may silently produce imprecise results. A program that uses 16-bit integers to represent numbers of pennies may fail if called upon to represent an amount more than $327,67, but if it traps for overflow one will be able to guarantee that it will always either produce exact results or fail. A program that uses double-precision math may, by contrast, silently produce results which should be rounded one way, but instead end up rounded the other way. – supercat Feb 20 '23 at 15:58

score 31 · Answer 2 · answered Oct 13 '20 at 10:56

31

Stephen Kitt has already provided a good answer regarding the FDIV bug. I'll fill in some details about Intel employing logicians:

Because of this bug, Intel had to replace a lot of processors, which was very expensive. Not wanting to repeat this, they hired a number of computer scientists with background in formal logic to prove the correctness of algorithms to be implemented in successors of the pentium. If you want to know more about their research, check out the publications of two of these scientists: https://www.cl.cam.ac.uk/~jrh13/papers/index.html, https://scholar.google.com/citations?user=MACCA0cAAAAJ&hl=en

answered Oct 13 '20 at 10:56

Bagnus

411
3
4

14

The FDIV bug wasn't an algorithm bug, it was a wrong implementation of a lookup table such that 5 cells that should have been +2 read as 0. Possibly an algorithm bug in whatever generated that table, but the wikipedia description makes it sound like the storage cells were actually missing. This bug is also one of the major reasons why later Intel CPUs have a microcode-update interface. - future problems discovered in more complicated CPUs can be mitigated, sometimes with a performance cost. – Peter Cordes Oct 14 '20 at 02:48

score 23 · Answer 3 · edited Oct 14 '20 at 04:49

Intel had a rather complex bunch of hardware to compute a floating-point quotient in a way that yielded two bits per iteration, which required having a rather large table listing all the combinations of bit patterns where part of the quotient should be 11 [rather than listing all patterns individually, the table would have had entries where each bit may be 0, 1, or X, such that e.g. a bit pattern of 100X01X would match 1000010, 1000011, 1001010, or 1001011, so the table didn't need an impossibly huge number of entries]. Unfortunately, part of the table got corrupted when it was being transferred from whatever tool was used to generate it, into the chip design.

I find this approach to division somewhat curious, since it would have been quick to examine the divisor and produce a value which, when multiplied by both the divisor (rounding up) and dividend (rounding down), would force the new divisor to have its upper bits equal to 0.1111 or 0.11111111, which would make it easy to extract 4 or 8 bits per iteration. The final quotient would likely be slightly less than the correct value [never greater, given the directions of rounding earlier], but it would be close enough that only two or three couple of successive-approximation steps should be needed at the end to clean things up.

In any case, the ultimate irony with the Intel FDIV bug is that, earlier, during the 386/387 era, there was a competing product by Weitek which could perform single-precision floating-point math much faster than Intel's chips, but didn't do double precision math at all. Some programs which would normally have used double-precision math shipped versions for the Weitek which used single-precision math and thus produced less accurate results. Intel's marketing team decided to exploit this (designed, and regarded as acceptable) lack of precision by producing an ad which showed a motherboard with a dime-store calculator decorated with clown graphics where the CPU should have been, and the caption "Ask for genuine Intel Math CoProcessors, or who knows what math you’ll have to count on".

very informative answer. And the ad is so ironical in hindsight ! — Olivier Dulac, Oct 13 '20 at 23:11

score 3 · Answer 4 · answered Oct 12 '20 at 14:34

3

I think this is probably referring to the Pentium FDIV bug (floating-point divide bug).

I don't recall any specific problems with trigonometry instructions.

answered Oct 12 '20 at 14:34

dave

35,301
3
80
160

Trigonometry instructions are dependent on division. In early x87 FPUs it was explicit - there was a single trigonometric instruction and it gave 2 results. You had to divide one by the other to get tg / cotg (no, they were not sin and cos as one would initially think, FSINCOS instruction appeared later) and do even more (including divisions) in order to get sin/cos. – fraxinus Oct 13 '20 at 07:04
1

Intel wasn't the only manufacturer to have trouble with floating point arithmetic. The DEC PDP-6 also gave unfortunate results for some of its floating point ops. I'm saying unfortunate instead of erroneous, because a floating point operation sometimes requires roundoff to provide a representable result. – Walter Mitty Oct 13 '20 at 10:52
1

There is/was a documentation bug for the error bounds of fsin: https://randomascii.wordpress.com/2014/10/09/intel-underestimates-error-bounds-by-1-3-quintillion/. Intel used to claim 1ulp precision, but that's extremely far from true for some worst-case inputs near pi (e.g. less than 4 of 64 mantissa bits correct), and even worse for large-magnitude inputs. Because x87 uses its 64-bit-mantissa Pi value for range-reduction, not a higher precision value. But this doc bug was only discovered / corrected in 2014. – Peter Cordes Oct 14 '20 at 02:57
1

@PeterCordes: A better spec for intrinsics would be to specify that the result will be within a certain tolerance of the exact value for some value of input within a certain tolerance of the given input. There are very few non-contrived situations where a function that returns a value within one ulp of the sine of exactly x within 1ulp would be more useful than a function that was 1% faster and returned a value within 0.5ulp of the sine of some number within 0.5ulp of the specified value. – supercat Oct 15 '20 at 17:33
1

@PeterCordes: Actually, I suspect that for many purposes a function that returns the sine of 1.00000000000000003898x would be more useful than a function that returns the sine of exactly x, since accurately multiplying a floating-point number by 3.141592653589793115998 is a lot easier than accurately multiplying by π. – supercat Oct 15 '20 at 17:40

Intel CPU bug in the 90s

4 Answers4