Why are log and exp considered 'expensive' computations in ML?

Question

In many resources/videos I see comments being made along the lines of "and we can see here that we have a logarithm/exponential so this will be an expensive computation to make." (such as with the sigmoid activation function for a NN). What is behind these statements?

A more specific example would be: what makes $e^{-x}$ so much different than something like $x^4$ or $4^x$ in a computer calculation?

Do it on paper yourself - with a pencil and no calculator. How long would it take you to do 123+456? Or 123x456? Long division - 123/456? How about a Log(123)? Or e^3, even? Computers need to perform the same number of steps humans do, more or less, to solve a given problem. — J..., May 03 '21 at 14:46
@J. A rather dangerous simplification. By that logic dividing two doubles with 15 places should take much longer than dividing two shorter doubles. Computers use very different algorithms for virtually any arithmetic than what you'd do in your head (if you do mental arithmetic using a carry-lookahead adder kind of approach, hats off) and the intuition for how long something takes really doesn't work. — Voo, May 03 '21 at 15:09
@Voo The comparison here is between operations, so let's not extrapolate wildly. Obviously it's a rough approximation, but I think by the time someone develops a cursory understanding of the required steps for the given algorithms then it becomes highly unsurprising that computing a logarithm requires much more work than a simple addition or multiplication. The argument about precision is specious because the computer always computes the same precision regardless of the ostensible "size" of the operands. I said "on paper" to enforce the notion that we're not using fast mental tricks. — J..., May 03 '21 at 15:22
@J But the point is that computers don't do the same "on paper" algorithms humans use. So yes arguing that "computers work totally different than humans that's why size doesn't matter" is exactly my point. Or to take another example: Multiplication and division are on paper pretty similar, but a modern CPU can take up to 30 times longer to do integer division than multiplication. Then there are tricks computers can do that vastly improve performance of otherwise expensive operations (hi lookup tables in hardware for computing sine) and so on. — Voo, May 03 '21 at 21:20
@Voo No, they don't, but it's a reasonably close first approximation that is useful for an order of magnitude estimate that is readily accessible to the intuitions of a trained mathematician. I'm not debating that the exact algorithms are different - multiplication can be easily paralleled on the die, division not. A 64-bit divide can be 4-7x slower than a 32-bit divide (on an x64 chip) where multiplication is the same speed for both, for example - sure, I get it. Getting 19-decimal places of a logarithm is still a beast of a problem, no matter how you slice it. — J..., May 03 '21 at 23:10

Daniel Shapero · Accepted Answer · 2021-05-17T14:50:15.910

23

To add to Lutz Lehmann's answer, you can look up the latency for the CPU instructions in this comprehensive table by Agner Fog. For example, on the Intel Ivy Bridge processors:

FADD / FSUB (floating point add and subtract) both take 3 cycles
FMUL (multiply) takes 5 cycles
FDIV (divide) takes 10-24 cycles
FYL2X ($y \cdot \log_2(x)$) takes 90-106 cycles
F2XM1 ($2^x - 1$) takes about 68 cycles

To calculate the exponential with a base other than 2, you have to use a combination of the FYL2X and F2XM1 instructions. In addition to the reasons that Lutz stated, the IEEE-754 standard has pretty stringent requirements for how accurately transcendental functions have to be evaluated -- correct to the last bit and correct rounding. For many applications, this level of accuracy is completely unnecessary, and people will use approximations in software that take far fewer cycles than the built-in hardware instructions take. For example, this blog post shows how to quickly calculate an approximation to the exponential function if you're willing to trade off some accuracy. I gather than in ML it's also quite common to use lower precision floating point numbers -- just 32, 16, or even 8 bits -- instead of the full 64.

edited May 17 '21 at 14:50

answered May 02 '21 at 00:23

Daniel Shapero

10,263
1
28
59

One could also add the other direction of IEEE-754 compliance, many math libraries have software implementations for these functions as (some of) the FPU versions are deemed too unreliable. So one could also get a speed-up with potentially less accuracy by using direct calls to the FPU. – Lutz Lehmann May 02 '21 at 07:12
8

Agner's instruction tables definitely give you the right order-of-magnitude intuition. However, these x87 instructions won't be used directly on modern architectures. In fact, I would be curious to know if you've ever seen a compiler emit an x87 instruction like FYL2X or F2XM1. I haven't. About the only time you're going to get this is if you have hand-written, carefully optimized assembly. Compilers use SSE and later instruction sets, not the legacy x87, in most cases. And complex math operations like this are going to result in calls to library functions anyway, not inline code. – Cody Gray - on strike May 02 '21 at 10:23
1

@CodyGray Yes I sort of assumed that the timings would be taken as a heuristic and not exactly but I didn't say so. The Compiler Explorer nicely illustrates your point. – Daniel Shapero May 02 '21 at 17:53
1

Re "the IEEE-754 standard has pretty stringent requirements for how accurately transcendental functions have to be evaluated -- correct to the last bit and correct rounding." IEEE-754 recommends a set of correctly-rounded math functions but does not require them. – njuffa May 02 '21 at 19:58
IEEE754 only requires bit-exact results for +-*/ and sqrt, IIRC. For some of the functions defined in IEEE754, I believe that there aren't even known, proven algorithms of bit-exact results over their whole range. – MSalters May 03 '21 at 09:03
1

@MSalters "For some of the functions defined in IEEE754, I believe that there aren't even known, proven algorithms of bit-exact results over their whole range." That's nonsense. There might not be acceptably fast proven bit-exact algorithms, but there certainly are algorithms that give you bit-exact results for all of them. – orlp May 03 '21 at 10:11
@orlp: "Nonsense" ?! I don't mean elementary functions such as sin and tan, but functions like the Gaussian error function erf and the various Bessel functions. And note that I specifically say "proven". – MSalters May 03 '21 at 10:21
@MSalters See e.g. erf. But you don't even have to be explicit. erf and the Bessel functions are finite integrals of locally Lipschitz continuous functions, and thus well-behaved. This means there exist some constant precision with which you can do a Riemann summation and get a correct answer up to some specified amount of digits. – orlp May 03 '21 at 10:34
@orlp: I don't follow the subject in great detail, so it's good to know that we now have proof for erf. But note that when IEEE754:2008 was published, this 2011 paper did not yet exist. Are you really suggesting that there are similar papers published for all of the functions in the current IEEE754 standard? – MSalters May 03 '21 at 10:42
Computing things in arbitrary precision isn't exactly new tech. Mathematica has done that forever. – Federico Poloni May 03 '21 at 18:31

Lutz Lehmann · Answer 2 · 2021-05-03T10:31:17.527

11

$\exp$, $\sin$, $\tan$ and their inverse and otherwise related functions are transcendental, defined by an infinite power series. Meaning it takes some effort to evaluate uniformly good approximations. This is done via argument simplification using the properties of these functions, and polynomial approximations, or using quotients of polynomials à la Padé approximation (but again shifted toward some kind of uniform convergence).

So indeed a sigmoid $\frac1{1+e^{-x}}$ should take some more cycles than for instance $\frac{0.5+\max(0,x)}{1+|x|}$.

Power functions are evaluated via $x^y=\exp(y\log(x))$ or some wrapper that removes parts that are simpler to compute, thus especially for non-integer $y$ more expensive than the exponential function.

edited May 03 '21 at 10:31

answered May 01 '21 at 23:30

Lutz Lehmann

6,044
1
17
25

Depending on the hardware platform, the performance difference can be fairly small. I compared 1.0f / (1.0f + expf (-a)); and (0.5f + fmaxf (0.0f, a)) / (1.0f + fabsf (a)); using CUDA 11.1 on a Quadro P2000. Compiled with nvcc -o sigmoid_test.exe -arch=sm_61 -use_fast_math sigmoid_test.cu. The throughput of the former is $136 \cdot 10^{9}$ function calls per second, the throughput of the latter is $183 \cdot 10^{9}$ function calls per second. – njuffa May 02 '21 at 20:25
$\tan$ can not be defined by a power series at all, because of the discontinuities. And arguably $\exp$ and $\sin$ aren't, at least conceptually, defined by power series either, but by differential equations. (Which is why directly computing $e^x$ for random given $x$ is expensive, but then approximating it also for nearby $x'$ actually becomes much easier. Unfortunately, computers hardly make use of this fact.) – leftaroundabout May 03 '21 at 09:34
@njuffa this is probably a result of pipelining of some sorts. The exponential version is still much more expensive, but most of that computation is done while something else happens (memory fetches, probably), therefore you don't see it. In the max / abs version, you still have the same memory issues, during which the actual processors just sit idle. I think if you tried an expression with more work on a single input field, you'd see more drastic differences. – leftaroundabout May 03 '21 at 09:38
1

@leftaroundabout The main reason is that GPUs have dedicated hardware to compute exp2 and 1/x. This dedicated hardware is fast but not quite as fast as a regular FMA. Both implementations compile to five instructions for the sm_61 architecture used by the Quadro P2000, but the sigmoid is slower because it requires the special hardware twice (MUFU.EX2 and MUFU.RCP), while the alternative needs it only once (MUFU.RCP). – njuffa May 03 '21 at 09:57
@leftaroundabout sigmoid: FMUL32I.FTZ R4, R2, -1.4426950216293334961 ; RRO.EX2 R8, R4 ; MUFU.EX2 R5, R8 ; FADD.FTZ R6, R5, 1 ; MUFU.RCP R6, R6 ; alternate: FADD.FTZ R7, |R2|, 1 ; FMNMX.FTZ R6, RZ, R2, !PT ; MUFU.RCP R7, R7 ; FADD.FTZ R6, R6, 0.5 ; FMUL.FTZ R6, R6, R7 ; – njuffa May 03 '21 at 10:00
@leftaroundabout : $e^x$ can be reduced as $2^m·e^{x-m\ln(2)}$, where the exponent reduction is done in extended precision to reduce cancellation effects. Then the remaining exponential is indeed computed as a rational function, something like $e^x=\frac{g(x^2)+x}{g(x^2)-x}$ with some suitable polynomial $g$ giving a somewhat uniform error on $[0,1]$. Implementations can differ in the choice of uniformity, giving different coefficients for $g$. – Lutz Lehmann May 03 '21 at 10:29
cont. @leftaroundabout : $\tan$ has the argument reduction $\tan(\frac\pi2-x)=1/\tan x$, of course periodicity and symmetry, and the double angle formulas. So the interval where the actual function has to be evaluated can be reduced to, for instance, $[0,\frac\pi8]$. In such a small interval around $0$ it can then be approximated by some polynomial, or rational function. – Lutz Lehmann May 03 '21 at 10:34
@LutzLehmann Implementing $\exp$ with a rational approximation at the core would be unusual these days. Usually a minimax polynomial approximation is used, possibly in combination with a table. Making full use of FMA (fused multiply-add) where available. – njuffa May 03 '21 at 19:28
@njuffa : I had seen this form relatively recently (in the last 5 years) in one libm implementation. I noticed it because I would have expected something close to a balanced Padé fraction. I have not done further research into if this form allows for less terms in $g$ than in a purely polynomial approximation, so that the higher cost of division would be justified. The advantage is that $e^x·e^{-x}=1$ is automatic in such an approximation. – Lutz Lehmann May 03 '21 at 19:46
1

@LutzLehmann The problem with rational approximations is not just the cost of division, but also increased error (rounding error from evaluation two polynomials and from the division). There are library source bases that carry forward design styles from the 1990s, while the field has progressed in the past 25 years. The last time I used a rational approximation for $2^{x}-1$ (F2XM1 instruction) was in the AMD Athlon processor ca. 1997. – njuffa May 03 '21 at 19:56

Why are log and exp considered 'expensive' computations in ML?

2 Answers2