Why are dual numbers needed only in forward-mode autodiff?

Question

I'm trying to understand autodiff better, and specifically the connection between autodiff and dual numbers, and why dual numbers are needed in the first place.

The pytorch help pages about autodiff [1][2], for example, does not mention dual numbers at all. The wikipedia page, as well as other sources, suggest that it is only implemented in forward-mode automatic differentiation.

My question is - why do we need dual numbers at all?

My intuition is that it is just an elegant way to store both the function value and it's derivative. But I think this can be done with any data-structure where for each operation you store the function evaluation and it's derivative (based on elementary operations/rules like the product rule, quotient rule, etc, and on primitive derivative like polynomials, exponents, etc.).

I'm failing to see the actual benefit of the dual-number representation.

I don't see how you're coming to the conclusion that they're not used in the reverse-mode differentiation, they definitely are. — EMP, Jun 14 '22 at 13:30
"Forward mode automatic differentiation is accomplished by augmenting the algebra of real numbers and obtaining a new arithmetic" - Wikipedia. There were also other sources that stated it, emphasizing the forward mode and not the reverse mode. — Maverick Meerkat, Jun 14 '22 at 14:42
I don't know why this question was downvoted or there was a close vote. Even if the question is predicated on a false assumption (e.g. that dual numbers aren't used in reverse-mode AD when in fact they are), the question itself is perfectly clear. — Daniel Shapero, Jun 15 '22 at 16:11
@DanielShapero the "false assumption" is what both Wikipedia, PyTorch manuals, and any other source I could find on the web in 3 days of searching (and including what the current highest voted answer by Chris Rackauckas lecture notes) - state. It's also the only thing that makes sense after thinking about it.. I hope your "even" does not bear judgment :-) — Maverick Meerkat, Jun 16 '22 at 10:19
No judgment at all -- I don't know enough about how AD is implemented to say one way or the other. All I'm trying to say is that a question that might be based false premises isn't necessarily a bad question. — Daniel Shapero, Jun 16 '22 at 16:42

score 5 · Answer 1 · answered Jun 15 '22 at 19:53

Dual numbers is one way of implementing forward mode automatic differentiation. But any implementation is mathematically equivalent to Dual numbers so in some sense any implementation is in some sense just an implementation of the Dual number algebra on some tuple of numbers for the primal and derivative. It's all just semantics.

In the MIT 18.337 lecture notes lecture 8, you see that dual numbers are a good pedagogical tool because they highlight the analogy to complex step arithmetic, making the motivation extremely clear (i.e. instead of storing the derivative in the lower end of your previous number, store it in another 64-bit slot according to the Taylor expansion).

And note it's not correct to think of Dual numbers as "just" an analogy. It's an implementation of smooth infinitesimal analysis's nilpotent infinitesimals. This is probably not the motivation most people would cite, but there is an entire sect of non-standard analysis which formalizes the use of nilpotent epsilons (i.e. $\epsilon^2 = 0$), so it's a truly consistent branch of analysis. Bell's book is a nice in depth treatment. That said, it's an odd area of NSA where the law of the excluded middle must be dropped, so no proofs by contradiction, but it does lead to a very powerful algebraic method equivalent to calculus.

I suggest to specify in your answer that those MIT notes are your own work. — Federico Poloni, Jun 15 '22 at 22:07
Yes, after thinking some more, I agree with you that it's more than an analogy. It's quite a complete implementation. — Maverick Meerkat, Jun 16 '22 at 10:07

score 3 · Answer 2 · answered Jun 14 '22 at 15:41

3

As stated in the wikipedia page, by our definition of the dual number, the portion multiplying epsilon is the derivative. This is useful in automatic differentiation as you stated because it provides a structure to store both the function and its derivative. A proper dual number set up will work for essentially any function you would want to use (trigonometric functions, exponential functions, etc.), and therefore allows for easy frechet differentiation. This is useful because the goal of most automatic differentiation isn't the full Jacobian matrix of the function (as this requires large memory), but instead the Jacobian vector product, and dual number are great at doing so.

You can see an example of the reverse mode here:

https://ntrs.nasa.gov/api/citations/20160000689/downloads/20160000689.pdf

answered Jun 14 '22 at 15:41

EMP

2,079
10
19

The dual in the paper you linked is for the dual solution, not dual numbers. Jacobian vector product means you don't have to calculate all the jacobians in the computation graph and store them in memory, but simply take the product of them with the vector of interest - so this is matrix linear algebra, and has nothing to do with dual numbers. – Maverick Meerkat Jun 15 '22 at 13:16
It's a similar idea for the forward and the reverse, so if you see the forward as using dual numbers then you see the reverse mode as doing so. If instead you spend your time insisting there is a meaningful difference between referring to dual numbers as an analogy to automatic differentiation rather than the mathematical description of what we are doing, then yes you see there as being a difference. This specific work has everything to do with dual numbers because they use dual numbers/reverse mode automatic differentiation to calculate the transpose vector product. – EMP Jun 15 '22 at 18:07
I agree it's not just an analogy. It's a complete representation. I changed my answer accordingly. Still - dual numbers $\neq$ reverse mode autodiff. – Maverick Meerkat Jun 16 '22 at 10:18

Maverick Meerkat · Answer 3 · 2022-06-16T10:06:46.313

0

I highly disagree with EMP. After looking just about everything I could find on the topic on the past few days, I come to the following conclusion:

"Dual numbers" as data structures are simply an elegant way to implement automatic differentiation: you store both the function value, and it's derivative, in the same data structure - and you define how elementary operations ($+,-,\times,\div,x^n,\sqrt x, \sin x, \cos x, e^x, \ln x, $ etc.) work on both elements in the structure.
This is useful in forward-mode, as you compute both the function value and it's derivative in one pass. In reverse-mode / backprop you first compute the function value all the way to the end of the computational graph, and then traverse back for the derivative. Since the operations for function value and function derivative are separate, there's no point in creating this dual structure and dual operations.
Dual numbers as a mathematical concept are the equivalent to the data structure. If math symbols help you, go ahead and learn the math notation.

edited Jun 16 '22 at 10:06

answered Jun 15 '22 at 13:14

Maverick Meerkat

157
5

1

on 2., in the reverse mode one usually also computes the local or node derivatives in the forward pass. This is trivial for the binary operations, but requires appropriate evaluations for the unitary elementary functions. // On 3., this structure can also be interpreted as Taylor polynomials of degree 1 and their operations, with truncation to the same degree. – Lutz Lehmann Jun 15 '22 at 18:19
@LutzLehmann if you already compute the node derivatives on the forward pass, why go backward? This doesn't make sense. This totally eliminates the benefits of backprop (reverse mode autodiff) in many to one function (many weights, one loss). – Maverick Meerkat Jun 16 '22 at 10:09
I said the node derivatives, from node input to node output. In the backward pass the gradients from graph output to node output (up to the output of the graph input nodes) are assembled via chain rule. It is not that important when the node derivatives get computed, the largest (conceptual) difference is probably with sine and cosine, as one logical implementation is as a combined node. – Lutz Lehmann Jun 16 '22 at 11:10

Why are dual numbers needed only in forward-mode autodiff?

3 Answers3