What is the fastest way to compute sin and cos together?

Question

I would like to compute both the sine and co-sine of a value together (for example to create a rotation matrix). Of course I could compute them separately one after another like a = cos(x); b = sin(x);, but I wonder if there is a faster way when needing both values.

Edit: To summarize the answers so far:

Vlad said, that there is the asm command FSINCOS computing both of them (in almost the same time as a call to FSIN alone)
Like Chi noticed, this optimization is sometimes already done by the compiler (when using optimization flags).
caf pointed out, that functions sincos and sincosf are probably available and can be called directly by just including math.h
tanascius approach of using a look-up table is discussed controversial. (However on my computer and in a benchmark scenario it runs 3x faster than sincos with almost the same accuracy for 32-bit floating points.)
Joel Goodwin linked to an interesting approach of an extremly fast approximation technique with quite good accuray (for me, this is even faster then the table look-up)

See also this question about native implementation of sin/cos: http://stackoverflow.com/questions/1640595 — Joel Goodwin, Apr 21 '10 at 15:29
try `sinx ~ x-x^3/6` and `cosx~1-x^2/4` as approximations if you care about speed more than accuracy. You can add on terms in either series as you put more weight on accuracy (http://en.wikipedia.org/wiki/Taylor_series scroll down to trig taylor series.) Note this is a general way to approximate any function you want that is differntiable `n` times. So if you have some bigger function that that sine's and cosine's belong to you will get a much bigger speed up if you approximate it instead of the sin,cos's independently. — ldog, Apr 23 '10 at 03:43
This is poor technique with very poor accuracy. See post by Joel Goodwin. Taylor series have been posted below. Please post it as an answer. — Danvil, Apr 23 '10 at 09:50
Well it depends on your requirements, if you want accuracy Taylor series will be a **good** approximation only if you need values of `x` close to some point `x_0`, then expand your Taylor series around `x_0` instead of 0. This will give you excellent accuracy near `x_0` but the farther you go the worse the results. You probably thought the accuracy sucks cause as you looked at the given asnwer and tried it for values far from `0`. That answer is with sin,cos expanded around 0. — ldog, Apr 23 '10 at 16:45

Vlad · Accepted Answer · 2011-12-19T23:01:35.340

54

Modern Intel/AMD processors have instruction FSINCOS for calculating sine and cosine functions simultaneously. If you need strong optimization, perhaps you should use it.

Here is a small example: http://home.broadpark.no/~alein/fsincos.html

Here is another example (for MSVC): http://www.codeguru.com/forum/showthread.php?t=328669

Here is yet another example (with gcc): http://www.allegro.cc/forums/thread/588470

Hope one of them helps. (I didn't use this instruction myself, sorry.)

As they are supported on processor level, I expect them to be way much faster than table lookups.

Edit:
Wikipedia suggests that FSINCOS was added at 387 processors, so you can hardly find a processor which doesn't support it.

Edit:
Intel's documentation states that FSINCOS is just about 5 times slower than FDIV (i.e., floating point division).

Edit:
Please note that not all modern compilers optimize calculation of sine and cosine into a call to FSINCOS. In particular, my VS 2008 didn't do it that way.

Edit:
The first example link is dead, but there is still a version at the Wayback Machine.

edited Dec 19 '11 at 23:01

answered Apr 21 '10 at 14:17

Vlad

34,303
6
78
192

Do you have any reference for this or a snippet how to call it in C or C#? – Danvil Apr 21 '10 at 14:20
Are compilers smart enough to use this instruction in case you write: x=cos(a);y=sin(a); That would be great. – phkahler Apr 21 '10 at 14:41
1

@phkahler: That would be great. Don't know if such optimization is used by the modern compilers. – Vlad Apr 21 '10 at 14:59
Do note that if you don't need a full 80 bit double extended result (just double or single precision, for example), it is possible to compute `cos` and `sin` much faster in software than with the `fsincos` hardware instruction. On the other hand, it's also (much) more work to implement that way. – Stephen Canon Apr 21 '10 at 15:08
@Stephen: which way is it faster to do in software? As Intel's manual states, the FSINCOS instruction itself is quite fast. – Vlad Apr 21 '10 at 15:11
13

The `fsincos` instruction is *not* "quite fast". Intel's own optimization manual quotes it as requiring between 119 and 250 cycles on recent micro-architectures. Intel's math library (distributed with ICC), by comparison, can *separately* compute `sin` and `cos` in less than 100 cycles, using a software implementation that uses SSE instead of the x87 unit. A similar software implementation that computed both simultaneously could be faster still. – Stephen Canon Apr 21 '10 at 15:50
@Stephen: could you please post the assembler code? I suppose the code uses some version of built-in sin computation (perhaps SSE-specific), and that sincos is available as well. – Vlad Apr 21 '10 at 16:08
@Stephen: and ~200 cycles is really fast: `FDIV` requires 40. – Vlad Apr 21 '10 at 16:10
3

@Vlad: The ICC math libraries are not open-source, and I don't have a license to redistribute them, so I can't post the assembly. I can tell you that there is no built-in `sin` computation for them to take advantage of, however; they use the same SSE instructions as everyone else. To your second comment, the speed relative to `fdiv` is immaterial; if there are two ways to do something and one is twice as fast as the other, it doesn't make sense to call the slower one "fast", regardless of how long it takes relative to some completely unrelated task. – Stephen Canon Apr 21 '10 at 16:23
@Stephen: I've looked through the Intel's manual; they mention the precision of the functions separately. Without the assembly code I would assume that they achieve speed by sacrificing the accuracy of computation. Could you confirm or disprove this assumption? – Vlad Apr 21 '10 at 16:29
1

The software `sin` function in their library delivers full double-precision accuracy. The `fsincos` instruction delivers somewhat more accuracy (double extended), but that extra accuracy gets thrown away in most programs that call the `sin` function, as its result is usually rounded to double precision by later arithmetic operations or a store to memory. In most situations, they deliver the same accuracy for practical use. – Stephen Canon Apr 21 '10 at 17:19
4

Note also that `fsincos` is not a complete implementation on its own; you need an additional range reduction step to put the argument into the valid input range for the `fsincos` instruction. The library `sin` and `cos` functions include this reduction as well as the core computation, so they are even faster (by comparison) than the cycle timings that I listed might indicate. – Stephen Canon Apr 21 '10 at 17:22

score 41 · Answer 2 · answered Apr 21 '10 at 14:46

41

Modern x86 processors have a fsincos instruction which will do exactly what you're asking - calculate sin and cos at the same time. A good optimizing compiler should detect code which calculates sin and cos for the same value and use the fsincos command to execute this.

It took some twiddling of compiler flags for this to work, but:

$ gcc --version
i686-apple-darwin9-gcc-4.0.1 (GCC) 4.0.1 (Apple Inc. build 5488)
Copyright (C) 2005 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ cat main.c
#include <math.h> 

struct Sin_cos {double sin; double cos;};

struct Sin_cos fsincos(double val) {
  struct Sin_cos r;
  r.sin = sin(val);
  r.cos = cos(val);
  return r;
}

$ gcc -c -S -O3 -ffast-math -mfpmath=387 main.c -o main.s

$ cat main.s
    .text
    .align 4,0x90
.globl _fsincos
_fsincos:
    pushl   %ebp
    movl    %esp, %ebp
    fldl    12(%ebp)
    fsincos
    movl    8(%ebp), %eax
    fstpl   8(%eax)
    fstpl   (%eax)
    leave
    ret $4
    .subsections_via_symbols

Tada, it uses the fsincos instruction!

answered Apr 21 '10 at 14:46

Chi

21,886
6
35
37

This is cool! Could you explain what -mfpmath=387 is doing? And does it also work with MSVC? – Danvil Apr 21 '10 at 14:51
1

Note that `-ffast-math` and `-mfpmath` lead to different results in some cases. – Debilski Apr 21 '10 at 15:06
3

mfpmath=387 will force gcc to use x87 instructions instead of SSE instructions. I suspect MSVC has similar optimizations and flags, but i don't have MSVC handy to be sure. Using x87 instructions will likely be a detriment to performance in other code though, you should also look at my other answer, to use Intel's MKL. – Chi Apr 21 '10 at 15:18
My old gcc 3.4.4 from cygwin produces 2 separate calls to `fsin` and `fcos`. :-( – Vlad Apr 21 '10 at 15:19
Tried with Visual Studio 2008 with highest optimizations enabled. It calls 2 library functions `__CIsin` and `__CIcos`. – Vlad Apr 21 '10 at 15:24

score 13 · Answer 3 · answered Apr 21 '10 at 14:08

13

When you need performance, you could use a precalculated sin/cos table (one table will do, stored as a Dictionary). Well, it depends on the accuracy you need (maybe the table would be to big), but it should be really fast.

answered Apr 21 '10 at 14:08

tanascius

51,705
21
111
133

Then the input value needs to be mapped to [0,2*pi] (or smaller with additional checks) and this call to fmod eats away performance. In my (propably suboptimal) implementation I could not gain performance with the look-up table. Would you have any advice here? – Danvil Apr 21 '10 at 14:25
11

A precomputed table will almost certainly be slower than just calling `sin` because the precomputed table will trash the cache. – Andreas Brinck Apr 21 '10 at 14:25
1

It depends how big the table is. A 256-entry table is often quite accurate enough and uses only 1Kb... if you use it a lot wouldn't it get stuck in the cache without adversely affecting the rest of the app's performance? – Mr. Boy Apr 21 '10 at 15:16
@Danvil: Here is an example of a sine lookup table http://en.wikipedia.org/wiki/Lookup_table#Computing_sines. However it assumes that you already mapped your input to [0;2pi], too. – tanascius Apr 21 '10 at 15:48
@AndreasBrinck I wouldn't go that far. It Depends(TM). Modern caches are huge and lookup tables are small. Quite often if you take a bit of care in memory layout your lookup table need not make any difference to the cache utilisation of the rest of your computation. The fact that the lookup table fits inside the cache is one of the reasons it is so fast. Even in Java where it's difficult to control mem layout precisely, I've had massive performance wins with lookup tables. – Jarrod Smith Oct 23 '12 at 07:25

score 13 · Answer 4 · edited May 23 '17 at 12:34

Technically, you’d achieve this by using complex numbers and Euler’s Formula. Thus, something like (C++)

complex<double> res = exp(complex<double>(0, x));
// or equivalent
complex<double> res = polar<double>(1, x);
double sin_x = res.imag();
double cos_x = res.real();

should give you sine and cosine in one step. How this is done internally is a question of the compiler and library being used. It could (and might) well take longer to do it this way (just because Euler’s Formula is mostly used to compute the complex exp using sin and cos – and not the other way round) but there might be some theoretical optimisation possible.

Edit

The headers in <complex> for GNU C++ 4.2 are using explicit calculations of sin and cos inside polar, so it doesn’t look too good for optimisations there unless the compiler does some magic (see the -ffast-math and -mfpmath switches as written in Chi’s answer).

sorry, but Euler's Formula doesn't actually tell you *how* to compute something, it's just an identity (albeit a very useful one) that relates complex exponentials to real trigonometric functions. There are benefits of calculating sine and cosine together, but they involve common subexpressions and your answer does not discuss this. — Jason S, Jun 29 '17 at 02:42

score 12 · Answer 5 · edited Apr 21 '10 at 15:49

12

You could compute either and then use the identity:

cos(x)² = 1 - sin(x)²

but as @tanascius says, a precomputed table is the way to go.

edited Apr 21 '10 at 15:49

Michael Myers

184,092
45
284
291

answered Apr 21 '10 at 14:08

Mitch Wheat

288,400
42
452
532

8

And be aware that using this method involves computing a power and a square root, so if performance is important, make sure to verify that this is actually faster than computing the other trig function directly. – Tyler McHenry Apr 21 '10 at 14:11
4

`sqrt()` is often optimized in hardware, so it may very well be faster then `sin()` or `cos()`. The power is just self multiplication, so don't use `pow()`. There are some tricks to get reasonably accurate square-roots very quickly without hardware support. Lastly, be sure to profile before doing any of this. – deft_code Apr 21 '10 at 14:23
This is an interesting approach, though the additional square root and checks for the right sign, could reduce the performance. Any concrete advice here? – Danvil Apr 21 '10 at 14:26
12

Note that √(1 - cos^2 x) is less accurate than computing sin x directly, in particular when x ~ 0. – kennytm Apr 21 '10 at 14:37
1

For small x, the Taylor series for y=sqrt(1-x*x) is very nice. You can get good accuracy with the first 3 terms and it only requires a few multiplies and one shift. I've used it in fixed point code. – phkahler Apr 21 '10 at 14:46
1

@phkahler: Your Taylor series doesn't apply because when x ~ 0, cos x ~ 1. – kennytm Apr 21 '10 at 15:38
@Kenny - I use it for sin values up to about 0.4 where cos is not really 1. – phkahler Apr 21 '10 at 15:44
"compute either and then use the identity" --> better to compute `sin()` first to retain precision. – chux - Reinstate Monica Jul 02 '19 at 13:29
Would it be easier to compute tan(x/2), then find cos(x) = (1-tan^2(x/2))/(1+tan^2(x/2)) and sin(x) = 2tan(x/2)/(1+tan^2(x/2)) ? – Math Machine Apr 08 '22 at 00:42
easier where? ..... – Mitch Wheat Apr 08 '22 at 01:48

score 10 · Answer 6 · answered Apr 21 '10 at 23:54

If you use the GNU C library, then you can do:

#define _GNU_SOURCE
#include <math.h>

and you will get declarations of the sincos(), sincosf() and sincosl() functions that calculate both values together - presumably in the fastest way for your target architecture.

Joel Goodwin · Answer 7 · 2018-02-22T09:16:45.490

8

There is very interesting stuff on this forum page, which is focused on finding good approximations that are fast: http://www.devmaster.net/forums/showthread.php?t=5784

Disclaimer: Not used any of this stuff myself.

Update 22 Feb 2018: Wayback Machine is the only way to visit the original page now: https://web.archive.org/web/20130927121234/http://devmaster.net/posts/9648/fast-and-accurate-sine-cosine

edited Feb 22 '18 at 09:16

answered Apr 21 '10 at 14:33

Joel Goodwin

4,886
25
30

I tried this one as well, and it gave me quite good performance. But sin and cos are computed independently. – Danvil Apr 21 '10 at 14:44
My feeling is this sine/cosine calculation will be faster than getting sine and using a square root approximation to get cosine, but a test will verify that. The primary relationship between sine and cosine is one of phase; is it possible to code so you could re-use the sine values you calculate for the phase-shifted cosine calls by taking this into account? (This may be a stretch, but had to ask) – Joel Goodwin Apr 21 '10 at 14:52
Not directly (despite the question asking exactly this). I need sin and cos of a value x and there is no way to know if at some other place I coincidentally computed x+pi/2 ... – Danvil Apr 21 '10 at 15:06
I used it in my game to draw a circle of particles. Since it's just a visual effect, the result is close enough, and perfomance is really impressive. – Maxim Kamalov Jan 09 '15 at 14:02
I'm not impressed; [Chebyshev approximations](https://www.embeddedrelated.com/showarticle/152.php) usually give you the most accuracy for a given performance. – Jason S Jun 29 '17 at 02:45
@EvanBenn looks indeed like the whole forum has been scrubbed so I've added a Wayback Machine link to the original – Joel Goodwin Feb 22 '18 at 09:17

Joseph Quinsey · Answer 8 · 2010-05-03T05:49:43.617

Many C math libraries, as caf indicates, already have sincos(). The notable exception is MSVC.

Sun has had sincos() since at least 1987 (twenty-three years; I have a hard-copy man page)
HPUX 11 had it in 1997 (but isn't in HPUX 10.20)
Added to glibc in version 2.1 (Feb 1999)
Became a built-in in gcc 3.4 (2004), __builtin_sincos().

And regarding look-up, Eric S. Raymond in the Art of Unix Programming (2004) (Chapter 12) says explicitly this a Bad Idea (at the present moment in time):

"Another example is precomputing small tables--for example, a table of sin(x) by degree for optimizing rotations in a 3D graphics engine will take 365 × 4 bytes on a modern machine. Before processors got enough faster than memory to demand caching, this was an obvious speed optimization. Nowadays it may be faster to recompute each time rather than pay for the percentage of additional cache misses caused by the table.

"But in the future, this might turn around again as caches grow larger. More generally, many optimizations are temporary and can easily turn into pessimizations as cost ratios change. The only way to know is to measure and see." (from the Art of Unix Programming)

But, judging from the discussion above, not everyone agrees.

"365 x 4 bytes". You need to account for leap years, so that should actually be 365.25 x 4 bytes. Or maybe he meant to use the number of degrees in a circle instead of the number of days in an earth year. — Ponkadoodle, May 15 '12 at 03:51
@Wallacoloo: Nice observation. I missed it. But the error is in the [original](http://books.google.ca/books?id=H4q1t-jAcBIC&lpg=PA291&ots=-d1AHuBh3u&dq=degree%20for%20optimizing%20rotations%20in%20a%203D%20graphics%20engine%20will%20take&pg=PA291#v=onepage&q=degree%20for%20optimizing%20rotations%20in%20a%203D%20graphics%20engine%20will%20take&f=false). — Joseph Quinsey, May 16 '12 at 14:03
LOL. Plus, he **neglects the fact** that in many of the computer games of that area, you will only need a finite number of angles. There are no cache misses then, if you know the possible angles. I'd use tables exactly in this case, and give `fsincos` (CPU instruction!) a try for the others. It's often as fast as interpolating sin and cos from a large table. — Erich Schubert, Jan 05 '13 at 15:32

score 5 · Answer 9 · answered Apr 21 '10 at 14:27

I don't believe that lookup tables are necessarily a good idea for this problem. Unless your accuracy requirements are very low the table needs to be very large. And modern CPUs can do a lot of computation while a value is fetched from main memory. This is not one of those questions which can be properly answered by argument (not even mine), test and measure and consider the data.

But I'd look to the fast implementations of SinCos that you find in libraries such as AMD's ACML and Intel's MKL.

score 4 · Answer 10 · answered Apr 21 '10 at 15:10

4

If you are willing to use a commercial product, and are calculating a number of sin/cos calculations at the same time (so you can use vectored functions), you should check out Intel's Math Kernel Library.

It has a sincos function

According to that documentation, it averages 13.08 clocks/element on core 2 duo in high accuracy mode, which i think will be even faster than fsincos.

answered Apr 21 '10 at 15:10

Chi

21,886
6
35
37

1

Similarly, on OSX one can use `vvsincos` or `vvsincosf` from the Accelerate.framework. I believe that AMD has similar functions in their vector library as well. – Stephen Canon Apr 21 '10 at 21:28

score 3 · Answer 11 · answered May 13 '10 at 09:35

3

This article shows how to construct a parabolic algorithm that generates both the sine and the cosine:

DSP Trick: Simultaneous Parabolic Approximation of Sin and Cos

http://www.dspguru.com/dsp/tricks/parabolic-approximation-of-sin-and-cos

answered May 13 '10 at 09:35

Probes

41
2

1

hmmm... I need to do a shootout between this and [Chebyshev approximation](https://www.embeddedrelated.com/showarticle/152.php) which I think would win. – Jason S Jun 29 '17 at 02:47

score 2 · Answer 12 · answered May 24 '14 at 00:40

You may want to have a look at http://gruntthepeon.free.fr/ssemath/, which offers an SSE vectorized implementation inspired from CEPHES library. It has good accuracy (maximum deviation from sin/cos on the order of 5e-8) and speed (slightly outperforms fsincos on a single call basis, and a clear winner over multiple values).

score 2 · Answer 13 · answered Apr 21 '10 at 14:09

2

When performance is critical for this kind of thing it is not unusual to introduce a lookup table.

answered Apr 21 '10 at 14:09

Tom Cabanski

7,608
2
21
25

score 2 · Answer 14 · answered Apr 21 '10 at 14:44

2

For a creative approach, how about expanding the Taylor series? Since they have similar terms, you could do something like the following pseudo:

numerator = x
denominator = 1
sine = x
cosine = 1
op = -1
fact = 1

while (not enough precision) {
    fact++
    denominator *= fact
    numerator *= x

    cosine += op * numerator / denominator

    fact++
    denominator *= fact
    numerator *= x

    sine += op * numerator / denominator

    op *= -1
}

This means you do something like this: starting at x and 1 for sin and cosine, follow the pattern - subtract x^2 / 2! from cosine, subtract x^3 / 3! from sine, add x^4 / 4! to cosine, add x^5 / 5! to sine...

I have no idea whether this would be performant. If you need less precision than the built in sin() and cos() give you, it may be an option.

answered Apr 21 '10 at 14:44

Tesserex

16,806
5
64
105

Actually the i-the sine extension factor is x/i times the i-the cosine extension factor. But I would doubt that using the Taylor series is really fast ... – Danvil Apr 21 '10 at 14:48
1

Chebyshev is much better than Taylor for polynomial function approximation. Don't use Taylor approximation. – Timmmm Mar 28 '17 at 08:28
There are a bunch of numerical faux pas here; numerator and denominator both quickly become large and that leads to floating-point errors. Not to mention how do you decide what "not enough precision" is and how to calculate it? Taylor approximation is good in the neighborhood around a single point; away from that point they quickly become inaccurate and require a large number of terms, which is why Timmmm's suggestion about Chebyshev approximation (which creates good approximations over a given interval) is a good one. – Jason S Jun 29 '17 at 02:52

score 2 · Answer 15 · answered Nov 04 '11 at 15:07

There is a nice solution in the CEPHES library which can be pretty fast and you can add/remove accuracy quite flexibly for a bit more/less CPU time.

Remember that cos(x) and sin(x) are the real and imaginary parts of exp(ix). So we want to calculate exp(ix) to get both. We precalculate exp(iy) for some discrete values of y between 0 and 2pi. We shift x to the interval [0, 2pi). Then we select the y that is closest to x and write
exp(ix)=exp(iy+(ix-iy))=exp(iy)exp(i(x-y)).

We get exp(iy) from the lookup table. And since |x-y| is small (at most half the distance between the y-values), the Taylor series will converge nicely in just a few terms, so we use that for exp(i(x-y)). And then we just need a complex multiplication to get exp(ix).

Another nice property of this is that you can vectorize it using SSE.

score 1 · Answer 16 · answered Sep 15 '13 at 20:14

1

An accurate yet fast approximation of sin and cos function simultaneously, in javascript, can be found here: http://danisraelmalta.github.io/Fmath/ (easily imported to c/c++)

answered Sep 15 '13 at 20:14

user2781980

11
1

score 1 · Answer 17 · answered May 09 '10 at 13:14

1

I have posted a solution involving inline ARM assembly capable of computing both the sine and cosine of two angles at a time here: Fast sine/cosine for ARMv7+NEON

answered May 09 '10 at 13:14

jcayzac

1,402
1
13
23

score 0 · Answer 18 · answered Apr 21 '10 at 14:11

0

Have you thought of declaring lookup tables for the two functions? You'd still have to "calculate" sin(x) and cos(x), but it'd be decidedly faster, if you don't need a high degree of accuracy.

answered Apr 21 '10 at 14:11

Frank Shearar

16,809
8
66
92

x y · Answer 19 · 2019-09-01T07:15:44.350

The MSVC compiler may use the (internal) SSE2 functions

 ___libm_sse2_sincos_ (for x86)
 __libm_sse2_sincos_  (for x64)

in optimized builds if appropriate compiler flags are specified (at minimum /O2 /arch:SSE2 /fp:fast). The names of these functions seem to imply that they do not compute separate sin and cos, but both "in one step".

For example:

void sincos(double const x, double & s, double & c)
{
  s = std::sin(x);
  c = std::cos(x);
}

Assembly (for x86) with /fp:fast:

movsd   xmm0, QWORD PTR _x$[esp-4]
call    ___libm_sse2_sincos_
mov     eax, DWORD PTR _s$[esp-4]
movsd   QWORD PTR [eax], xmm0
mov     eax, DWORD PTR _c$[esp-4]
shufpd  xmm0, xmm0, 1
movsd   QWORD PTR [eax], xmm0
ret     0

Assembly (for x86) without /fp:fast but with /fp:precise instead (which is the default) calls separate sin and cos:

movsd   xmm0, QWORD PTR _x$[esp-4]
call    __libm_sse2_sin_precise
mov     eax, DWORD PTR _s$[esp-4]
movsd   QWORD PTR [eax], xmm0
movsd   xmm0, QWORD PTR _x$[esp-4]
call    __libm_sse2_cos_precise
mov     eax, DWORD PTR _c$[esp-4]
movsd   QWORD PTR [eax], xmm0
ret     0

So /fp:fast is mandatory for the sincos optimization.

But please note that

___libm_sse2_sincos_

is maybe not as precise as

__libm_sse2_sin_precise
__libm_sse2_cos_precise

due to the missing "precise" at the end of its name.

On my "slightly" older system (Intel Core 2 Duo E6750) with the latest MSVC 2019 compiler and appropriate optimizations, my benchmark shows that the sincos call is about 2.4 times faster than separate sin and cos calls.

aenigma · Answer 20 · 2022-04-09T14:24:30.173

The fastest way is the approximation of functions with single precision by Chebyshev polynomials. It is better to set the argument in degrees, so less resources are spent on reducing the argument. The following code is proposed for Visual Studio. This is an example in C++, but the assembly code can be adapted to other environments.

void _declspec(naked) _vectorcall SinCosD(float x, float &s, float &c)
{
  _declspec(align(16)) static const float ct[8] = // Таблица констант
  {                            
    -1/180.0f,                 // Множитель для приведения x
    -0.0f,                     // 80000000h
    1.74532924E-2f,            // b0/90 = c0
    90.0f,                     // Константа для перехода от cos к sin
    1.34955580E-11f,           // b2/90^5 = c2
    3.91499677E-22f,           // b4/90^9 = c4
    -8.86095677E-7f,           // b1/90^3 = c1
    -9.77249307E-17f           // b3/90^7 = c3
  };
  _asm
  {                            
    mov eax,offset ct          // В eax - адрес таблицы констант
    vmovaps xmm1,[eax]         // xmm1 = 90 # c0 : 80000000h # -1/180
    vmovddup xmm4,[eax+16]     // xmm4 = c4 # c2 : c4 # c2
    vmulss xmm1,xmm1,xmm0      // xmm1 = 90 # c0 : 80000000h # -x/180
    vmovddup xmm5,[eax+24]     // xmm5 = c3 # c1 : c3 # c1
    vcvtss2si eax,xmm1         // eax = -k, где k - округлённое до целых значение x/180
    vshufps xmm2,xmm1,xmm1,93  // xmm2 = 90 # 80000000h
    imul eax,180               // eax = -180*k; of=1, если переполнение
    jno sc_cont                // В случае слишком большого |x| считать, как при x=0
    sub eax,eax                // Для этого обнулить eax
    vxorps xmm0,xmm0,xmm0      // и обнулить xmm0
      sc_cont:                 // Продолжаем для корректного значения x
    vcvtsi2ss xmm1,xmm1,eax    // xmm1 = -180*k в позиции 0
    vaddss xmm1,xmm1,xmm0      // xmm1 = x-k*180 = 90*t - число в диапазоне [-90; 90]
    shl eax,29                 // При нечётном k установить знаковый бит eax
    vmovd xmm0,eax             // В xmm0 - знаковая маска результата
    vorps xmm2,xmm2,xmm1       // xmm2 = -90 # -|90*t|
    vmovlhps xmm0,xmm0,xmm0    // Знаковую маску скопировать в старшую половину xmm0
    vhsubps xmm2,xmm2,xmm1     // xmm2 = 90*t : 90-|90*t| - приведённые аргументы
    vxorps xmm0,xmm0,xmm2      // В xmm0 - приведённые аргументы с учётом знака
    vmovsldup xmm2,xmm2        // xmm2 = 90*t # 90*t : 90-|90*t| # 90-|90*t|
    vmulps xmm2,xmm2,xmm2      // xmm2 = p # p : q # q - аргументы многочлена
    vmovhlps xmm1,xmm1,xmm1    // xmm1 = c0 : с0 (свободный член)
    vfmadd231ps xmm5,xmm4,xmm2 // xmm5 = c3+c4*p # c1+c2*p : c3+c4*q # c1+c2*q
    vmulps xmm3,xmm2,xmm2      // xmm3 = p^2 : q^2
    vmovshdup xmm4,xmm5        // xmm4 = c3+c4*p : c3+c4*q
    vfmadd231ps xmm5,xmm4,xmm3 // xmm5 = c1+c2*p+c3*p^2+c4*p^3 : c1+c2*q+с3*q^2+с4*q^3
    vfmadd231ps xmm1,xmm5,xmm2 // xmm1 = сумма для синуса : сумма для косинуса
    vmulps xmm0,xmm0,xmm1      // xmm0 = sin x : cos x - готовый результат (-1)^k*t*f(t)
    vmovss [edx],xmm0          // Сохранить косинус в переменной c
    vextractps [ecx],xmm0,2    // Сохранить синус в переменной s
    ret                        // Вернуться
  }
}

What is the fastest way to compute sin and cos together?

20 Answers20

Linked