Useless Snippet #4: Basic trigonometry (sin/cos)

Goal: Calculate the sin/cos of an arbitrary angle.
Restrictions:

  1. Should be faster than CRT’s sin/cos and faster than FPU’s fsin/fcos
  2. The error should be kept to a minimum compared to the above functions
  3. Double precision is required
  4. Function should be in the form: double func(double), ie. no xmm regs passed to the function, result should be returned on the FPU stack and it should calculate only 1 angle at a time

For the rest of the post we’ll be talking about sin(). Calculating cos(), or even sincos() should be easily derived from the code below. The required changes will be described at the end of the post.

Where do we start?
From an approximation of course. Mr. Taylor is our friend here. Let’s take a look at the general form of the Taylor series for sin(x):

where x=a is the point about which we are expanding our function.

Now let’s see how the first 9 terms look like compared to the original function, when the series is expanded around zero (a = 0).

sin(x) vs Taylor series of 9th degree

Visually, they seem really close in the [0, PI] range, but the error starts getting larger in the [PI, 2PI] range, untils it explodes. For the record, the maximum absolute error in the range [0, PI] is ~6.9e-3, which isn’t good for a double precision function. In order to reduce the error, one option is to add more terms to the series. But that would hurt performance.

Another option, which we will use here, is to subdivide the [0, PI] range into N subranges (e.g. 16, 32, 64, etc.) and use a different constant (a) for each range. This way the overall error will be kept to a minimum and we won’t need more terms.

By the way, the same idea is implemented in msvcr90.dll::__libm_sse2_sin(). Calling this function directly is a bit complicated, because it relies on having the angle stored into xmm0 register. There are two ways to call this function. Either enable SSE2 instruction set (/arch:SSE2 compiler flag) or GetProcAddress() the function from msvcr90.dll and make a wrapper in order to put the angle into xmm0 and put the result on the FPU stack, in order to fullfil restriction #4.

Reordering equation (1) gives:

So, for any angle x, we have to do the following:

  1. Calculate the index of the subrange in which ‘x’ belongs. ‘a’ would be the range’s start
  2. Calculate d=x-a, sin(a) and cos(a)
  3. Evaluate function (2)

For step 2 we can use a look-up table, like the one below.

__declspec(align(16))
static const double g_SinCosConstantTable[] = 
{
	sin( 0.0 * (_PI / 32.0)), cos( 0.0 * (_PI / 32.0)),
	sin( 1.0 * (_PI / 32.0)), cos( 1.0 * (_PI / 32.0)),
	sin( 2.0 * (_PI / 32.0)), cos( 2.0 * (_PI / 32.0)),
	sin( 3.0 * (_PI / 32.0)), cos( 3.0 * (_PI / 32.0)),
	sin( 4.0 * (_PI / 32.0)), cos( 4.0 * (_PI / 32.0)),
	sin( 5.0 * (_PI / 32.0)), cos( 5.0 * (_PI / 32.0)),
	sin( 6.0 * (_PI / 32.0)), cos( 6.0 * (_PI / 32.0)),
	sin( 7.0 * (_PI / 32.0)), cos( 7.0 * (_PI / 32.0)),
	sin( 8.0 * (_PI / 32.0)), cos( 8.0 * (_PI / 32.0)),
	sin( 9.0 * (_PI / 32.0)), cos( 9.0 * (_PI / 32.0)),
	sin(10.0 * (_PI / 32.0)), cos(10.0 * (_PI / 32.0)),
	sin(11.0 * (_PI / 32.0)), cos(11.0 * (_PI / 32.0)),
	sin(12.0 * (_PI / 32.0)), cos(12.0 * (_PI / 32.0)),
	sin(13.0 * (_PI / 32.0)), cos(13.0 * (_PI / 32.0)),
	sin(14.0 * (_PI / 32.0)), cos(14.0 * (_PI / 32.0)),
	sin(15.0 * (_PI / 32.0)), cos(15.0 * (_PI / 32.0)),
	sin(16.0 * (_PI / 32.0)), cos(16.0 * (_PI / 32.0)),
	sin(17.0 * (_PI / 32.0)), cos(17.0 * (_PI / 32.0)),
	sin(18.0 * (_PI / 32.0)), cos(18.0 * (_PI / 32.0)),
	sin(19.0 * (_PI / 32.0)), cos(19.0 * (_PI / 32.0)),
	sin(20.0 * (_PI / 32.0)), cos(20.0 * (_PI / 32.0)),
	sin(21.0 * (_PI / 32.0)), cos(21.0 * (_PI / 32.0)),
	sin(22.0 * (_PI / 32.0)), cos(22.0 * (_PI / 32.0)),
	sin(23.0 * (_PI / 32.0)), cos(23.0 * (_PI / 32.0)),
	sin(24.0 * (_PI / 32.0)), cos(24.0 * (_PI / 32.0)),
	sin(25.0 * (_PI / 32.0)), cos(25.0 * (_PI / 32.0)),
	sin(26.0 * (_PI / 32.0)), cos(26.0 * (_PI / 32.0)),
	sin(27.0 * (_PI / 32.0)), cos(27.0 * (_PI / 32.0)),
	sin(28.0 * (_PI / 32.0)), cos(28.0 * (_PI / 32.0)),
	sin(29.0 * (_PI / 32.0)), cos(29.0 * (_PI / 32.0)),
	sin(30.0 * (_PI / 32.0)), cos(30.0 * (_PI / 32.0)),
	sin(31.0 * (_PI / 32.0)), cos(31.0 * (_PI / 32.0)),
	sin(32.0 * (_PI / 32.0)), cos(32.0 * (_PI / 32.0)),
	sin(33.0 * (_PI / 32.0)), cos(33.0 * (_PI / 32.0)),
	sin(34.0 * (_PI / 32.0)), cos(34.0 * (_PI / 32.0)),
	sin(35.0 * (_PI / 32.0)), cos(35.0 * (_PI / 32.0)),
	sin(36.0 * (_PI / 32.0)), cos(36.0 * (_PI / 32.0)),
	sin(37.0 * (_PI / 32.0)), cos(37.0 * (_PI / 32.0)),
	sin(38.0 * (_PI / 32.0)), cos(38.0 * (_PI / 32.0)),
	sin(39.0 * (_PI / 32.0)), cos(39.0 * (_PI / 32.0)),
	sin(40.0 * (_PI / 32.0)), cos(40.0 * (_PI / 32.0)),
	sin(41.0 * (_PI / 32.0)), cos(41.0 * (_PI / 32.0)),
	sin(42.0 * (_PI / 32.0)), cos(42.0 * (_PI / 32.0)),
	sin(43.0 * (_PI / 32.0)), cos(43.0 * (_PI / 32.0)),
	sin(44.0 * (_PI / 32.0)), cos(44.0 * (_PI / 32.0)),
	sin(45.0 * (_PI / 32.0)), cos(45.0 * (_PI / 32.0)),
	sin(46.0 * (_PI / 32.0)), cos(46.0 * (_PI / 32.0)),
	sin(47.0 * (_PI / 32.0)), cos(47.0 * (_PI / 32.0)),
	sin(48.0 * (_PI / 32.0)), cos(48.0 * (_PI / 32.0)),
	sin(49.0 * (_PI / 32.0)), cos(49.0 * (_PI / 32.0)),
	sin(50.0 * (_PI / 32.0)), cos(50.0 * (_PI / 32.0)),
	sin(51.0 * (_PI / 32.0)), cos(51.0 * (_PI / 32.0)),
	sin(52.0 * (_PI / 32.0)), cos(52.0 * (_PI / 32.0)),
	sin(53.0 * (_PI / 32.0)), cos(53.0 * (_PI / 32.0)),
	sin(54.0 * (_PI / 32.0)), cos(54.0 * (_PI / 32.0)),
	sin(55.0 * (_PI / 32.0)), cos(55.0 * (_PI / 32.0)),
	sin(56.0 * (_PI / 32.0)), cos(56.0 * (_PI / 32.0)),
	sin(57.0 * (_PI / 32.0)), cos(57.0 * (_PI / 32.0)),
	sin(58.0 * (_PI / 32.0)), cos(58.0 * (_PI / 32.0)),
	sin(59.0 * (_PI / 32.0)), cos(59.0 * (_PI / 32.0)),
	sin(60.0 * (_PI / 32.0)), cos(60.0 * (_PI / 32.0)),
	sin(61.0 * (_PI / 32.0)), cos(61.0 * (_PI / 32.0)),
	sin(62.0 * (_PI / 32.0)), cos(62.0 * (_PI / 32.0)),
	sin(63.0 * (_PI / 32.0)), cos(63.0 * (_PI / 32.0)),
};

Now let’s take a look at the implementation.

FPU version

double fast_sin_fpu(double x)
{
	static const double _PI = 3.1415926535897932384626433832795; // Digits taken from the Window's calculator :)
	static const double _32_OVER_PI = 32.0 / _PI;
	static const double _PI_OVER_32 = _PI / 32.0;
	static const double _INV_FACT_2 = 1.0 / 2.0;
	static const double _INV_FACT_3 = 1.0 / 6.0;
	static const double _INV_FACT_4 = 1.0 / 24.0;
	static const double _INV_FACT_5 = 1.0 / 120.0;
	static const double _INV_FACT_6 = 1.0 / 720.0;
	static const double _INV_FACT_7 = 1.0 / 5040.0;
	static const double _INV_FACT_8 = 1.0 / 40320.0;
	static const double _INV_FACT_9 = 1.0 / 362880.0;
 
	// Calculate alpha (the Taylor constant) by dividing the [0, PI] range in 32 subranges
	// and selecting the lower part of each range as alpha.
	double scaledAngle = x * _32_OVER_PI;
 
	// This is the same as floor(scaledAngle + 0.5) but it avoids the function call.
	int arrayIndex = _mm_cvtsd_si32(_mm_load_sd(&scaledAngle));
	double a = arrayIndex * _PI_OVER_32; // NOTE: At this point 'arrayIndex' will be negative for a negative angle.
 
	arrayIndex &= 0x0000003F; // Bring back to [0, 63] range. Note that this removes the sign from the array index.
	arrayIndex <<= 1; // 2 constants for each subrange
 
	double d = x - a;
	double d2 = d * d;
 
	double t1 =     (1.0 + d2 * (d2 * (_INV_FACT_4 + d2 * (d2 * _INV_FACT_8 - _INV_FACT_6)) - _INV_FACT_2));
	double t2 = d * (1.0 + d2 * (d2 * (_INV_FACT_5 + d2 * (d2 * _INV_FACT_9 - _INV_FACT_7)) - _INV_FACT_3));
 
	return  g_SinCosConstantTable[arrayIndex + 0] * t1 + 
		g_SinCosConstantTable[arrayIndex + 1] * t2;
}

The code first calculates the index into the constant table (arrayIndex) holding the correct sin(a)/cos(a). It then calculates t1 and t2, which are the sin(a) and cos(a) factors respectively, by factoring out (x-a)^2. Finally, it calculates sin(x) using t1, t2, sin(a) and cos(a) using equation (2).

Maximum absolute error [0, 2PI] = 2.498e-016
Performance = ~61 cycles

Error-wise it seems pretty good. Performance-wise it’s better that the standard sin(x) function (~108 cycles). By taking a look at the code, it becomes obvious that there are a lot of similarities between t1 and t2. This means that we can calculate both of them at the same time using SSE2. Let’s try it.

SSE2 version

double fast_sin_sse2(double x)
{
	static const double _PI = 3.1415926535897932384626433832795; // Digits taken from the Window's calculator :)
	static const double _32_OVER_PI = 32.0 / _PI;
	static const double _PI_OVER_32 = _PI / 32.0;
	__declspec(align(16)) static const double _INV_FACT_2_3[2] = {1.0 / 2.0, 1.0 / 6.0};
	__declspec(align(16)) static const double _INV_FACT_4_5[2] = {1.0 / 24.0, 1.0 / 120.0};
	__declspec(align(16)) static const double _INV_FACT_6_7[2] = {1.0 / 720.0, 1.0 / 5040.0};
	__declspec(align(16)) static const double _INV_FACT_8_9[2] = {1.0 / 40320.0, 1.0 / 362880.0};
	__declspec(align(16)) static const double _ONE[2] = {1.0, 1.0};
 
	// Calculate alpha (the Taylor constant) by dividing the [0, PI] range in 32 subranges
	// and selecting the lower part of each range as alpha.
	__m128d xmm_x = _mm_load_sd(&x);
	__m128d xmm_scaledAngle = _mm_mul_sd(xmm_x, _mm_load_sd(&_32_OVER_PI));
 
	int arrayIndex = _mm_cvtsd_si32(xmm_scaledAngle);
 
	__m128d xmm_a = _mm_cvtsi32_sd(_mm_setzero_pd(), arrayIndex);
	xmm_a = _mm_mul_sd(xmm_a, _mm_load_sd(&_PI_OVER_32));
 
	arrayIndex &= 0x0000003F;
	arrayIndex <<= 1; // 2 constants for subrange
 
	__m128d xmm_d = _mm_sub_sd(xmm_x, xmm_a);
	xmm_d = _mm_unpacklo_pd(xmm_d, xmm_d);
	__m128d xmm_d2 = _mm_mul_pd(xmm_d, xmm_d);
 
	__m128d xmm_t = _mm_mul_pd(xmm_d2, _mm_load_pd(&_INV_FACT_8_9[0]));
	xmm_t = _mm_sub_pd(xmm_t, _mm_load_pd(&_INV_FACT_6_7[0]));
	xmm_t = _mm_mul_pd(xmm_t, xmm_d2);
	xmm_t = _mm_add_pd(xmm_t, _mm_load_pd(&_INV_FACT_4_5[0]));
	xmm_t = _mm_mul_pd(xmm_t, xmm_d2);
	xmm_t = _mm_sub_pd(xmm_t, _mm_load_pd(&_INV_FACT_2_3[0]));
	xmm_t = _mm_mul_pd(xmm_t, xmm_d2);
	xmm_t = _mm_add_pd(xmm_t, _mm_load_pd(&_ONE[0]));
 
	__m128d xmm_1_d = _mm_unpacklo_pd(_mm_load_pd(&_ONE[0]), xmm_d);
	xmm_t = _mm_mul_pd(xmm_t, xmm_1_d);
 
	xmm_t = _mm_mul_pd(xmm_t, _mm_load_pd(&g_SinCosConstantTable[arrayIndex]));
 
	__m128d xmm_t_inv = _mm_unpackhi_pd(xmm_t, xmm_t);
	xmm_t = _mm_add_pd(xmm_t, xmm_t_inv);
 
	double sin_angle;
	_mm_store_sd(&sin_angle, xmm_t);
	return sin_angle;
}

This function implements the same algorithm as the FPU version, except from the fact that it calculates both t1 and t2 at the same time using packed SSE2 instructions.

Maximum absolute error [0, 2PI] = 2.498e-016
Performance = ~54 cycles

That’s better. Maximum error is still the same and the performance increased by approx. 7 cycles per function evaluation. The problem with this version is that there is a long dependency chain, when calculating xmm_t, which hurts performance. A little reordering of equation (2) gives us the final version.

2nd SSE2 version

double fast_sin_sse2_v2(double x)
{
	__declspec(align(16)) static const double _INV_FACT_2_3[2] = {1.0 / 2.0, 1.0 / 6.0};
	__declspec(align(16)) static const double _INV_FACT_4_5[2] = {1.0 / 24.0, 1.0 / 120.0};
	__declspec(align(16)) static const double _INV_FACT_6_7[2] = {1.0 / 720.0, 1.0 / 5040.0};
	__declspec(align(16)) static const double _INV_FACT_8_9[2] = {1.0 / 40320.0, 1.0 / 362880.0};
	__declspec(align(16)) static const double _ONE[2] = {1.0, 1.0};
 
	// Calculate alpha (the Taylor constant) by dividing the [0, PI] range in 32 subranges
	// and selecting the lower part of each range as alpha.
	__m128d xmm_x = _mm_load_sd(&x);
	__m128d xmm_scaledAngle = _mm_mul_sd(xmm_x, _mm_load_sd(&_32_OVER_PI));
 
	int arrayIndex = _mm_cvtsd_si32(xmm_scaledAngle);
 
	__m128d xmm_a = _mm_cvtsi32_sd(_mm_setzero_pd(), arrayIndex);
	xmm_a = _mm_mul_sd(xmm_a, _mm_load_sd(&_PI_OVER_32));
 
	arrayIndex &= 0x0000003F;
	arrayIndex <<= 1; // 2 constants for subrange
 
	__m128d xmm_d = _mm_sub_sd(xmm_x, xmm_a);
	xmm_d = _mm_unpacklo_pd(xmm_d, xmm_d);
	__m128d xmm_d2 = _mm_mul_pd(xmm_d, xmm_d);
	__m128d xmm_d4 = _mm_mul_pd(xmm_d2, xmm_d2);
 
	__m128d xmm_t1 = _mm_mul_pd(xmm_d2, _mm_load_pd(&_INV_FACT_4_5[0]));
	__m128d xmm_t2 = _mm_mul_pd(xmm_d2, _mm_load_pd(&_INV_FACT_8_9[0]));
	xmm_t1 = _mm_sub_pd(xmm_t1, _mm_load_pd(&_INV_FACT_2_3[0]));
	xmm_t2 = _mm_sub_pd(xmm_t2, _mm_load_pd(&_INV_FACT_6_7[0]));
	xmm_t1 = _mm_mul_pd(xmm_t1, xmm_d2);
	xmm_t2 = _mm_mul_pd(xmm_t2, xmm_d2);
	xmm_t1 = _mm_add_pd(xmm_t1, _mm_load_pd(&_ONE[0]));
	xmm_t2 = _mm_mul_pd(xmm_t2, xmm_d4);
 
	__m128d xmm_t = _mm_add_pd(xmm_t1, xmm_t2);
 
	__m128d xmm_1_d = _mm_unpacklo_pd(_mm_load_pd(&_ONE[0]), xmm_d);
	xmm_t = _mm_mul_pd(xmm_t, xmm_1_d);
 
	xmm_t = _mm_mul_pd(xmm_t, _mm_load_pd(&g_SinCosConstantTable[arrayIndex]));
 
	__m128d xmm_t_inv = _mm_unpackhi_pd(xmm_t, xmm_t);
	xmm_t = _mm_add_pd(xmm_t, xmm_t_inv);
 
	double sin_angle;
	_mm_store_sd(&sin_angle, xmm_t);
	return sin_angle;
}

What we did here is to break the calculation of t into two independent parts in order to be able to calculate them in parallel. This way the compiler is free to reschedule instruction differently in order to achieve better performance.

Maximum absolute error [0, 2PI] = 2.498e-16
Performance = ~50 cycles

Can we do any better with inline asm? Let’s see… (I promise this is the last one)

SSE2 inline assembly version

double fast_sin_sse2_asm(double x)
{
	__declspec(align(16)) static const double _INV_FACT_2_3[2] = {1.0 / 2.0, 1.0 / 6.0};
	__declspec(align(16)) static const double _INV_FACT_4_5[2] = {1.0 / 24.0, 1.0 / 120.0};
	__declspec(align(16)) static const double _INV_FACT_6_7[2] = {1.0 / 720.0, 1.0 / 5040.0};
	__declspec(align(16)) static const double _INV_FACT_8_9[2] = {1.0 / 40320.0, 1.0 / 362880.0};
	__declspec(align(16)) static const double _ONE[2] = {1.0, 1.0};
 
	double sin_angle;
	__asm
	{
		movsd xmm0, x;
		movsd xmm1, _32_OVER_PI;
		movapd xmm2, _INV_FACT_4_5;
		mulsd xmm1, xmm0;
		movapd xmm3, _INV_FACT_8_9;
		cvtsd2si eax, xmm1;
		movapd xmm4, _INV_FACT_2_3;
		cvtsi2sd xmm1, eax;
		mulsd xmm1, _PI_OVER_32;
		and eax, 0x0000003F;
		subsd xmm0, xmm1;
		movapd xmm5, _INV_FACT_6_7;
		unpcklpd xmm0, xmm0;
		movapd xmm6, _ONE;
		movapd xmm1, xmm0;
		shl eax, 4;
		mulpd xmm1, xmm0;
		mulpd xmm2, xmm1;
		mulpd xmm3, xmm1;
		movapd xmm7, xmm1;
		subpd xmm2, xmm4;
		subpd xmm3, xmm5;
		mulpd xmm2, xmm1;
		mulpd xmm7, xmm7;
		mulpd xmm3, xmm1;
		addpd xmm2, xmm6;
		mulpd xmm3, xmm7;
		unpcklpd xmm6, xmm0;
		addpd xmm2, xmm3;
		movapd xmm4, g_SinCosConstantTable[eax];
		mulpd xmm2, xmm6;
		mulpd xmm2, xmm4;
		movapd xmm5, xmm2;
		unpckhpd xmm2, xmm2;
		addsd xmm5, xmm2;
		movsd sin_angle, xmm5;
		fld sin_angle;
	}
}

Maximum absolute error [0, 2PI] = 2.498e-16
Performance = ~47 cycles

There’s probably some room for improvement here, but I’m a bit lazy.

Summary
The table below compares the various versions against CRT’s sin(), FPU’s fsin and __libm_sse2_sin() in terms of accuracy and performance. Accuracy is measured in the [0, 2PI] range, with an interval of PI/128.0, using the CRT’s sin() function as base. Performance is measured using 1M random angles in the (0.0, 90112.0) range, using sin() as base (relative performance is shown). The range has been selected this way in order to force __libm_sse2_sin() to follow the Taylor approximation path. For values near 0.0, it returns the value itself, and for values larger than 90112.0 it falls back to the standard sin() function.

Function Max Abs Error Performance (rel)
sin() 0 1.00
fsin 0 0.98
__libm_sse2_sin() 0 1.80
fast_sin_fpu() 2.498e-016 1.77
fast_sin_sse2() 2.498e-016 2.00
fast_sin_sse2_v2() 2.498e-016 2.16
fast_sin_sse2_asm() 2.498e-016 2.30

Note that __libm_sse2_sin() is called through a wrapper function, as described earlier, instead of switching on SSE2 instruction set for the whole project. This means that the timings include 2 function calls, although the wrapper can be inlined.

Calculating cos()
Using the above algorithm as base we can calculate cos() easily. Let’s take a look at the Taylor series of cos(x).

After reordering it gives:

Taking into account that cos(a) = sin(PI/2 + a) and -sin(a) = cos(PI/2 + a) we end up with:

Equation (5) is the same as equation (2) with ‘a’ replaced by ‘PI/2 + a’. This means that by altering the ‘arrayIndex’ we can calculate cos(x) using the exact same implementation. Adding 16 (which is the index for the range starting at PI/2), before ANDing with 0x0000003F, does the trick.

Calculating sin(x) and cos(x) at the same time should now be straightforward and it doesn’t cost too much.

That’s it. Thanks for reading.

Leave a Reply