Goal: Multiply a batch of Vector3f’s with the same 4×4 matrix.

Restrictions:

1. ‘src’ and ‘dst’ arrays shouldn’t point to the same memory location.

2. All pointers should be 16-byte aligned (see below for details on array sizes).

3. Treat Vector3f’s as positions (w = 1.0).

4. Matrix is column major.

__Structs and initialization__

^{?}View Code CPP

struct _Vector3f { float x, y, z; }; struct _Vector4f { float x, y, z, w; }; _Vector3f* AllocateSrcArray(unsigned int numVertices) { // Update 2011-07-27: The size of the src array was incorrect. // We are loading 3 xmm regs per loop. So we need numLoops * 12 floats, where // numLoops is (numVertices / 4) + 1 for the SSE version unsigned int memSize = sizeof(float) * ((numVertices >> 2) + 1) * 12; return (_Vector3f*)_aligned_malloc(memSize, 16); } _Vector4f* AllocateDstArray(unsigned int numVertices) { unsigned int memSize = sizeof(_Vector4f) * (((numVertices >> 2) + 1) << 2); return (_Vector4f*)_aligned_malloc(memSize, 16); } |

As you can see, we are wasting a bit or memory in order to avoid dealing with special cases in the SSE version. The C version isn’t affected by the extra padding, so there is no overhead. The worst case scenario (numVertices = 4 * n + 1) is to allocate an extra 48 bytes ‘dst’ array and extra 4 bytes for the ‘src’ array (total = 52 bytes). Nothing extraordinary when dealing with large batches of vertices. Code-side, the worst case is to perform 3 extra stores to the ‘dst’ array and 3 extra loads from the ‘src’ array.

__C code:__

^{?}View Code CPP

void TransformVec3Pos_C(_Vector3f* __restrict src, _Vector4f* __restrict dst, unsigned int numVertices, float* __restrict matrix4x4) { for(unsigned int iVertex = 0;iVertex < numVertices;++iVertex) { _Vector3f* s = &src[iVertex]; _Vector4f* d = &dst[iVertex]; d->x = matrix4x4[0] * s->x + matrix4x4[4] * s->y + matrix4x4[ 8] * s->z + matrix4x4[12]; d->y = matrix4x4[1] * s->x + matrix4x4[5] * s->y + matrix4x4[ 9] * s->z + matrix4x4[13]; d->z = matrix4x4[2] * s->x + matrix4x4[6] * s->y + matrix4x4[10] * s->z + matrix4x4[14]; d->w = matrix4x4[3] * s->x + matrix4x4[7] * s->y + matrix4x4[11] * s->z + matrix4x4[15]; } } |

There is nothing special with this code. Straight Vec3/Matrix4x4 multiply. Plain-old FPU code generated by the compiler. (Note-to-self: I have to check the SSE code generated by the compiler when using the corresponding switches).

Performance: ~32 Cycles per vertex (batch size = 8k vertices)

__SSE Optimized:__

The code below process 4 vertices at a time.

4 vertices * 3 floats/vertex = 12 floats = 3 loads of 4 floats per loop

First load = (x0, y0, z0, x1)

Second load = (y1, z1, x2, y2)

Third load = (z2, x3, y3, z3)

We don’t keep the matrix in xmm regs in order to avoid intermediate stores (profiling says this is good).

Keep the matrix loads to a minimum (see comment in code). Hopefully it should be in cache most of the time.

Written in asm because I couldn’t find a way to force the compiler to generate the code below using intrisics. The compiler insisted on using the stack for intermediate stores (matrix columns). As far as I remember the extra overhead is 1 – 2 cycles/vertex more on average. Not much of an overhead; I just wanted to see where this can be taken.

^{?}View Code CPP

void TrasnformVec3Pos_SSE_Asm(_Vector3f* __restrict src, _Vector4f* __restrict dst, unsigned int numVertices, float* __restrict matrix) { __asm { // Calculate the total number of iterations... mov ecx, [numVertices]; shr ecx, 0x02; inc ecx; // ecx = totalIterations = (numVertices >> 2) + 1; xor esi, esi; // esi = iIter = 0; mov eax, [src]; // eax = &src[0]; mov edx, [dst]; // edx = &dst[0]; mov edi, [matrix]; // edi = &matrix[0]; _LoopStart: // Matrix column 0 is being loaded 2 times. // Matrix column 1 is being loaded 2 times. // Matrix column 2 is being loaded 1 time. // Matrix column 3 is being loaded 1 time. // Other that the above extra memory loads, each vector is loaded once, // and there aren't any intermediate stores. prefetcht0 [eax + 0x100]; movaps xmm0, [eax]; // xmm0 = (x0, y0, z0, x1) movaps xmm1, [edi]; // xmm1 = (m0, m1, m2, m3) << Matrix.Column[0] movaps xmm2, xmm0; // xmm2 = (x0, y0, z0, x1) movaps xmm3, xmm0; // xmm3 = (x0, y0, z0, x1) shufps xmm2, xmm2, 0x00; // xmm2 = (x0, x0, x0, x0) prefetcht0 [edx + 0x80]; shufps xmm3, xmm3, 0xFF; // xmm3 = (x1, x1, x1, x1) mulps xmm2, xmm1; // xmm2 = (x0 * m0, x0 * m1, x0 * m2, x0 * m3) mulps xmm3, xmm1; // xmm3 = (x1 * m0, x1 * m1, x1 * m2, x1 * m3) movaps xmm4, xmm0; // xmm4 = (x0, y0, z0, x1) movaps xmm1, [edi + 0x10]; // xmm1 = (m4, m5, m6, m7) << Matrix.Column[1] shufps xmm4, xmm4, 0x55; // xmm4 = (y0, y0, y0, y0) shufps xmm0, xmm0, 0xAA; // xmm0 = (z0, z0, z0, z0) mulps xmm4, xmm1; // xmm4 = (y0 * m4, y0 * m5, y0 * m6, y0 * m7) movaps xmm5, [eax + 0x10]; // xmm5 = (y1, z1, x2, y2) addps xmm2, xmm4; // xmm2 = (x0 * m0 + y0 * m4, x0 * m1 + y0 * m5, x0 * m2 + y0 * m6, x0 * m3 + y0 * m7) movaps xmm6, xmm5; // xmm6 = (y1, z1, x2, y2) movaps xmm4, [edi + 0x20]; // xmm4 = (m8, m9, m10, m11) << Matrix.Column[2] shufps xmm6, xmm6, 0x00; // xmm6 = (y1, y1, y1, y1) mulps xmm0, xmm4; // xmm0 = (z0 * m8, z0 * m9, z0 * m10, z0 * m11) mulps xmm6, xmm1; // xmm6 = (y1 * m4, y1 * m5, y1 * m6, y1 * m7) addps xmm0, xmm2; // xmm0 = (x0 * m0 + y0 * m4 + z0 * m8, x0 * m1 + y0 * m5 + z0 * m9, x0 * m2 + y0 * m6 + z0 * m10, x0 * m3 + y0 * m7 + z0 * m11) addps xmm3, xmm6; // xmm3 = (x1 * m0 + y1 * m4, x1 * m1 + y1 * m5, x1 * m2 + y1 * m6, x1 * m3 + y1 * m7) movaps xmm2, xmm5; // xmm2 = (y1, z1, x2, y2) movaps xmm6, xmm5; // xmm6 = (y1, z1, x2, y2) shufps xmm2, xmm2, 0x55; // xmm2 = (z1, z1, z1, z1) shufps xmm6, xmm6, 0xFF; // xmm6 = (y2, y2, y2, y2) mulps xmm2, xmm4; // xmm2 = (z1 * m8, z1 * m9, z1 * m10, z1 * m11) mulps xmm6, xmm1; // xmm6 = (y2 * m4, y2 * m5, y2 * m6, y2 * m7) addps xmm2, xmm3; // xmm2 = (x1 * m0 + y1 * m4 + z1 * m8, x1 * m1 + y1 * m5 + z1 * m9, x1 * m2 + y1 * m6 + z1 * m10, x1 * m3 + y1 * m7 + z1 * m11) movaps xmm1, [edi]; // xmm1 = (m0, m1, m2, m3) << Matrix.Column[0] shufps xmm5, xmm5, 0xAA; // xmm5 = (x2, x2, x2, x2) movaps xmm3, [eax + 0x20]; // xmm3 = (z2, x3, y3, z3) mulps xmm5, xmm1; // xmm5 = (x2 * m0, x2 * m1, x2 * m2, x2 * m3) movaps xmm7, xmm3; // xmm7 = (z2, x3, y3, z3) addps xmm6, xmm5; // xmm6 = (x2 * m0 + y2 * m4, x2 * m1 + y2 * m5, x2 * m2 + y2 * m6, x2 * m3 + y2 * m7) shufps xmm7, xmm7, 0x00; // xmm7 = (z2, z2, z2, z2) movaps xmm5, xmm3; // xmm5 = (z2, x3, y3, z3) mulps xmm7, xmm4; // xmm7 = (z2 * m8, z2 * m9, z2 * m10, z2 * m11) shufps xmm5, xmm5, 0x55; // xmm5 = (x3, x3, x3, x3) addps xmm6, xmm7; // xmm6 = (x2 * m0 + y2 * m4 + z2 * m8, x2 * m1 + y2 * m5 + z1 * m9, x2 * m2 + y2 * m6 + z2 * m10, x2 * m3 + y2 * m7 + z2 * m11) mulps xmm5, xmm1; // xmm5 = (x3 * m0, x3 * m1, x3 * m2, x3 * m3) movaps xmm7, [edi + 0x30]; // xmm7 = (m12, m13, m14, m15) << Matrix.Column[3] movaps xmm1, [edi + 0x10]; // xmm1 = (m4, m5, m6, m7) << Matrix.Column[1] add eax, 0x30; addps xmm0, xmm7; // xmm0 = d0 addps xmm2, xmm7; // xmm2 = d1 addps xmm6, xmm7; // xmm6 = d2 addps xmm5, xmm7; // xmm5 = (x3 * m0 + m12, x3 * m1 + m13, x3 * m2 + m14, x3 * m3 + m15) movaps [edx], xmm0; // dst[i] = xmm0 movaps xmm7, xmm3; // xmm7 = (z2, x3, y3, z3) movaps [edx + 0x10], xmm2; // dst[i + 1] = xmm2 shufps xmm7, xmm7, 0xAA; // xmm7 = (y3, y3, y3, y3) shufps xmm3, xmm3, 0xFF; // xmm3 = (z3, z3, z3, z3) mulps xmm7, xmm1; // xmm7 = (y3 * m4, y3 * m5, y3 * m6, y3 * m7) mulps xmm3, xmm4; // xmm3 = (z3 * m8, z3 * m9, z3 * m10, z3 * m11) addps xmm5, xmm7; // xmm5 = (x3 * m0 + y3 * m4 + m12, x3 * m1 + y3 * m5 + m13, x3 * m2 + y3 * m6 + m14, x3 * m3 + y3 * m7 + m15) movaps [edx + 0x20], xmm6; // dst[i + 2] = xmm6 addps xmm5, xmm3; // xmm5 = d3 add edx, 0x40; inc esi; cmp esi, ecx; movaps [edx - 0x10], xmm5; // dst[i + 3] = xmm5 jb _LoopStart; } } |

Performance: ~10 Cycles per vertex (batch size = 8k vertices)

Batch size | C | SSE | Speed up (%) |

128 | 202 | 180 | 12 |

256 | 118 | 96 | 22 |

512 | 70 | 51 | 37 |

1024 | 52 | 31 | 67 |

2048 | 40 | 20 | 100 |

4096 | 35 | 13 | 169 |

8192 | 32 | 10 | 220 |

16384 | 30 | 9 | 233 |

32768 | 30 | 8 | 252 |

65536 | 30 | 8 | 275 |

Table: Comparison between the two methods and speedup (values are averages between several independent runs)

The stages of optimization aren’t present because I don’t have the code anymore. Next time, I’ll keep it around. One thing observed is that even the most naive SSE implementation (loading individual vertex components (movss) and processing only one vertex per loop) gives significant speedups compared to the FPU (C) implementation.

Note: All timings have been measured using Intel’s Performance Counter Monitor library v1.6 (msr.sys driver). All tests have been executed on Core i7 740QM using Microsoft Visual C++ 2008 compiler. The process’ and thread’s affinity has been set to 0×01 (the thread runs on the 1st core only) and the thread’s priority has been set to highest.

If you are the one person who’s going to test the above code for correctness, please share your findings in the comments. I’m not responsible for any damage the above code might do on your hardware ™.

Tags: optimization, SSE

## Sean Barrett Says:

July 24th, 2011 at 2:08 amHow much of the speedup is really due to SSE and how much is due to the prefetching?

I’m unclear why you would see such wide variation in timings with batch size on either code. Even a batch size of 128 should run the C version enough to minimize any non-iteration overhead, so it seems highly implausible that the C code should get get 6 times faster as the batch size goes from 128 to 16384 (although it’s possible that’s the hardware prefetcher coming into play). This seems more like the kind of variation you see in java benchmarks (which have huge startup overheads). What exactly are you measuring?

## JD Says:

July 24th, 2011 at 9:10 amHi Sean,

I just rerun the SSE tests without prefetching. The difference is 1 – 2 cycles depending on batch size. Prefetching seems to be useful for batch sizes larger than 1024 vertices. Smaller batches give identical results with or without prefetching.

Here is my profile function.

^{?}View Code CPPTransformVec3Pos is a macro which points to one of the two functions mentioned in the post. Do you see something strange in this code? I’ve verified that MSVC doesn’t inline either one of the functions. Any ideas on what might affect the results?

(

~~I hope code blocks appear correctly in comments~~Apparently it messed up the code a bit, but I’ve corrected it)## JD Says:

July 24th, 2011 at 5:59 pmI think I found the mistake! I don’t take into account the library’s overhead (monitor->getCoreCounterState(0)), which is about 20k cycles. 20k cycles overhead for calculations taking less than 1k is a big thing.

Profiling using RDTSC gives the result you expected. The C version gets stable to 20 cycles/vertex for a batch of 128 vertices (22 cycles/vertex for 16 vertices). There is a strange drop in the SSE version from ~11 cycles/vertex for a batch of 256 vertices, to ~5 cycles/vertex for a batch with 512 vertices. Also, prefetching seems to harm performance for really small batch (which I think is expected). For large batches, prefetching gives 0.5 to 1 cycle/vertex speedup on average (I assume it can be tuned further by adjusting the offsets in the prefetch commands).

I know that averages without stdev don’t say much, so I’ll try to calculate it next time.

The numbers have been calculated by running the following snippet 100000 times, sorting the results in ascending order, and calculating the mean from the middle 60000 iterations:

^{?}View Code CPPRDTSC isn’t the best way to profile code on multi-core systems, from what I’ve read. I was wondering if there’s an alternative way to count cycles with minimal overhead. Do you have any ideas?

## Sean Barrett Says:

July 26th, 2011 at 1:53 amThanks for investigating further, that makes more sense!

RDTSC should actually be fine on multicore systems (it should stay in sync), but it’s still kind of problematic. In particular RDTSC is bad for measuring wall-clock time since it may vary over time e.g. due to SpeedStep. Also, I’m not sure what effect Turbo Boost might have on RDTSC — it might not affect RDTSC rate measurements, which would mean the processor is actually processing more cycles than RDTSC reports. For best results at trying to measure cycles you probably want to use RDTSC and turn off Turbo Boost or fully load all your cores. But with all these varying clock rates and hyperthreads and things, it’s becoming less meaningful to talk about measuring actual clock cycles, and you might want to just stick to wall-clock microseconds on a specific processor, or such.

For measuring moderately-small things with wall-clock time, QueryPerformanceCounter is preferred. It’s supposed to be based on the hardware bus clock rate, which won’t ever vary. (However, the Frequency isn’t 100.0000% accurate, so you can’t actually use it for long-term wall-clock times.)