Goal: Classify whether a batch of AABBs are completely inside, completely outside or intersecting a frustum (6 planes).

Restrictions:

1. AABBs are defined as (Center, Extent) pairs.

2. All vectors are Vector3f’s.

__Structs and initialization__

^{?}[Copy to clipboard]View Code CPP

#define OUTSIDE 0 #define INSIDE 1 #define INTERSECT 2 // or 3 depending on the algorithm used (see the discussion on the 2nd SSE version) struct _Vector3f { float x, y, z; }; struct _AABB { _Vector3f m_Center; _Vector3f m_Extent; }; struct _Plane { float nx, ny, nz, d; }; |

All arrays are 16-byte aligned (AABB list, AABB state array and the 6 planes). The code is based on method 4c from an excellent post by Fabian Giesen on AABB culling: View frustum culling. Now let’s see the code.

__C (reference implementation)__

^{?}[Copy to clipboard]View Code CPP

void CullAABBList_C(_AABB* aabbList, unsigned int numAABBs, _Plane* frustumPlanes, unsigned int* aabbState) { for(unsigned int iAABB = 0;iAABB < numAABBs;++iAABB) { const _Vector3f& aabbCenter = aabbList[iAABB].m_Center; const _Vector3f& aabbSize = aabbList[iAABB].m_Extent; unsigned int result = 1; // Assume that the aabb will be Inside the frustum for(unsigned int iPlane = 0;iPlane < 6;++iPlane) { const _Plane& frustumPlane = frustumPlanes[iPlane]; float d = aabbCenter.x * frustumPlane.nx + aabbCenter.y * frustumPlane.ny + aabbCenter.z * frustumPlane.nz; float r = aabbSize.x * fast_fabsf(frustumPlane.nx) + aabbSize.y * fast_fabsf(frustumPlane.ny) + aabbSize.z * fast_fabsf(frustumPlane.nz); float d_p_r = d + r; float d_m_r = d - r; if(d_p_r < -frustumPlane.d) { result = 0; // Outside break; } else if(d_m_r < -frustumPlane.d) result = 2; // Intersect } aabbState[iAABB] = result; } } |

Performance (cycles/AABB): Average = 102.5 (stdev = 12.0)

Nothing special about the code, except from the fact that it breaks out of the inner loop as soon as the box is classified as completely outside of a plane. No need to check the others.

But we can do better. First thing, precalculate the absolute of the plane normals. Second thing, replace the floating point comparisons with ANDs. Lets see the code.

__C Optimized__

^{?}[Copy to clipboard]View Code CPP

void CullAABBList_C_Opt(_AABB* __restrict aabbList, unsigned int numAABBs, _Plane* __restrict frustumPlanes, unsigned int* __restrict aabbState) { _Plane absFrustumPlanes[6]; for(unsigned int iPlane = 0;iPlane < 6;++iPlane) { absFrustumPlanes[iPlane].nx = fast_fabsf(frustumPlanes[iPlane].nx); absFrustumPlanes[iPlane].ny = fast_fabsf(frustumPlanes[iPlane].ny); absFrustumPlanes[iPlane].nz = fast_fabsf(frustumPlanes[iPlane].nz); } for(unsigned int iAABB = 0;iAABB < numAABBs;++iAABB) { const _Vector3f& aabbCenter = aabbList[iAABB].m_Center; const _Vector3f& aabbSize = aabbList[iAABB].m_Extent; unsigned int result = 1; // Assume that the aabb will be Inside the frustum for(unsigned int iPlane = 0;iPlane < 6;++iPlane) { const _Plane& frustumPlane = frustumPlanes[iPlane]; const _Plane& absFrustumPlane = absFrustumPlanes[iPlane]; float d = aabbCenter.x * frustumPlane.nx + aabbCenter.y * frustumPlane.ny + aabbCenter.z * frustumPlane.nz; float r = aabbSize.x * absFrustumPlane.nx + aabbSize.y * absFrustumPlane.ny + aabbSize.z * absFrustumPlane.nz; float d_p_r = d + r + frustumPlane.d; if(IsNegativeFloat(d_p_r)) { result = 0; // Outside break; } float d_m_r = d - r + frustumPlane.d; if(IsNegativeFloat(d_m_r)) result = 2; // Intersect } aabbState[iAABB] = result; } } |

Performance (cycles/AABB): Average = 84.9 (stdev = 12.3)

fast_fabsf() and IsNegativeFloat() are simple functions which treat the float as int and remove/check the MSB.

Now let’s see what we can do using SSE.

__SSE (1 AABB at a time)__

^{?}[Copy to clipboard]View Code CPP

void CullAABBList_SSE_1(_AABB* aabbList, unsigned int numAABBs, _Plane* frustumPlanes, unsigned int* aabbState) { __declspec(align(16)) static const unsigned int absPlaneMask[4] = {0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF, 0xFFFFFFFF}; __declspec(align(16)) _Plane absFrustumPlanes[6]; __m128 xmm_absPlaneMask = _mm_load_ps((float*)&absPlaneMask[0]); for(unsigned int iPlane = 0;iPlane < 6;++iPlane) { __m128 xmm_frustumPlane = _mm_load_ps(&frustumPlanes[iPlane].nx); __m128 xmm_absFrustumPlane = _mm_and_ps(xmm_frustumPlane, xmm_absPlaneMask); _mm_store_ps(&absFrustumPlanes[iPlane].nx, xmm_absFrustumPlane); } for(unsigned int iAABB = 0;iAABB < numAABBs;++iAABB) { __m128 xmm_aabbCenter_x = _mm_load_ss(&aabbList[iAABB].m_Center.x); __m128 xmm_aabbCenter_y = _mm_load_ss(&aabbList[iAABB].m_Center.y); __m128 xmm_aabbCenter_z = _mm_load_ss(&aabbList[iAABB].m_Center.z); __m128 xmm_aabbExtent_x = _mm_load_ss(&aabbList[iAABB].m_Extent.x); __m128 xmm_aabbExtent_y = _mm_load_ss(&aabbList[iAABB].m_Extent.y); __m128 xmm_aabbExtent_z = _mm_load_ss(&aabbList[iAABB].m_Extent.z); unsigned int result = 1; // Assume that the aabb will be Inside the frustum for(unsigned int iPlane = 0;iPlane < 6;++iPlane) { __m128 xmm_frustumPlane_Component = _mm_load_ss(&frustumPlanes[iPlane].nx); __m128 xmm_d = _mm_mul_ss(xmm_aabbCenter_x, xmm_frustumPlane_Component); xmm_frustumPlane_Component = _mm_load_ss(&frustumPlanes[iPlane].ny); xmm_d = _mm_add_ss(xmm_d, _mm_mul_ss(xmm_aabbCenter_y, xmm_frustumPlane_Component)); xmm_frustumPlane_Component = _mm_load_ss(&frustumPlanes[iPlane].nz); xmm_d = _mm_add_ss(xmm_d, _mm_mul_ss(xmm_aabbCenter_z, xmm_frustumPlane_Component)); __m128 xmm_absFrustumPlane_Component = _mm_load_ss(&absFrustumPlanes[iPlane].nx); __m128 xmm_r = _mm_mul_ss(xmm_aabbExtent_x, xmm_absFrustumPlane_Component); xmm_absFrustumPlane_Component = _mm_load_ss(&absFrustumPlanes[iPlane].ny); xmm_r = _mm_add_ss(xmm_r, _mm_mul_ss(xmm_aabbExtent_y, xmm_absFrustumPlane_Component)); xmm_absFrustumPlane_Component = _mm_load_ss(&absFrustumPlanes[iPlane].nz); xmm_r = _mm_add_ss(xmm_r, _mm_mul_ss(xmm_aabbExtent_z, xmm_absFrustumPlane_Component)); __m128 xmm_frustumPlane_d = _mm_load_ss(&frustumPlanes[iPlane].d); __m128 xmm_d_p_r = _mm_add_ss(_mm_add_ss(xmm_d, xmm_r), xmm_frustumPlane_d); __m128 xmm_d_m_r = _mm_add_ss(_mm_add_ss(xmm_d, xmm_r), xmm_frustumPlane_d); // Shuffle d_p_r and d_m_r in order to perform only one _mm_movmask_ps __m128 xmm_d_p_r__d_m_r = _mm_shuffle_ps(xmm_d_p_r, xmm_d_m_r, _MM_SHUFFLE(0, 0, 0, 0)); int negativeMask = _mm_movemask_ps(xmm_d_p_r__d_m_r); // Bit 0 holds the sign of d + r and bit 2 holds the sign of d - r if(negativeMask & 0x01) { result = 0; // Outside break; } else if(negativeMask & 0x04) result = 2; // Intersect } aabbState[iAABB] = result; } } |

Performance (cycles/AABB): Average = 63.9 (stdev = 10.8)

Again, since we are processing one AABB at a time, we break out of the inner (plane) loop as soon as the AABB is classified as completely outside one of the 6 planes. The SSE code should be straightforward. All arithmetic operations are scalar and we use _mm_movemask_ps in order to extract the signs of d + r and d – r. Depending on the signs, we classify the AABB as completely outside or intersecting the frustum.

The “problem” with the above snippet is that all SSE operations are scalar. We can do better by swizzling the data from 4 AABBs in such a way that it will be possible to calculate 4 d + r and 4 d – r simultaneously. Let’s do that.

__SSE (4 AABBs at a time)__

^{?}[Copy to clipboard]View Code CPP

void CullAABBList_SSE_4(_AABB* aabbList, unsigned int numAABBs, _Plane* frustumPlanes, unsigned int* aabbState) { __declspec(align(16)) static const unsigned int absPlaneMask[4] = {0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF, 0xFFFFFFFF}; __declspec(align(16)) _Plane absFrustumPlanes[6]; __m128 xmm_absPlaneMask = _mm_load_ps((float*)&absPlaneMask[0]); for(unsigned int iPlane = 0;iPlane < 6;++iPlane) { __m128 xmm_frustumPlane = _mm_load_ps(&frustumPlanes[iPlane].nx); __m128 xmm_absFrustumPlane = _mm_and_ps(xmm_frustumPlane, xmm_absPlaneMask); _mm_store_ps(&absFrustumPlanes[iPlane].nx, xmm_absFrustumPlane); } // Process 4 AABBs in each iteration... unsigned int numIterations = numAABBs >> 2; for(unsigned int iIter = 0;iIter < numIterations;++iIter) { // NOTE: Since the aabbList is 16-byte aligned, we can use aligned moves. // Load the 4 Center/Extents pairs for the 4 AABBs. __m128 xmm_cx0_cy0_cz0_ex0 = _mm_load_ps(&aabbList[(iIter << 2) + 0].m_Center.x); __m128 xmm_ey0_ez0_cx1_cy1 = _mm_load_ps(&aabbList[(iIter << 2) + 0].m_Extent.y); __m128 xmm_cz1_ex1_ey1_ez1 = _mm_load_ps(&aabbList[(iIter << 2) + 1].m_Center.z); __m128 xmm_cx2_cy2_cz2_ex2 = _mm_load_ps(&aabbList[(iIter << 2) + 2].m_Center.x); __m128 xmm_ey2_ez2_cx3_cy3 = _mm_load_ps(&aabbList[(iIter << 2) + 2].m_Extent.y); __m128 xmm_cz3_ex3_ey3_ez3 = _mm_load_ps(&aabbList[(iIter << 2) + 3].m_Center.z); // Shuffle the data in order to get all Xs, Ys, etc. in the same register. __m128 xmm_cx0_cy0_cx1_cy1 = _mm_shuffle_ps(xmm_cx0_cy0_cz0_ex0, xmm_ey0_ez0_cx1_cy1, _MM_SHUFFLE(3, 2, 1, 0)); __m128 xmm_cx2_cy2_cx3_cy3 = _mm_shuffle_ps(xmm_cx2_cy2_cz2_ex2, xmm_ey2_ez2_cx3_cy3, _MM_SHUFFLE(3, 2, 1, 0)); __m128 xmm_aabbCenter0123_x = _mm_shuffle_ps(xmm_cx0_cy0_cx1_cy1, xmm_cx2_cy2_cx3_cy3, _MM_SHUFFLE(2, 0, 2, 0)); __m128 xmm_aabbCenter0123_y = _mm_shuffle_ps(xmm_cx0_cy0_cx1_cy1, xmm_cx2_cy2_cx3_cy3, _MM_SHUFFLE(3, 1, 3, 1)); __m128 xmm_cz0_ex0_cz1_ex1 = _mm_shuffle_ps(xmm_cx0_cy0_cz0_ex0, xmm_cz1_ex1_ey1_ez1, _MM_SHUFFLE(1, 0, 3, 2)); __m128 xmm_cz2_ex2_cz3_ex3 = _mm_shuffle_ps(xmm_cx2_cy2_cz2_ex2, xmm_cz3_ex3_ey3_ez3, _MM_SHUFFLE(1, 0, 3, 2)); __m128 xmm_aabbCenter0123_z = _mm_shuffle_ps(xmm_cz0_ex0_cz1_ex1, xmm_cz2_ex2_cz3_ex3, _MM_SHUFFLE(2, 0, 2, 0)); __m128 xmm_aabbExtent0123_x = _mm_shuffle_ps(xmm_cz0_ex0_cz1_ex1, xmm_cz2_ex2_cz3_ex3, _MM_SHUFFLE(3, 1, 3, 1)); __m128 xmm_ey0_ez0_ey1_ez1 = _mm_shuffle_ps(xmm_ey0_ez0_cx1_cy1, xmm_cz1_ex1_ey1_ez1, _MM_SHUFFLE(3, 2, 1, 0)); __m128 xmm_ey2_ez2_ey3_ez3 = _mm_shuffle_ps(xmm_ey2_ez2_cx3_cy3, xmm_cz3_ex3_ey3_ez3, _MM_SHUFFLE(3, 2, 1, 0)); __m128 xmm_aabbExtent0123_y = _mm_shuffle_ps(xmm_ey0_ez0_ey1_ez1, xmm_ey2_ez2_ey3_ez3, _MM_SHUFFLE(2, 0, 2, 0)); __m128 xmm_aabbExtent0123_z = _mm_shuffle_ps(xmm_ey0_ez0_ey1_ez1, xmm_ey2_ez2_ey3_ez3, _MM_SHUFFLE(3, 1, 3, 1)); unsigned int in_out_flag = 0x0F; // = 01111b Assume that all 4 boxes are inside the frustum. unsigned int intersect_flag = 0x00; // = 00000b if intersect_flag[i] == 1 then this box intersects the frustum. for(unsigned int iPlane = 0;iPlane < 6;++iPlane) { // Calculate d... __m128 xmm_frustumPlane_Component = _mm_load_ps1(&frustumPlanes[iPlane].nx); __m128 xmm_d = _mm_mul_ps(xmm_frustumPlane_Component, xmm_aabbCenter0123_x); xmm_frustumPlane_Component = _mm_load_ps1(&frustumPlanes[iPlane].ny); xmm_frustumPlane_Component = _mm_mul_ps(xmm_frustumPlane_Component, xmm_aabbCenter0123_y); xmm_d = _mm_add_ps(xmm_d, xmm_frustumPlane_Component); xmm_frustumPlane_Component = _mm_load_ps1(&frustumPlanes[iPlane].nz); xmm_frustumPlane_Component = _mm_mul_ps(xmm_frustumPlane_Component, xmm_aabbCenter0123_z); xmm_d = _mm_add_ps(xmm_d, xmm_frustumPlane_Component); // Calculate r... xmm_frustumPlane_Component = _mm_load_ps1(&absFrustumPlanes[iPlane].nx); __m128 xmm_r = _mm_mul_ps(xmm_aabbExtent0123_x, xmm_frustumPlane_Component); xmm_frustumPlane_Component = _mm_load_ps1(&absFrustumPlanes[iPlane].ny); xmm_frustumPlane_Component = _mm_mul_ps(xmm_frustumPlane_Component, xmm_aabbExtent0123_y); xmm_r = _mm_add_ps(xmm_r, xmm_frustumPlane_Component); xmm_frustumPlane_Component = _mm_load_ps1(&absFrustumPlanes[iPlane].nz); xmm_frustumPlane_Component = _mm_mul_ps(xmm_frustumPlane_Component, xmm_aabbExtent0123_z); xmm_r = _mm_add_ps(xmm_r, xmm_frustumPlane_Component); // Calculate d + r + frustumPlane.d __m128 xmm_d_p_r = _mm_add_ps(xmm_d, xmm_r); xmm_frustumPlane_Component = _mm_load_ps1(&frustumPlanes[iPlane].d); xmm_d_p_r = _mm_add_ps(xmm_d_p_r, xmm_frustumPlane_Component); // Check which boxes are outside this plane (if any)... // NOTE: At this point whichever component of the xmm_d_p_r reg is negative, the corresponding // box is outside the frustum. unsigned int in_out_flag_curPlane = _mm_movemask_ps(xmm_d_p_r); in_out_flag &= ~in_out_flag_curPlane; // NOTed the mask because it's 1 for each box which is outside the frustum, and in_out_flag holds the opposite. // If all boxes have been marked as outside the frustum, stop checking the rest of the planes. if(!in_out_flag) break; // Calculate d - r + frustumPlane.d __m128 xmm_d_m_r = _mm_sub_ps(xmm_d, xmm_r); xmm_d_m_r = _mm_add_ps(xmm_d_m_r, xmm_frustumPlane_Component); // Check which boxes intersect the frustum... unsigned int intersect_flag_curPlane = _mm_movemask_ps(xmm_d_m_r); intersect_flag |= intersect_flag_curPlane; } // Calculate the state of the AABB from the 2 flags. // If the i-th bit from in_out_flag is 0, then the result will be 0 independent of value intersect_flag // If the i-th bit from in_out_flag is 1, then the result will be either 1 or 2 depending on the intersect_flag. aabbState[(iIter << 2) + 0] = ((in_out_flag & 0x00000001) >> 0) << ((intersect_flag & 0x00000001) >> 0); aabbState[(iIter << 2) + 1] = ((in_out_flag & 0x00000002) >> 1) << ((intersect_flag & 0x00000002) >> 1); aabbState[(iIter << 2) + 2] = ((in_out_flag & 0x00000004) >> 2) << ((intersect_flag & 0x00000004) >> 2); aabbState[(iIter << 2) + 3] = ((in_out_flag & 0x00000008) >> 3) << ((intersect_flag & 0x00000008) >> 3); } // Process the rest of the AABBs one by one... for(unsigned int iAABB = numIterations << 2; iAABB < numAABBs;++iAABB) { // NOTE: This loop is identical to the CullAABBList_SSE_1() loop. Not shown in order to keep this snippet small. } } |

Performance (cycles/AABB): Average = 24.1 (stdev = 4.2)

Compared to the original (unoptimized) C version, it’s 4x faster. But that’s unfair. Compared to the scalar SSE version it’s about 2.5x faster. Not bad, don’t you thing? Can it get any better? Probably yes. The problem with the 2 SSE versions is the lack of enough XMM registers to keep all the AABB data in them and not use the stack for storing intermediate results. Unfortunately, we need 6 registers for the 4 Center/Extent pairs, and the rest (2) aren’t enough for the inner loop. Compiling under 64-bits will give better results because of that.

Another way to optimize this algorithm further is to have the data already laid out the way the code expects them (in SoA form). This will get rid of the shuffles (the moves will still be 6).

Finally, the code used to calculate the state of each AABB from the two bitfields can change. One other way of doing it is the following:

^{?}[Copy to clipboard]View Code CPP

intersect_flag &= in_out_flag; intersect_flag <<= 1; aabbState[i] = ((in_out_flag & (1 << i)) | (intersect_flag & (1 << (i + 1)))) >> i |

This way, we can get rid of the variable shifts (all shifts and ANDs are compile-time constants). The difference from the previous version is that the intersection state is 3 instead of 2. 2 isn’t a valid state, that’s why we AND the intersection_flag with the in_out_flag. 2 means that the box is outside *and* intersecting the frustum, which is invalid.

Unfortunately, in practice this doesn’t make a big difference in perf. So, it’s a matter of taste.

__Results__

Below is a table with more perf data from all the above snippets. Two cases have been tested. For both of them, the frustum is the [0, 1]^3 box. The first one is a random case, where all the boxes are randomly placed in the [-1.0, 2.0] box, and have random sizes in [0.1, 0.2] range. The other case is the worst case, where all boxes are inside the frustum.

Method |
Random (32) |
Worst (32) |
Random (1024) * |
Worst (1024) |

C (ref) | 105.2 (14.0) | 181.5 (21.0) | 102.5 (12.0) | 159.3 (20.3) |

C (opt) | 96.1 (10.3) | 138.0 (17.2) | 84.9 (12.3) | 119.8 (16.3) |

SSE_1 | 73.7 (9.8) | 93.2 (13.8) | 63.9 (10.8) | 72.8 (9.8) |

SSE_4 | 26.5 (4.0) | 27.6 (4.3) | 24.1 (4.2) | 22.9 (4.0) |

Table: Performance comparison between the 4 snippets. 2 batch sizes, 32 and 1024 AABBs. The column marked with * contains the data shown in the post.

That’s all folks. Thanks for reading. Any corrections/suggestions are welcome.

PS. I got my hands on a Sandy Bridge CPU recently (Core i5) so I’ll experiment with AVX in the future. Maybe the next post includes some AVX code. Who knows?

Tags: optimization, SSE

## Eric Haines Says:

August 26th, 2011 at 9:44 pmOne quick comment: there’s a nice optimization if you’re using AABB/frustum testing in a hierarchy, which is to pass on the knowledge of the parents to the children. For example, say you find your parent AABB is inside 4 of 6 frustum planes, overlapping the remaining 2. Typically this overlap condition means you test the child AABBs inside the parent. You then signal the children to test only these two planes (for fully inside and fully outside), since you know the children are fully inside the other 4. I do this by a simple bitfield for the 6 planes (e.g. 0x001010 would mean test the third and fifth frustum plane) which is used to index a table saying how many and which planes to test.

Given the speediness of SSE testing, this optimization may not be worth the trouble, but I thought I’d mention it. This idea is from Bishop, L., D. Eberly, T. Whitted, M. Finch, and M. Shantz, “Designing a PC Game Engine,” IEEE Computer Graphics and Applications, pp. 46-53, Jan./Feb. 1998.

## JD Says:

August 27th, 2011 at 12:05 pmHi and thanks for the comment.

I’m aware of the algorithm you mention. It was probably my fault. I should have mentioned the potential applications of the above algorithm. The snippets presented in the post is suited more to situations where you have a lot of AABBs you want to test at once. E.g. all the objects in a leaf of a hierarchy. The algorithm you describe is still very useful in order to reach the leaves. And it’s not that much of a trouble to keep the bitfields around, while traversing the hierarchy, and test only the planes needed.

On the other hand, integrating that idea to the SSE version of the above snippet, wouldn’t do any good, imho. I haven’t tested it, but I assume that inserting a hardly predicted conditional branch in the inner loop of the last snippet would hurt performance. In the SSE case which process 4 AABBs at once, you have to OR the bitfields of the 4 AABBs into one, and only check those planes which are relevant to at least one of the four AABBs.

I might check it in the future just for the fun of it.

Thanks for stopping by.

## JD Says:

August 27th, 2011 at 12:09 pmForgive my misunderstanding. There is no need for a conditional branch in the inner loop because you said:

That would be just fine for the SSE version. I’ll definitely try it and post the results as soon as I can.