Shadow masks and stencil buffer optimization

We’ve recently added shadow mask support in one of the render paths used in Lightning Engine, and i thought of testing how much does stencil masking helps performance.

Shadow masks are screen sized textures containing the shadow term, and possibly attenuation calculations, for a light, for each pixel. Since they don’t contain color information, one can use low precision integer formats (e.g. RGB8 or RGBA8) for this matter, and pack multiple lights in one texture (one in each channel). For more information on shadow masks you can read Crytek’s presentation and accompanying paper from SIGGRAPH ’07.
Paper : Finding Next Gen: CryEngine 2
Presentation : Finding Next Gen: CryEngine 2

The procedure for creating a shadow mask texture is as follows:

Render scene's linear depth to a texture (LinearDepthTex)
Render a full screen quad, and in the fragment shader do:
{
  Reconstruct 3D position from linear depth
  Calculate shadow term (i.e. project shadowmap(s)) and attenuation for this position
  Store the value to the appropriate channel
}

The problem with this procedure is that for small lights and big screen resolutions, a lot of time is spent calculating shadows and attenuation for pixels not affected by the light. Since we are reconstructing the 3D pixel position from a linear depth buffer, we need the vectors from the camera to the 4 corners of the far plane to be interpolated across the full-screen quad. Alternatively we could render the light’s volume (instead of the full-screen quad) and somehow calculate the required information for the vertices of the light volume. I don’t know if this actually works, but because there is an alternative way to minimize the required work for each light, while keeping 3D pixel position reconstruction as simple as possible, we haven’t tested this case.

The idea (as mentioned in Crytek’s presentation) is to use the stencil buffer for masking out unaffected pixels. By marking all the pixels affected by the light in the stencil buffer, we effectively limiting the required work to only those pixels, despite the fact that we are rendering a full-screen quad. The question is, “Is this actually true?”. Ideally we would expect the GPU to perform the stencil test before entering the fragment shader.

One thing worth noting here is that the renderer currently used in Lightning Engine is based on OpenGL, so what follows may only apply due to this fact.

In order to test the validity of the above assumption, i’ve created a little demo, which does the following:

  1. Set the appropriate render target (either the window or a RGBA8 + DEPTH24_STENCIL8 FBO)
  2. Clear color, depth and stencil buffer. Stencil is cleared to the value ’1′
  3. Disable color writes and render a cube with its faces pointing inside to the depth buffer (DepthFunc = LESS)
  4. Disable depth writes and enable stencil writes and testing (either two-sided or single sided). Configure stencil ops in order to get a value of ’0′ for all pixels affected by the light volume.
  5. Render the light volume (a small cube intersecting one of the 8 corners of the previous cube)
  6. Disable stencil writes, enable color writes and set the depth func to EQUAL.
  7. Render the first cube again with a heavy fragment shader. Leave stencil testing enabled, and set the stencil func to pass when the stencil value is EQUAL to ’0′

One note here : The “heavy” fragment shader in step 7 doesn’t actually do anything meaningful, but it compiles to around 500 asm instructions (it’s written in Cg and compiled with the FP40 profile), so it should be fine for our needs. It doesn’t discard any pixel, and it doesn’t write to the depth output. Also alpha testing is disabled the whole time.

The results, for a fixed camera configuration in all cases, are:

  1. Render to the window, with stencil masking : 256 fps
  2. Render to the window, without stencil masking : 52 fps
  3. Render to the FBO, with stencil masking : 51 fps
  4. Render to the FBO, without stencil masking : 52 fps

The GPU used was a GeForce 7950GT (Forceware 175.16) on a Q6600.

In order to check the actual rendering speed in the FBO case, the final texture isn’t displayed on screen with a full-screen quad.

From the above results, there seems to be something affecting the place of the stencil test in the pipeline, depending on whether you are rendering to an offscreen render target or the main framebuffer. I’ve tried varius different depth and stencil functions for the last pass, but none of them seems to make a difference. I’m not going to conclude that this is the driver’s fault, but i’ll leave the case open for further investigation.

One thing i haven’t tested is the usage of multiple different FBOs (or different textures) between successive frames in order to minimize the dependence of previous frames results to the current one. Cycling through 2 or 3 different textures should be enough. Since i don’t actually use the resulted texture, there shouldn’t be a performance difference in this demo. But in a real application where the texture is actually used (probably immediately after rendering to it) it should make a difference. I’ll post updates once i check this.

Before closing, i wanted to point out that in the case of shadow masks, there are a couple of other things you can do to speed up the generation pass, such as scissoring and depth bounds testing based on the light’s screen-space bounding rectangle/box. Even without stencil masking, this can help a lot in the case of relatively small lights.

If you have any thoughts/insights on the subject, i’d like to hear them.

Thanks for reading.

JD

VN:F [1.9.3_1094]
Rating: 5.0/5 (3 votes cast)
Shadow masks and stencil buffer optimization, 5.0 out of 5 based on 3 ratings

Tags: , ,

8 Responses to “Shadow masks and stencil buffer optimization”

  1. Brian Richardson Says:

    Is it possible that the FBO you’re rendering to doesn’t have the depth/stencil buffer attached to it correctly? I’ve run into an occasional situation where I wouldn’t always notice if there was a depth/stencil size/format mismatch and the API would just not use the buffer. With DirectX, you can just use the DX debug runtime and it’ll complain about that.

    VA:F [1.9.3_1094]
    Rating: 0 (from 0 votes)
  2. JD Says:

    Hi Brian,

    I’m not really sure if I could get the setup wrong, and here is why:

    1) I’m using a renderbuffer for the depth/stencil surface, so there is only one way to create and set it up.
    2) Dimensions are the same as the color texture, primarly because they are both created in the same function which takes the dimensions as parameters, but most importantly, differently sized attachments aren’t supported by my 7950, so i would expect the FBO to be incomplete (i check for that).
    3) There is only one format supported for rendering to a stencil buffer with FBOs, and this is DEPTH24_STENCIL8 (GL_EXT_packed_depth_stencil) so I don’t have a lot of options here.

    I haven’t checked GLexpert, so i don’t know if it complains about my setup, so i’ll have to do it sometime. AFAIK this is the only way to check if something is wrong under GL. Hopefully the debug profile under GL 3.0 will make tracing down such problems easier, but i’m not really sure about that until complete GL 3.0 drivers are out (and i’m able to get a GL 3.0 capable card, but this is another story).

    One thing I’ve tested since i posted this, is what happens if i don’t render anything to the stencil buffer. I just clear it to the correct value and let the fragments pass if the value is different than the clear value. Again, when rendering to the window FPS is pretty high (because no fragments pass the test), but when i render to the FBO the FPS is the same with or without the test, such as the test happens after the fragment shader (the output is correct in both cases).

    If you have any suggestions on how to test this further i’d be glad to hear them.

    JD

    VN:F [1.9.3_1094]
    Rating: 0 (from 0 votes)
  3. Brian Richardson Says:

    Looks like you’ve covered all the bases with the last test to me. I just know it’s good to ask the dumb questions when getting a “funny result.” heh

    VA:F [1.9.3_1094]
    Rating: 0 (from 0 votes)
  4. Kay Chang Says:

    I’ve performed similar tests on my GeForce 7900GT a while ago.
    - early depth rejection always works, on the window framebuffer, or offscreen FBO
    - easly stencil rejection only works on the window framebuffer, never on the offscreen FBO

    While running on a Geforce 8800GTX, early depth rejection and early stencil rejection always work, either on the window framebuffer of offscreen FBO.

    It’s also worth noticing that early depth rejection and early stencil rejection are triggered only with specific render-states combinations, which ARENT the same for Nvidia or ATI graphic cards.

    AFAIK: with Nvidia, stencil-write must be disabled for early stencil rejection to kick out.
    On ATI, you can write the stencil while testing it, early rejection keep working in theory (hummm havent tried that myself).
    For early depth rejection, it might only work with specific depth-compare functions (not all – depends on vendors). trying the share offscreen depth-texture between different FBO is likely to trash early depth cull memory, etc…

    Definitely a lot of fun to get all that working xD

    VA:F [1.9.3_1094]
    Rating: 0 (from 0 votes)
  5. JD Says:

    Thanks a lot for the info, Kay. I suspected that it may have something to do with the GPU itself and not the driver, but i didn’t have a chance to test it on something else but my card. Writing to stencil with early stencil rejection isn’t required in the case of shadow masks, since the only object written to stencil is the light volume, but it may be relevant for other more complex algorithm.

    I’ll have to make some tests for early depth rejection at some point. Fortunately i’m not sharing the depth renderbuffer between FBOs, so there should be no problems with that. I’m currently changing the color attachments of an FBO and keeping the depth buffer attached all the time. Are there any problems with that?

    Thanks again for the valuable information.

    JD

    VN:F [1.9.3_1094]
    Rating: 0 (from 0 votes)
  6. Kay Chang Says:

    I am doing the same thing (same FBO with depth-stencil attachment – several color attachments RGBA8-FP16 swapped over the frame). It has been working for a while, although the latest drivers gives me staggered framerate (not sure why though).

    If you are programming on an nvidia card, you should have 2 useful extensions available to you:
    - GL_DEPTH_BOUNDS_TEST_EXT: I use it as a work-around to render light volumes in my deferred renderer if early-stencil rejection isn’t working ( => nvidia).
    - GL_EXT_timer_query: very accurate to time GPU calls and to use as a GPU profiler. it has been very valuable to me, as sometimes the driver screw things up when you switch FBOs or change to FBO -funky- configurations, etc… some operations are not very well supported by the driver (or maybe not meant to be?).

    Hope this helps.

    Cheers,
    Kc

    VA:F [1.9.3_1094]
    Rating: 0 (from 0 votes)
  7. JD Says:

    Using the same FBO and switching attachments may not be the optimal solution, but i’m doing it that way mostly because i was reading everywhere that this should be the fastest approach (in other words, avoid excessive glBindFramebuffer calls). On the other hand, changing FBO attachments all the time may add extra validation overhead in the driver. The only way to find the best approach is to profile your code, but again it may be driver specific as you said.

    I’ve already mentioned depth bounds testing in combination with scissoring, in order to speed up shadow mask calculations, at the end of post. I’m also aware of the timer query extension, but i haven’t had the chance to work with it yet. I didn’t know that those two are nvidia only extensions (especially depth bounds testing), so i’ll have that in mind.

    Thanks again for your comments.

    JD

    VN:F [1.9.3_1094]
    Rating: 0 (from 0 votes)
  8. Caffeinated Guy Says:

    - easly stencil rejection only works on the window framebuffer, never on the offscreen FBO

    ARG!! This explains why it works great on my 9400M laptop but makes it run even slower on my 6600 :( (from 24fps before stenciling to 19fps)

    -Greg

    VA:F [1.9.3_1094]
    Rating: 0 (from 0 votes)

Leave a Reply

Enter this code


Page optimized by WP Minify WordPress Plugin