Parallel Split Shadow Maps

Update 2012-06-24: Links to the original page/paper seems to be dead. Search for Parallel-Split Shadow Maps for Large-scale Virtual Environments on google and you’ll find it.

Parallel Split Shadow Maps (PSSM) is a shadow mapping technique which tries to minimize shadow aliasing by splitting the view frustum into smaller regions, each one covered by a separate small shadow buffer, effectively distributing the shadow buffer resolution more evenly along the view frustum. For more details on how the algorithm works see the original paper.

What we will present here is the way we integrated PSSM into the engine. First we’ll take a look at how different shadow algorithms can be implemented as plugins to the engine, based on one common interface, and then we will jump into the details of our implementation.

High-level overview of Shadows Methods

Shadow Methods are regular render paths (they are based on the same interface), which means that they get all the functionality provided to the other render paths. More specifically:

  • They can be updated every frame, as long as they appear in the render queue
  • They can traverse the scene and collect visible objects from their own camera
  • They can have multiple passes
  • The user can specify a different shader for each pass, for each different effect

The main differences from the other render paths are:

  1. They can be attached to light sources
  2. They must implement a specific Cg interface

Attaching shadow methods to light sources, instead of specifying the shadowing algorithm per-object, results in a unified shadowing system where all objects in a scene can cast and receive shadows in the same way. Self-shadowing is a by-product of this system, and it cannot be turned off, but the user is free to choose whether an object (a piece of geometry) will cast shadows on the scene (and as a result on itself) or not, through the material / effect it uses.

The required Cg interface, every shadow method must implement, is responsible for generating the shadow term when rendering an illumination pass. It is very simple, as it requires only one function to be implemented, and it is presented in the following code snippet:

interface IShadow
{
    float CalcShadow(float4 positionWS);
};

The only thing required by the interface is the current (pixel) position in world space. The shadow method is responsible for (e.g.) transforming this position to light space, reading a shadow map and returning the shadow term to the caller (which is usually a light interface). This way, the shadowing algorithm is decoupled from the actual lighting algorithm. I.e. one shadow method which is designed for spot lights can be used with a spot light with or without attenuation, with or without a projected texture, etc. Even the main surface shader doesn’t need to know the shadowing algorithm.

One thing worth noting here is that all required shader permutations are generated on the fly, automatically, when and if they are needed. In other words a (e.g.) dot3 diffuse shader with a spot light and a regular depth-based shadow map is generated only if an object using this shader gets into the region affected by such a light.

Parallel Split Shadow Maps implementation

A high level view of the PSSM algorithm is presented in the following pseudocode:

Split the view frustum into different parts using planes parallel to the near and far clipping planes
for each frustum part FPi
{
    Calculate a focused projection matrix (ProjMat), which will include all objects casting shadows on FPi
    Create a new light frustum using ProjMat and the light's modelview matrix (FocusedLightFrustum)
    Collect all objects intersecting FocusedLightFrustum and render them to a depth map (DMi)
}
for each object in the scene O
{
    Render O with its surface shader, using the correct depth map for calculating the shadow term
}

For simplicity lets assume that the maximum number of different splits is 4 (which is the maximum allowable value in our current implementation). This should be enough in most cases, especially if you constraint the camera’s far plane to a relatively small value, which can be different than the actual maximum view distance.
The algorithms for splitting the view frustum into smaller parts, as well as calculating the focused projection matrix, are described in the paper, and sample source code can be found in the varius demos showcasing the technique.

Now lets see how we can implement the last part of the algorithm, which is the actual rendering of the scene with the previously generated shadow maps. Most demos render the objects included in each frustum part, immediately after creating the shadow map, preferably using clipping planes in order to limit the projection only to the affected objects/pixels. Unfortunately such an algorithm may require rendering some of the objects in the scene multiple times, and may result in multiple unneccessary state changes (render target swithes, shader switches, etc.). It has the advantage of using only one shadow map for all splits, which can be beneficial in memory limited situations or in the case where there are few free texture units available in the illumination shader. Because this approach seemed impractical in our case, we decided to go with a single pass approach.

So lets look at simplest possible implementation, which renders all splits in a one pass.

Vertex shader (Cg):

struct AppToVertex
{
    float4 PositionOS : POSITION; // object space position
};
 
struct VertexToFragment
{
    float4 PositionCS : POSITION; // clip space position
    float4 PositionWS : TEXCOORD0; // world space position used for selecting the correct shadow map
    float4 ShadowMapCoord[4] : TEXCOORD1; // vertex position in light space (1 for each frustum split)
};
 
VertexToFragment main_vs(AppToVertex input, 
                          uniform float4x4 mvp, 
                          uniform float4x4 modelToWorldMatrix,
                          uniform float4x4 shadowMatrix[4])
{
    VertexToFragment output;
 
    output.PositionCS = mul(mvp, input.PositionOS);
    output.PositionWS = mul(modelToWorldMatrix, input.PositionOS);
    output.ShadowMapCoord[0] = mul(shadowMatrix[0], input.PositionOS);
    output.ShadowMapCoord[1] = mul(shadowMatrix[1], input.PositionOS);
    output.ShadowMapCoord[2] = mul(shadowMatrix[2], input.PositionOS);
    output.ShadowMapCoord[3] = mul(shadowMatrix[3], input.PositionOS);
 
    return output;
}

Fragment shader (Cg):

float4 main_fs(VertexToFragment input, 
                    uniform sampler2D shadowMap[4], 
                    uniform float4 splitPositions, // the split positions in camera space; distance from near plane
                    uniform float4 cameraMatZ  // the z axis of the camera's view matrix in order to calculate current pixel's distance from the near plane
                   ) : COLOR0
{
    float eyeSpaceZ = -dot(cameraMatZ, input.PositionWS);
 
    float shadowTerm = 1;
    if(eyeSpaceZ < splitPositions.x)
        shadowTerm = tex2Dproj(shadowMap[0], input.ShadowMapCoord[0]).x;
    else if(eyeSpaceZ < splitPositions.y)
        shadowTerm = tex2Dproj(shadowMap[1], input.ShadowMapCoord[1]).x;
    else if(eyeSpaceZ < splitPositions.z)
        shadowTerm = tex2Dproj(shadowMap[2], input.ShadowMapCoord[2]).x;
    else if(eyeSpaceZ < splitPositions.w)
        shadowTerm = tex2Dproj(shadowMap[3], input.ShadowMapCoord[3]).x;
 
    return shadowTerm.xxxx;
}

This fragment shader compiles to 25 instructions in arbfp1 profile (21 arithmetic and 4 texture), or 31 instructions in ps_2_0 (27 arithmetic and 4 texture). Unfortunatelly the use of 5 interpolators doesn’t allow the compilation with earlier profiles. The above shader combination requires 4 texture units, 18 uniforms (float4′s) and 5 interpolators.

A slightly optimized version of the above shader:

float4 main_fs_optimized(VertexToFragment input, 
                    uniform sampler2D shadowMap[4], 
                    uniform float4 splitPositions, // the split positions in camera space; distance from near plane
                    uniform float4 cameraMatZ  // the z axis of the camera's view matrix in order to calculate current pixel's distance from the near plane
                   ) : COLOR0
{
    float eyeSpaceZ = -dot(cameraMatZ, input.PositionWS);
 
    // Perform all comparisons at once...
    float4 comparison = eyeSpaceZ.xxxx < splitPositions;
 
    // Pack all shadow results into a single float4...
    float4 shadowTermVec;
    shadowTermVec.x = tex2Dproj(shadowMap[0], input.ShadowMapCoord[0]).x;
    shadowTermVec.y = tex2Dproj(shadowMap[1], input.ShadowMapCoord[1]).x;
    shadowTermVec.z = tex2Dproj(shadowMap[2], input.ShadowMapCoord[2]).x;
    shadowTermVec.w = tex2Dproj(shadowMap[3], input.ShadowMapCoord[3]).x;
 
    // and finally select the correct one.
    comparison.yzw -= comparison.xyz;
    float result = lerp(1.0, shadowTermVec.x, comparison.x);
    result = lerp(result, shadowTermVec.y, comparison.y);
    result = lerp(result, shadowTermVec.z, comparison.z);
    result = lerp(result, shadowTermVec.w, comparison.w);
 
    return result.xxxx;
}

After the optimizations, the shader compiles to 15 instructions in arbfp1 profile (11 arithmetic and 4 texture), or 21 instructions in ps_2_0 (17 arithmetic and 4 texture).

Unfortunately the above shader isn’t compatible with our Cg interface, so a different approach must be taken. The reason for using such an interface (as mentioned above) is to decouple shadow (and lighting) calculations from the main shader, in order to be able to generate all required permutations automatically, instead of altering each shader every time we implement a new shadowing algorithm. The main difference from the above shaders is that in our case we have the pixel’s position in world space, inside the fragment shader, instead of getting it as a result from the vertex shader. This is required in order to be able to support shadow masks, where the pixel position is generated in a full screen pass from a linear depth buffer.

So, what we have to do is to perform the world-space to light-space transformation per-pixel. This one alone requires 4 dots, so we have to find a different set of optimizations to perform. What we decided to do are the following :

  1. Pack all 4 shadowmaps into one atlas. This reduce the required texture units from 4 to 1, for the shadow maps.
  2. Instead of selecting the correct shadowing term, we select the correct matrix, and perform only one matrix-vector multiply per-pixel

Selecting the correct matrix, before performing the shadow map look up, is a bit difficult, since we can’t index into an array of uniforms, with a run-time calculated index. In order to overcome this problem, we packed all 4 matrices into a 4×4 RGBA32F rectangular texture, and read the correct matrix by shifting the texcoords.

The resulting shader looks like this:

float4 main_fs_lengine(VertexToFragment input, 
                    uniform sampler2D shadowMapAtlas, 
                    uniform samplerRECT matrixTex,
                    uniform float4 splitPositions, // the split positions in camera space; distance from near plane
                    uniform float4 cameraMatZ  // the z axis of the camera's view matrix in order to calculate current pixel's distance from the near plane
                   ) : COLOR0
{
    float eyeSpaceZ = -dot(cameraMatZ, input.PositionWS);
 
    // Perform all comparisons at once...
    float4 comparison = eyeSpaceZ.xxxx < splitPositions;
 
    // total takes the following values depending on the current split:
    // total = 4 -> 1st split
    // total = 3 -> 2nd split
    // total = 2 -> 3rd split
    // total = 1 -> 4th split
    // total = 0 -> beyond 4th split
    //
    // This is the same as dot(comparison, float4(1, 1, 1, 1))
    float total = dot(comparison, comparison);
 
    // an offset of 0.5 is used for sampling at the center of the pixel
    float2 matrixUV = float2(4.5 - total, 0.5);
 
    float4 shadowMatrix[4];
    for(int i = 0;i < 4;i++)
    {
        shadowMatrix[i] = texRECT(matrixTex, matrixUV);
 
        // We should treat matrixUV as a float2, instead of altering only the second 
        // texcoord, because otherwise the compiler doesn't do a single ADD for this.
        // It uses 2 MOVs in order to create the next texcoord
        matrixUV += float2(0.0, 1.0);
    }
 
    float4 lightSpaceVert;
    lightSpaceVert.x = dot(shadowMatrix[0], input.PositionWS);
    lightSpaceVert.y = dot(shadowMatrix[1], input.PositionWS);
    lightSpaceVert.z = dot(shadowMatrix[2], input.PositionWS);
    lightSpaceVert.w = dot(shadowMatrix[3], input.PositionWS);
 
    return lerp(1.0, tex2Dproj(shadowMapAtlas, lightSpaceVert).x, comparison.w).xxxx;
}

This shader compiles to 22 instructions at ps_2_0 profile (5 tex and 17 arithmetic), and to 19 instructions for the arbfp1 profile (5 tex and 14 alu). Not that bad after all, especially if you consider that the final shader requires only 1 interpolator, 2 uniforms and 2 texture units, indepently of the number of split used. Also, the final shader is both compatible with our Cg interface (which means that it can be easily plugged into the engine) and with shadow masks.

I know this was long, so thanks for making it this far. Comments/throughts/corrections are welcome.

Next stop, probably a combination of Variance Shadow maps with PSSM (PSVSM)…

JD

Tags: ,

4 Responses to “Parallel Split Shadow Maps”

  1. Eichi Says:

    Hey, I’ve just read your article and its a good idea! But I wonder about your shaders. They seem to be a bit costly. Five TEX’ per fragment is very heavy! Take a look at my page – I’ve been implementing PSSM for a while, but with a sligthly different approach: I render every shadow map in a different color channel (thus making VSM impossible) to spare three additional textures. Furthermore, I came up with a solution with only one (!) tex per fragment.

  2. JD Says:

    Hi Eichi,

    First of all, thanks for reading.

    The actual shader i’m using is the last one. I know that 5 TEXs should be expensive, but unfortunately in my case this seemed like the best approach. I think the 4 TEXs from the matrix texture should be very coherent so they should be fast, but I’m not really sure about that.

    Your approach of rendering all 4 splits into the 4 different channels of the same texture sounds good. May I ask what format you are using for that? I assume RGBA16f (or RGBA32f), in contrast with my implementation where I use one regular depth24 texture for all shadow maps. If this is the case, isn’t there a difference in performance from rendering to a floating point texture?

    What I can’t understand is how do you sample your shadowmap. The way i think about it, you have 2 options:

    1) Perform 4 TEXs (using the 4 different shadowmap space vectors) and select the correct one based on the calculated index (see the second fragment shader from above), or
    2) Select the correct shadowmap space vector and perform only one TEX at the correct location (the channel can be easily extracted using the split index).

    In the second case (which is the only one i can think with 1 TEX) how do you select the correct shadowmap space vector using the split index?

    JD

  3. Brian Richardson Says:

    Check out “Stable Rendering of Cascaded Shadow Maps”. It shows a really cool trick of using masks and dot products to scale the base projection matrix. That would let you get rid of the texMatrix samples by just computing the tex coords directly. You can also fold the texture atlas computation into it!

    You can also see it in the first comment of this: http://pixelstoomany.wordpress.com/2007/09/21/fast-percentage-closer-filtering-on-deferred-cascaded-shadow-maps/

  4. JD Says:

    Brian, thanks for the link. The depth bounds test sounds like a nice idea, but unfortunately i can’t think of an easy way to incorporate it in the current design due to multiple passes. I’ll definitely keep it in mind for the future though.

    About the first comment in the link you posted. The resulting shader will look like the second fragment shader i posted above, with the exception that since we have an atlas, it’s more natural to first find the correct vector and then sample the shadowmap atlas. The idea is the same, but unfortunately it can’t be applied in my case since i don’t have access to the 4 shadow vectors from a vertex shader. What i have is the pixel’s position in world space, and i have to work with that. This means calculating the 4 shadowmap vectors in the fragment shader (by passing the matrices as constants), then finding the correct one using the calculated index and finally sampling the shadowmap. This needs only one TEX but it’s math heavy so i don’t know if there is actually a speed up.

    Unfortunately, i don’t have ShaderX6 so i can’t comment on the article you suggested. It’s in my list for books to buy but i didn’t have a chance to get it yet. The good thing is that you gave me an idea to improve the shader. Here it is:

    Since all 4 matrices (1 per split) are based on the same light projection matrix, with different crop matrices, i think it’s possible to simplify the shader by passing the light projection matrix as uniforms, and store the crop matrices in the texture. The good thing is that the crop matrices have a simple representation and they can be packed in 4 floats, which requires only one texel in the matrixTex. This way we still need the transformation from world to light space (4 DOTs), but we can substitute the 4 TEXs to the matrixTex with 1 TEX, 1 MUL and 1 MAD.

    I haven’t tried it yet, but i’ll definitely do so once i have the chance.

    Thanks again for the comments.

    JD

Leave a Reply


four + 3 =