Virtual Textures
We’ve recently started searching for ways to optimize our terrain renderer. Our current implementation uses material splatting with user specified weights per material, as well as decals for extra details (e.g. roads, foot steps, etc.). For distant terrain we use a base layer, in order to minimize the amount of work required to render the terrain. Here I’ll talk about our attempt on virtual textures, the difficulties we’ve encountered and how we are planning on integrating the technique into the engine.
Virtual textures, megatextures, clipmaps, etc.
Independently of the name you pick, the fundamental concept is the same: The ability to render a scene using textures larger than the limit (in dimensions or memory) imposed by the graphics API and/or the GPU used, by keeping in memory only the parts required to render each frame. The different names correspond to different algorithms based on this idea. Some are designed for terrain rendering only, others for arbitrary geometry. Here are some links on the subject (in no particular order):
- SGI’s paper on clipmaps
- The Clipmap: A Virtual Mipmap
- Virtual Texture: A Large Area Raster Resource for the GPU
- Hardware-Independent Clipmapping
- Unified Texture Management for Arbitrary Meshes
- Sparse Virtual Textures
- nVidia’s white paper on clipmaps for DX10
- Clipmapping on the GPU
- Advanced virtual texture topics (Chapter 2 from here)
Our implementation is based on Sean Barrett’s GDC presentation and demo, with some modifications and additions. Other than Sean’s presentation video, you can read papers no. 5 and 9 for more informations on this particular algorithm. The major steps of the algorithm are:
- Find which tiles are required for the current frame
- Update the tile cache by uploading newly introduced tiles
- Update the indirection table
- Render the geometry with a special fragment shader using the tile cache instead of a regural texture
Step 1: Find required tiles
There are several different ways to find which tiles are required for each frame. For a comprehensive list of available options read paragraph 2.3.6 from paper 9. Our current implementation uses Barrett’s original idea of rendering the scene to a texture with a special shader (or a special texture) and reading back the result on CPU for further processing. Unfortunately, this approach doesn’t allow any prediction and suffers from temporal blurriness, as mentioned in Crytek’s paper. In our little demo, this isn’t actually a big problem, so we settled with this solution for now.
The only problem of this method is that the required readback may slow things down a lot, especially if exact levels of details are required (i.e. when rendering the special buffer at full screen resolution). In order to speed things up we used two different PBOs (GL_ARB_pixel_buffer_object) for the readback with one frame delay between rendering and actual processing. Rendering of this special buffer is performed at half the screen resolution, with an appropriate bias to the selected level of detail. This way we can get almost the same output in one forth of the required time.
After the buffer is processed, we have collected all the required tiles for rendering the scene. There might be a problem though which isn’t mentioned in the above pseudocode. The number of required tiles may be more than the number of tiles the cache can support. In order to prevent this from happening, after finding which tiles are required and marking them in a separate buffer, we also request all their parents.
Step 2: Update the tile cache
Starting from the mipmap at the lowest detail level, we request all visible tiles from that mipmap, one by one. When all tiles from a mipmap level have been considered, we move on to the next mipmap level. If the cache is full, we stop asking for higher detail tiles and we continue with the next step. This works because we already have their parents in the cache, and we can use them instead. Of course the quality won’t be optimal but we can continue rendering without worrying about invalid data.
Every time a tile is inserted in the cache texture, we inform the indirection table about its location in the cache and its mipmap level.
Tiles are requested asynchrounsly from a separate thread. This thread is responsible for “generating” the requested tile data and sending them back to the main thread. We say “generate” because the data can be either proceduraly generated or read from a file. When the main thread requests a tile which isn’t available in the cache (a second level cache in system RAM; not to be confused with the tile cache texture which resides in VRAM), the corresponding pixel in the indirection table is unmarked, and its parent is used.
Step 3: Update the indirection table
The next step is to update all tiles in the indirection table, not touched during step 2. During step 1, every tile which should be requested during step 2, has been marked. In this step we check all the tiles which haven’t been marked, and we assign them the same cache page as their parents. This procedure is peformed without using recursion, by looping through each mipmap level of the indirection table, assigning parent page information to each unmarked child.
When each tile in the indirection table have a corresponding cache page assigned, we must update the indirection texture. In contrast to what Crytek suggests in their paper, we used an RGBA8 indirection texture. The main reason for this, is to avoid the extra CPU time required for converting the indirection data to FP16 format, and the required bandwidth for uploading the data, especially when the indirection texture is big. As in Barrett’s demo, we use the floor() trick to convert a fixed point value in the [0, 1] range to an integer value in the shader. We haven’t seen any problems yet, so the extra gain in performance is worth it.
Step 4: Render the geometry
Finally, it’s time to use the virtual texture (indirection table + tile cache). The original texture coordinates are used to read the indirection texture. The indirection information are then used to calculate the correct coordinate in the tile cache texture. The real problem with this step is filtering of the tile cache texture. Currently we have only tried bilinear filtering. Bilinear filtering requires one texel border around each tile, but we already use a 4-pixel border for DXT compression, so other, more complicated, filters can be implemented in the shader. Bilinear filtering of the tile cache texture, with a mipmapped indirection table, gives the same output as a regular texture with bilinear filtering on the nearest mipmap level (MagFilter = LINEAR, MinFilter = LINEAR_MIPMAP_NEAREST).
Storing and editing virtual textures
Having described the major steps of the algorithm, the only real problem remaining is how to store and edit virtual textures. Ideally we want the worker thread to be as fast as possible, in order to minimize popping. On the other hand, we need to be able to edit virtual textures, because the ultimate goal is to integrate the algorithm into our terrain editor. There are two options here:
1) Bake a virtual texture with all of its mipmaps on disk and load it directly or
2) Keep the information about layers, weights and decals and bake the requested tiles on demand.
The first option may be faster for run-time use, but the space requirements increase rapidly with texture resolution. The second option requires more work from the worker thread, so there may be a longer delay between the request and the availability of a tile, but the space requirements are the same as the current algorithm. Since baking of tiles must happen in a second thread, render-to-texture isn’t a candidate, so the baking must be done on the CPU.
For our little demo, we decided to go with the prebaked approach. The maximum virtual texture resolution we’ve tried was 65536×65536, with a tile size of 256×256 (without borders). This texture, in uncompressed RGBA8 format, requires approximately 22GB of space (including mipmaps). In order to minimize the required space, JPEG compression was used for each tile individually. The final texture was about 4GB at a quality of 75 (1/6 of the original size), but since we are using JPEG, this has to do with the amount of detail the texture has. For JPEG we used jpeglib, so compression ratios may vary with other libraries.
Generating mipmaps for a 65536×65536 texture can be a little tricky, because usually mipmap generation algorithms require the mipmap level to be present in RAM for the generation of the next mipmap level. In order to overcome this problem we used the method described by Crytek in their paper, as out-of-core mipmap generation. Mipmap level 0 is stored as separate tiles on disk. When you want to generate a tile for mipmap level 1, all you need is the tiles around the same location from the previous mipmap level. We used a 2×2 box filter for mipmap generation, so only 4 tiles from the previous mipmaps are required to be in memory. Note here that mipmap tiles, in this case, can have a different size than the virtual texture tiles.
The main advantage of the second option is the ability to modify the virtual texture at run-time. Whenever something changes, we can just invalidate the corresponding tiles in the cache and let the worker thread regenerate the data. The main problem is the generation of mipmaps. Since we don’t store anything on disk, we have to bake all mipmap levels, in contrast with option 1 where the baking happens only for mipmap 0. We haven’t tried this yet, but as mentioned above the goal is to integrate the algorithm into our terrain renderer and editor, so interactive editing of textures is needed. Another advantage of this approach is that virtual texture size can be a function of available power and not available disk space. Depending on how fast we can bake tiles, we can adjust the size of the indirection table, effectively altering the size of the virtual texture. E.g. with a tile size of 256 pixels, we can get virtual textures with dimensions up to 1M x 1M (assuming the maximum texture size supported by the GPU is 4096).
Things to try
- Use a second, regular, texture for all mipmap levels below a threshold. E.g. for a virtual texture larger than 4096×4096 we can store mipmap levels below 4096×4096 in a regular texture. This way we can remove the tiles which correspond to those mipmaps from the tile cache, making up more space for higher detail mipmaps. Potential problems include increase in the required texture memory for keeping this extra texture, and an increase in the number of required texture units. Handling of those mipmaps inside the shader should be easy and should require only a TEX and a lerp.
- Currently the bottlenecks in our demo is step 1 (scanning the PBO data in order to collect required tiles) and step 3 (indirection texture updates). An alternative method for finding the tiles required at each frame is presented in paper 5 from above. This is definetely something we should look into.
- Real-time DXT compression of tiles. We already have the required border around each tile, so we have to use it in order to save some texture memory.
There may be some things worth mentioning, that i’m forgetting right now. The subject is big and it can’t fit in a single post. In order to keep it at a reasonable size, i’ll stop here. There will be probably at least one more post on the subject once we manage to test more things. Any questions/suggestions/comments are always welcome.
JD
Tags: rendering, virtual_textures
May 22nd, 2009 at 9:32 pm
For my SVT demo, I used an algorithm that could bake mipmap levels on demand, which is relevant to dynamic updating even if you’re mostly prebaking.
This is mostly interesting for decals. The idea is that you just mipmap the decal, and then apply the mipmaps of the decal to the mipmapped underlying image, rather than recompute the mipmap. This is obviously inaccurate (features that get covered up by the decal at high-res mipmap levels will bleed through at lower mipmap levels as the decal’s alpha drops), but if you pretend you’re using trilinear, the goal of the mipmaps are to minimize the visible transition between mipmap levels, not to produce (say) the RMSE-optimized idealized filtered version. So if in general mipmapping is good enough, it shouldn’t be surprising that mipmapping of the decal independently is good enough (if you imagine what happens during an LOD transition).
One subtlety: if you mipmap the decal with a box filter, then you have an alignment issue: if your decal appears at (0,0) on the texture, it will align with the underlying mipmap and mipmap “identically” with how it would, but if it appears at (1,1) in the texture, to apply the mipmap to the next level up it now falls at a subpixel position, so you either need to snap, or bilerp it. Bilerping will introduce extra blurring, and snapping will introduce a visible movement as you transition LODs.
My solution in my demo was to produce 4 mipmap variants from the base, each shifted by (0,0), (1,0), (0,1), and (1,1). This requires 4 times as much storage, but the mipmap is 1/4 the size. If you wanted to do it “right”, you would need 16 variants of the next level up, 64 of the next level up, etc. This would require just as much storage per mipmap level as the original image. Instead, I just computed 4 shifts at each level total, allowing some translation as you went up mipmap levels (but only half as much as the naive solution would produce).
An alternative would be to use a mipmap that’s not biased to any particular shift; you could do this with just some different filter that’s more friendly, or you can just use the *average* of all the mipmaps that would be computed in the above process. This would probably be what I would do if I wanted to do this for real on the CPU. (If you can GPU accelerate it, then maybe the bilerp of the naive decal mipmaps is good enough.)
GPU decal application to baked mipmaps has a drawback–once your decal mip level reaches 1×1, if it has a non-zero alpha, you actually want to continue generating higher level mipmaps (e.g. you have a 1×1 decal with alpha=0.6, then you want a next-higher mipmap that is a 1×1 decal with alpha=0.15). Hardware designers missed this orthogonality.
May 24th, 2009 at 11:42 am
Hello and welcome.
First of all I would like to thank you for the excellent demo and presentation.
At first I didn’t fully understood what you were suggesting, but after thinking about it a bit more I think I got it. Dynamic placement of decals on a prebaked virtual texture. The reason I didn’t get it the first time was because we haven’t come to the “game” side of things yet. Let me explain. Currently the virtual texture is generated by painting several different materials on it (see another post on the subject) inside an editor, and by manipulating decals and roads (by translating, rotating and scaling them). In those case, a complete invalidation of all the affected tiles from all mipmap levels is required, in contrast with the game side of things, where *static* decals are generated whenever the player shoots something. For those cases, your idea sounds really good, but I guess we’ll have to start implementing it in order to fully understand the alignment problem you describe. Either way, thanks a lot for the idea.
On the subject of editing the virtual texture, we’ve mainly focused out attention to the speed of the software rasterizer (since the last post). After a lot of optimizations and by taking into account that we are generating a texture for a terrain (top-down axis aligned quad(s)), the speed has been improved a lot, and the average time for baking the diffuse color and normalmap of an arbitrary tile with 12 terrain layers and about 200 decals (quads) and roads (arbitrary quadrilaterals) is about 5msec. This number doesn’t say anything on its own, but compared to the 18 msec for 3 layers (without normalmaps and decals) we mentioned in the other post, it’s a nice improvement. Unfortunately, I don’t know how that compares to a GPU implementation because we haven’t tried it yet.
Our main focus lately is on ways to improve the visual quality of the final output. The main problem when you use a single texture for terrain rendering is the excessive stretching on steep slopes. We implemented the indirection mapping technique, but it didn’t seemed practical especially for the terrain sizes we’ve tested (2048 and up). The result might have been a bit better (increased the resolution of the texture on steep slopes), but this didn’t help a lot with stretching since it actually requires different UV parametrization. So we decided to use extra geometry for a selected number of terrain layers with triplanar mapping. The results are a lot better than the previous method at the expense of extra rendering passes (which, unfortunately, is in contrast with the idea of virtual textures). We’ll consider the technique again in the future, but for now we’ll stick with the extra pass.
One idea we had (but didn’t have time to test) is the usage of a 2D texture array for the tile cache. This way, the tiles will be independent of each other, minimizing the artifacts when using trilinear or anisotropic filtering. You’ll still get some artifacts when the filter kernel is larger than the border used, but the data that will be fetched will actually be relevant to the tile (no worries about sampling a rock texture for a neighboring tile when you are expecting something grass-like). Have you tried something similar? Any problems we have to keep in mind when we’ll try it?
Thanks again for dropping by.