<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments for Lightning Engine</title>
	<atom:link href="http://blog.makingartstudios.com/?feed=comments-rss2" rel="self" type="application/rss+xml" />
	<link>http://blog.makingartstudios.com</link>
	<description></description>
	<lastBuildDate>Sat, 27 Aug 2011 10:09:58 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.5.1</generator>
	<item>
		<title>Comment on Useless Snippet #3: AABB/Frustum test by JD</title>
		<link>http://blog.makingartstudios.com/?p=155&#038;cpage=1#comment-88</link>
		<dc:creator>JD</dc:creator>
		<pubDate>Sat, 27 Aug 2011 10:09:58 +0000</pubDate>
		<guid isPermaLink="false">http://blog.makingartstudios.com/?p=155#comment-88</guid>
		<description><![CDATA[Forgive my misunderstanding. There is no need for a conditional branch in the inner loop because you said:
&lt;blockquote&gt;I do this by a simple bitfield for the 6 planes (e.g. 0×001010 would mean test the third and fifth frustum plane) &lt;strong&gt;which is used to index a table saying how many and which planes to test&lt;/strong&gt;.&lt;/blockquote&gt;

That would be just fine for the SSE version. I&#039;ll definitely try it and post the results as soon as I can.]]></description>
		<content:encoded><![CDATA[<p>Forgive my misunderstanding. There is no need for a conditional branch in the inner loop because you said:</p>
<blockquote><p>I do this by a simple bitfield for the 6 planes (e.g. 0×001010 would mean test the third and fifth frustum plane) <strong>which is used to index a table saying how many and which planes to test</strong>.</p></blockquote>
<p>That would be just fine for the SSE version. I&#8217;ll definitely try it and post the results as soon as I can.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Useless Snippet #3: AABB/Frustum test by JD</title>
		<link>http://blog.makingartstudios.com/?p=155&#038;cpage=1#comment-87</link>
		<dc:creator>JD</dc:creator>
		<pubDate>Sat, 27 Aug 2011 10:05:35 +0000</pubDate>
		<guid isPermaLink="false">http://blog.makingartstudios.com/?p=155#comment-87</guid>
		<description><![CDATA[Hi and thanks for the comment. 

I&#039;m aware of the algorithm you mention. It was probably my fault. I should have mentioned the potential applications of the above algorithm. The snippets presented in the post is suited more to situations where you have a lot of AABBs you want to test at once. E.g. all the objects in a leaf of a hierarchy. The algorithm you describe is still very useful in order to reach the leaves. And it&#039;s not that much of a trouble to keep the bitfields around, while traversing the hierarchy, and test only the planes needed. 

On the other hand, integrating that idea to the SSE version of the above snippet, wouldn&#039;t do any good, imho. I haven&#039;t tested it, but I assume that inserting a hardly predicted conditional branch in the inner loop of the last snippet would hurt performance. In the SSE case which process 4 AABBs at once, you have to OR the bitfields of the 4 AABBs into one, and only check those planes which are relevant to at least one of the four AABBs.

I might check it in the future just for the fun of it.

Thanks for stopping by.]]></description>
		<content:encoded><![CDATA[<p>Hi and thanks for the comment. </p>
<p>I&#8217;m aware of the algorithm you mention. It was probably my fault. I should have mentioned the potential applications of the above algorithm. The snippets presented in the post is suited more to situations where you have a lot of AABBs you want to test at once. E.g. all the objects in a leaf of a hierarchy. The algorithm you describe is still very useful in order to reach the leaves. And it&#8217;s not that much of a trouble to keep the bitfields around, while traversing the hierarchy, and test only the planes needed. </p>
<p>On the other hand, integrating that idea to the SSE version of the above snippet, wouldn&#8217;t do any good, imho. I haven&#8217;t tested it, but I assume that inserting a hardly predicted conditional branch in the inner loop of the last snippet would hurt performance. In the SSE case which process 4 AABBs at once, you have to OR the bitfields of the 4 AABBs into one, and only check those planes which are relevant to at least one of the four AABBs.</p>
<p>I might check it in the future just for the fun of it.</p>
<p>Thanks for stopping by.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Useless Snippet #3: AABB/Frustum test by Eric Haines</title>
		<link>http://blog.makingartstudios.com/?p=155&#038;cpage=1#comment-86</link>
		<dc:creator>Eric Haines</dc:creator>
		<pubDate>Fri, 26 Aug 2011 19:44:33 +0000</pubDate>
		<guid isPermaLink="false">http://blog.makingartstudios.com/?p=155#comment-86</guid>
		<description><![CDATA[One quick comment: there&#039;s a nice optimization if you&#039;re using AABB/frustum testing in a hierarchy, which is to pass on the knowledge of the parents to the children. For example, say you find your parent AABB is inside 4 of 6 frustum planes, overlapping the remaining 2. Typically this overlap condition means you test the child AABBs inside the parent. You then signal the children to test only these two planes (for fully inside and fully outside), since you know the children are fully inside the other 4. I do this by a simple bitfield for the 6 planes (e.g. 0x001010 would mean test the third and fifth frustum plane) which is used to index a table saying how many and which planes to test.

Given the speediness of SSE testing, this optimization may not be worth the trouble, but I thought I&#039;d mention it. This idea is from Bishop, L., D. Eberly, T. Whitted, M. Finch, and M. Shantz, &quot;Designing a PC Game Engine,&quot; IEEE Computer Graphics and Applications, pp. 46-53, Jan./Feb. 1998.]]></description>
		<content:encoded><![CDATA[<p>One quick comment: there&#8217;s a nice optimization if you&#8217;re using AABB/frustum testing in a hierarchy, which is to pass on the knowledge of the parents to the children. For example, say you find your parent AABB is inside 4 of 6 frustum planes, overlapping the remaining 2. Typically this overlap condition means you test the child AABBs inside the parent. You then signal the children to test only these two planes (for fully inside and fully outside), since you know the children are fully inside the other 4. I do this by a simple bitfield for the 6 planes (e.g. 0&#215;001010 would mean test the third and fifth frustum plane) which is used to index a table saying how many and which planes to test.</p>
<p>Given the speediness of SSE testing, this optimization may not be worth the trouble, but I thought I&#8217;d mention it. This idea is from Bishop, L., D. Eberly, T. Whitted, M. Finch, and M. Shantz, &#8220;Designing a PC Game Engine,&#8221; IEEE Computer Graphics and Applications, pp. 46-53, Jan./Feb. 1998.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Useless Snippet #2: AABB from a point list by JD</title>
		<link>http://blog.makingartstudios.com/?p=141&#038;cpage=1#comment-81</link>
		<dc:creator>JD</dc:creator>
		<pubDate>Wed, 27 Jul 2011 17:37:35 +0000</pubDate>
		<guid isPermaLink="false">http://blog.makingartstudios.com/?p=141#comment-81</guid>
		<description><![CDATA[To be honest, I totally forgot about MINPS/MAXPS. I&#039;ll definitely try them and update the post later.

As for the 4 parallel min/max, I&#039;ve thought of that, but I think it&#039;s applicable only when the app is 64-bits due to the greater number of XMM regs (16 if I&#039;m not mistaken). Doing it that way requires at least 8 regs for the 4 min/max pairs, so I&#039;m out of registers when compiled as 32-bit code. The goal is to minimize dependencies in the inner loop. With MINPS/MAXPS I think I can do that for 2 min/max pairs at the expense of using unaligned loads. I&#039;ll try this idea too.

Thanks for the feedback.]]></description>
		<content:encoded><![CDATA[<p>To be honest, I totally forgot about MINPS/MAXPS. I&#8217;ll definitely try them and update the post later.</p>
<p>As for the 4 parallel min/max, I&#8217;ve thought of that, but I think it&#8217;s applicable only when the app is 64-bits due to the greater number of XMM regs (16 if I&#8217;m not mistaken). Doing it that way requires at least 8 regs for the 4 min/max pairs, so I&#8217;m out of registers when compiled as 32-bit code. The goal is to minimize dependencies in the inner loop. With MINPS/MAXPS I think I can do that for 2 min/max pairs at the expense of using unaligned loads. I&#8217;ll try this idea too.</p>
<p>Thanks for the feedback.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Useless Snippet #2: AABB from a point list by Arseny Kapoulkine</title>
		<link>http://blog.makingartstudios.com/?p=141&#038;cpage=1#comment-80</link>
		<dc:creator>Arseny Kapoulkine</dc:creator>
		<pubDate>Wed, 27 Jul 2011 16:54:53 +0000</pubDate>
		<guid isPermaLink="false">http://blog.makingartstudios.com/?p=141#comment-80</guid>
		<description><![CDATA[There are _mm_min_ps and _mm_max_ps intrinsics, which get compiled to MINPS/MAXPS instructions. Using these should probably be faster than your current approach.

Another thing which may help (if there are enough registers) is to do a parallel min/max in chunks of 4 vertices - i.e. compute 4 min vectors and 4 max vectors, each for (i % 4) == k, k=0..3, and then reduce 4 results to 1.]]></description>
		<content:encoded><![CDATA[<p>There are _mm_min_ps and _mm_max_ps intrinsics, which get compiled to MINPS/MAXPS instructions. Using these should probably be faster than your current approach.</p>
<p>Another thing which may help (if there are enough registers) is to do a parallel min/max in chunks of 4 vertices &#8211; i.e. compute 4 min vectors and 4 max vectors, each for (i % 4) == k, k=0..3, and then reduce 4 results to 1.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Useless Snippet #1: Transform Vec3f by Matrix4x4f by Sean Barrett</title>
		<link>http://blog.makingartstudios.com/?p=119&#038;cpage=1#comment-79</link>
		<dc:creator>Sean Barrett</dc:creator>
		<pubDate>Mon, 25 Jul 2011 23:53:50 +0000</pubDate>
		<guid isPermaLink="false">http://blog.makingartstudios.com/?p=119#comment-79</guid>
		<description><![CDATA[Thanks for investigating further, that makes more sense!

RDTSC should actually be fine on multicore systems (it should stay in sync), but it&#039;s still kind of problematic. In particular RDTSC is bad for measuring wall-clock time since it may vary over time e.g. due to SpeedStep. Also, I&#039;m not sure what effect Turbo Boost might have on RDTSC -- it might not affect RDTSC rate measurements, which would mean the processor is actually processing more cycles than RDTSC reports. For best results at trying to measure cycles you probably want to use RDTSC and turn off Turbo Boost or fully load all your cores. But with all these varying clock rates and hyperthreads and things, it&#039;s becoming less meaningful to talk about measuring actual clock cycles, and you might want to just stick to wall-clock microseconds on a specific processor, or such.

For measuring moderately-small things with wall-clock time, QueryPerformanceCounter is preferred. It&#039;s supposed to be based on the hardware bus clock rate, which won&#039;t ever vary. (However, the Frequency isn&#039;t 100.0000% accurate, so you can&#039;t actually use it for long-term wall-clock times.)]]></description>
		<content:encoded><![CDATA[<p>Thanks for investigating further, that makes more sense!</p>
<p>RDTSC should actually be fine on multicore systems (it should stay in sync), but it&#8217;s still kind of problematic. In particular RDTSC is bad for measuring wall-clock time since it may vary over time e.g. due to SpeedStep. Also, I&#8217;m not sure what effect Turbo Boost might have on RDTSC &#8212; it might not affect RDTSC rate measurements, which would mean the processor is actually processing more cycles than RDTSC reports. For best results at trying to measure cycles you probably want to use RDTSC and turn off Turbo Boost or fully load all your cores. But with all these varying clock rates and hyperthreads and things, it&#8217;s becoming less meaningful to talk about measuring actual clock cycles, and you might want to just stick to wall-clock microseconds on a specific processor, or such.</p>
<p>For measuring moderately-small things with wall-clock time, QueryPerformanceCounter is preferred. It&#8217;s supposed to be based on the hardware bus clock rate, which won&#8217;t ever vary. (However, the Frequency isn&#8217;t 100.0000% accurate, so you can&#8217;t actually use it for long-term wall-clock times.)</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Useless Snippet #1: Transform Vec3f by Matrix4x4f by JD</title>
		<link>http://blog.makingartstudios.com/?p=119&#038;cpage=1#comment-78</link>
		<dc:creator>JD</dc:creator>
		<pubDate>Sun, 24 Jul 2011 15:59:32 +0000</pubDate>
		<guid isPermaLink="false">http://blog.makingartstudios.com/?p=119#comment-78</guid>
		<description><![CDATA[I think I found the mistake! I don&#039;t take into account the library&#039;s overhead (monitor-&gt;getCoreCounterState(0)), which is about 20k cycles. 20k cycles overhead for calculations taking less than 1k is a big thing.
Profiling using RDTSC gives the result you expected. The C version gets stable to 20 cycles/vertex for a batch of 128  vertices (22 cycles/vertex for 16 vertices). There is a strange drop in the SSE version from ~11 cycles/vertex for a batch of 256 vertices, to ~5 cycles/vertex for a batch with 512 vertices. Also, prefetching seems to harm performance for really small batch (which I think is expected). For large batches, prefetching gives 0.5 to 1 cycle/vertex speedup on average (I assume it can be tuned further by adjusting the offsets in the prefetch commands). 

I know that averages without stdev don&#039;t say much, so I&#039;ll try to calculate it next time.

The numbers have been calculated by running the following snippet 100000 times, sorting the results in ascending order, and calculating the mean from the middle 60000 iterations:

&lt;pre lang=&quot;cpp&quot; colla=&quot;+&quot;&gt;
InitializeArrays();
float cyclesPerVertex = Profile();
Cleanup();
&lt;/pre&gt;

RDTSC isn&#039;t the best way to profile code on multi-core systems, from what I&#039;ve read. I was wondering if there&#039;s an alternative way to count cycles with minimal overhead. Do you have any ideas?]]></description>
		<content:encoded><![CDATA[<p>I think I found the mistake! I don&#8217;t take into account the library&#8217;s overhead (monitor-&gt;getCoreCounterState(0)), which is about 20k cycles. 20k cycles overhead for calculations taking less than 1k is a big thing.<br />
Profiling using RDTSC gives the result you expected. The C version gets stable to 20 cycles/vertex for a batch of 128  vertices (22 cycles/vertex for 16 vertices). There is a strange drop in the SSE version from ~11 cycles/vertex for a batch of 256 vertices, to ~5 cycles/vertex for a batch with 512 vertices. Also, prefetching seems to harm performance for really small batch (which I think is expected). For large batches, prefetching gives 0.5 to 1 cycle/vertex speedup on average (I assume it can be tuned further by adjusting the offsets in the prefetch commands). </p>
<p>I know that averages without stdev don&#8217;t say much, so I&#8217;ll try to calculate it next time.</p>
<p>The numbers have been calculated by running the following snippet 100000 times, sorting the results in ascending order, and calculating the mean from the middle 60000 iterations:</p>

<div class="wp_codebox_msgheader"><span class="right"><sup><a href="http://www.ericbess.com/ericblog/2008/03/03/wp-codebox/#examples" target="_blank" title="WP-CodeBox HowTo?"><span style="color: #99cc00">?</span></a></sup></span><span class="left"><a href="javascript:;" onclick="javascript:showCodeTxt('p119code1'); return false;">View Code</a> CPP</span><div class="codebox_clear"></div></div><div class="wp_codebox"><table><tr id="p1191"><td class="code" id="p119code1"><pre class="cpp" style="font-family:monospace;">InitializeArrays<span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
<span style="color: #0000ff;">float</span> cyclesPerVertex <span style="color: #000080;">=</span> Profile<span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
Cleanup<span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span></pre></td></tr></table></div>

<p>RDTSC isn&#8217;t the best way to profile code on multi-core systems, from what I&#8217;ve read. I was wondering if there&#8217;s an alternative way to count cycles with minimal overhead. Do you have any ideas?</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Useless Snippet #1: Transform Vec3f by Matrix4x4f by JD</title>
		<link>http://blog.makingartstudios.com/?p=119&#038;cpage=1#comment-77</link>
		<dc:creator>JD</dc:creator>
		<pubDate>Sun, 24 Jul 2011 07:10:13 +0000</pubDate>
		<guid isPermaLink="false">http://blog.makingartstudios.com/?p=119#comment-77</guid>
		<description><![CDATA[Hi Sean,
&lt;blockquote&gt;How much of the speedup is really due to SSE and how much is due to the prefetching?&lt;/blockquote&gt;
I just rerun the SSE tests without prefetching. The difference is 1 - 2 cycles depending on batch size. Prefetching seems to be useful for batch sizes larger than 1024 vertices. Smaller batches give identical results with or without prefetching.

&lt;blockquote&gt;What exactly are you measuring?&lt;/blockquote&gt;
Here is my profile function.
&lt;pre lang=&quot;cpp&quot; colla=&quot;+&quot;&gt;
void Profile_TransformVec3Pos(_Vector3f* __restrict src, _Vector4f* __restrict dst, unsigned int numVertices, float* __restrict matrix)
{
	SetProcessAffinityMask(GetCurrentProcess(), 0x01);
	SetThreadAffinityMask(GetCurrentThread(), 0x01);
	SetThreadPriority(GetCurrentThread(), THREAD_PRIORITY_HIGHEST);

	PCM* monitor = PCM::getInstance();
	printf(&quot;\n&quot;);
	if(monitor-&gt;good())
	{
		PCM::CustomCoreEventDescription events[4];
		events[0].event_number = 0x24;
		events[0].umask_value = 0x01; // 
		events[1].event_number = 0x24;
		events[1].umask_value = 0x02; // 
		events[2].event_number = 0x10;
		events[2].umask_value = 0x10; // SSE fp packed uops
		events[3].event_number = 0x10;
		events[3].umask_value = 0x20; // SSE fp scalar uops

		monitor-&gt;resetPMU();
		if(monitor-&gt;program(PCM::CUSTOM_CORE_EVENTS, &amp;events[0]) != PCM::Success)
		{
			printf(&quot;Error while trying to program PCM\n&quot;);
		}
	}

	printf(&quot;Starting measurement...\n&quot;);
	CoreCounterState before_sstate, after_sstate;

	before_sstate = monitor-&gt;getCoreCounterState(0); 
	{
		TransformVec3Pos(src, dst, numVertices, matrix);
	}
	after_sstate = monitor-&gt;getCoreCounterState(0);

	unsigned __int64 totalCycles = getCycles(before_sstate, after_sstate);
	double cyclesPerVertex = (double)totalCycles / (double)numVertices;

	printf(&quot;IPC: %g\n&quot;, getIPC(before_sstate, after_sstate));
	printf(&quot;Cycles: %I64u\n&quot;, totalCycles);
	printf(&quot;Cycles/vertex: %.2f\n&quot;, (float)cyclesPerVertex);
	printf(&quot;Event 0: %I64u\n&quot;, getNumberOfCustomEvents(0, before_sstate, after_sstate));
	printf(&quot;Event 1: %I64u\n&quot;, getNumberOfCustomEvents(1, before_sstate, after_sstate));
	printf(&quot;Event 2: %I64u\n&quot;, getNumberOfCustomEvents(2, before_sstate, after_sstate));
	printf(&quot;Event 3: %I64u\n&quot;, getNumberOfCustomEvents(3, before_sstate, after_sstate));

	monitor-&gt;cleanup();
}
&lt;/pre&gt;

TransformVec3Pos is a macro which points to one of the two functions mentioned in the post. Do you see something strange in this code? I&#039;ve verified that MSVC doesn&#039;t inline either one of the functions. Any ideas on what might affect the results?

(&lt;del datetime=&quot;2011-07-24T07:49:52+00:00&quot;&gt;I hope code blocks appear correctly in comments&lt;/del&gt; Apparently it messed up the code a bit, but I&#039;ve corrected it)]]></description>
		<content:encoded><![CDATA[<p>Hi Sean,</p>
<blockquote><p>How much of the speedup is really due to SSE and how much is due to the prefetching?</p></blockquote>
<p>I just rerun the SSE tests without prefetching. The difference is 1 &#8211; 2 cycles depending on batch size. Prefetching seems to be useful for batch sizes larger than 1024 vertices. Smaller batches give identical results with or without prefetching.</p>
<blockquote><p>What exactly are you measuring?</p></blockquote>
<p>Here is my profile function.</p>

<div class="wp_codebox_msgheader"><span class="right"><sup><a href="http://www.ericbess.com/ericblog/2008/03/03/wp-codebox/#examples" target="_blank" title="WP-CodeBox HowTo?"><span style="color: #99cc00">?</span></a></sup></span><span class="left"><a href="javascript:;" onclick="javascript:showCodeTxt('p119code2'); return false;">View Code</a> CPP</span><div class="codebox_clear"></div></div><div class="wp_codebox"><table><tr id="p1192"><td class="code" id="p119code2"><pre class="cpp" style="font-family:monospace;"><span style="color: #0000ff;">void</span> Profile_TransformVec3Pos<span style="color: #008000;">&#40;</span>_Vector3f<span style="color: #000040;">*</span> __restrict src, _Vector4f<span style="color: #000040;">*</span> __restrict dst, <span style="color: #0000ff;">unsigned</span> <span style="color: #0000ff;">int</span> numVertices, <span style="color: #0000ff;">float</span><span style="color: #000040;">*</span> __restrict matrix<span style="color: #008000;">&#41;</span>
<span style="color: #008000;">&#123;</span>
	SetProcessAffinityMask<span style="color: #008000;">&#40;</span>GetCurrentProcess<span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span>, <span style="color: #208080;">0x01</span><span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
	SetThreadAffinityMask<span style="color: #008000;">&#40;</span>GetCurrentThread<span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span>, <span style="color: #208080;">0x01</span><span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
	SetThreadPriority<span style="color: #008000;">&#40;</span>GetCurrentThread<span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span>, THREAD_PRIORITY_HIGHEST<span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
&nbsp;
	PCM<span style="color: #000040;">*</span> monitor <span style="color: #000080;">=</span> PCM<span style="color: #008080;">::</span><span style="color: #007788;">getInstance</span><span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
	<span style="color: #0000dd;">printf</span><span style="color: #008000;">&#40;</span><span style="color: #FF0000;">&quot;<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
	<span style="color: #0000ff;">if</span><span style="color: #008000;">&#40;</span>monitor<span style="color: #000040;">-</span><span style="color: #000080;">&gt;</span>good<span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span><span style="color: #008000;">&#41;</span>
	<span style="color: #008000;">&#123;</span>
		PCM<span style="color: #008080;">::</span><span style="color: #007788;">CustomCoreEventDescription</span> events<span style="color: #008000;">&#91;</span><span style="color: #0000dd;">4</span><span style="color: #008000;">&#93;</span><span style="color: #008080;">;</span>
		events<span style="color: #008000;">&#91;</span><span style="color: #0000dd;">0</span><span style="color: #008000;">&#93;</span>.<span style="color: #007788;">event_number</span> <span style="color: #000080;">=</span> <span style="color: #208080;">0x24</span><span style="color: #008080;">;</span>
		events<span style="color: #008000;">&#91;</span><span style="color: #0000dd;">0</span><span style="color: #008000;">&#93;</span>.<span style="color: #007788;">umask_value</span> <span style="color: #000080;">=</span> <span style="color: #208080;">0x01</span><span style="color: #008080;">;</span> <span style="color: #666666;">// </span>
		events<span style="color: #008000;">&#91;</span><span style="color: #0000dd;">1</span><span style="color: #008000;">&#93;</span>.<span style="color: #007788;">event_number</span> <span style="color: #000080;">=</span> <span style="color: #208080;">0x24</span><span style="color: #008080;">;</span>
		events<span style="color: #008000;">&#91;</span><span style="color: #0000dd;">1</span><span style="color: #008000;">&#93;</span>.<span style="color: #007788;">umask_value</span> <span style="color: #000080;">=</span> <span style="color: #208080;">0x02</span><span style="color: #008080;">;</span> <span style="color: #666666;">// </span>
		events<span style="color: #008000;">&#91;</span><span style="color: #0000dd;">2</span><span style="color: #008000;">&#93;</span>.<span style="color: #007788;">event_number</span> <span style="color: #000080;">=</span> <span style="color: #208080;">0x10</span><span style="color: #008080;">;</span>
		events<span style="color: #008000;">&#91;</span><span style="color: #0000dd;">2</span><span style="color: #008000;">&#93;</span>.<span style="color: #007788;">umask_value</span> <span style="color: #000080;">=</span> <span style="color: #208080;">0x10</span><span style="color: #008080;">;</span> <span style="color: #666666;">// SSE fp packed uops</span>
		events<span style="color: #008000;">&#91;</span><span style="color: #0000dd;">3</span><span style="color: #008000;">&#93;</span>.<span style="color: #007788;">event_number</span> <span style="color: #000080;">=</span> <span style="color: #208080;">0x10</span><span style="color: #008080;">;</span>
		events<span style="color: #008000;">&#91;</span><span style="color: #0000dd;">3</span><span style="color: #008000;">&#93;</span>.<span style="color: #007788;">umask_value</span> <span style="color: #000080;">=</span> <span style="color: #208080;">0x20</span><span style="color: #008080;">;</span> <span style="color: #666666;">// SSE fp scalar uops</span>
&nbsp;
		monitor<span style="color: #000040;">-</span><span style="color: #000080;">&gt;</span>resetPMU<span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
		<span style="color: #0000ff;">if</span><span style="color: #008000;">&#40;</span>monitor<span style="color: #000040;">-</span><span style="color: #000080;">&gt;</span>program<span style="color: #008000;">&#40;</span>PCM<span style="color: #008080;">::</span><span style="color: #007788;">CUSTOM_CORE_EVENTS</span>, <span style="color: #000040;">&amp;</span>events<span style="color: #008000;">&#91;</span><span style="color: #0000dd;">0</span><span style="color: #008000;">&#93;</span><span style="color: #008000;">&#41;</span> <span style="color: #000040;">!</span><span style="color: #000080;">=</span> PCM<span style="color: #008080;">::</span><span style="color: #007788;">Success</span><span style="color: #008000;">&#41;</span>
		<span style="color: #008000;">&#123;</span>
			<span style="color: #0000dd;">printf</span><span style="color: #008000;">&#40;</span><span style="color: #FF0000;">&quot;Error while trying to program PCM<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
		<span style="color: #008000;">&#125;</span>
	<span style="color: #008000;">&#125;</span>
&nbsp;
	<span style="color: #0000dd;">printf</span><span style="color: #008000;">&#40;</span><span style="color: #FF0000;">&quot;Starting measurement...<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span><span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
	CoreCounterState before_sstate, after_sstate<span style="color: #008080;">;</span>
&nbsp;
	before_sstate <span style="color: #000080;">=</span> monitor<span style="color: #000040;">-</span><span style="color: #000080;">&gt;</span>getCoreCounterState<span style="color: #008000;">&#40;</span><span style="color: #0000dd;">0</span><span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span> 
	<span style="color: #008000;">&#123;</span>
		TransformVec3Pos<span style="color: #008000;">&#40;</span>src, dst, numVertices, matrix<span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
	<span style="color: #008000;">&#125;</span>
	after_sstate <span style="color: #000080;">=</span> monitor<span style="color: #000040;">-</span><span style="color: #000080;">&gt;</span>getCoreCounterState<span style="color: #008000;">&#40;</span><span style="color: #0000dd;">0</span><span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
&nbsp;
	<span style="color: #0000ff;">unsigned</span> __int64 totalCycles <span style="color: #000080;">=</span> getCycles<span style="color: #008000;">&#40;</span>before_sstate, after_sstate<span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
	<span style="color: #0000ff;">double</span> cyclesPerVertex <span style="color: #000080;">=</span> <span style="color: #008000;">&#40;</span><span style="color: #0000ff;">double</span><span style="color: #008000;">&#41;</span>totalCycles <span style="color: #000040;">/</span> <span style="color: #008000;">&#40;</span><span style="color: #0000ff;">double</span><span style="color: #008000;">&#41;</span>numVertices<span style="color: #008080;">;</span>
&nbsp;
	<span style="color: #0000dd;">printf</span><span style="color: #008000;">&#40;</span><span style="color: #FF0000;">&quot;IPC: %g<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span>, getIPC<span style="color: #008000;">&#40;</span>before_sstate, after_sstate<span style="color: #008000;">&#41;</span><span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
	<span style="color: #0000dd;">printf</span><span style="color: #008000;">&#40;</span><span style="color: #FF0000;">&quot;Cycles: %I64u<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span>, totalCycles<span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
	<span style="color: #0000dd;">printf</span><span style="color: #008000;">&#40;</span><span style="color: #FF0000;">&quot;Cycles/vertex: %.2f<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span>, <span style="color: #008000;">&#40;</span><span style="color: #0000ff;">float</span><span style="color: #008000;">&#41;</span>cyclesPerVertex<span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
	<span style="color: #0000dd;">printf</span><span style="color: #008000;">&#40;</span><span style="color: #FF0000;">&quot;Event 0: %I64u<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span>, getNumberOfCustomEvents<span style="color: #008000;">&#40;</span><span style="color: #0000dd;">0</span>, before_sstate, after_sstate<span style="color: #008000;">&#41;</span><span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
	<span style="color: #0000dd;">printf</span><span style="color: #008000;">&#40;</span><span style="color: #FF0000;">&quot;Event 1: %I64u<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span>, getNumberOfCustomEvents<span style="color: #008000;">&#40;</span><span style="color: #0000dd;">1</span>, before_sstate, after_sstate<span style="color: #008000;">&#41;</span><span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
	<span style="color: #0000dd;">printf</span><span style="color: #008000;">&#40;</span><span style="color: #FF0000;">&quot;Event 2: %I64u<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span>, getNumberOfCustomEvents<span style="color: #008000;">&#40;</span><span style="color: #0000dd;">2</span>, before_sstate, after_sstate<span style="color: #008000;">&#41;</span><span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
	<span style="color: #0000dd;">printf</span><span style="color: #008000;">&#40;</span><span style="color: #FF0000;">&quot;Event 3: %I64u<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span>, getNumberOfCustomEvents<span style="color: #008000;">&#40;</span><span style="color: #0000dd;">3</span>, before_sstate, after_sstate<span style="color: #008000;">&#41;</span><span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
&nbsp;
	monitor<span style="color: #000040;">-</span><span style="color: #000080;">&gt;</span>cleanup<span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span><span style="color: #008080;">;</span>
<span style="color: #008000;">&#125;</span></pre></td></tr></table></div>

<p>TransformVec3Pos is a macro which points to one of the two functions mentioned in the post. Do you see something strange in this code? I&#8217;ve verified that MSVC doesn&#8217;t inline either one of the functions. Any ideas on what might affect the results?</p>
<p>(<del datetime="2011-07-24T07:49:52+00:00">I hope code blocks appear correctly in comments</del> Apparently it messed up the code a bit, but I&#8217;ve corrected it)</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Useless Snippet #1: Transform Vec3f by Matrix4x4f by Sean Barrett</title>
		<link>http://blog.makingartstudios.com/?p=119&#038;cpage=1#comment-76</link>
		<dc:creator>Sean Barrett</dc:creator>
		<pubDate>Sun, 24 Jul 2011 00:08:03 +0000</pubDate>
		<guid isPermaLink="false">http://blog.makingartstudios.com/?p=119#comment-76</guid>
		<description><![CDATA[How much of the speedup is really due to SSE and how much is due to the prefetching?

I&#039;m unclear why you would see such wide variation in timings with batch size on either code. Even a batch size of 128 should run the C version enough to minimize any non-iteration overhead, so it seems highly implausible that the C code should get get 6 times faster as the batch size goes from 128 to 16384 (although it&#039;s possible that&#039;s the hardware prefetcher coming into play). This seems more like the kind of variation you see in java benchmarks (which have huge startup overheads). What exactly are you measuring?]]></description>
		<content:encoded><![CDATA[<p>How much of the speedup is really due to SSE and how much is due to the prefetching?</p>
<p>I&#8217;m unclear why you would see such wide variation in timings with batch size on either code. Even a batch size of 128 should run the C version enough to minimize any non-iteration overhead, so it seems highly implausible that the C code should get get 6 times faster as the batch size goes from 128 to 16384 (although it&#8217;s possible that&#8217;s the hardware prefetcher coming into play). This seems more like the kind of variation you see in java benchmarks (which have huge startup overheads). What exactly are you measuring?</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Memory Analyzer v1.1 by JD</title>
		<link>http://blog.makingartstudios.com/?p=88&#038;cpage=1#comment-67</link>
		<dc:creator>JD</dc:creator>
		<pubDate>Thu, 03 Feb 2011 08:26:38 +0000</pubDate>
		<guid isPermaLink="false">http://blog.makingartstudios.com/?p=88#comment-67</guid>
		<description><![CDATA[It&#039;s not actually flickering. It&#039;s the various tabs coming up. For some reason the tab control is visible before it&#039;s populated, and that&#039;s why it looks like flickering. We&#039;ll keep it mind and we&#039;ll try to fix it in the next release. 

Thanks for all the feedback. I hope the flickering does not prevent you from using the application. If you have any other problems while working with Memory Analyzer, we will be glad to help.]]></description>
		<content:encoded><![CDATA[<p>It&#8217;s not actually flickering. It&#8217;s the various tabs coming up. For some reason the tab control is visible before it&#8217;s populated, and that&#8217;s why it looks like flickering. We&#8217;ll keep it mind and we&#8217;ll try to fix it in the next release. </p>
<p>Thanks for all the feedback. I hope the flickering does not prevent you from using the application. If you have any other problems while working with Memory Analyzer, we will be glad to help.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Memory Analyzer v1.1 by NA</title>
		<link>http://blog.makingartstudios.com/?p=88&#038;cpage=1#comment-66</link>
		<dc:creator>NA</dc:creator>
		<pubDate>Thu, 03 Feb 2011 08:19:04 +0000</pubDate>
		<guid isPermaLink="false">http://blog.makingartstudios.com/?p=88#comment-66</guid>
		<description><![CDATA[now it works, but it looks strange how the app is starting, heavy flickering....]]></description>
		<content:encoded><![CDATA[<p>now it works, but it looks strange how the app is starting, heavy flickering&#8230;.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Memory Analyzer v1.1 by JD</title>
		<link>http://blog.makingartstudios.com/?p=88&#038;cpage=1#comment-65</link>
		<dc:creator>JD</dc:creator>
		<pubDate>Wed, 02 Feb 2011 18:51:20 +0000</pubDate>
		<guid isPermaLink="false">http://blog.makingartstudios.com/?p=88#comment-65</guid>
		<description><![CDATA[Sorry for the misunderstanding. 

I&#039;ve managed to reproduce the issue with the version linked in the project&#039;s page (for some strange reason it doesn&#039;t seem to appear every time). The bug doesn&#039;t seem to be present in the current version of the code, so I hope it has been fixed (couldn&#039;t find a related issue in the bug tracker, so it&#039;s either been fixed accidentally or hasn&#039;t made it to the bug tracker). 

Please try the new version (&lt;a href=&quot;http://www.makingartstudios.com/memanalyzer/memanalyzerdemo_v1.1_build20110202.zip&quot; rel=&quot;nofollow&quot;&gt;direct link&lt;/a&gt;) and report back. Thanks.]]></description>
		<content:encoded><![CDATA[<p>Sorry for the misunderstanding. </p>
<p>I&#8217;ve managed to reproduce the issue with the version linked in the project&#8217;s page (for some strange reason it doesn&#8217;t seem to appear every time). The bug doesn&#8217;t seem to be present in the current version of the code, so I hope it has been fixed (couldn&#8217;t find a related issue in the bug tracker, so it&#8217;s either been fixed accidentally or hasn&#8217;t made it to the bug tracker). </p>
<p>Please try the new version (<a href="http://www.makingartstudios.com/memanalyzer/memanalyzerdemo_v1.1_build20110202.zip" rel="nofollow">direct link</a>) and report back. Thanks.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Memory Analyzer v1.1 by NA</title>
		<link>http://blog.makingartstudios.com/?p=88&#038;cpage=1#comment-64</link>
		<dc:creator>NA</dc:creator>
		<pubDate>Wed, 02 Feb 2011 15:35:43 +0000</pubDate>
		<guid isPermaLink="false">http://blog.makingartstudios.com/?p=88#comment-64</guid>
		<description><![CDATA[I said, when I start the app and THEN quit immeditely the app crash.

c:\windows\winsxs\x86_microsoft.vc90.mfc_1fc8b3b9a1e18e3b_9.0.30729.1_x-ww_405b0943\MFC90.DLL
c:\windows\winsxs\x86_microsoft.vc90.crt_1fc8b3b9a1e18e3b_9.0.30729.1_x-ww_6f74963e\MSVCR90.DLL
c:\windows\winsxs\wow64_microsoft.windows.common-controls_6595b64144ccf1df_6.0.3790.3959_x-ww_5fa17f4e\COMCTL32.DLL
c:\windows\winsxs\x86_microsoft.vc90.crt_1fc8b3b9a1e18e3b_9.0.30729.1_x-ww_6f74963e\MSVCP90.DLL]]></description>
		<content:encoded><![CDATA[<p>I said, when I start the app and THEN quit immeditely the app crash.</p>
<p>c:\windows\winsxs\x86_microsoft.vc90.mfc_1fc8b3b9a1e18e3b_9.0.30729.1_x-ww_405b0943\MFC90.DLL<br />
c:\windows\winsxs\x86_microsoft.vc90.crt_1fc8b3b9a1e18e3b_9.0.30729.1_x-ww_6f74963e\MSVCR90.DLL<br />
c:\windows\winsxs\wow64_microsoft.windows.common-controls_6595b64144ccf1df_6.0.3790.3959_x-ww_5fa17f4e\COMCTL32.DLL<br />
c:\windows\winsxs\x86_microsoft.vc90.crt_1fc8b3b9a1e18e3b_9.0.30729.1_x-ww_6f74963e\MSVCP90.DLL</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Memory Analyzer v1.1 by JD</title>
		<link>http://blog.makingartstudios.com/?p=88&#038;cpage=1#comment-63</link>
		<dc:creator>JD</dc:creator>
		<pubDate>Wed, 02 Feb 2011 14:33:58 +0000</pubDate>
		<guid isPermaLink="false">http://blog.makingartstudios.com/?p=88#comment-63</guid>
		<description><![CDATA[Fresh installation of WinXP 64-bit with SP2 on a VM. No VS installed. Installed either one of the redistributable packages (the one you linked and the one in the project&#039;s page). Memory Analyzer starts up just fine.

Since you&#039;ve run it from a debugger, can you please tell me which DLLs are being loaded? I&#039;m more interested in the MFC&#039;s DLL folders (which contain the version number). Maybe we can reproduce the bug, if we know the exact versions. 

Thanks again for the feedback.]]></description>
		<content:encoded><![CDATA[<p>Fresh installation of WinXP 64-bit with SP2 on a VM. No VS installed. Installed either one of the redistributable packages (the one you linked and the one in the project&#8217;s page). Memory Analyzer starts up just fine.</p>
<p>Since you&#8217;ve run it from a debugger, can you please tell me which DLLs are being loaded? I&#8217;m more interested in the MFC&#8217;s DLL folders (which contain the version number). Maybe we can reproduce the bug, if we know the exact versions. </p>
<p>Thanks again for the feedback.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Memory Analyzer v1.1 by JD</title>
		<link>http://blog.makingartstudios.com/?p=88&#038;cpage=1#comment-62</link>
		<dc:creator>JD</dc:creator>
		<pubDate>Tue, 01 Feb 2011 09:58:57 +0000</pubDate>
		<guid isPermaLink="false">http://blog.makingartstudios.com/?p=88#comment-62</guid>
		<description><![CDATA[Thanks for the feedback.

The application uses MFC (obviously) so I can&#039;t think of a reason it should be a problem. I have the same configuration (VS2008 SP1) but on Win7 64-bit and it works. We&#039;ve tested it on WinXP 32-bit as well as Vista 32-bit (on virtual machines) and it worked correctly (with the appropriate redistributable package).

The only thing I can think of is that the application doesn&#039;t use the correct version of the MFC DLLs for some reason. We&#039;ll try to reproduce the problem on a virtual machine with WinXP 64-bit. I&#039;ll be back later with more info.

And I&#039;m sorry for the captcha problem. It&#039;s WP&#039;s (or maybe the plugin&#039;s) fault. We&#039;ll look into it to.]]></description>
		<content:encoded><![CDATA[<p>Thanks for the feedback.</p>
<p>The application uses MFC (obviously) so I can&#8217;t think of a reason it should be a problem. I have the same configuration (VS2008 SP1) but on Win7 64-bit and it works. We&#8217;ve tested it on WinXP 32-bit as well as Vista 32-bit (on virtual machines) and it worked correctly (with the appropriate redistributable package).</p>
<p>The only thing I can think of is that the application doesn&#8217;t use the correct version of the MFC DLLs for some reason. We&#8217;ll try to reproduce the problem on a virtual machine with WinXP 64-bit. I&#8217;ll be back later with more info.</p>
<p>And I&#8217;m sorry for the captcha problem. It&#8217;s WP&#8217;s (or maybe the plugin&#8217;s) fault. We&#8217;ll look into it to.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Memory Analyzer v1.1 by NA</title>
		<link>http://blog.makingartstudios.com/?p=88&#038;cpage=1#comment-61</link>
		<dc:creator>NA</dc:creator>
		<pubDate>Tue, 01 Feb 2011 09:36:31 +0000</pubDate>
		<guid isPermaLink="false">http://blog.makingartstudios.com/?p=88#comment-61</guid>
		<description><![CDATA[I use VS 2008 and SP1 and installed:
http://www.microsoft.com/downloads/en/details.aspx?FamilyID=a5c84275-3b97-4ab7-a40d-3802b2af5fc2&amp;displaylang=en

mfc90.dll!CControlFrameWnd::PostNcDestroy()  Line 74	C++ ctlframe.cpp
mfc90.dll!CWnd::OnNcDestroy()  Line 863 + 0xa bytes	C++

I think the GUI is the problem.

BTW: please don&#039;t delete my comment when I input the wrong captcha code, that SUCKS!]]></description>
		<content:encoded><![CDATA[<p>I use VS 2008 and SP1 and installed:<br />
<a href="http://www.microsoft.com/downloads/en/details.aspx?FamilyID=a5c84275-3b97-4ab7-a40d-3802b2af5fc2&#038;displaylang=en" rel="nofollow">http://www.microsoft.com/downloads/en/details.aspx?FamilyID=a5c84275-3b97-4ab7-a40d-3802b2af5fc2&#038;displaylang=en</a></p>
<p>mfc90.dll!CControlFrameWnd::PostNcDestroy()  Line 74	C++ ctlframe.cpp<br />
mfc90.dll!CWnd::OnNcDestroy()  Line 863 + 0xa bytes	C++</p>
<p>I think the GUI is the problem.</p>
<p>BTW: please don&#8217;t delete my comment when I input the wrong captcha code, that SUCKS!</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Memory Analyzer v1.1 by JD</title>
		<link>http://blog.makingartstudios.com/?p=88&#038;cpage=1#comment-60</link>
		<dc:creator>JD</dc:creator>
		<pubDate>Mon, 31 Jan 2011 11:46:58 +0000</pubDate>
		<guid isPermaLink="false">http://blog.makingartstudios.com/?p=88#comment-60</guid>
		<description><![CDATA[Can you give some more info?
1) Which version(s) of Visual Studio you have installed?
2) Have you installed the Microsoft Visual C++ Feature Pack redistributable package? There&#039;s a link in the download page.
3) Can you tell us the address at which the exception occurs? It should be somewhere in the crash dialog. Also if you can tell us the problematic module as well as the module&#039;s base address it will be easier for us to catch the bug.

It seems that this is a common bug which doesn&#039;t appear on our machines. It doesn&#039;t seem to be related to the Windows version (also happened in Win7 Enterprise with the previous version).

Thanks for reporting it. You will try to fix it asap.]]></description>
		<content:encoded><![CDATA[<p>Can you give some more info?<br />
1) Which version(s) of Visual Studio you have installed?<br />
2) Have you installed the Microsoft Visual C++ Feature Pack redistributable package? There&#8217;s a link in the download page.<br />
3) Can you tell us the address at which the exception occurs? It should be somewhere in the crash dialog. Also if you can tell us the problematic module as well as the module&#8217;s base address it will be easier for us to catch the bug.</p>
<p>It seems that this is a common bug which doesn&#8217;t appear on our machines. It doesn&#8217;t seem to be related to the Windows version (also happened in Win7 Enterprise with the previous version).</p>
<p>Thanks for reporting it. You will try to fix it asap.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Memory Analyzer v1.1 by NA</title>
		<link>http://blog.makingartstudios.com/?p=88&#038;cpage=1#comment-59</link>
		<dc:creator>NA</dc:creator>
		<pubDate>Mon, 31 Jan 2011 11:23:18 +0000</pubDate>
		<guid isPermaLink="false">http://blog.makingartstudios.com/?p=88#comment-59</guid>
		<description><![CDATA[I simply started Memory Analyzer and quit immediately, and the app crashed.
OS: WinXP 64]]></description>
		<content:encoded><![CDATA[<p>I simply started Memory Analyzer and quit immediately, and the app crashed.<br />
OS: WinXP 64</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Memory Analyzer v1.0 demo by JD</title>
		<link>http://blog.makingartstudios.com/?p=39&#038;cpage=1#comment-58</link>
		<dc:creator>JD</dc:creator>
		<pubDate>Tue, 04 Jan 2011 07:36:01 +0000</pubDate>
		<guid isPermaLink="false">http://blog.makingartstudios.com/?p=39#comment-58</guid>
		<description><![CDATA[Hello Thom and thanks for downloading Memory Analyzer. We are really sorry that it doesn&#039;t work for you. We&#039;ll try to upload a new version later today which hopefully will fix the problem you are facing. 

In the meantime, can you also post the modules base address or the offset to the exception code in order to debug it and see what&#039;s going on. The address at which the exception occurs is, unfortunately, useful only if it&#039;s relative to the module&#039;s base. 

Thanks again for trying Memory Analyzer.]]></description>
		<content:encoded><![CDATA[<p>Hello Thom and thanks for downloading Memory Analyzer. We are really sorry that it doesn&#8217;t work for you. We&#8217;ll try to upload a new version later today which hopefully will fix the problem you are facing. </p>
<p>In the meantime, can you also post the modules base address or the offset to the exception code in order to debug it and see what&#8217;s going on. The address at which the exception occurs is, unfortunately, useful only if it&#8217;s relative to the module&#8217;s base. </p>
<p>Thanks again for trying Memory Analyzer.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Memory Analyzer v1.0 demo by Thom</title>
		<link>http://blog.makingartstudios.com/?p=39&#038;cpage=1#comment-57</link>
		<dc:creator>Thom</dc:creator>
		<pubDate>Mon, 03 Jan 2011 20:24:00 +0000</pubDate>
		<guid isPermaLink="false">http://blog.makingartstudios.com/?p=39#comment-57</guid>
		<description><![CDATA[This looks like a great tool that would really help me out. Unfortunately, it throws an exception immediately upon startup. (Unhandled exception at 0x003ae2e0 in MemoryAnalyzer.exe: 0xC0000005: Access violation reading location 0x00000000.)

This is on Win7 Enterprise (32-bit), and an application that was built using VS2008. 

Has the MemAnalyzer had any updates since 9/2009 that might help me get it running?

Thanks!
Thom]]></description>
		<content:encoded><![CDATA[<p>This looks like a great tool that would really help me out. Unfortunately, it throws an exception immediately upon startup. (Unhandled exception at 0x003ae2e0 in MemoryAnalyzer.exe: 0xC0000005: Access violation reading location 0&#215;00000000.)</p>
<p>This is on Win7 Enterprise (32-bit), and an application that was built using VS2008. </p>
<p>Has the MemAnalyzer had any updates since 9/2009 that might help me get it running?</p>
<p>Thanks!<br />
Thom</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Tracking Memory Allocations by Jelle van der Beek</title>
		<link>http://blog.makingartstudios.com/?p=16&#038;cpage=1#comment-56</link>
		<dc:creator>Jelle van der Beek</dc:creator>
		<pubDate>Mon, 13 Dec 2010 18:11:29 +0000</pubDate>
		<guid isPermaLink="false">http://blog.makingartstudios.com/?p=16#comment-56</guid>
		<description><![CDATA[Alright! let me know if there&#039;s any news on your tools. Keep up the good work!]]></description>
		<content:encoded><![CDATA[<p>Alright! let me know if there&#8217;s any news on your tools. Keep up the good work!</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Tracking Memory Allocations by JD</title>
		<link>http://blog.makingartstudios.com/?p=16&#038;cpage=1#comment-55</link>
		<dc:creator>JD</dc:creator>
		<pubDate>Mon, 13 Dec 2010 15:58:02 +0000</pubDate>
		<guid isPermaLink="false">http://blog.makingartstudios.com/?p=16#comment-55</guid>
		<description><![CDATA[Sorry for the late approval but Akismet blocked the comment because of the 2 links.

That seems like a great tool. It&#039;s nice to see the progress since your Gamasutra articles. We are still working on and off on our tool but we haven&#039;t had the chance to post an update since our &lt;a href=&quot;http://blog.makingartstudios.com/?p=39&quot; rel=&quot;nofollow&quot;&gt;last post&lt;/a&gt; (one year ago).

We&#039;ll be watching your youtube channel for updates on this tool. If there is a demo somewhere I&#039;ll also be glad to give it a try.

Thanks for stopping by.]]></description>
		<content:encoded><![CDATA[<p>Sorry for the late approval but Akismet blocked the comment because of the 2 links.</p>
<p>That seems like a great tool. It&#8217;s nice to see the progress since your Gamasutra articles. We are still working on and off on our tool but we haven&#8217;t had the chance to post an update since our <a href="http://blog.makingartstudios.com/?p=39" rel="nofollow">last post</a> (one year ago).</p>
<p>We&#8217;ll be watching your youtube channel for updates on this tool. If there is a demo somewhere I&#8217;ll also be glad to give it a try.</p>
<p>Thanks for stopping by.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Tracking Memory Allocations by Jelle van der Beek</title>
		<link>http://blog.makingartstudios.com/?p=16&#038;cpage=1#comment-54</link>
		<dc:creator>Jelle van der Beek</dc:creator>
		<pubDate>Mon, 13 Dec 2010 10:41:49 +0000</pubDate>
		<guid isPermaLink="false">http://blog.makingartstudios.com/?p=16#comment-54</guid>
		<description><![CDATA[Hey guys... don&#039;t know if this thread is still active, but I posted a video of my new memory tool on Youtube. Check it out!

http://www.youtube.com/watch?v=T4fEijatabk
http://www.youtube.com/watch?v=JgQ38n1qhRM&amp;feature=related]]></description>
		<content:encoded><![CDATA[<p>Hey guys&#8230; don&#8217;t know if this thread is still active, but I posted a video of my new memory tool on Youtube. Check it out!</p>
<p><a href="http://www.youtube.com/watch?v=T4fEijatabk" rel="nofollow">http://www.youtube.com/watch?v=T4fEijatabk</a><br />
<span class='embed-youtube' style='text-align:center; display: block;'><iframe class='youtube-player' type='text/html' width='640' height='390' src='http://www.youtube.com/embed/JgQ38n1qhRM?version=3&#038;rel=1&#038;fs=1&#038;showsearch=0&#038;showinfo=1&#038;iv_load_policy=1&#038;wmode=transparent' frameborder='0'></iframe></span></p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Memory Analyzer v1.0 demo by JD</title>
		<link>http://blog.makingartstudios.com/?p=39&#038;cpage=1#comment-36</link>
		<dc:creator>JD</dc:creator>
		<pubDate>Tue, 29 Sep 2009 16:13:37 +0000</pubDate>
		<guid isPermaLink="false">http://blog.makingartstudios.com/?p=39#comment-36</guid>
		<description><![CDATA[You can download the new version from here: &lt;a href=&quot;http://blog.makingartstudios.com/?page_id=72&quot; rel=&quot;nofollow&quot;&gt;Memory Analyzer v1.0 (build 20090929)&lt;/a&gt;. In the Setup view there is a new edit box along with a browse button for setting the working directory of the new process. If it&#039;s left blank the working directory is the same as before (Memory Analyzer&#039;s directory). Hope it works as you expected.]]></description>
		<content:encoded><![CDATA[<p>You can download the new version from here: <a href="http://blog.makingartstudios.com/?page_id=72" rel="nofollow">Memory Analyzer v1.0 (build 20090929)</a>. In the Setup view there is a new edit box along with a browse button for setting the working directory of the new process. If it&#8217;s left blank the working directory is the same as before (Memory Analyzer&#8217;s directory). Hope it works as you expected.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Memory Analyzer v1.0 demo by JD</title>
		<link>http://blog.makingartstudios.com/?p=39&#038;cpage=1#comment-30</link>
		<dc:creator>JD</dc:creator>
		<pubDate>Mon, 21 Sep 2009 08:19:06 +0000</pubDate>
		<guid isPermaLink="false">http://blog.makingartstudios.com/?p=39#comment-30</guid>
		<description><![CDATA[Hi and sorry for the delayed reply. Yes it&#039;s possible but it&#039;s not implemented yet. We&#039;ll try adding the necessary controls and we will release a new version in the following days.]]></description>
		<content:encoded><![CDATA[<p>Hi and sorry for the delayed reply. Yes it&#8217;s possible but it&#8217;s not implemented yet. We&#8217;ll try adding the necessary controls and we will release a new version in the following days.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Memory Analyzer v1.0 demo by Anders</title>
		<link>http://blog.makingartstudios.com/?p=39&#038;cpage=1#comment-29</link>
		<dc:creator>Anders</dc:creator>
		<pubDate>Thu, 17 Sep 2009 08:09:41 +0000</pubDate>
		<guid isPermaLink="false">http://blog.makingartstudios.com/?p=39#comment-29</guid>
		<description><![CDATA[Is it possible to set the working directory for the selected application? I need to start my application from within a specific folder. Thanks.]]></description>
		<content:encoded><![CDATA[<p>Is it possible to set the working directory for the selected application? I need to start my application from within a specific folder. Thanks.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Shadow masks and stencil buffer optimization by Caffeinated Guy</title>
		<link>http://blog.makingartstudios.com/?p=7&#038;cpage=1#comment-28</link>
		<dc:creator>Caffeinated Guy</dc:creator>
		<pubDate>Wed, 12 Aug 2009 05:10:59 +0000</pubDate>
		<guid isPermaLink="false">http://blog.makingartstudios.com/?p=7#comment-28</guid>
		<description><![CDATA[- easly stencil rejection only works on the window framebuffer, never on the offscreen FBO

ARG!! This explains why it works great on my 9400M laptop but makes it run even slower on my 6600 :( (from 24fps before stenciling to 19fps)

-Greg]]></description>
		<content:encoded><![CDATA[<p>- easly stencil rejection only works on the window framebuffer, never on the offscreen FBO</p>
<p>ARG!! This explains why it works great on my 9400M laptop but makes it run even slower on my 6600 <img src='http://blog.makingartstudios.com/wp-includes/images/smilies/icon_sad.gif' alt=':(' class='wp-smiley' />  (from 24fps before stenciling to 19fps)</p>
<p>-Greg</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Memory Analyzer v1.0 demo by JD</title>
		<link>http://blog.makingartstudios.com/?p=39&#038;cpage=1#comment-27</link>
		<dc:creator>JD</dc:creator>
		<pubDate>Thu, 11 Jun 2009 15:21:44 +0000</pubDate>
		<guid isPermaLink="false">http://blog.makingartstudios.com/?p=39#comment-27</guid>
		<description><![CDATA[We have missed a small bug in the code which prevented Memory Analyzer from reporting allocations when &quot;Hide internal allocations&quot; was checked. It&#039;s now fixed and you can download the new version from here: &lt;a href=&quot;http://blog.makingartstudios.com/?page_id=72&quot; rel=&quot;nofollow&quot;&gt;Memory Analyzer v1.0 (build 20090611)&lt;/a&gt;. The original post was updated to point to the new version.]]></description>
		<content:encoded><![CDATA[<p>We have missed a small bug in the code which prevented Memory Analyzer from reporting allocations when &#8220;Hide internal allocations&#8221; was checked. It&#8217;s now fixed and you can download the new version from here: <a href="http://blog.makingartstudios.com/?page_id=72" rel="nofollow">Memory Analyzer v1.0 (build 20090611)</a>. The original post was updated to point to the new version.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Memory Analyzer v1.0 demo by JD</title>
		<link>http://blog.makingartstudios.com/?p=39&#038;cpage=1#comment-26</link>
		<dc:creator>JD</dc:creator>
		<pubDate>Sun, 07 Jun 2009 10:08:24 +0000</pubDate>
		<guid isPermaLink="false">http://blog.makingartstudios.com/?p=39#comment-26</guid>
		<description><![CDATA[Hello Kevin and welcome,

First of all, thanks for the feedback. Some of the things you mentioned (like resizing all controls when the window was resized) were in our TODO list. We have fixed the scrolling issue and we have added mouse wheel support and larger block sizes (up to 256 bytes).

You can download the new version of Memory Analyzer from here: &lt;a href=&quot;http://blog.makingartstudios.com/?page_id=72&quot; rel=&quot;nofollow&quot;&gt;Memory Analyzer v1.0 (build 20090607)&lt;/a&gt;. I&#039;ve also updated the original post in order to point to the new archive.

The new version also includes some small additions (e.g. you can now manually specify which functions you would like to suppress through an external file) and bug fixes.

Regarding the single view of the entire virtual memory. Unfortunately there isn&#039;t a way right now to do that. We plan to add one more view for that, with color encoded fixed size blocks, and without callstack and source code info. If you have something specific in mind about how this view should look, please share it with us.

Thanks again for testing Memory Analyzer and for the feedback.]]></description>
		<content:encoded><![CDATA[<p>Hello Kevin and welcome,</p>
<p>First of all, thanks for the feedback. Some of the things you mentioned (like resizing all controls when the window was resized) were in our TODO list. We have fixed the scrolling issue and we have added mouse wheel support and larger block sizes (up to 256 bytes).</p>
<p>You can download the new version of Memory Analyzer from here: <a href="http://blog.makingartstudios.com/?page_id=72" rel="nofollow">Memory Analyzer v1.0 (build 20090607)</a>. I&#8217;ve also updated the original post in order to point to the new archive.</p>
<p>The new version also includes some small additions (e.g. you can now manually specify which functions you would like to suppress through an external file) and bug fixes.</p>
<p>Regarding the single view of the entire virtual memory. Unfortunately there isn&#8217;t a way right now to do that. We plan to add one more view for that, with color encoded fixed size blocks, and without callstack and source code info. If you have something specific in mind about how this view should look, please share it with us.</p>
<p>Thanks again for testing Memory Analyzer and for the feedback.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Memory Analyzer v1.0 demo by Kevin</title>
		<link>http://blog.makingartstudios.com/?p=39&#038;cpage=1#comment-25</link>
		<dc:creator>Kevin</dc:creator>
		<pubDate>Thu, 04 Jun 2009 19:57:47 +0000</pubDate>
		<guid isPermaLink="false">http://blog.makingartstudios.com/?p=39#comment-25</guid>
		<description><![CDATA[It would be really helpful if the Fragmentation view was resizable.  Even at 32 bytes, the scrolling could go on for quite a while.

Dragging the scrollbar doesn&#039;t seem to work.  If I drag the view often stays at the top of the selected heap.

If you have a really small heap that doesn&#039;t fill the view, the view doesn&#039;t completely repaint, so you see the old data from the previously viewed heap at the end.  Clicking on the scrollbar appears to force a refresh.

Note that I saw these just by using your tool with Excel (2003).

Is there some way to get a single view of the entire virtual memory (not specific to any given heap)?  This would be helpful in seeing the overall picture of the application&#039;s memory when I don&#039;t care which heap the allocation came from.]]></description>
		<content:encoded><![CDATA[<p>It would be really helpful if the Fragmentation view was resizable.  Even at 32 bytes, the scrolling could go on for quite a while.</p>
<p>Dragging the scrollbar doesn&#8217;t seem to work.  If I drag the view often stays at the top of the selected heap.</p>
<p>If you have a really small heap that doesn&#8217;t fill the view, the view doesn&#8217;t completely repaint, so you see the old data from the previously viewed heap at the end.  Clicking on the scrollbar appears to force a refresh.</p>
<p>Note that I saw these just by using your tool with Excel (2003).</p>
<p>Is there some way to get a single view of the entire virtual memory (not specific to any given heap)?  This would be helpful in seeing the overall picture of the application&#8217;s memory when I don&#8217;t care which heap the allocation came from.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Virtual Textures by JD</title>
		<link>http://blog.makingartstudios.com/?p=12&#038;cpage=1#comment-24</link>
		<dc:creator>JD</dc:creator>
		<pubDate>Sun, 24 May 2009 09:42:04 +0000</pubDate>
		<guid isPermaLink="false">http://blog.makingartstudios.com/?p=12#comment-24</guid>
		<description><![CDATA[Hello and welcome.

First of all I would like to thank you for the excellent demo and presentation.

At first I didn&#039;t fully understood what you were suggesting, but after thinking about it a bit more I think I got it. Dynamic placement of decals on a prebaked virtual texture. The reason I didn&#039;t get it the first time was because we haven&#039;t come to the &quot;game&quot; side of things yet. Let me explain. Currently the virtual texture is generated by painting several different materials on it (see &lt;a href=&quot;http://blog.makingartstudios.com/?p=13&quot; rel=&quot;nofollow&quot;&gt;another post on the subject&lt;/a&gt;) inside an editor, and by manipulating decals and roads (by translating, rotating and scaling them). In those case, a complete invalidation of all the affected tiles from all mipmap levels is required, in contrast with the game side of things, where *static* decals are generated whenever the player shoots something. For those cases, your idea sounds really good, but I guess we&#039;ll have to start implementing it in order to fully understand the alignment problem you describe. Either way, thanks a lot for the idea. 

On the subject of editing the virtual texture, we&#039;ve mainly focused out attention to the speed of the software rasterizer (since the last post). After a lot of optimizations and by taking into account that we are generating a texture for a terrain (top-down axis aligned quad(s)), the speed has been improved a lot, and the average time for baking the diffuse color and normalmap of an arbitrary tile with 12 terrain layers and about 200 decals (quads) and roads (arbitrary quadrilaterals) is about 5msec. This number doesn&#039;t say anything on its own, but compared to the 18 msec for 3 layers (without normalmaps and decals) we mentioned in the other post, it&#039;s a nice improvement. Unfortunately, I don&#039;t know how that compares to a GPU implementation because we haven&#039;t tried it yet. 

Our main focus lately is on ways to improve the visual quality of the final output. The main problem when you use a single texture for terrain rendering is the excessive stretching on steep slopes. We implemented the &lt;a href=&quot;http://graphics.cs.williams.edu/papers/IndirectionI3D08/&quot; rel=&quot;nofollow&quot;&gt;indirection mapping&lt;/a&gt; technique, but it didn&#039;t seemed practical especially for the terrain sizes we&#039;ve tested (2048 and up). The result might have been a bit better (increased the resolution of the texture on steep slopes), but this didn&#039;t help a lot with stretching since it actually requires different UV parametrization. So we decided to use extra geometry for a selected number of terrain layers with triplanar mapping. The results are a lot better than the previous method at the expense of extra rendering passes (which, unfortunately, is in contrast with the idea of virtual textures). We&#039;ll consider the technique again in the future, but for now we&#039;ll stick with the extra pass.

One idea we had (but didn&#039;t have time to test) is the usage of a 2D texture array for the tile cache. This way, the tiles will be independent of each other, minimizing the artifacts when using trilinear or anisotropic filtering. You&#039;ll still get some artifacts when the filter kernel is larger than the border used, but the data that will be fetched will actually be relevant to the tile (no worries about sampling a rock texture for a neighboring tile when you are expecting something grass-like). Have you tried something similar? Any problems we have to keep in mind when we&#039;ll try it?

Thanks again for dropping by.]]></description>
		<content:encoded><![CDATA[<p>Hello and welcome.</p>
<p>First of all I would like to thank you for the excellent demo and presentation.</p>
<p>At first I didn&#8217;t fully understood what you were suggesting, but after thinking about it a bit more I think I got it. Dynamic placement of decals on a prebaked virtual texture. The reason I didn&#8217;t get it the first time was because we haven&#8217;t come to the &#8220;game&#8221; side of things yet. Let me explain. Currently the virtual texture is generated by painting several different materials on it (see <a href="http://blog.makingartstudios.com/?p=13" rel="nofollow">another post on the subject</a>) inside an editor, and by manipulating decals and roads (by translating, rotating and scaling them). In those case, a complete invalidation of all the affected tiles from all mipmap levels is required, in contrast with the game side of things, where *static* decals are generated whenever the player shoots something. For those cases, your idea sounds really good, but I guess we&#8217;ll have to start implementing it in order to fully understand the alignment problem you describe. Either way, thanks a lot for the idea. </p>
<p>On the subject of editing the virtual texture, we&#8217;ve mainly focused out attention to the speed of the software rasterizer (since the last post). After a lot of optimizations and by taking into account that we are generating a texture for a terrain (top-down axis aligned quad(s)), the speed has been improved a lot, and the average time for baking the diffuse color and normalmap of an arbitrary tile with 12 terrain layers and about 200 decals (quads) and roads (arbitrary quadrilaterals) is about 5msec. This number doesn&#8217;t say anything on its own, but compared to the 18 msec for 3 layers (without normalmaps and decals) we mentioned in the other post, it&#8217;s a nice improvement. Unfortunately, I don&#8217;t know how that compares to a GPU implementation because we haven&#8217;t tried it yet. </p>
<p>Our main focus lately is on ways to improve the visual quality of the final output. The main problem when you use a single texture for terrain rendering is the excessive stretching on steep slopes. We implemented the <a href="http://graphics.cs.williams.edu/papers/IndirectionI3D08/" rel="nofollow">indirection mapping</a> technique, but it didn&#8217;t seemed practical especially for the terrain sizes we&#8217;ve tested (2048 and up). The result might have been a bit better (increased the resolution of the texture on steep slopes), but this didn&#8217;t help a lot with stretching since it actually requires different UV parametrization. So we decided to use extra geometry for a selected number of terrain layers with triplanar mapping. The results are a lot better than the previous method at the expense of extra rendering passes (which, unfortunately, is in contrast with the idea of virtual textures). We&#8217;ll consider the technique again in the future, but for now we&#8217;ll stick with the extra pass.</p>
<p>One idea we had (but didn&#8217;t have time to test) is the usage of a 2D texture array for the tile cache. This way, the tiles will be independent of each other, minimizing the artifacts when using trilinear or anisotropic filtering. You&#8217;ll still get some artifacts when the filter kernel is larger than the border used, but the data that will be fetched will actually be relevant to the tile (no worries about sampling a rock texture for a neighboring tile when you are expecting something grass-like). Have you tried something similar? Any problems we have to keep in mind when we&#8217;ll try it?</p>
<p>Thanks again for dropping by.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Virtual Textures by Sean Barrett</title>
		<link>http://blog.makingartstudios.com/?p=12&#038;cpage=1#comment-23</link>
		<dc:creator>Sean Barrett</dc:creator>
		<pubDate>Fri, 22 May 2009 19:32:12 +0000</pubDate>
		<guid isPermaLink="false">http://blog.makingartstudios.com/?p=12#comment-23</guid>
		<description><![CDATA[For my SVT demo, I used an algorithm that could bake mipmap levels on demand, which is relevant to dynamic updating even if you&#039;re mostly prebaking.

This is mostly interesting for decals. The idea is that you just mipmap the decal, and then apply the mipmaps of the decal to the mipmapped underlying image, rather than recompute the mipmap. This is obviously inaccurate (features that get covered up by the decal at high-res mipmap levels will bleed through at lower mipmap levels as the decal&#039;s alpha drops), but if you pretend you&#039;re using trilinear, the goal of the mipmaps are to minimize the visible transition between mipmap levels, not to produce (say) the RMSE-optimized  idealized filtered version. So if in general mipmapping is good enough, it shouldn&#039;t be surprising that mipmapping of the decal independently is good enough (if you imagine what happens during an LOD transition).

One subtlety: if you mipmap the decal with a box filter, then you have an alignment issue: if your decal appears at (0,0) on the texture, it will align with the underlying mipmap and mipmap &quot;identically&quot; with how it would, but if it appears at (1,1) in the texture, to apply the mipmap to the next level up it now falls at a subpixel position, so you either need to snap, or bilerp it. Bilerping will introduce extra blurring, and snapping will introduce a visible movement as you transition LODs.

My solution in my demo was to produce 4 mipmap variants from the base, each shifted by (0,0), (1,0), (0,1), and (1,1). This requires 4 times as much storage, but the mipmap is 1/4 the size. If you wanted to do it &quot;right&quot;, you would need 16 variants of the next level up, 64 of the next level up, etc. This would require just as much storage per mipmap level as the original image. Instead, I just computed 4 shifts at each level total, allowing some translation as you went up mipmap levels (but only half as much as the naive solution would produce).

An alternative would be to use a mipmap that&#039;s not biased to any particular shift; you could do this with just some different filter that&#039;s more friendly, or you can just use the *average* of all the mipmaps that would be computed in the above process. This would probably be what I would do if I wanted to do this for real on the CPU. (If you can GPU accelerate it, then maybe the bilerp of the naive decal mipmaps is good enough.)

GPU decal application to baked mipmaps has a drawback--once your decal mip level reaches 1x1, if it has a non-zero alpha, you actually want to continue generating higher level mipmaps (e.g. you have a 1x1 decal with alpha=0.6, then you want a next-higher mipmap that is a 1x1 decal with alpha=0.15). Hardware designers missed this orthogonality.]]></description>
		<content:encoded><![CDATA[<p>For my SVT demo, I used an algorithm that could bake mipmap levels on demand, which is relevant to dynamic updating even if you&#8217;re mostly prebaking.</p>
<p>This is mostly interesting for decals. The idea is that you just mipmap the decal, and then apply the mipmaps of the decal to the mipmapped underlying image, rather than recompute the mipmap. This is obviously inaccurate (features that get covered up by the decal at high-res mipmap levels will bleed through at lower mipmap levels as the decal&#8217;s alpha drops), but if you pretend you&#8217;re using trilinear, the goal of the mipmaps are to minimize the visible transition between mipmap levels, not to produce (say) the RMSE-optimized  idealized filtered version. So if in general mipmapping is good enough, it shouldn&#8217;t be surprising that mipmapping of the decal independently is good enough (if you imagine what happens during an LOD transition).</p>
<p>One subtlety: if you mipmap the decal with a box filter, then you have an alignment issue: if your decal appears at (0,0) on the texture, it will align with the underlying mipmap and mipmap &#8220;identically&#8221; with how it would, but if it appears at (1,1) in the texture, to apply the mipmap to the next level up it now falls at a subpixel position, so you either need to snap, or bilerp it. Bilerping will introduce extra blurring, and snapping will introduce a visible movement as you transition LODs.</p>
<p>My solution in my demo was to produce 4 mipmap variants from the base, each shifted by (0,0), (1,0), (0,1), and (1,1). This requires 4 times as much storage, but the mipmap is 1/4 the size. If you wanted to do it &#8220;right&#8221;, you would need 16 variants of the next level up, 64 of the next level up, etc. This would require just as much storage per mipmap level as the original image. Instead, I just computed 4 shifts at each level total, allowing some translation as you went up mipmap levels (but only half as much as the naive solution would produce).</p>
<p>An alternative would be to use a mipmap that&#8217;s not biased to any particular shift; you could do this with just some different filter that&#8217;s more friendly, or you can just use the *average* of all the mipmaps that would be computed in the above process. This would probably be what I would do if I wanted to do this for real on the CPU. (If you can GPU accelerate it, then maybe the bilerp of the naive decal mipmaps is good enough.)</p>
<p>GPU decal application to baked mipmaps has a drawback&#8211;once your decal mip level reaches 1&#215;1, if it has a non-zero alpha, you actually want to continue generating higher level mipmaps (e.g. you have a 1&#215;1 decal with alpha=0.6, then you want a next-higher mipmap that is a 1&#215;1 decal with alpha=0.15). Hardware designers missed this orthogonality.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Tracking Memory Allocations by Greg</title>
		<link>http://blog.makingartstudios.com/?p=16&#038;cpage=1#comment-22</link>
		<dc:creator>Greg</dc:creator>
		<pubDate>Tue, 24 Mar 2009 12:38:17 +0000</pubDate>
		<guid isPermaLink="false">http://blog.makingartstudios.com/?p=16#comment-22</guid>
		<description><![CDATA[do you plan to make this tool available please?]]></description>
		<content:encoded><![CDATA[<p>do you plan to make this tool available please?</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Shadow masks and stencil buffer optimization by JD</title>
		<link>http://blog.makingartstudios.com/?p=7&#038;cpage=1#comment-20</link>
		<dc:creator>JD</dc:creator>
		<pubDate>Fri, 12 Sep 2008 11:08:18 +0000</pubDate>
		<guid isPermaLink="false">http://blog.makingartstudios.com/?p=7#comment-20</guid>
		<description><![CDATA[Using the same FBO and switching attachments may not be the optimal solution, but i&#039;m doing it that way mostly because i was reading everywhere that this should be the fastest approach (in other words, avoid excessive glBindFramebuffer calls). On the other hand, changing FBO attachments all the time may add extra validation overhead in the driver. The only way to find the best approach is to profile your code, but again it may be driver specific as you said.

I&#039;ve already mentioned depth bounds testing in combination with scissoring, in order to speed up shadow mask calculations, at the end of post. I&#039;m also aware of the timer query extension, but i haven&#039;t had the chance to work with it yet. I didn&#039;t know that those two are nvidia only extensions (especially depth bounds testing), so i&#039;ll have that in mind. 

Thanks again for your comments.

JD]]></description>
		<content:encoded><![CDATA[<p>Using the same FBO and switching attachments may not be the optimal solution, but i&#8217;m doing it that way mostly because i was reading everywhere that this should be the fastest approach (in other words, avoid excessive glBindFramebuffer calls). On the other hand, changing FBO attachments all the time may add extra validation overhead in the driver. The only way to find the best approach is to profile your code, but again it may be driver specific as you said.</p>
<p>I&#8217;ve already mentioned depth bounds testing in combination with scissoring, in order to speed up shadow mask calculations, at the end of post. I&#8217;m also aware of the timer query extension, but i haven&#8217;t had the chance to work with it yet. I didn&#8217;t know that those two are nvidia only extensions (especially depth bounds testing), so i&#8217;ll have that in mind. </p>
<p>Thanks again for your comments.</p>
<p>JD</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Shadow masks and stencil buffer optimization by Kay Chang</title>
		<link>http://blog.makingartstudios.com/?p=7&#038;cpage=1#comment-19</link>
		<dc:creator>Kay Chang</dc:creator>
		<pubDate>Wed, 10 Sep 2008 09:08:53 +0000</pubDate>
		<guid isPermaLink="false">http://blog.makingartstudios.com/?p=7#comment-19</guid>
		<description><![CDATA[I am doing the same thing (same FBO with depth-stencil attachment - several color attachments RGBA8-FP16 swapped over the frame). It has been working for a while, although the latest drivers gives me staggered framerate (not sure why though).

If you are programming on an nvidia card, you should have 2 useful extensions available to you:
- GL_DEPTH_BOUNDS_TEST_EXT: I use it as a work-around to render light volumes in my deferred renderer if early-stencil rejection isn&#039;t working ( =&gt; nvidia).
- GL_EXT_timer_query: very accurate to time GPU calls and to use as a GPU profiler. it has been very valuable to me, as sometimes the driver screw things up when you switch FBOs or change to FBO -funky- configurations, etc... some operations are not very well supported by the driver (or maybe not meant to be?).

Hope this helps.

Cheers,
Kc]]></description>
		<content:encoded><![CDATA[<p>I am doing the same thing (same FBO with depth-stencil attachment &#8211; several color attachments RGBA8-FP16 swapped over the frame). It has been working for a while, although the latest drivers gives me staggered framerate (not sure why though).</p>
<p>If you are programming on an nvidia card, you should have 2 useful extensions available to you:<br />
- GL_DEPTH_BOUNDS_TEST_EXT: I use it as a work-around to render light volumes in my deferred renderer if early-stencil rejection isn&#8217;t working ( =&gt; nvidia).<br />
- GL_EXT_timer_query: very accurate to time GPU calls and to use as a GPU profiler. it has been very valuable to me, as sometimes the driver screw things up when you switch FBOs or change to FBO -funky- configurations, etc&#8230; some operations are not very well supported by the driver (or maybe not meant to be?).</p>
<p>Hope this helps.</p>
<p>Cheers,<br />
Kc</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Shadow masks and stencil buffer optimization by JD</title>
		<link>http://blog.makingartstudios.com/?p=7&#038;cpage=1#comment-18</link>
		<dc:creator>JD</dc:creator>
		<pubDate>Wed, 10 Sep 2008 05:35:58 +0000</pubDate>
		<guid isPermaLink="false">http://blog.makingartstudios.com/?p=7#comment-18</guid>
		<description><![CDATA[Thanks a lot for the info, Kay. I suspected that it may have something to do with the GPU itself and not the driver, but i didn&#039;t have a chance to test it on something else but my card. Writing to stencil with early stencil rejection isn&#039;t required in the case of shadow masks, since the only object written to stencil is the light volume, but it may be relevant for other more complex algorithm.

I&#039;ll have to make some tests for early depth rejection at some point. Fortunately i&#039;m not sharing the depth renderbuffer between FBOs, so there should be no problems with that. I&#039;m currently changing the color attachments of an FBO and keeping the depth buffer attached all the time. Are there any problems with that?

Thanks again for the valuable information.

JD]]></description>
		<content:encoded><![CDATA[<p>Thanks a lot for the info, Kay. I suspected that it may have something to do with the GPU itself and not the driver, but i didn&#8217;t have a chance to test it on something else but my card. Writing to stencil with early stencil rejection isn&#8217;t required in the case of shadow masks, since the only object written to stencil is the light volume, but it may be relevant for other more complex algorithm.</p>
<p>I&#8217;ll have to make some tests for early depth rejection at some point. Fortunately i&#8217;m not sharing the depth renderbuffer between FBOs, so there should be no problems with that. I&#8217;m currently changing the color attachments of an FBO and keeping the depth buffer attached all the time. Are there any problems with that?</p>
<p>Thanks again for the valuable information.</p>
<p>JD</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Shadow masks and stencil buffer optimization by Kay Chang</title>
		<link>http://blog.makingartstudios.com/?p=7&#038;cpage=1#comment-17</link>
		<dc:creator>Kay Chang</dc:creator>
		<pubDate>Tue, 09 Sep 2008 12:45:28 +0000</pubDate>
		<guid isPermaLink="false">http://blog.makingartstudios.com/?p=7#comment-17</guid>
		<description><![CDATA[I&#039;ve performed similar tests on my GeForce 7900GT a while ago.
- early depth rejection always works, on the window framebuffer, or offscreen FBO
- easly stencil rejection only works on the window framebuffer, never on the offscreen FBO

While running on a Geforce 8800GTX, early depth rejection and early stencil rejection always work, either on the window framebuffer of offscreen FBO.

It&#039;s also worth noticing that early depth rejection and early stencil rejection are triggered only with specific render-states combinations, which ARENT the same for Nvidia or ATI graphic cards.

AFAIK: with Nvidia, stencil-write must be disabled for early stencil rejection to kick out.
On ATI, you can write the stencil while testing it, early rejection keep working in theory (hummm havent tried that myself).
For early depth rejection, it might only work with specific depth-compare functions (not all - depends on vendors). trying the share offscreen depth-texture between different FBO is likely to trash early depth cull memory, etc...

Definitely a lot of fun to get all that working xD]]></description>
		<content:encoded><![CDATA[<p>I&#8217;ve performed similar tests on my GeForce 7900GT a while ago.<br />
- early depth rejection always works, on the window framebuffer, or offscreen FBO<br />
- easly stencil rejection only works on the window framebuffer, never on the offscreen FBO</p>
<p>While running on a Geforce 8800GTX, early depth rejection and early stencil rejection always work, either on the window framebuffer of offscreen FBO.</p>
<p>It&#8217;s also worth noticing that early depth rejection and early stencil rejection are triggered only with specific render-states combinations, which ARENT the same for Nvidia or ATI graphic cards.</p>
<p>AFAIK: with Nvidia, stencil-write must be disabled for early stencil rejection to kick out.<br />
On ATI, you can write the stencil while testing it, early rejection keep working in theory (hummm havent tried that myself).<br />
For early depth rejection, it might only work with specific depth-compare functions (not all &#8211; depends on vendors). trying the share offscreen depth-texture between different FBO is likely to trash early depth cull memory, etc&#8230;</p>
<p>Definitely a lot of fun to get all that working xD</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Parallel Split Shadow Maps by JD</title>
		<link>http://blog.makingartstudios.com/?p=8&#038;cpage=1#comment-16</link>
		<dc:creator>JD</dc:creator>
		<pubDate>Fri, 05 Sep 2008 12:26:32 +0000</pubDate>
		<guid isPermaLink="false">http://blog.makingartstudios.com/?p=8#comment-16</guid>
		<description><![CDATA[Brian, thanks for the link. The depth bounds test sounds like a nice idea, but unfortunately i can&#039;t think of an easy way to incorporate it in the current design due to multiple passes. I&#039;ll definitely keep it in mind for the future though. 

About the first comment in the link you posted. The resulting shader will look like the second fragment shader i posted above, with the exception that since we have an atlas, it&#039;s more natural to first find the correct vector and then sample the shadowmap atlas. The idea is the same, but unfortunately it can&#039;t be applied in my case since i don&#039;t have access to the 4 shadow vectors from a vertex shader. What i have is the pixel&#039;s position in world space, and i have to work with that. This means calculating the 4 shadowmap vectors in the fragment shader (by passing the matrices as constants), then finding the correct one using the calculated index and finally sampling the shadowmap. This needs only one TEX but it&#039;s math heavy so i don&#039;t know if there is actually a speed up.

Unfortunately, i don&#039;t have ShaderX6 so i can&#039;t comment on the article you suggested. It&#039;s in my list for books to buy but i didn&#039;t have a chance to get it yet. The good thing is that you gave me an idea to improve the shader. Here it is:

Since all 4 matrices (1 per split) are based on the same light projection matrix, with different crop matrices, i think it&#039;s possible to simplify the shader by passing the light projection matrix as uniforms, and store the crop matrices in the texture. The good thing is that the crop matrices have a simple representation and they can be packed in 4 floats, which requires only one texel in the matrixTex. This way we still need the transformation from world to light space (4 DOTs), but we can substitute the 4 TEXs to the matrixTex with 1 TEX, 1 MUL and 1 MAD.

I haven&#039;t tried it yet, but i&#039;ll definitely do so once i have the chance.

Thanks again for the comments.

JD]]></description>
		<content:encoded><![CDATA[<p>Brian, thanks for the link. The depth bounds test sounds like a nice idea, but unfortunately i can&#8217;t think of an easy way to incorporate it in the current design due to multiple passes. I&#8217;ll definitely keep it in mind for the future though. </p>
<p>About the first comment in the link you posted. The resulting shader will look like the second fragment shader i posted above, with the exception that since we have an atlas, it&#8217;s more natural to first find the correct vector and then sample the shadowmap atlas. The idea is the same, but unfortunately it can&#8217;t be applied in my case since i don&#8217;t have access to the 4 shadow vectors from a vertex shader. What i have is the pixel&#8217;s position in world space, and i have to work with that. This means calculating the 4 shadowmap vectors in the fragment shader (by passing the matrices as constants), then finding the correct one using the calculated index and finally sampling the shadowmap. This needs only one TEX but it&#8217;s math heavy so i don&#8217;t know if there is actually a speed up.</p>
<p>Unfortunately, i don&#8217;t have ShaderX6 so i can&#8217;t comment on the article you suggested. It&#8217;s in my list for books to buy but i didn&#8217;t have a chance to get it yet. The good thing is that you gave me an idea to improve the shader. Here it is:</p>
<p>Since all 4 matrices (1 per split) are based on the same light projection matrix, with different crop matrices, i think it&#8217;s possible to simplify the shader by passing the light projection matrix as uniforms, and store the crop matrices in the texture. The good thing is that the crop matrices have a simple representation and they can be packed in 4 floats, which requires only one texel in the matrixTex. This way we still need the transformation from world to light space (4 DOTs), but we can substitute the 4 TEXs to the matrixTex with 1 TEX, 1 MUL and 1 MAD.</p>
<p>I haven&#8217;t tried it yet, but i&#8217;ll definitely do so once i have the chance.</p>
<p>Thanks again for the comments.</p>
<p>JD</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Parallel Split Shadow Maps by Brian Richardson</title>
		<link>http://blog.makingartstudios.com/?p=8&#038;cpage=1#comment-15</link>
		<dc:creator>Brian Richardson</dc:creator>
		<pubDate>Thu, 04 Sep 2008 21:14:36 +0000</pubDate>
		<guid isPermaLink="false">http://blog.makingartstudios.com/?p=8#comment-15</guid>
		<description><![CDATA[Check out &quot;Stable Rendering of Cascaded Shadow Maps&quot;.  It shows a really cool trick of using masks and dot products to scale the base projection matrix.  That would let you get rid of the texMatrix samples by just computing the tex coords directly.  You can also fold the texture atlas computation into it!  

You can also see it in the first comment of this:  http://pixelstoomany.wordpress.com/2007/09/21/fast-percentage-closer-filtering-on-deferred-cascaded-shadow-maps/]]></description>
		<content:encoded><![CDATA[<p>Check out &#8220;Stable Rendering of Cascaded Shadow Maps&#8221;.  It shows a really cool trick of using masks and dot products to scale the base projection matrix.  That would let you get rid of the texMatrix samples by just computing the tex coords directly.  You can also fold the texture atlas computation into it!  </p>
<p>You can also see it in the first comment of this:  <a href="http://pixelstoomany.wordpress.com/2007/09/21/fast-percentage-closer-filtering-on-deferred-cascaded-shadow-maps/" rel="nofollow">http://pixelstoomany.wordpress.com/2007/09/21/fast-percentage-closer-filtering-on-deferred-cascaded-shadow-maps/</a></p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Shadow masks and stencil buffer optimization by Brian Richardson</title>
		<link>http://blog.makingartstudios.com/?p=7&#038;cpage=1#comment-14</link>
		<dc:creator>Brian Richardson</dc:creator>
		<pubDate>Thu, 04 Sep 2008 21:05:31 +0000</pubDate>
		<guid isPermaLink="false">http://blog.makingartstudios.com/?p=7#comment-14</guid>
		<description><![CDATA[Looks like you&#039;ve covered all the bases with the last test to me.  I just know it&#039;s good to ask the dumb questions when getting a &quot;funny result.&quot;  heh]]></description>
		<content:encoded><![CDATA[<p>Looks like you&#8217;ve covered all the bases with the last test to me.  I just know it&#8217;s good to ask the dumb questions when getting a &#8220;funny result.&#8221;  heh</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Shadow masks and stencil buffer optimization by JD</title>
		<link>http://blog.makingartstudios.com/?p=7&#038;cpage=1#comment-13</link>
		<dc:creator>JD</dc:creator>
		<pubDate>Wed, 03 Sep 2008 06:08:38 +0000</pubDate>
		<guid isPermaLink="false">http://blog.makingartstudios.com/?p=7#comment-13</guid>
		<description><![CDATA[Hi Brian,

I&#039;m not really sure if I could get the setup wrong, and here is why:

1) I&#039;m using a renderbuffer for the depth/stencil surface, so there is only one way to create and set it up.
2) Dimensions are the same as the color texture, primarly because they are both created in the same function which takes the dimensions as parameters, but most importantly, differently sized attachments aren&#039;t supported by my 7950, so i would expect the FBO to be incomplete (i check for that).
3) There is only one format supported for rendering to a stencil buffer with FBOs, and this is DEPTH24_STENCIL8 (&lt;a href=&quot;http://www.opengl.org/registry/specs/EXT/packed_depth_stencil.txt&quot; rel=&quot;nofollow&quot;&gt;GL_EXT_packed_depth_stencil&lt;/a&gt;) so I don&#039;t have a lot of options here.

I haven&#039;t checked GLexpert, so i don&#039;t know if it complains about my setup, so i&#039;ll have to do it sometime. AFAIK this is the only way to check if something is wrong under GL. Hopefully the debug profile under GL 3.0 will make tracing down such problems easier, but i&#039;m not really sure about that until complete GL 3.0 drivers are out (and i&#039;m able to get a GL 3.0 capable card, but this is another story).

One thing I&#039;ve tested since i posted this, is what happens if i don&#039;t render anything to the stencil buffer. I just clear it to the correct value and let the fragments pass if the value is different than the clear value. Again, when rendering to the window FPS is pretty high (because no fragments pass the test), but when i render to the FBO the FPS is the same with or without the test, such as the test happens after the fragment shader (the output is correct in both cases).

If you have any suggestions on how to test this further i&#039;d be glad to hear them.

JD]]></description>
		<content:encoded><![CDATA[<p>Hi Brian,</p>
<p>I&#8217;m not really sure if I could get the setup wrong, and here is why:</p>
<p>1) I&#8217;m using a renderbuffer for the depth/stencil surface, so there is only one way to create and set it up.<br />
2) Dimensions are the same as the color texture, primarly because they are both created in the same function which takes the dimensions as parameters, but most importantly, differently sized attachments aren&#8217;t supported by my 7950, so i would expect the FBO to be incomplete (i check for that).<br />
3) There is only one format supported for rendering to a stencil buffer with FBOs, and this is DEPTH24_STENCIL8 (<a href="http://www.opengl.org/registry/specs/EXT/packed_depth_stencil.txt" rel="nofollow">GL_EXT_packed_depth_stencil</a>) so I don&#8217;t have a lot of options here.</p>
<p>I haven&#8217;t checked GLexpert, so i don&#8217;t know if it complains about my setup, so i&#8217;ll have to do it sometime. AFAIK this is the only way to check if something is wrong under GL. Hopefully the debug profile under GL 3.0 will make tracing down such problems easier, but i&#8217;m not really sure about that until complete GL 3.0 drivers are out (and i&#8217;m able to get a GL 3.0 capable card, but this is another story).</p>
<p>One thing I&#8217;ve tested since i posted this, is what happens if i don&#8217;t render anything to the stencil buffer. I just clear it to the correct value and let the fragments pass if the value is different than the clear value. Again, when rendering to the window FPS is pretty high (because no fragments pass the test), but when i render to the FBO the FPS is the same with or without the test, such as the test happens after the fragment shader (the output is correct in both cases).</p>
<p>If you have any suggestions on how to test this further i&#8217;d be glad to hear them.</p>
<p>JD</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Shadow masks and stencil buffer optimization by Brian Richardson</title>
		<link>http://blog.makingartstudios.com/?p=7&#038;cpage=1#comment-12</link>
		<dc:creator>Brian Richardson</dc:creator>
		<pubDate>Tue, 02 Sep 2008 22:58:10 +0000</pubDate>
		<guid isPermaLink="false">http://blog.makingartstudios.com/?p=7#comment-12</guid>
		<description><![CDATA[Is it possible that the FBO you&#039;re rendering to doesn&#039;t have the depth/stencil buffer attached to it correctly?  I&#039;ve run into an occasional situation where I wouldn&#039;t always notice if there was a depth/stencil size/format mismatch and the API would just not use the buffer.  With DirectX, you can just use the DX debug runtime and it&#039;ll complain about that.]]></description>
		<content:encoded><![CDATA[<p>Is it possible that the FBO you&#8217;re rendering to doesn&#8217;t have the depth/stencil buffer attached to it correctly?  I&#8217;ve run into an occasional situation where I wouldn&#8217;t always notice if there was a depth/stencil size/format mismatch and the API would just not use the buffer.  With DirectX, you can just use the DX debug runtime and it&#8217;ll complain about that.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Parallel Split Shadow Maps by JD</title>
		<link>http://blog.makingartstudios.com/?p=8&#038;cpage=1#comment-11</link>
		<dc:creator>JD</dc:creator>
		<pubDate>Sat, 02 Aug 2008 18:36:52 +0000</pubDate>
		<guid isPermaLink="false">http://blog.makingartstudios.com/?p=8#comment-11</guid>
		<description><![CDATA[Hi Eichi,

First of all, thanks for reading.

The actual shader i&#039;m using is the last one. I know that 5 TEXs should be expensive, but unfortunately in my case this seemed like the best approach. I think the 4 TEXs from the matrix texture should be very coherent so they should be fast, but I&#039;m not really sure about that. 

Your approach of rendering all 4 splits into the 4 different channels of the same texture sounds good. May I ask what format you are using for that? I assume RGBA16f (or RGBA32f), in contrast with my implementation where I use one regular depth24 texture for all shadow maps. If this is the case, isn&#039;t there a difference in performance from rendering to a floating point texture?

What I can&#039;t understand is how do you sample your shadowmap. The way i think about it, you have 2 options:

1) Perform 4 TEXs (using the 4 different shadowmap space vectors) and select the correct one based on the calculated index (see the second fragment shader from above), or 
2) Select the correct shadowmap space vector and perform only one TEX at the correct location (the channel can be easily extracted using the split index).


In the second case (which is the only one i can think with 1 TEX) how do you select the correct shadowmap space vector using the split index? 

JD]]></description>
		<content:encoded><![CDATA[<p>Hi Eichi,</p>
<p>First of all, thanks for reading.</p>
<p>The actual shader i&#8217;m using is the last one. I know that 5 TEXs should be expensive, but unfortunately in my case this seemed like the best approach. I think the 4 TEXs from the matrix texture should be very coherent so they should be fast, but I&#8217;m not really sure about that. </p>
<p>Your approach of rendering all 4 splits into the 4 different channels of the same texture sounds good. May I ask what format you are using for that? I assume RGBA16f (or RGBA32f), in contrast with my implementation where I use one regular depth24 texture for all shadow maps. If this is the case, isn&#8217;t there a difference in performance from rendering to a floating point texture?</p>
<p>What I can&#8217;t understand is how do you sample your shadowmap. The way i think about it, you have 2 options:</p>
<p>1) Perform 4 TEXs (using the 4 different shadowmap space vectors) and select the correct one based on the calculated index (see the second fragment shader from above), or<br />
2) Select the correct shadowmap space vector and perform only one TEX at the correct location (the channel can be easily extracted using the split index).</p>
<p>In the second case (which is the only one i can think with 1 TEX) how do you select the correct shadowmap space vector using the split index? </p>
<p>JD</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Parallel Split Shadow Maps by Eichi</title>
		<link>http://blog.makingartstudios.com/?p=8&#038;cpage=1#comment-10</link>
		<dc:creator>Eichi</dc:creator>
		<pubDate>Sat, 02 Aug 2008 17:22:55 +0000</pubDate>
		<guid isPermaLink="false">http://blog.makingartstudios.com/?p=8#comment-10</guid>
		<description><![CDATA[Hey, I&#039;ve just read your article and its a good idea! But I wonder about your shaders. They seem to be a bit costly. Five TEX&#039; per fragment is very heavy! Take a look at my page - I&#039;ve been implementing PSSM for a while, but with a sligthly different approach: I render every shadow map in a different color channel (thus making VSM impossible) to spare three additional textures. Furthermore, I came up with a solution with only one (!) tex per fragment.]]></description>
		<content:encoded><![CDATA[<p>Hey, I&#8217;ve just read your article and its a good idea! But I wonder about your shaders. They seem to be a bit costly. Five TEX&#8217; per fragment is very heavy! Take a look at my page &#8211; I&#8217;ve been implementing PSSM for a while, but with a sligthly different approach: I render every shadow map in a different color channel (thus making VSM impossible) to spare three additional textures. Furthermore, I came up with a solution with only one (!) tex per fragment.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on First post (and hopefully not the last)&#8230; by Evangelos</title>
		<link>http://blog.makingartstudios.com/?p=3&#038;cpage=1#comment-8</link>
		<dc:creator>Evangelos</dc:creator>
		<pubDate>Thu, 19 Jun 2008 09:00:04 +0000</pubDate>
		<guid isPermaLink="false">http://blog.makingartstudios.com/?p=3#comment-8</guid>
		<description><![CDATA[nice work. I cannot wait for more videos... and please give some tips for marching cube and alternative techniques :)]]></description>
		<content:encoded><![CDATA[<p>nice work. I cannot wait for more videos&#8230; and please give some tips for marching cube and alternative techniques <img src='http://blog.makingartstudios.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on First post (and hopefully not the last)&#8230; by Isovitis</title>
		<link>http://blog.makingartstudios.com/?p=3&#038;cpage=1#comment-6</link>
		<dc:creator>Isovitis</dc:creator>
		<pubDate>Mon, 26 May 2008 02:41:42 +0000</pubDate>
		<guid isPermaLink="false">http://blog.makingartstudios.com/?p=3#comment-6</guid>
		<description><![CDATA[Excellent work!
I hope that appropriate people would appreciate it!]]></description>
		<content:encoded><![CDATA[<p>Excellent work!<br />
I hope that appropriate people would appreciate it!</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on First post (and hopefully not the last)&#8230; by Leverkin</title>
		<link>http://blog.makingartstudios.com/?p=3&#038;cpage=1#comment-5</link>
		<dc:creator>Leverkin</dc:creator>
		<pubDate>Fri, 23 May 2008 08:20:14 +0000</pubDate>
		<guid isPermaLink="false">http://blog.makingartstudios.com/?p=3#comment-5</guid>
		<description><![CDATA[Great Website. Keep up the good work. Hope you upload more videos soon !]]></description>
		<content:encoded><![CDATA[<p>Great Website. Keep up the good work. Hope you upload more videos soon !</p>
]]></content:encoded>
	</item>
</channel>
</rss>

<!-- Performance optimized by W3 Total Cache. Learn more: http://www.w3-edge.com/wordpress-plugins/

Page Caching using disk
Database Caching 17/72 queries in 0.116 seconds using disk

 Served from: blog.makingartstudios.com @ 2013-05-22 19:39:52 by W3 Total Cache -->