DX11 Verlet Cloth Simulation - Threading Problem

mburk · July 9, 2013, 11:06am

Hi.

This is one for the DX11 & Shader Experts.

I’ve been trying to implement a cloth simulation with a physics spring system based on verlet integration. And I pretty much succeeded, as you can see in the screenshot, BUT I still have a very basic problem:
Since everything happens in a compute shader, the whole calculation is based on how many thread groups are initialized and how many dispatches are done. This seems to greatly affect the simulation and if I use a “wrong” thread configuration, feedback loops appear up to the point that everything explodes (kind of). My guess is, that the number of threads and threadgroups has to correspond with the shader kernels of my graphics card or something, because I found a stable configuration, that runs like a charm on my machine, but totally gets out of control on a different one. For some reason it works with 4 dispatches and 4 threadgroups for 4096 points. This configuration is also scalable, because it works with 8 dispatches, 8 threadgroups and 16384 points as well (And so on). So there seems to be some direct connection, but I can’t figure out a general rule.

So I guess my question would be:

Is there a right an generally working combination of threads and threadgroups for this application, or do I need do synchronize the threadgroups? Or maybe I made a completely different mistake and all this has nothing to do with threads?

If something in the code or the patch is not clear, please ask.

All thoughts and suggestions welcome!

Thank you all very much for reading all this! ;)

Cheers

Verlet Patch - final.rar (1.1 MB)

vux · July 9, 2013, 5:12pm

Ok so gonna take a little to explain (preparing a docu about that), very cool by the way first thing ;)

Threadgroups tells the graphics card how many threads it has to dispatch at the same time.

So they can give you a good boost depending on how to access the data. In your case you use 1 dimension Dispatch, so rule is like that:

[numthreads(countX, 1, 1)](numthreads(countX, 1, 1))

you have N elements to dispatch, so you need to dispatch enough groups to cover your whole buffer, which then has 2 cases:

Your buffer size is a multiple of countX:
dispatcher X = buffer size / count
Your buffer size is not a multiple of countX
In that case you need to send a few more threads than the buffer size:
dispatcher X = frac( (buffersize + (countX-1)) / count

But then in compute shader you need to discard threads that will overflow your buffer, which is simply done as :

if (DTid.x > buffersize) { return; }

That should be the first line in your compute shader.

Now generally for 1d buffers, I tend to use:
[numthreads(64, 1, 1)](numthreads(64, 1, 1))

Which normally fits pretty well.

Now in your case you will do a lot of lookup neighbours in 2 dimensions, so 1d dispatch is not ideal for cache locality.

So another way to do is for example:
[numthreads(8, 8, 1)](numthreads(8, 8, 1))

It will dispatch 64 threads as done above, but the value from SV_DispatchThreadID will be different.

In 1d case it’s more or less just a counter (x will go from 0->buffer count, y and z will always stay at 1).

Now in 2d case best is to split your buffer into row/column count and to the same equation as above but one for row count and once for column count.

That means that your lookup function will change, since now value from DTId will be grid based instead (eg: x and y will get values).

So you’ll have this inside compute shader
uint index = (i.y * columncount) + i.x;

Use this value for lookups.

One other important bit with threads (not applicable in you case), is when you use groupshared (which can either increase your perf a lot or decrease it a lot if not used properly). Since this amount of memory is limited, you need to make sure each dispatch will not overflow this (but I’ll keep that for separate topics ;)

Last one, while having quick look at your code:

p1 = Output[iterator](iterator).pos.xy;
p2 = Output[iterator+addition](iterator+addition).pos.xy;						
						
							Output[iterator](iterator).pos.xy += correctionVectorHalf;							Output[iterator+addition](iterator+addition).pos.xy -= correctionVectorHalf;

Be extremely careful when doing that, normally you should 2 buffers and do some ping pong, since you might have some race conditions like that (you don’t know in what order elements are executed).

That’s it for now, if you have other questions let me know.

mburk · July 9, 2013, 5:21pm

Wow, thank your for the extensive answer! I’ll take some time for my brain to process that and try to implement the changes.
I’ll uplaod the new patch, when it’s working.

Cheers

vux · July 10, 2013, 5:55pm

Working on a couple of cool bits for it too ;)

Noir · October 8, 2013, 10:29am

anything new?

ingo · May 3, 2015, 10:58pm

Same question here… anything new? THX

princemio · May 6, 2015, 12:50am

same question here

mburk · May 12, 2015, 11:01pm

Hi guys,
I kinda “solved” the threading problems by letting every thread compute only one point of the cloth.
See: wrapped-in-code