Spatial sorting on the gpu

digitalwannabe · July 30, 2021, 2:24pm

I’m trying to implement the counting sort algorithm for points in 3D with compute shaders.

Like other sorting algorithms it works by dividing space into cells, lets say 3x3x3. It then counts how many points are in each of the 27 cells (aka a histogram) and generates what is called a prefix sum, which for each cell stores the sum of all points in all cells before it. In other words, when using this to sort, it tells me the start index of where in my output buffer the points of this specific cell should go.

As a last step the algorithm cycles through all points and uses the point’s cell-ID to look up in the prefix sum buffer at which index to place the point in the output buffer. Finally, when a point from cell x gets assigned to the output buffer, the prefix sum value of cell x gets incremented by 1.
As a result the points are sorted into cells as intended.

I’m struggling with the final step/renderpass, which should be simple:
I got 2 shaders here, the first one simply copies my prefix sum buffer to a RWStructuredBuffer so I can write to it (in the other shader)…not sure if this is how you do that in general if you need a writeable buffer coming from another renderer, but seems to work:

StructuredBuffer<uint> Counter;  
RWStructuredBuffer<uint> OutputBuffer : COUNTERBUFFER;

[numthreads(1,1,1)]
void CS(uint3 tid : SV_DispatchThreadID)
{
	if (tid.x >= elementCount)
		return;
	OutputBuffer[tid.x] = Counter[tid.x];
}

The second shader does the work. The struct ‘particleID’ stores point-ID and cell-ID for each point/particle:

StructuredBuffer<particleID> ParticleIDs;
RWStructuredBuffer<uint> Counter : COUNTERBUFFER; //prefix sum
RWStructuredBuffer<particleID> ResultBuffer : RESULTBUFFER;

[numthreads(64,1,1)]
void CS(uint3 tid : SV_DispatchThreadID)
{
	if (tid.x >= elementCount)
		return;
		
	particleID partID = ParticleIDs[tid.x];
	uint oldval;
	InterlockedAdd(Counter[partID.cellID],1,oldval);
	ResultBuffer[oldval] = partID;
	
}

Unfortunately this doesn’t work. While the cell-IDs in the output buffer seem to be ok, the point-IDs flicker. There seems to be an issue with the InterlockedAdd(), but I don’t really have experience with atomic functions. Maybe the issue also is the copy in the first shader and when all of this is actually executed…

Any help/hint is much appreciated!

Example patch below
gpuSort.zip (7.3 KB)

thank you!

antokhio · August 1, 2021, 9:51am

Hi need to look on this, from description it doesn’t really sounds right.
First i don’t think interlocked add works that way:

InterlockedAdd(Counter[partID.cellID],1,oldval);

It is just an + (Spectral) for integers, basically you can do following:

uint counter;
[numthreads(64,1,1)]
void CS(uint3 tid : SV_DispatchThreadID)
{
   if (condition)
      InterlockedAdd(counter, integer) <--- this would just add some integer value to counter...
}

There is also something called GroupShared Resource, witch is closer to what you are describing, basically you can define groups, and thread inside of groups, and each thread would be able to write to GroupShared resource…

So since I’m pretty much out of time anyways can suggest you following:
Group shared example:

Sort example:

digitalwannabe · August 1, 2021, 10:27am

hi @antokhio, thanks for helping out :)

as I’ve said I don’t really have experience with atomic functions, but a few steps before this one I’m creating the cell histogram, ie counting how many particles are in a cell, like so:

[numthreads(64,1,1)]
void CS(uint3 tid : SV_DispatchThreadID)
{
	if (tid.x >= elementCount)
		return;
	
	uint oldval;
	InterlockedAdd(OutputBuffer[ParticleIDs[tid.x].cellID],1,oldval);
}

and this does work fine. I’m failing to see the difference here.

Concerning GroupShared Resource, i thought about that, but how would I initialize the shared resource with my prefix sum buffer?

Will have a look at your code tonight, thank you!

antokhio · August 1, 2021, 9:24pm

Ok, maybe, bit over looked since never found good use case for interlocked )
The function from “cell histogram” looks totally valid…
Briefly looked at the patch, gonna try to check tomorrow more closely…

digitalwannabe · August 5, 2021, 8:41pm

if you find the time it would be great, thank you :)

antokhio · August 11, 2021, 8:52am

Hi, I took a look on your patch again, there is still something broken and I’m not sure exactly what, in any case I suppose you have race conditions somewhere or major logic problem.

So let’s finalize what you are trying to do:
On first step you write your cell particle counts (e.g. binsizes) and getting something like instance start index.
Then your second shader takes one of this start indices write something particle on it and increment index.

So few obvious problems:

stride is 8 here, you can test by outputing your input buffer:

Second:
This is you instance start index:

This after interlocked Add:

notice that your index cannot be 20 in this case, since index starts from 0, 20 items index 19
this means, if index is broken on start you likely overwriting buffer

digitalwannabe · August 12, 2021, 9:31am

thanks for taking the time @antokhio!

you are obviously right about the stride- I have simplified the struct for this post and forgot to change that.

concerning indices: I have also suspected that, but I think this is correct. The last particle in cell 1 gets index 19 (oldval), then the buffer is incremented to 20 (but never used I figure):

uint oldval;
InterlockedAdd(Counter[partID.cellID],1,oldval);
ResultBuffer[oldval] = partID;

antokhio · August 12, 2021, 4:50pm

Played a bit more, so the jumping comes from “oldval” variable, as soon as it used the values jumping all over the place ;)
I guess the only way to do that, is to generate proper indexes for particles…

The GPU threads are not always evaluated one after each they all go in parallel, so if you add something on one thread the second thread not guaranteed to receive value after addition, however sometimes it would (race condition)… That’s so far the only guess i have…

I know the way to sort this in order, is to make an foreach loop in the shader, against binsize of cell, this would be slow but working…

However now I’m not even sure that it isn’t working, since all threads in parallel isn’t mean they are in order, so you can’t expect index to be the same between frames…

system · August 13, 2022, 1:23pm

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed.