Compute buffer average

Hello,

I would like to use a compute buffer to calculate a single average of a few million vectors. How can I accomplish this? Seems like an easy task but I can’t find a way to avoid parallel threads writing over each other when summing the vectors and can’t really find info on the topic. How such a thing is usually done on GPU?

Thanks in advance

i dont know if i understan at totally this, but for compute thins on the gpu i use the dynamic buffer nodes, and it works good, ypu can download deGPU-z software so you can check your gpu perfomance. look how many mamorie you are using of your gpu etc

@sabme For example if I have 100 values and I do same manipulations on them and save to 100 values then everything works as expected. The problem is when I have 100 values which I need to add together and save it into 1 value.

Sorry if it’s hard to understand my explanation, I was programming 14 hours today and I’m pretty much brain dead at this point.

umm i dont get it at all. can you be more specific of what are you doing?

the dymamic buffer has diferent types of data some recieves 3d need 3 input values, you can use a vectro 3d to give 3 diferent data inputs, other has 4d, some recieve raw values etc. so maybe you are not using the correct node

Hi. There is interlocked operations that are used for such stuff… You won’t recive a single slice buffer, but firs value of it would be your goal…

Maybe you can explain a little bit better what exactly you want to do (like bounds) there is few tricks around also… here you can look at vertex count over here, as a starter https://discourse.vvvv.org/uploads/default/original/2X/b/b74a1e11a9f079bf017c8d86629547e266848f74.zip

@everyoneishappy made InstanceNoodles, which contains a large collection of compute shaders. Seems to me like this is what you need.

1 Like

it’s not an easy task for a compute shader, they want to calculate everything in parallel but having a moving sum over all slices is one of the worst things you can do to a GPU. it cannot calculate anything in parallel if you do these locks. ii can only use one core and all other cores have to wait, which means your whole frame is blocked.

there is a lot of theory how this can be achieved on GPU, the efficient way to go is making multiple passes and reduce the sums in thread groups step by step in each pass in a tree like manner. check the “Parallel Reduction” section in this pdf if you want to know more about the theory:

but for simplicity just try the node from instance noodles and see whether it’s fast enough for you.

2 Likes

@velcrome @tonfilm Thank you for the answers! It was very helpful, just what I needed.

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed.