Pipet vs Compute Pipet

catweasel · October 18, 2013, 11:56am

A patch showing dx9 vs dx11 pipet performance
crtl f9 to show ticks, dx9 goes faster than dx11
I’d love a way to improve this as its shlowing down a patch I have to pipet in from a dx11 texture!

dz9 vx dx11 (7.8 kB)

catweasel · October 18, 2013, 12:26pm

Sorry missing module

fixed file paths (9.5 kB)

vux · October 18, 2013, 12:41pm

Right, just tested, and not using Ctrl+F9 since it’s not the way to monitor performance in that use case, see all the random threads about Info (DX9.Texture)

Remove VSync + high mainloop.

Queue on FPS (average of 100 frames).

DX9 : 45fps average
DX11 : 53fps average
DX11 (Outputting data as value, by skipping vector split + rgb join) : 90 fps average.

vux · October 18, 2013, 12:44pm

Also please note that warp size in compute shader is way too low for that spread count, but I keep this as exercise for the reader ;)

vux · October 18, 2013, 1:26pm

On another note:

Map operation can be done directly in Compute Shader before sampling. (Since it’s your most consuming operation).
Change warp size, [numthreads(1, 1, 1)](numthreads(1, 1, 1)) is only for little processing, change to [numthreads(64, 1, 1)](numthreads(64, 1, 1)) and update your dispatcher accordingly.
If you don’t sample Alpha, set your buffer as float3, which will reduce your bandwidth (not sure that will give you a decent gain but who knows).
Color nodes are slow as hell, just avoid them, better to work with value nodes.
If you just want to sample on LinearSpread/Cross, offset that in Compute shader too, and get rid of them in CPU side (including dynamic buffer).
Use Zip/Unzip instead of vector join/split, they are much faster, please note to devvvvs can we have vector join/split nodes using same code as zip/unzip? :)
Make your own ReadBack that doesn’t deal with layout and output color directly.
Don’t read timings on GPU->CPU nodes and take them for granted (that’s for dx9 and dx11). Technically don’t take any node timing for granted unless you know exactly what the node is doing (and what it involves under the hood).
Avoid using GPU->CPU nodes unless you know what it implies.

And on last one, you are not sending colors later back to your GPU right? ;)

u7angel · October 18, 2013, 1:45pm

@vux this post could be part of the wiki (is there a documentation about dx11?, need to check). good checklist what to do and what not.

catweasel · October 18, 2013, 1:47pm

Thanks for the info. I was looking at ticks as I have several patches I switch between, and the frame rate is varying not only on patch, but on the same patch too, and the bottle neck seems to be gpu, but I’ll bare your advice in mind!
Colours are being sent via a network, encoded as jitter frames, I’ll look at altering the plugin to use float3 or 4 instead of colour, I did look at getting rid of the alpha, but it didnt make any difference.

Change warp size,numthreads(1, 1, 1) is only for little processing, change >tonumthreads(64, 1, 1) and update your dispatcher accordingly.
Do I just divide thread x by 64 here, and round up to the next int?

not sure how to do cross in the shader, but I can always subpatch the cross and map, and evaluate to 0 with them ;)

Make your own readback,
I think you might be teasing here, you are aware of my coding skills!?

And I do avoid GPU to CPU as much as I can, just sometimes gfx need to get sent to LED’s for example, as in this case!

DiMiX · October 18, 2013, 4:04pm

here DX9-55fps vs. DX11-40fps.
ML 120 filter GPU load 5%, CPU single core 45%
and i dont know the way to boost it
wishing PipetForDummies /mapping shader,no alpha sampling,zip-unzip

vux · October 18, 2013, 4:05pm

https://www.google.com/search?q=info-(ex9.texture)-excessive-cpu-usage-%2520-low-frame-rate&oq=info-(ex9.texture)-excessive-cpu-usage-%2520-low-frame-rate&aqs=chrome…69i57.5866j0j8&sourceid=chrome&espv=210&es_sm=93&ie=UTF-8

was the link speaking about that by the way, and making a documentation right now since I had this question about timing about 200 times at least (that includes both dx9 and dx11 ;)

And here is little docu:

dx11-vvvv-pipeline

Please note that custom background color for nodes would be also really useful for tutorials (and for custom debug) ;)

For warp size it’s exactly as you mentioned. So your shader will likely be something like 10x faster, using 1,1,1 is only for very specific cases (eg: simple examples, or some more advanced stuff like batch generation).

If you want to do dynamic cross you can also do a warp of 8,8,1 which will simplify you calculations, but you need to add bounds check in that case.

If you set the apply pin for the dynamic buffer to 0, you cross/map will not be evaluated, not need for a subpatch.

And yeah I expected (actually I hoped) that you would use it for something like that ;)

DiMiX · October 18, 2013, 4:28pm

@vux:dead links

vux · October 18, 2013, 4:37pm

Yeah website converts can link into something which doesn’t work (go figure). Updated first and second one still has ACL issue.

sebl · October 18, 2013, 5:54pm

wasn’t there a who builds the fastest pipet challenge?

i also tried a warp of 8,8,1 - but it doesnt seem to affect performance noticeable

pipet.zip (6.4 kB)

vux · October 18, 2013, 7:18pm

Ah I remember that quote!

Quick check on yours, you have a warp of 128 but dispatch divide by 64.

Also since you changed float4 by float3, you need to change stride parameter to 12.

But yeah that’s the concept :)

Please note as long as you don’t use groupshared a warp size > 64 will not give you a big gain (which you don’t need in this case).

group of 8,8,1 will change the access pattern, and gives you 2d coords instead of 1d for DispatchThreadID. It can be useful in some cases, but since this one is really basic and process only 40000 elements you will not see any noticeable difference.

For ReadBack, output is not much optimized (due to forcing layout).
It’s recommended to use custom dedicated output (and would be much better if vvvv was using floats (I heard that one a few times somehow ;)

Attached c# code for faster, dedicated Vector3/Vector4 versions

160 fps now ;)

ReadBackVector3.txt (3.1 kB)
ReadBackVector4.txt (3.2 kB)

mrboni · October 18, 2013, 10:26pm

It’s a genuine joy to read forum posts by you Vux :)

sebl · October 19, 2013, 12:08am

Quick check on yours, you have a warp of 128 but dispatch divide by 64.
Also since you changed float4 by float3, you need to change stride parameter to 12.

ah, those where some testing leftovers. i experimented with different dispatchers, because i had the experience, that those optimums differ if you use another gpu - for example, few powerful cores (like on my 580) or much more cores with a bit less power each (like in a titan).
of course, my final setup was a stride of 12 and a 64,1,1 in the CS. with that, i got 60fps with dx11 (around 43 with dx9-pipet.
with 8.8.1 or similar, i wasn’t sure, how to use this …id.x… does this dispatchThreadId become some kind of 2-dimensional array then? And, if yes, how to use it?

what i don’t understand is, why is your plug so much faster. isn’t it doing the same/similar things/functions just in a different way? is the bottleneck elsewhere (perhaps in the nodes after the readback, dealing with very high spreadcounts)? And, what’s that double*? i’ve never seen that before… and it’s hard to google.

For ReadBack, output is not much optimized (due to forcing layout).
this is another one, i don’t understand… forcing layout

vux · October 19, 2013, 11:06am

@sebl :

– And, what’s that double*?
It’s a pointer to the output pin data, so if you google pointer instead, you’ll have enough reading for a lifetime about it :)

– Forcing layout
–isn’t it doing the same/similar things/functions just in a different way?
Since readback is generic, I can’t as easily predict what data/how many pins I have to output, so there’s a decent amount of overhead dealing with that. Since here I know I want to output Vector3 I can get rid of a decent amount of code and make a much faster loop.

–About access pattern dispatchThreadId is uint3, so if you use 8,8,1 you’ll be able to use i.xy instead of i.x. If you use 64,1,1 i.y and i.z will always be 0. So yes it’s more or less I (Spreads) with Cross :)

Attached is a little example showing some basic spread operations using either 64 or 8x8 warp size, some kind of morning gymnastics. Please note that i don’t use Group semantics yet, otherwise you can do a bit more brain twist.

@mrboni : Pleasure ;)

i.zip (4.3 kB)

catweasel · October 19, 2013, 11:19am

I shall look forward to researching through all this after my gig tonight! Thank you for you replies Vux, I think your tutorials will be welcomed by many!
I’m loving where I’ve got to with dx11 but theres still much to learn, and googling doesn’t alway turn up my information, msdn site is a little dry of explanations…

sebl · October 22, 2013, 11:58am

ReadBackVector Nodes are supernice! i’ve just put them in a dynamic plugin, so one can easily change vectorformat in patch and still gain from performance optimizations.

thanks for the detailed explanations!

catweasel · October 22, 2013, 1:33pm

@sebl, I just tried doing that, I added the references via ctl+j, but despite the plugin editor not showing errors the plugin is red…
and tty gives
00:13:28 ERR : System.Runtime.InteropServices.InvalidComObjectException in VVVV.DX11.Lib: COM object that has been separated from its underlying RCW cannot be used.

sebl · October 22, 2013, 6:29pm

@cat: yes, i had that issue in the beginning, but deleting the plugin and recreating the node fixed that problem here.

DX11.BufferReadbackDynamic.7z (112.2 kB)