I’m trying to do an audio analysis using FFT(Dshow9) node in order to find keys from audio recording. I was trying to match audio of human voice singing with, lets say, this guitar and piano key chart:
and hopefully plot the keys from the voice.
However, I don’t have much understanding about audio analysis (I was tasked to do this because it has something to do with my research project). I tried to look at the vux’s FFT documentation and I gave it a go but I’m not sure if I’m getting the output correct.
You can find my example patch in the attachment.
I didn’t include the audio file. Just letting you guys know in case you thought it didnt work because I didnt put it in :)
Basically the patch is attempting to find the frequency from FFT bin and vice versa. But I got the feeling that I’m doing it wrong. Please have a look and comment a little bit.
Also if there’s somebody out there who could provide a good explanation on using FFT(Dshow9)node and what those pins represent, that would be great.
Any help is much appreciated!
fft_bin_search03.v4p (82.4 kB)
vux doc ftttricks is indeed a great explanation of how it all works.
You set spread count to a power of 2 value, for example I used to use 1024.
FFT retrieves you energy of specific frequencies. But they are encoded in bins. So you’ll get 1024 bins of which second 512 are mirrored from the first 512 so are not needed. So how to decode data from bins when you need a specific frequency.
Transforming vux’s great explanation: So if our FFT is 1024 bins, and our file samplerate is 44100Hz, then the bin for 440Hz is:
1024*440/44100 = 10.20
That means that you’ll need bin 10 to get energy for note A at 440hz.
Hi, thanks for the reply and I really appreciate your extended explanation of vux doc ftttricks.
However, I think I should have inquired my question in a different way. The thing is I get the theoretical and calculation part of the vux’s fft explanation. But, what I don’t get is what the output values from the FFT pin (the FFT L and FFT R pins)represent? Is that the frequency value from the audio or is it something else?
P.s: I’m a v4 and audio analysis noob, so please bear with me :)
what you get from the FFT node is frequency values. it basically takes a number of audio samples and analyzes which frequencies are containted.
the interesting thing is: it will always cover the whole audible spectrum from 0 Hz (which is not audible) to 22 kHz.
now the frequency resolution over the spectrum is dependent on the number of audiosamples you feed the FFT node (which is basically the spread count).
e.g.: if your spread count is 8, you will get 8 frequency bands from the FFT node and the bands will be quite broad, because you get only 8 bands to cover the whole spectrum.
-> 22000/8 = 2750, that means each band will be 2750 Hz wide and you’ll get the following bands as output:
band 0: 0Hz to 2750Hz
band 1: 2750Hz to 5500Hz
band 3: 5500Hz to 8250Hz
band 7: 19250Hz to 22000Hz
if you set your spread count to 16, the calculation is as follows:
-> 22000/16 = 1375
band 0: 0Hz to 1375Hz
band 1: 1375Hz to 2750Hz
band 2: 2750Hz to 4125Hz
band 15: 20625Hz to 22000Hz
so the frequency resolution is getting better with increasing spread count and the values output by the node represent the energy present in each band.
now the bad news:
you won’t be able to do pitch or key tracking with just the FFT node alone like you would need for your application. using a FFT for this is not exact enough and you will have a hard time to find the fundamental frequency because of all the harmonics in a real world pitched sound.
this becomes quite a complex topic quickly and there is lots of research going on with tremendous improvements over the last years, but this is something that you won’t be able to cover as an “audio noob”. also, VVVV is probably the wrong software, you’d better look into PD or Max or other more audio related software.
i hope i could shed some light on this issue.
Thank you for the helpful explanation, it’s much clear now! Appreciate it!