From Omar's notebook.
If you pull up a tuner app on your phone and play an A note on a piano, the tuner will say ~440 hz. Do the same thing on a violin's A note and you'll also get 440 hz.
This seemed impossible based on my understanding of sound, which was:
So if two sound waves have exactly the same frequency, according to my tuner app, they cannot possibly sound different. But they do! Does that mean there's something else, something more than frequency that can differentiate sound?
There isn't! That's all sound is - vibrations in the air of some specific frequency.
Was the tuner lying? Yes, kind of: when we say a sound is 440 hz, that's not actually a complete description of the sound. It's kind of like telling you a shape has 4 points - you might assume it's a square. But it could be a rectangle, or some completely wacky irregular polygon.
Here's an A note from a piano and a violin that I recorded from online virtual instruments.
Click "Play" to hear them. Below each is a small slice of the audio.
Drag and drop your own sound file in these diagrams to use it throughout the article.
They look structurally similar, but they're not the same sound wave. They both have a pattern that repeats ~3 times in 9 milliseconds (or ~440 times per second). So that's what makes them both 440 hz.
But that pattern itself that repeats is different. This is why saying "A sound is 440 hz" isn't a complete description, so assumption (3) for me was a misunderstanding.
Here is what the sound would look like if the speaker vibrated exactly 440 times per second in a continuous, regular, motion.
The fact that all 3 of these sound different, even though they are all made of a pattern that repeats at 440 hz, tells me I need to tweak assumption (1). Our ears don't pick up on just a single frequency in the air. If they did, these sounds would be indistinguishable (which I think actually happens for our eyes, see metamerism.)
So our ears can detect the internal structure of these patterns. To decompose this structure on a computer, we'll use the Fourier transform.
The Fast Fourier Transform (FFT) algorithm allows you to extract frequencies in a signal.
Our signal in this case is a list of amplitudes, how loud the sound is at every sample.
When you run the signal through like FFT(signal)
you get a list numbers that correspond to a list of frequencies. For example, if the FFT output looks like this:
[0.001, 0.1, 0.7, 0, 12, 0, 2, 1]
You'll also have a corresponding list of frequencies:
[0, 55, 110, 220, 440, 880, 1320, 1760]
We interpret this to mean 440 is the most significant frequency in this signal, because its FFT coefficient is 12 (the largest), and it's the 5th number. And the 5th frequency in our list is 440.
I've found it easiest to interpret the results of FFT by putting it in a table in terms of the original sound frequency and the relative weights, so I divide all of them by the largest number and multiply by 100.
Below are the FFT results running on our two 9 ms slices. Click to expand the chosen slices. You can also scroll back up to change the slices by hand.
(10 ms slice)
Freq. (hz) | Relative Weight |
---|---|
N/A | N/A |
(10 ms slice)
Freq. (hz) | Relative Weight |
---|---|
N/A | N/A |
Frequencies with weight < 1 are omitted from this table.
These tables tell us that both sounds do have ~440 hz as the strongest frequency, but there's other frequencies inside too! The violin one appears a lot more complex, in that it seems to contain a lot more frequencies that contribute significantly.
The last step is reconstructing the sound from this table. Representing sound this way is really powerful. This is the basis for a lot of sound analysis/transformations like:
Below is a sandbox for you to explore these ideas of sound reconstruction.
The code returns a outputMultipliers
array that scales the original frequencies. A new sound is reconstructed from those new frequencies/weights. You can try zero-ing out all the high frequencies, or isolate a single frequency.
Freq. (hz) | Relative Weight |
---|---|
N/A | N/A |
Frequencies with weight < 1 are omitted from this table.
So the correct version of assumption (3) is: When we say a sound is 440 hz, we mean that's the frequency with the most weight in the signal. To give a complete description of the sound you need to know (1) all the frequencies it's made of and (2) how much of each frequency to use.
I created this to learn how FFT works. This is the end-to-end demonstration I was looking for to help me understand it. It's best used to test your understanding while reading other materials. I don't have the source code up yet but if you'd like to extend this or use it as a teaching tool just let me know!
You can drag and drop any sound file in the very first two figures and all figures will update to use it, including the FFT tables and the code sandbox. Here's a few notes from Philharmonia you can try dragging in:
A few key points I needed for a correct implementation:
A few insights I learned from exploring the code sandbox:
Below is a list of resources I used while writing this article. Special thanks to the Algorithm Archive Discord community for helping me understand how to interpret the output of FFT in terms of the original signal. And thanks to Better Explained for having the only explanation of FFT that ever made sense to me.