If you pull up a tuner app on your phone and play an A note on a piano, the tuner will say ~440 hz. Do the same thing on a violin's A note and you'll also get 440 hz.
This seemed impossible based on my understanding of sound, which was:
Our ears perceive sound by picking up on the frequency of the air vibrating around us.
Two sounds are different if they have different frequencies, in the same way two colors appear different if they emit light of different wavelengths.
When we say a sound is 440 hz, that means a point vibrating along that sound wave makes a complete cycle 440 times per second.
So if two sound waves have exactly the same frequency, according to my tuner app, they cannot possibly sound different. But they do! Does that mean there's something else, something more than frequency that can differentiate sound?
There isn't! That's all sound is - vibrations in the air of some specific frequency.
Was the tuner lying? Yes, kind of: when we say a sound is 440 hz, that's not actually a complete description of the sound. It's kind of like telling you a shape has 4 points - you might assume it's a square. But it could be a rectangle, or some completely wacky irregular polygon.
Visualizing a couple notes
Here's an A note from a piano and a violin that I recorded from online virtual instruments.
Click "Play" to hear them. Below each is a small slice of the audio.
Drag and drop your own sound file in these diagrams to use it throughout the article.
They look structurally similar, but they're not the same sound wave. They both have a pattern that repeats ~3 times in 9 milliseconds (or ~440 times per second). So that's what makes them both 440 hz.
But that pattern itself that repeats is different. This is why saying "A sound is 440 hz" isn't a complete description, so assumption (3) for me was a misunderstanding.
Here is what the sound would look like if the speaker vibrated exactly 440 times per second in a continuous, regular, motion.
The fact that all 3 of these sound different, even though they are all made of a pattern that repeats at 440 hz, tells me I need to tweak assumption (1). Our ears don't pick up on just a single frequency in the air. If they did, these sounds would be indistinguishable (which I think actually happens for our eyes, see metamerism.)
So our ears can detect the internal structure of these patterns. To decompose this structure on a computer, we'll use the Fourier transform.
Performing the Fourier Transform
The Fast Fourier Transform (FFT) algorithm allows you to extract frequencies in a signal.
Our signal in this case is a list of amplitudes, how loud the sound is at every sample.
When you run the signal through like FFT(signal) you get a list numbers that correspond to a list of frequencies. For example, if the FFT output looks like this:
[0.001, 0.1, 0.7, 0, 12, 0, 2, 1]
You'll also have a corresponding list of frequencies:
[0, 55, 110, 220, 440, 880, 1320, 1760]
We interpret this to mean 440 is the most significant frequency in this signal, because its FFT coefficient is 12 (the largest), and it's the 5th number. And the 5th frequency in our list is 440.
I've found it easiest to interpret the results of FFT by putting it in a table in terms of the original sound frequency and the relative weights, so I divide all of them by the largest number and multiply by 100.
Below are the FFT results running on our two 9 ms slices. Click to expand the chosen slices. You can also scroll back up to change the slices by hand.
Piano A note
(10 ms slice)
Violin A note
(10 ms slice)
Frequencies with weight < 1 are omitted from this table.
These tables tell us that both sounds do have ~440 hz as the strongest frequency, but there's other frequencies inside too! The violin one appears a lot more complex, in that it seems to contain a lot more frequencies that contribute significantly.
The last step is reconstructing the sound from this table. Representing sound this way is really powerful. This is the basis for a lot of sound analysis/transformations like:
Compression? Remove the highest frequencies that our ears can't hear as easily.
Noise filtering? Remove specific known frequencies, that's how you can remove the sound of car horns but keep your voice in a recording.
Auto tune? Check which musical note is closest to the list of frequencies in the sound, and alter/remove/add frequencies to make it sound closer to the note.
Voice recognition? You can think of the frequencies you can create by talking as a unique pattern a software can search for. Like how the letter "E" appears ~11% of the time in most English text. We all have a few frequencies in our voices that appear with predictable probability.
Below is a sandbox for you to explore these ideas of sound reconstruction.
So the correct version of assumption (3) is: When we say a sound is 440 hz, we mean that's the frequency with the most weight in the signal. To give a complete description of the sound you need to know (1) all the frequencies it's made of and (2) how much of each frequency to use.
I created this to learn how FFT works. This is the end-to-end demonstration I was looking for to help me understand it. It's best used to test your understanding while reading other materials. I don't have the source code up yet but if you'd like to extend this or use it as a teaching tool just let me know!
You can drag and drop any sound file in the very first two figures and all figures will update to use it, including the FFT tables and the code sandbox. Here's a few notes from Philharmonia you can try dragging in:
A few key points I needed for a correct implementation:
How to retrieve the original frequencies from the FFT output. The FFT only tells you the frequency in terms of how often it repeats in the list of samples. So you need to multiple by the sample rate (like 44100 for most sound) to get frequency per second (Hz). Most FFT libraries in JS do NOT give you a list of the frequencies (just the coefficients). The formula for figuring out what frequency corresponds to what coefficient is described here.
You have to take a small slice of the audio. Given that the frequency of the sound changes over time, even when playing a single note, you won't get accurate/expected frequencies if you FFT the entire thing. For example, a piano note has an "attack" and a "decay", you want to take a slice from the middle.
See the Short Time Fourier Transform (STFT). This is computing FFT for a small slice of audio as we just did, but do this for all slices of the audio. You'll get a list of frequencies over time. This is often visualized as a spectrogram.
You have to sample an integer number of cycles. If you take an arbitrary slice of a pure A note, you likely will get a lot more than 440 in your FFT table, unless you happen to pick a start and an end that matches a multiple of the cycle length. This article automatically shortens any slice to the nearest cycle (we figure that out by finding the nearest sample where the signal crosses the Y axis). This is also important when playing the reconstructed sound on loop (otherwise you hear a popping sound).
The discrete fourier transform (DFT) doesn't include all possible frequencies. In theory, FFT is a continuous sum (AKA an integral) of all possible frequencies in the audio. What we have implemented here is a discrete version, where we sum a finite list of frequencies. In a small slice you may not get 440 exactly as an output, only because it wasn't included as a frequency the algorithm was looking for. I think in principle you could have an implementation that allows you to specify what frequencies must be included (if you know what you're looking for) but I haven't seen such an implementation.
FFT isn't magic. A popular analogy is that FFT can take a smoothie and extract all the component that went into it. But this seems impossible??? The trick is the FFT comes in with an assumption of what frequencies might be in there. We go through each possible frequency and ask "How much 440 hz is in this sound?" and "How much 880 hz is in this sound?" and so on. So it's closer to having an unknown substance and figuring out what it is by checking if it reacts to known chemicals/substances.
A few insights I learned from exploring the code sandbox:
The sum of all the frequencies < 1 weight have a big effect. I had these hidden in the FFT tables thinking they weren't significant, and any one of them isn't, but removing them altogether does have a very noticeable effect. Here's a code snippet (scroll up) that removes all frequencies with weight less than 1. Or try the opposite, listen to only weights less than 1.
The high frequencies are a big part of the violin sound. Removing anything at 5000 hz and up makes it no longer really sound like a violin (or just sound really muted). This isn't true for the piano.
A string tuned to 440 hz will never emit frequencies any lower than that. This is true because of the physics of standing waves, but it was really cool to learn about this in theory, and then go back to this interface and see that this was indeed true for all the recordings of notes I had! Without having known this before I implemented it.
Below is a list of resources I used while writing this article. Special thanks to the Algorithm Archive Discord community for helping me understand how to interpret the output of FFT in terms of the original signal. And thanks to Better Explained for having the only explanation of FFT that ever made sense to me.