A typical FFT-based spectrogram uses 1024 bins on a 48 kHz audio, with about 50 Hz step per pixel. Most of the interesting audio activity happens below 3 kHz, so 50 Hz per pixel gives only 60 pixels for that area. As a result, the spectrogram is pixelated. One way to get highres spectrograms is to use CWT (continuous wavelet transform), but it’s messy to implement. Another trick is to use regular FFT shifted by 1/2 pixel (by 25 Hz): the DFT Shift Theorem enables such frequency shifting by multiplying the input signal by
exp(-i*pi*k/N). Smoothness in the time direction is easier to achieve: the 1024 bins window can be advanced by arbitrarily small time steps.
Images below use an HSV-based color scheme:
1.0 - FFT[i]^2 / max FFT^2, i.e. the max FFT amplitude has 0 saturation or just white.
Click the images to see fullres versions.
Bird songs recordings were taken from www.fssbirding.org.uk/sonagrams.htm. Unlike musical instruments, birds don’t seem to bother to create a complex multi-layered harmonics pattern. Instead, they create a complex pattern with the base (fundamental) frequency alone. The “cloud” above the main drawing is what would be the 2nd harmonic. These sonograms are remarkably different from other sounds, as if birds “draw” with sound something that’s flying backwards in time.
Compare this with CWT and standard FFT (no overlapping frames, a fixed set of frequencies):
The CWT spectrogram was obtained with soundshader.github.io/?s=cwt. Despite this CWT implementation runs on GPU and this “advanced” FFT runs on JS, CWT is about 50-100x slower.
Musical instruments usually have this multi-level harmonics structure. Flute is one of the simplest instruments, with first 2 harmonics dominating the spectrum. However it would be a mistake to to call flute sound simple: as you see, every level has its own regular pattern that can’t be recreated with a simple mix of sinusoidal tones.
Viloin is the most instresting: on levels 8 and 9 it creates an intricate ornament. I don’t know if it’s a feature or a defect of the instrument. It’s interesting that our ears collapse the entire 20-story harmonics tower with all these different ornaments on different floors into a single tone.
The bright lines are vowel formants: a pair or triple of them uniquely identify a vowel. Each greenish column is a word, usually consisting of two vowels. The horizontal bar is a bell. Vowels have pretty complex structure and look like a mix of bird songs with musical instruments, as they also have harmonics.