Skip to content

pirate/soundview

Repository files navigation

SoundView

Real-time audio visualizer that turns microphone input into a rich, multi-layered visual display intended to mimic the human auditory processing system and the features it provides for free.

In theory one could learn to read the live visual display to "hear" and intepret speech, music, and other ambient sound using your eyes.

Live Site: https://pirate.github.io/soundview/

Screenshot 2026-03-12 at 3 49 24 PM

Background / Inspiration: This 💬 GPT-5.4 conversation where I was asking about how human brains process sound.

Related Projects & Resources


What It Shows

Cochleagram (main spectrogram area, ~60% of screen)

A scrolling time-frequency display rendered at native Retina resolution. Each pixel column represents one frame (~16ms) of audio.

  • Vertical axis: Frequency (50Hz at bottom, 16kHz at top), with a piecewise log scale that compresses the extremes and expands the 200-8000Hz speech/music range for maximum detail
  • Color: Thermal colormap from black (silent) through blue, cyan, green, yellow, orange, red to white (loud)
  • Resolution: FFT size 8192 gives ~5.4Hz per bin, with per-pixel rendering via ImageData
  • Sensitivity: Adjustable via slider, with perceptual gamma compression (0.35) and a noise gate to suppress mic self-noise

Overlay Lines on Cochleagram

  • White line: Pitch (fundamental frequency) tracked via YIN-style autocorrelation. Snaps instantly on octave jumps instead of drawing diagonals
  • Pink line: Spectral centroid (brightness/timbre), smoothed with jitter rejection — only shows when stable
  • Cyan line: Spectral rolloff (frequency below which 85% of energy lives)
  • White voice lines: Up to 4 simultaneous pitches detected via subharmonic summation
  • Green formant dots: F1/F2/F3 vocal tract resonances at their frequency positions
  • Green harmonic dots: Overtone series when pitch is detected

Noise Fuzz (top of cochleagram)

Three rows of scattered pixels at the top, gated on aperiodic content only (suppressed during speech/music):

  • Top row: High-frequency noise — cyan (hissy) or white (broadband)
  • Middle row: Mid-frequency noise — pink (pink noise) or grey (balanced)
  • Bottom row: Low-frequency noise — brown/red (rumble)

Density and opacity scale with noise loudness. Color indicates noise spectral tilt.

Beat Detection (blue vertical lines)

BTrack-style beat tracker (Adam Stark, 2014):

  1. Onset detection function from spectral flux feeds into a circular buffer
  2. Autocorrelation estimates tempo period (60-164 BPM range) with Rayleigh weighting
  3. Cumulative score array chains evidence backward by one beat period
  4. Beat counter triggers when accumulated score peaks
  5. Requires 6+ consecutive confirmed beats before showing (prevents false positives)
  6. Every 10th beat displays the current BPM as a number

Broadband Transient Lines (white dashed vertical)

Detects sudden broadband energy spikes (claps, thuds, impacts) by counting how many cochleagram rows are "bright" in the current frame vs the running average. Debounced at 250ms.

Harmonic Profile Strip (~25% of screen)

32 rows showing the first 16 harmonics at 2x resolution with interpolation. Each row is color-coded by acoustic role:

Harmonic Color Significance
H1 White Fundamental strength
H2 Cyan Breathiness indicator (H2/H1 ratio)
H3 Orange Power/projection
H5, H7 Yellow Odd-harmonic signature (nasal, reed, square wave)
H8-H10 Magenta Brilliance region (trained singer, brass)
Even (H4,H6...) Blue Even harmonics
H11+ Gold Upper partials

Additional dimensions encoded:

  • Brightness: Harmonic amplitude (dB-compressed for visibility)
  • Saturation: Harmonic purity (peak vs surrounding noise floor)
  • Flash/dim: Temporal derivative (brightens on attack, dims on decay)
  • White line: Tracks the dominant non-fundamental harmonic when stable for 200ms+

Harmonics are computed from the store's autocorrelation-based pitch, with fallback to the strongest voice from multi-pitch detection (works for music through speakers, not just direct voice).

MIDI Note Strip (~7% of screen)

Scrolling piano-roll-style view of detected chord notes, with 12 rows (one per pitch class, C at bottom, B at top):

  • Chord tones light up in their pitch-class color (C=red, C#=orange, D=yellow, D#=yellow-green, E=green, F=teal, F#=cyan, G=blue, G#=indigo, A=purple, A#=magenta, B=pink) at full brightness proportional to chroma energy
  • Non-chord active notes shown as dim versions of their pitch-class color
  • Inactive notes are near-black, creating a clear on/off MIDI-note appearance
  • Chord name overlaid as text on the strip every ~1 second

Under the hood, chromagram energy (12 pitch-class bins folded from the FFT spectrum) is thresholded and compared against detected chord templates (major, minor, diminished, dominant 7th, minor 7th) to determine which notes belong to the current chord.

Circle of Fifths (overlay, bottom-left)

Interactive key detection visualization rendered on the overlay canvas above the timbre space map:

  • Outer ring: 12 major keys arranged in circle-of-fifths order (C at top, clockwise: C→G→D→A→E→B→F#→C#→G#→D#→A#→F)
  • Inner ring: 12 relative minor keys (Am at top, following the same fifths order)
  • Highlight: The detected key segment lights up blue when confident
  • Center: Shows the currently detected chord name
  • Key detection: Krumhansl-Kessler key profiles matched against the chromagram via Pearson correlation, with a slow accumulator for stability (updates ~4× per second)
  • Chord detection: Cosine similarity matching against chord templates (major, minor, dim, dom7, min7), updated every frame for responsiveness

Feature Strip (bottom ~15% of screen)

Energy/Spread/Flux band (3 rows):

  • Background color: spectral spread mapped to blue (narrow) → green → yellow → red (wide), modulated by RMS energy as brightness
  • White line: energy envelope (spectral flux, unsmoothed for fast transient response)
  • Black line: derivative of flux (spikes on onsets, dips on releases)
  • Blue squares: beat markers from the BTrack detector

Top Frequencies band (5 rows):

  • Background: instrument classification color (green=vocal, red-orange=drums, gold=brass, purple=strings, blue=piano, grey=noise)
  • Colored lines: top 3 detected frequencies via iterative peak picking with suppression and merge (same colors as voice arrows)

Voice Arrows (right edge, overlay canvas)

Up to 4 simultaneously detected pitches shown as colored arrows pointing left from the right edge, with frequency labels. Drawn on a separate canvas that clears each frame (never scrolls). Black background strip keeps them readable.

Colors match between arrows and their corresponding lines on the cochleagram and top-freq band:

  • Orange: strongest voice
  • Blue: 2nd voice
  • Green: 3rd voice
  • Magenta: 4th voice

Timbre Space Map (overlay, bottom-left)

2D scatter plot showing timbral characteristics as a moving dot with a fading trail:

  • X axis: Spectral centroid (log scale, 200Hz–8kHz) — left=dark, right=bright
  • Y axis: MFCC[1] (spectral tilt, adaptive normalization) — bottom=warm, top=cold
  • Dot color: Tristimulus (T1=red=fundamental dominance, T2=green=mid harmonics H2-H4, T3=blue=upper partials H5+). Each trail point stores its own tristimulus color from the time it was recorded
  • Inharmonicity bar: Orange bar at bottom edge, length proportional to harmonic deviation
  • Computed from raw (un-normalized) harmonic amplitudes for accurate energy ratios

MFCC Strip (~8% of screen)

13 rows showing Mel-Frequency Cepstral Coefficients with a diverging blue↔orange colormap:

  • Adaptive normalization tracks min/max per coefficient over time
  • MFCC[0] (bottom) represents overall spectral energy level
  • Higher MFCCs capture increasingly fine spectral envelope detail
  • Useful for distinguishing vowel sounds, instrument timbres, and speech vs music

Audio Analysis Pipeline

src/audio/engine.js

Microphone capture with AGC/noise suppression/echo cancellation disabled. Creates an AnalyserNode with FFT size 8192.

src/audio/filterbank.js

28 BiquadFilter bandpass filters from 30Hz to 20kHz (~1/3 octave spacing), each with its own AnalyserNode for per-band energy extraction.

src/audio/pitch.js

YIN-style autocorrelation pitch detection on the first 2048 samples of the time-domain buffer. Collects all NSDF peaks, finds the global best, then accepts the first peak within 50% of the best (proper YIN heuristic). Range: 60-800Hz.

src/audio/features.js

Per-frame feature extraction (~400 lines):

  • Band energy, smoothing, peak tracking, delta, periodicity (autocorrelation with 32 lag limit), roughness
  • Full spectrum copy (spectrumDb)
  • RMS, noise floor estimation, signal-above-noise, signal presence detection
  • Spectral shape: centroid (with smoothing + snap/jitter rejection), spread, flatness, slope, rolloff
  • Noisiness decomposition (tonal vs noise energy)
  • Pitch detection + smoothing (fast tracking with octave-jump snap)
  • Harmonicity + 32 harmonic amplitudes (normalized to fundamental)
  • Modulation depth/rate from envelope analysis
  • Onset detection (spectral flux with adaptive median threshold)
  • Chromagram, key/chord detection (delegated to chroma.js)
  • Timbre descriptors (delegated to timbre.js)
  • Clears key/chord state during silence

src/audio/chroma.js

Chromagram computation + key/chord detection:

  • Folds FFT spectrum into 12 pitch-class bins (HPCP) across the 60–5000Hz range
  • Log-scales chroma energy using the same dB floor/range approach as the cochleagram, with the sensitivity slider applied for consistent brightness control
  • Key detection via Pearson correlation against Krumhansl-Kessler major/minor profiles, with a slow accumulator (updates every 15 frames)
  • Chord detection via cosine similarity against 5 chord templates (major, minor, dim, dom7, min7), every frame

src/audio/timbre.js

Timbre descriptors:

  • 13 MFCCs via 26-band mel filterbank → log compression → DCT-II
  • Tristimulus (T1/T2/T3) from raw harmonic amplitudes (fundamental, mid harmonics H2-H4, upper partials H5+), decays toward zero when no pitch is detected
  • Inharmonicity: weighted deviation of actual partial frequencies from perfect harmonic series

src/audio/modulation.js

Per-band modulation spectrum via 64-point FFT of envelope history. 7 modulation bands from <1Hz to roughness (30-300Hz). Runs every 4th frame.

src/audio/formants.js

Spectral peak picking for F1/F2/F3 with:

  • Wide smoothing window (~150Hz) to reveal formant envelope over individual harmonics
  • Frequency-constrained assignment (F1: 200-1000Hz, F2: 600-2800Hz, F3: 1500-4500Hz)
  • Hysteresis smoothing (fast on large jumps, slow on jitter)
  • Rule-based sound classifier (silence/voiced harmonic/voiced noisy/fricative/plosive/nasal)

src/store/feature-store.js

Shared typed-array data bus. All audio features written by the analysis pipeline, read by the renderer each frame.

src/scene/engine.js

Minimal render loop using performance.now() for timing. No Three.js — pure 2D canvas.

src/scene/layers/spectrum-wall.js

The main renderer (~1400 lines). Creates two canvases:

  1. Spectrogram canvas: scrolling cochleagram + harmonics + MIDI note strip + feature strip + MFCC strip, rendered with ImageData for pixel-perfect output
  2. Overlay canvas: voice arrows, Circle of Fifths key display, timbre space map — cleared each frame

Also contains:

  • Multi-pitch detection via subharmonic summation
  • Top-frequency extraction (iterative argmax with suppression + merge)
  • Simple instrument classifier
  • BTrack beat tracker
  • Noise fuzz renderer
  • Chord-to-pitch-class parsing for MIDI note highlighting
  • Circle of Fifths rendering with key/chord detection display

Development

pnpm install
pnpm dev

# deploy:
pnpm build
npx gh-pages -d dist

Click "click to start" to grant microphone access. Controls:

  • sens: Sensitivity offset in dB (shifts the brightness curve)
  • speed: Scroll speed in pixels per frame (1-20)

Tech Stack

  • Vanilla JS (no framework)
  • Web Audio API (AnalyserNode, BiquadFilter, MediaStream)
  • Canvas 2D (ImageData for cochleagram, fillRect for features, arc for voice circles)
  • Vite for dev server and build

MIT License

About

Visualize sound using spectrogram with overlays for all the features that human hearing auto-detects for us (e.g. harmonics, tambre, envelope, independent voice detection, bpm)

Topics

Resources

Stars

Watchers

Forks

Contributors