Fire Sound: Texture Synthesis

The problem

Real fire is a broadband sound: crackling and hissing out to several kilohertz. A physically based flame simulator reproduces visual fire behavior convincingly but doesn't directly produce sound: time-stepping a 3D combustion simulation at audio sample rates is impractically expensive, and small- scale combustion noise comes from thermo-acoustics that those simulators don't resolve at typical resolutions (Chadwick & James 2011, §1).

The paper's workaround (§§2.2, 3) is a simplified analytical sound model. Combustion noise can be derived from a wave equation forced by the rate of change of the heat release rate (Crighton et al. 1992; equation 2 in the paper). Under the premixed-flame assumption (Strahle 1972), the heat release happens essentially at the moving flame front, so the volume integral over the combustion region can be rewritten as a surface integral over that front. Ignoring propagation delays, 1/r distance attenuation, and overall scaling constants (all fixed multiplicative factors at a chosen listener position), the radiated pressure reduces to p(t) = d/dt ∫_S(t) u(x, t) · n(x, t) dS (equation 6), the time derivative of the velocity flux integrated over the moving flame front. This is also proportional to the time derivative of the total heat release rate.

The flame solver runs at hundreds of steps per second (360 Hz for the paper's examples). After interpolating its pressure output up to audio sample rate, the resulting signal is band- limited to roughly the simulation's Nyquist (≈ 180 Hz): a slow-envelope rumble that captures the flame's overall rate of activity but contains essentially no audible structure above that. This is the "low-frequency, physically based pressure signal" the paper takes as input to its bandwidth-extension and texture-synthesis algorithms (§§4–5). The simulated signal alone therefore sounds dull rather than crackling, even when the visuals are perfect.

The companion spectral bandwidth extension demo (link) fills in the missing high frequencies with synchronized power-law noise that matches theoretical and experimental flame spectra (§4 of the paper). This demo takes the paper's other approach: it borrows the high-frequency structure from a real fire-audio recording (the "training" signal), via the texture-synthesis algorithm of §5. Quoting the paper: "By varying the flame sounds used for input training data, users can control the style of synthesized sounds, while retaining synchronization with simulated flames."

Why “data-driven, partially physically based”

The low-frequency input is physical: it comes from the §3 sound model derived above. The high-frequency content comes from sampling a real recording: the synthesized output's micro-structure is drawn from training audio rather than predicted from physics. Bursts in the input (loud combustion events) drive bursts in the output because the multi-resolution search at the top pyramid level locks the synthesized envelope to the input's, and the optional dynamic-range mapping (§5.3) further re-scales training amplitudes so the output's loudness distribution matches the input's.

How the algorithm works

Pyramids. Pad each signal to length 2^k+1; build a numLevels-level Gaussian pyramid (default 6) using the 5-tap stencil [0.05, 0.25, 0.40, 0.25, 0.05] and 2:1 decimation. The coarsest level holds the slow envelope; each lower level adds finer detail. The training signal's pyramid is built with reflected boundaries; the base signal's pyramid is zero-padded.
CDF init. If scaleCDF is on, sort the absolute amplitudes of the top pyramid level for both signals. These sorted arrays are the input/output CDFs used below.
Training-feature dictionaries. For each non-top pyramid level ℓ, build a feature vector from each training-signal window. Each feature has two regions: causal context from level ℓ immediately to the left of the window (16 samples by default), and coarser context from level ℓ+1 symmetrically around the window (33 linearly-interpolated samples by default). All features are stored in a KD-tree for fast nearest-neighbour lookup.
Initialise the output pyramid to the input pyramid, then zero every level except the top (so the output's coarse envelope is the input's).
Coarse-to-fine synthesis. For ℓ = numLevels-2, numLevels-3, …, 0, for each output window:
1. Build a feature vector from already-synthesized samples (causal) and the next-coarser level (which is fully synthesized at this point of the loop).
2. If ℓ+1 is the top level and scaleCDF is on: compute the average magnitude of the coarser-level feature entries; find that average's percentile in the base-signal CDF F_S; look up the matching amplitude in the training CDF F_T; the ratio r = trainingAmp / averageMag is the per-window scale factor (paper §5.3, equation 7). Blend r with scalingAlpha for partial mapping. Multiply the coarser feature entries by r so the NN search runs in a normalised space, and remember 1/r so the matched training window can be scaled back to the right amplitude when it's blended into the output.
3. Find the nearest training window in level ℓ's KD-tree.
4. Add that training window into the output level via a triangular hat-function blend, scaled by 1/r (or 1 if scaleCDF is off, or if ℓ+1 is not the top level).
Reconstruction. The output is the bottom level of the output pyramid, trimmed to the original sample range.

What the controls do

numLevels: pyramid depth. More levels means a coarser top level (slower envelope) and more detail layers; default 6 (top-level Nyquist ≈ 1.4 kHz at 44.1 kHz).
windowHW (samples): output window half-width at every level. Smaller windows let the synthesis switch training contexts more often; larger windows are smoother but blockier.
featureHW (samples): feature context half-width. The feature vector at level ℓ has windowHW · featureHW + 1 causal samples plus 2 · windowHW · (featureHW + 1) + 1 coarser samples. With the defaults (4 / 3) that's 13 + 33 = 46 dims.
falloff: exponential weight on feature dimensions: exp(−falloff · |distance|). 0 (the paper default) means uniform; large values bias the NN search toward local context.
scalingAlpha: how aggressively the CDF mapping rescales training amplitudes. 0 disables it (same as unticking scaleCDF); 1 is the paper default.
scaleCDF: master toggle for section 5.3 dynamic-range mapping.
RNG seed: integer seed for the shared PCG32 generator. Used here only for tie-breaking; on these signals you'll rarely see it change the output.
epsANN: nearest-neighbour approximation tolerance. The browser demo defaults to 5.0, which prunes the KD-tree very aggressively and is roughly 5-10× faster than exact search. The perceptual difference vs exact NN is below the noise floor on the bundled examples; in fact, on torch and candle, exact NN produces a tonal buzz that approximation actually breaks (see the failure-mode discussion below). The original C++ release defaults to 1.0. Set the slider to 0 for exact NN (deterministic, matches the Python reference bit-for-bit; the Tier 2 goldens always use 0).

Things to notice

Pick the Dragon example and toggle scaleCDF on/off: with it on, the loud "fire- breathing" bursts of the input drive correspondingly loud bursts in the output (CDF mapping pumps amplitude to match); with it off, the output is more uniformly textured.
Drag scalingAlpha from 0 to 1: at 0 you hear the training texture untouched; at 1 it's been re-scaled to match the input's amplitude statistics.
Compare the same flame in this demo with the spectral bandwidth extension demo (link). The bandwidth-extension version is "synthetic crackle" matched to the envelope; this version is "real recorded crackle" matched to the envelope. They're often perceptually similar, but the texture-synthesised version has temporal events (pops, micro- rumbles) that pure power-law noise can't reproduce.
Watch the output pyramid waveforms: the top level (coarsest) is the input's slow envelope; each lower level is a higher-octave detail layer added by the synthesis loop.
Watch the spectra plot: the output PSD (green) should sit roughly between the base PSD (blue) at low frequencies and the training PSD (orange) at high frequencies, with a smooth crossover that's chosen automatically by the coarse-to-fine search.

A failure mode of the CDF mapping

The dynamic-range mapping in §5.3 multiplies matched training windows by a per-window scale factor so the output's amplitude statistics track the input's. When the input has a much wider dynamic range than the training data, those scale factors get large; consecutive windows tend to receive similar large factors; the triangular overlap-add then reinforces them constructively into a tonal buzz at the windowHW stride frequency. The paper itself flags this in §6 (“...the method still has difficulty producing a suitable, temporally coherent output sound. This can occur in cases when the low-frequency input has a very wide dynamic range, while the training data has a small range”).

Among the bundled examples, both torch and candle hit this pathology at scalingAlpha = 1.0 (the C++ release default). The demo therefore loads them with scalingAlpha = 0.5, which blends in half the rescale and avoids the artefact while keeping the qualitative dynamic-range matching the algorithm wants. Drag the slider back to 1.0 on either example to hear the buzz; drag down to 0.0 to hear the texture without any rescale at all. The other three examples (burning_brick, dragon, flame_jet) ship at the canonical 1.0.

Caveats and things this model is not

The output's high-frequency content is borrowed from the training signal, not predicted by physics. If you train on a candle, your dragon's roar will sound suspiciously candle-like at high frequencies. The paper frames this as a feature: by choosing the training clip you choose the texture, while keeping the simulator's envelope. Picking training audio that matches your scene is part of the artistic process.
Two simulations with very different small-scale combustion structure but the same envelope will sound similar. If your application depends on the high-frequency physics, this synthesis step is not the right tool.
The browser demo uses approximate nearest-neighbour (epsANN = 5.0 by default), as does the released C++ (which uses the ANN library with epsANN = 1.0). The parity check runs both ports with epsANN = 0 (exact NN) and verifies they match within FFT round-off.
Defaults match each example's bundled default.xml: numLevels=6, windowHW=4, featureHW=3, falloff=0, scaleCDF=1, scalingAlpha=1. Two examples ship with a per-demo override scalingAlpha=0.5 (torch and candle) to avoid the CDF-mapping pathology described above. The Reset button restores each example's bundled defaults.

References

Chadwick, J. N., and James, D. L. (2011). Animating Fire with Sound. ACM Transactions on Graphics (SIGGRAPH 2011), 30(4). cs.cornell.edu/projects/Sound/fire · paper PDF
Wei, L.-Y., and Levoy, M. (2000). Fast texture synthesis using tree-structured vector quantization. SIGGRAPH 2000, 479–488. (parent multi-resolution texture synthesis algorithm.)
Burt, P. J., and Adelson, E. H. (1983). A multiresolution spline with application to image mosaics. ACM TOG, 2(4), 217–236. (Gaussian pyramid construction.)
Heeger, D. J., and Bergen, J. R. (1995). Pyramid-based texture analysis/synthesis. SIGGRAPH 1995, 229–238. (source of the histogram-matching technique adapted in §5.3.)
Training audio courtesy of The Recordist's Ultimate Fire sound library, redistributed under the permission inherited from the original Cornell release.

Fire Sound: Texture Synthesis

Signal pair

Algorithm parameters

Spectra (input vs training vs output)

Playback

Base (input)

Training

Synthesized

Output pyramid (coarsest at top)

Spectrograms

Base input