Fire Sound: Texture Synthesis

Stitch high-frequency content from a real fire-audio recording onto a low-frequency, physically based fire-simulation signal using a multi-resolution texture synthesis search.

Algorithm 2, §5 of Chadwick & James, Animating Fire with Sound (SIGGRAPH 2011) · paper PDF · attribution · parity check · Python source · Companion demo: Spectral Bandwidth Extension

Signal pair

Base:
Training:
 

Pick a flame-simulation base (top row) and a training recording (bottom row), or load your own WAVs. Each preset auto-pairs with its hand-picked training clip (highlighted as “· excerpt”); click any other training button to swap to the full-length WAV. Each click re-runs the synthesis. Adjust the sliders below and hit Synthesize.

Algorithm parameters

idle

Spectra (input vs training vs output)

base PSD training PSD synthesized output PSD

Playback

Base (input)

(no signal loaded)

Training

(no signal loaded)

Synthesized

(not yet synthesized)

Output pyramid (coarsest at top)

Spectrograms

Base input

Synthesized

Training

About this sound model

The problem

Real fire is a broadband sound: crackling and hissing out to several kilohertz. A physically based flame simulator reproduces visual fire behavior convincingly but doesn't directly produce sound: time-stepping a 3D combustion simulation at audio sample rates is impractically expensive, and small- scale combustion noise comes from thermo-acoustics that those simulators don't resolve at typical resolutions (Chadwick & James 2011, §1).

The paper's workaround (§§2.2, 3) is a simplified analytical sound model. Combustion noise can be derived from a wave equation forced by the rate of change of the heat release rate (Crighton et al. 1992; equation 2 in the paper). Under the premixed-flame assumption (Strahle 1972), the heat release happens essentially at the moving flame front, so the volume integral over the combustion region can be rewritten as a surface integral over that front. Ignoring propagation delays, 1/r distance attenuation, and overall scaling constants (all fixed multiplicative factors at a chosen listener position), the radiated pressure reduces to p(t) = d/dt ∫S(t) u(x, t) · n(x, t) dS (equation 6), the time derivative of the velocity flux integrated over the moving flame front. This is also proportional to the time derivative of the total heat release rate.

The flame solver runs at hundreds of steps per second (360 Hz for the paper's examples). After interpolating its pressure output up to audio sample rate, the resulting signal is band- limited to roughly the simulation's Nyquist (≈ 180 Hz): a slow-envelope rumble that captures the flame's overall rate of activity but contains essentially no audible structure above that. This is the "low-frequency, physically based pressure signal" the paper takes as input to its bandwidth-extension and texture-synthesis algorithms (§§4–5). The simulated signal alone therefore sounds dull rather than crackling, even when the visuals are perfect.

The companion spectral bandwidth extension demo (link) fills in the missing high frequencies with synchronized power-law noise that matches theoretical and experimental flame spectra (§4 of the paper). This demo takes the paper's other approach: it borrows the high-frequency structure from a real fire-audio recording (the "training" signal), via the texture-synthesis algorithm of §5. Quoting the paper: "By varying the flame sounds used for input training data, users can control the style of synthesized sounds, while retaining synchronization with simulated flames."

Why “data-driven, partially physically based”

The low-frequency input is physical: it comes from the §3 sound model derived above. The high-frequency content comes from sampling a real recording: the synthesized output's micro-structure is drawn from training audio rather than predicted from physics. Bursts in the input (loud combustion events) drive bursts in the output because the multi-resolution search at the top pyramid level locks the synthesized envelope to the input's, and the optional dynamic-range mapping (§5.3) further re-scales training amplitudes so the output's loudness distribution matches the input's.

How the algorithm works

  1. Pyramids. Pad each signal to length 2k+1; build a numLevels-level Gaussian pyramid (default 6) using the 5-tap stencil [0.05, 0.25, 0.40, 0.25, 0.05] and 2:1 decimation. The coarsest level holds the slow envelope; each lower level adds finer detail. The training signal's pyramid is built with reflected boundaries; the base signal's pyramid is zero-padded.
  2. CDF init. If scaleCDF is on, sort the absolute amplitudes of the top pyramid level for both signals. These sorted arrays are the input/output CDFs used below.
  3. Training-feature dictionaries. For each non-top pyramid level , build a feature vector from each training-signal window. Each feature has two regions: causal context from level immediately to the left of the window (16 samples by default), and coarser context from level ℓ+1 symmetrically around the window (33 linearly-interpolated samples by default). All features are stored in a KD-tree for fast nearest-neighbour lookup.
  4. Initialise the output pyramid to the input pyramid, then zero every level except the top (so the output's coarse envelope is the input's).
  5. Coarse-to-fine synthesis. For ℓ = numLevels-2, numLevels-3, …, 0, for each output window:
    1. Build a feature vector from already-synthesized samples (causal) and the next-coarser level (which is fully synthesized at this point of the loop).
    2. If ℓ+1 is the top level and scaleCDF is on: compute the average magnitude of the coarser-level feature entries; find that average's percentile in the base-signal CDF FS; look up the matching amplitude in the training CDF FT; the ratio r = trainingAmp / averageMag is the per-window scale factor (paper §5.3, equation 7). Blend r with scalingAlpha for partial mapping. Multiply the coarser feature entries by r so the NN search runs in a normalised space, and remember 1/r so the matched training window can be scaled back to the right amplitude when it's blended into the output.
    3. Find the nearest training window in level 's KD-tree.
    4. Add that training window into the output level via a triangular hat-function blend, scaled by 1/r (or 1 if scaleCDF is off, or if ℓ+1 is not the top level).
  6. Reconstruction. The output is the bottom level of the output pyramid, trimmed to the original sample range.

What the controls do

Things to notice

A failure mode of the CDF mapping

The dynamic-range mapping in §5.3 multiplies matched training windows by a per-window scale factor so the output's amplitude statistics track the input's. When the input has a much wider dynamic range than the training data, those scale factors get large; consecutive windows tend to receive similar large factors; the triangular overlap-add then reinforces them constructively into a tonal buzz at the windowHW stride frequency. The paper itself flags this in §6 (“...the method still has difficulty producing a suitable, temporally coherent output sound. This can occur in cases when the low-frequency input has a very wide dynamic range, while the training data has a small range”).

Among the bundled examples, both torch and candle hit this pathology at scalingAlpha = 1.0 (the C++ release default). The demo therefore loads them with scalingAlpha = 0.5, which blends in half the rescale and avoids the artefact while keeping the qualitative dynamic-range matching the algorithm wants. Drag the slider back to 1.0 on either example to hear the buzz; drag down to 0.0 to hear the texture without any rescale at all. The other three examples (burning_brick, dragon, flame_jet) ship at the canonical 1.0.

Caveats and things this model is not

References

  1. Chadwick, J. N., and James, D. L. (2011). Animating Fire with Sound. ACM Transactions on Graphics (SIGGRAPH 2011), 30(4). cs.cornell.edu/projects/Sound/fire · paper PDF
  2. Wei, L.-Y., and Levoy, M. (2000). Fast texture synthesis using tree-structured vector quantization. SIGGRAPH 2000, 479–488. (parent multi-resolution texture synthesis algorithm.)
  3. Burt, P. J., and Adelson, E. H. (1983). A multiresolution spline with application to image mosaics. ACM TOG, 2(4), 217–236. (Gaussian pyramid construction.)
  4. Heeger, D. J., and Bergen, J. R. (1995). Pyramid-based texture analysis/synthesis. SIGGRAPH 1995, 229–238. (source of the histogram-matching technique adapted in §5.3.)
  5. Training audio courtesy of The Recordist's Ultimate Fire sound library, redistributed under the permission inherited from the original Cornell release.