The problem
Real fire is a broadband sound: crackling and hissing
out to several kilohertz. A physically based flame simulator
reproduces visual fire behavior convincingly but doesn't
directly produce sound: time-stepping a 3D combustion simulation
at audio sample rates is impractically expensive, and small-
scale combustion noise comes from thermo-acoustics that those
simulators don't resolve at typical resolutions
(Chadwick & James 2011, §1).
The paper's workaround (§§2.2, 3) is a simplified
analytical sound model. Combustion noise can be derived from a
wave equation forced by the rate of change of the heat release
rate (Crighton et al. 1992; equation 2 in the paper). Under the
premixed-flame assumption (Strahle 1972), the heat release
happens essentially at the moving flame front, so the volume
integral over the combustion region can be rewritten as a
surface integral over that front. Ignoring propagation delays,
1/r distance attenuation, and overall scaling constants
(all fixed multiplicative factors at a chosen listener
position), the radiated pressure reduces to
p(t) = d/dt ∫S(t)
u(x, t) ·
n(x, t) dS
(equation 6), the time derivative of the velocity flux integrated
over the moving flame front. This is also proportional to the
time derivative of the total heat release rate.
The flame solver runs at hundreds of steps per second (360 Hz
for the paper's examples). After interpolating its pressure
output up to audio sample rate, the resulting signal is band-
limited to roughly the simulation's Nyquist (≈ 180 Hz):
a slow-envelope rumble that captures the flame's overall
rate of activity but contains essentially no audible structure
above that. This is the "low-frequency, physically based pressure
signal" the paper takes as input to its bandwidth-extension and
texture-synthesis algorithms (§§4–5). The
simulated signal alone therefore sounds dull rather than
crackling, even when the visuals are perfect.
The companion spectral bandwidth extension demo
(link) fills in the
missing high frequencies with synchronized power-law noise that
matches theoretical and experimental flame spectra
(§4 of the paper). This demo takes the
paper's other approach: it borrows the high-frequency structure
from a real fire-audio recording (the "training" signal),
via the texture-synthesis algorithm of §5. Quoting
the paper: "By varying the flame sounds used for input training
data, users can control the style of synthesized sounds, while
retaining synchronization with simulated flames."
Why “data-driven, partially physically based”
The low-frequency input is physical: it comes from the §3
sound model derived above. The high-frequency content comes from
sampling a real recording: the synthesized output's
micro-structure is drawn from training audio rather than
predicted from physics. Bursts in the input (loud combustion
events) drive bursts in the output because the multi-resolution
search at the top pyramid level locks the synthesized envelope
to the input's, and the optional dynamic-range mapping
(§5.3) further re-scales training amplitudes so the output's
loudness distribution matches the input's.
How the algorithm works
-
Pyramids. Pad each signal to length
2k+1; build a
numLevels-level Gaussian pyramid
(default 6) using the 5-tap stencil
[0.05, 0.25, 0.40, 0.25, 0.05]
and 2:1 decimation. The coarsest level holds the slow
envelope; each lower level adds finer detail. The training
signal's pyramid is built with reflected boundaries; the
base signal's pyramid is zero-padded.
-
CDF init. If
scaleCDF is on,
sort the absolute amplitudes of the top pyramid level for both
signals. These sorted arrays are the input/output CDFs used
below.
-
Training-feature dictionaries. For each
non-top pyramid level ℓ,
build a feature vector from each training-signal window. Each
feature has two regions: causal context from level
ℓ immediately to the left
of the window (16 samples by default), and coarser
context from level ℓ+1
symmetrically around the window (33 linearly-interpolated
samples by default). All features are stored in a KD-tree for
fast nearest-neighbour lookup.
-
Initialise the output pyramid to the input
pyramid, then zero every level except the top (so the output's
coarse envelope is the input's).
-
Coarse-to-fine synthesis. For
ℓ = numLevels-2, numLevels-3, …, 0,
for each output window:
-
Build a feature vector from already-synthesized samples
(causal) and the next-coarser level (which is fully
synthesized at this point of the loop).
-
If ℓ+1 is the top level and scaleCDF is on:
compute the average magnitude of the coarser-level feature
entries; find that average's percentile in the
base-signal CDF
FS; look up the
matching amplitude in the training CDF
FT; the ratio
r = trainingAmp / averageMag
is the per-window scale factor (paper §5.3,
equation 7). Blend
r with
scalingAlpha for partial
mapping. Multiply the coarser feature entries by
r so the NN search runs in
a normalised space, and remember
1/r so the matched training
window can be scaled back to the right amplitude when it's
blended into the output.
-
Find the nearest training window in
level ℓ's KD-tree.
-
Add that training window into the output
level via a triangular hat-function blend, scaled by
1/r (or 1 if scaleCDF is
off, or if ℓ+1 is not the top level).
-
Reconstruction. The output is the bottom level
of the output pyramid, trimmed to the original sample range.
What the controls do
-
numLevels: pyramid depth. More levels
means a coarser top level (slower envelope) and more detail
layers; default 6 (top-level Nyquist ≈ 1.4 kHz at
44.1 kHz).
-
windowHW (samples): output window
half-width at every level. Smaller windows let the synthesis
switch training contexts more often; larger windows are smoother
but blockier.
-
featureHW (samples): feature context
half-width. The feature vector at level ℓ has
windowHW · featureHW + 1
causal samples plus 2 · windowHW
· (featureHW + 1) + 1 coarser samples. With the
defaults (4 / 3) that's 13 + 33 = 46 dims.
-
falloff: exponential weight on feature
dimensions: exp(−falloff ·
|distance|). 0 (the paper default) means uniform; large
values bias the NN search toward local context.
-
scalingAlpha: how aggressively the CDF
mapping rescales training amplitudes. 0 disables it (same as
unticking
scaleCDF); 1 is the paper default.
-
scaleCDF: master toggle for
section 5.3 dynamic-range mapping.
-
RNG seed: integer seed for the shared
PCG32 generator. Used here only for tie-breaking; on these
signals you'll rarely see it change the output.
-
epsANN: nearest-neighbour approximation
tolerance. The browser demo defaults to 5.0, which prunes the
KD-tree very aggressively and is roughly 5-10× faster
than exact search. The perceptual difference vs exact NN is
below the noise floor on the bundled examples; in fact, on
torch and candle, exact NN produces a tonal buzz that
approximation actually breaks (see the failure-mode discussion
below). The original C++ release defaults to 1.0. Set the
slider to 0 for exact NN (deterministic, matches the Python
reference bit-for-bit; the Tier 2 goldens always use 0).
Things to notice
-
Pick the Dragon example and toggle
scaleCDF on/off: with it on, the loud "fire-
breathing" bursts of the input drive correspondingly loud
bursts in the output (CDF mapping pumps amplitude to match);
with it off, the output is more uniformly textured.
-
Drag scalingAlpha from 0 to 1: at 0 you hear
the training texture untouched; at 1 it's been re-scaled to
match the input's amplitude statistics.
-
Compare the same flame in this demo with the
spectral bandwidth extension demo
(link). The
bandwidth-extension version is "synthetic crackle" matched to
the envelope; this version is "real recorded crackle" matched
to the envelope. They're often perceptually similar, but the
texture-synthesised version has temporal events (pops, micro-
rumbles) that pure power-law noise can't reproduce.
-
Watch the output pyramid waveforms: the top
level (coarsest) is the input's slow envelope; each lower
level is a higher-octave detail layer added by the synthesis
loop.
-
Watch the spectra plot: the output PSD (green)
should sit roughly between the base PSD (blue) at low
frequencies and the training PSD (orange) at high frequencies,
with a smooth crossover that's chosen automatically by the
coarse-to-fine search.
A failure mode of the CDF mapping
The dynamic-range mapping in §5.3 multiplies matched
training windows by a per-window scale factor so the output's
amplitude statistics track the input's. When the input has a
much wider dynamic range than the training data, those scale
factors get large; consecutive windows tend to receive similar
large factors; the triangular overlap-add then reinforces them
constructively into a tonal buzz at the windowHW stride
frequency. The paper itself flags this in §6 (“...the
method still has difficulty producing a suitable, temporally
coherent output sound. This can occur in cases when the
low-frequency input has a very wide dynamic range, while the
training data has a small range”).
Among the bundled examples, both torch and
candle hit this pathology at
scalingAlpha = 1.0 (the C++
release default). The demo therefore loads them with
scalingAlpha = 0.5, which blends
in half the rescale and avoids the artefact while keeping the
qualitative dynamic-range matching the algorithm wants. Drag the
slider back to 1.0 on either example to hear the buzz; drag down
to 0.0 to hear the texture without any rescale at all. The other
three examples (burning_brick, dragon, flame_jet) ship at the
canonical 1.0.
Caveats and things this model is not
-
The output's high-frequency content is borrowed from the
training signal, not predicted by physics. If you train on a
candle, your dragon's roar will sound suspiciously candle-like
at high frequencies. The paper frames this as a feature: by
choosing the training clip you choose the texture, while
keeping the simulator's envelope. Picking training audio that
matches your scene is part of the artistic process.
-
Two simulations with very different small-scale combustion
structure but the same envelope will sound similar. If your
application depends on the high-frequency physics, this
synthesis step is not the right tool.
-
The browser demo uses approximate nearest-neighbour
(epsANN = 5.0 by default), as
does the released C++ (which uses the
ANN library
with epsANN = 1.0). The
parity check runs both ports
with epsANN = 0 (exact NN) and
verifies they match within FFT round-off.
-
Defaults match each example's bundled
default.xml:
numLevels=6, windowHW=4, featureHW=3,
falloff=0, scaleCDF=1, scalingAlpha=1. Two examples
ship with a per-demo override
scalingAlpha=0.5
(torch and candle) to avoid
the CDF-mapping pathology described above. The Reset button
restores each example's bundled defaults.
References
-
Chadwick, J. N., and James, D. L. (2011). Animating Fire
with Sound. ACM Transactions on Graphics (SIGGRAPH 2011),
30(4).
cs.cornell.edu/projects/Sound/fire
·
paper PDF
-
Wei, L.-Y., and Levoy, M. (2000). Fast texture synthesis
using tree-structured vector quantization. SIGGRAPH 2000,
479–488. (parent multi-resolution texture synthesis
algorithm.)
-
Burt, P. J., and Adelson, E. H. (1983). A multiresolution
spline with application to image mosaics. ACM TOG, 2(4),
217–236. (Gaussian pyramid construction.)
-
Heeger, D. J., and Bergen, J. R. (1995). Pyramid-based
texture analysis/synthesis. SIGGRAPH 1995, 229–238.
(source of the histogram-matching technique adapted in
§5.3.)
-
Training audio courtesy of The Recordist's
Ultimate Fire sound library, redistributed under the
permission inherited from the original Cornell release.