[mb-devel] f(t) pseudocode

Geoff Schmidt gschmidt at gschmidt.org
Sat Oct 29 16:03:00 UTC 2005


This assumes that the code reads a configuration file that has all of
the constants it needs. I've included a little spec for such a file
format, which hopefully is easily readable with scanf, and should have
enough header data that we can change our mind about what fields we need
in the future. Maybe in the early stages the client could shell out to
wget and get the current config file every night, so we could push a
configuration change and get feedback the next day.

File format:
<the magic word "FingerprintConfiguration">
<the name of this configuration>
<the name of the DSP strategy to use -- for now, the magic word
"logpower-linear-vq">
<signal_sample_rate, a real number giving the sampling rate to
downsample the incoming PCM to>
<analysis_window, the width of the FFT analysis window, in samples>
<sample_interval, the distance between successive evaluations of f(t),
in samples>
<basis_vector_count, a non-negative integer>
<basis_vectors, a series of basis_vector_count 'basis vectors', each of
which is just analysis_window / 2 reals>
<codebook_size, a non-negative integer>
<codebook, a series of codebook_size 'codebook entries', each of which
is just basis_vector_count reals>
<symbols, a series of codebook_size raw ASCII characters, none of which
may be whitespace characters>

Example:
FingerprintConfiguration
sampleconfig
logpower-linear-vq
8192
2048
768
3
.0232 .0423 .1310 .0224 ... (total of 4096 numbers)
.9798 .1223 .0001 -2.9343 ... (total of 4096 numbers)
.0000 .0000 .5000 .0000 ... (total of 4096 numbers)
5
.8970 .7669 -.7688
0 0 0
-1 -2 10 .237
4 -4 4
.123 .456 .789
abc

Pseudocode:

[Processing blocks of audio.]
0. Parse the configuration file. If the DSP strategy isn't
'logpower-linear-vq', fail with the error "Unsupported DSP strategy."
1. If the input stream is compressed, decode it to PCM. If it is stereo,
mix it down to mono, mixing both channels equally. 
2. Resample the input signal to signal_sample_rate. It's very important
to lowpass-filter the signal at the same time, instead of just dropping
samples. You can probably get code to do this out of the open-source
'sox' utility.
3. Read the first analysis_window samples, put them in a buffer, and
apply the following steps. Then slide all elements of that buffer "to
the left" by sample_interval samples, and read sample_interval more
samples from the input signal into the "right" side of the buffer. Do
the following steps again. Repeat until there are fewer than
sample_interval samples available on the input.

[Normalizing and taking a power spectrum.]
4. Make a copy of the analysis_window input samples that we are about to
process. This is a good time to promote the samples to 'double' if they
are still of integer type.
5. Find the average (mean) value of all of the samples. Subtract it from
each sample. Now the mean should be 0. In other words, remove any DC
offset.
6. Find the mean absolute value of all of the samples. Divide each
sample by this mean, making the mean absolute value 1.0. (Unless the
mean absolute value is very small, say less than 1% of the maximum
possible value based on the scaling of the input samples -- in that case
replace the samples with all 0's. 1% would be 1.28 in the case of 8-bit
input samples or 327.68 in the case of 16-bit input samples.) This is
gain control; in other words it ensures that volume level changes won't
change the fingerprint.
7. Multiply the samples by a Hanning window. In other words, compute a
Hanning window in an analysis_window-point-long array, and multiply each
sample by the corresponding point in the Hanning array. Google will give
you the formula for the Hanning function.
8. Perform a real-valued FFT. Ignoring the "mirroring" because of the
real-valued input, the output is analysis_window / 2 + 1 complex
numbers. An excellent GPL FFT library is FFTW.
9. Square the complex numbers to get analysis_window / 2 + 1 real
numbers: the power spectrum. The first one is the DC offset, and we know
it is zero because of the normalization we did. Discard it, leaving
analysis_window / 2 reals. Call this power_spectrum.
10. Take the natural logarithm of each element in power_spectrum, except
that if an element is extremely close to zero, set it to zero instead.
This has the effect of expressing power in terms of a decibel-like
measure relative to some unspecified physical reference signal.

[Arbitrary linear transformation.]
11. Multiply the basis_vectors matrix by power_spectrum to get a vector
called transformed_vector that is basis_vector_count elements long:
11a. Allocate an array transformed_vector, basis_vector_count elements
long.
11b. For each element i of transformed_vector:
11c. transformed_vector[i] = 0
11d. For j = 0 to analysis_window / 2 - 1 inclusive:
11e. transformed_vector[i] += power_spectrum[j] * basis_vectors[i][j]

[Vector quantization.]
12. Find the codebook vector that is closest (in Euclidean distance) to
transformed vector) and record its index (an integer from 0 to
codebook_size-1 inclusive) as matched_vector:
12a. Set matched_vector = -1
12b. Set matched_distance2, a real, = MAXDOUBLE
12c. For i = 0 to codebook_size - 1 inclusive:
12d. Set this_distance, a real = 0
12e. For j = 0 to basis_vector_count:
12f. difference = codebook[i][j] - transformed_vector[j]
12g. this_distance2 += difference * difference
12h. End for loop
12i. If this_distance2 < matched_distance2, set matched_vector = i and
matched_distance2 = this_distance2

[Output.]
13. Output symbols[matched_vector] as the result of f(t) for this block.

Commentary:

My first stab at a simple set of basis vectors will be the product of
three matrices that drop all but the "telephone bandwidth" part of the
spectrum, do some filtering 'across the length of the vector' to reduce
the impact of equalization (since this is all happening in a log domain,
this would seem to be linear), and finally do dimensionality reduction
for convenience. As a first attempt at a codebook, I'll build a training
corpus of transformed_vectors and do basic k-means clustering. This,
together with a small sample_interval compared to analysis_window,
should be enough for us to at least get started on relatively clean
files and see where we stand on fidelity.

Geoff



More information about the MusicBrainz-devel mailing list