[mb-devel] Future-proof FP design sketch
Geoff Schmidt
gschmidt at gschmidt.org
Fri Oct 14 06:03:45 UTC 2005
> So all i really need to get started is the Magic Function that takes PCM
> and hashes it up, as well as a description of how to match the output to
> find similar songs.
OK, let's try to sketch something out.
Let's say I give you a function that takes some fixed amount of audio --
probably in the 250ms-500ms range -- and gives you back a small integer,
say in the range of 0 to 63, such that when the audio undergoes "worst
case distortion" for the worst case we're willing to handle, there is a
50% chance that the function will still return the same number. Let's
say this "distortion" includes not only things like noise, equalization,
and lossy encoding, but the movement of the "start point" of our 250ms
chunk in the track by up to +/- 50ms, and uniform speeding up or slowing
down of the playback rate by up to +/- 3%. I'll come back to how we get
this function in a minute.
You can imagine taking this function and sliding its 250ms "analysis
window" over a song, sample by sample. There are long stretches where
the same symbol is returned, because I specified that the function had
to be reasonably robust to movement of the window. Let's say that we
calculate the function every 100ms and that this is our fingerprint for
a stretch of audio -- 10 of these numbers between 0 and 63 per second,
essentially an extremely compressed description of the song. (The way we
take the samples is important. We shouldn't take them at t, t+100ms,
t+200ms, etc. We should take them at t +/- 50ms, t + 100ms +/- 50ms, t +
200ms +/- 50ms, etc. That way we never consistently get the worst case
of comparing two fingerprints taken at time bases 50ms apart.)
So say we apply this procedure to a bunch of songs and put them in a
database, and provide some kind of efficient substring search index. Now
say we have an audio sequence to be identified. We subject it to the
same fingerprinting process, and then search for substrings of the query
sequence that match substrings of records in the database. (It is
simpler to think of the database not as a collection of songs, but as
one big long sequence, with annotations labeling particular time ranges
of that sequence as particular songs.) This finds us all parts of all
songs in the database that match parts of our query (above a certain
length, presumably.) The substring matching process can't look for
perfect matches; it must be content to have up to 50% of the positions
in a pairing not match, because that is the performance of our
fingerprint function under distortion.
So how do you use this? You can take a 5-10 second sample of a song,
throw it at the database, and see what records contain it, to identify a
song. You can take an entire song, throw it at the database, and look
for the best match across the entire length of the song to distinguish
subtle variations of a song that differ in only a few time intervals,
like some clean edits. You can take a 10 hour recording of a radio
stream and find the songs that were played (find database entries that
are substrings of the query.) And you can do a neat trick where you
take, say, a month's recording of a radio station and load it into the
database blind, without annotations, and then match the same recording
against the database -- that is, you find the places an audio sequence
matches itself, which is to say the repeated segments in it, which is to
say the songs and commercials. A nice thing is that anyone who knows how
to do string handling can hack on applications like these, without
having to dive into DSP land or write a classifier.
(Collisions are essentially impossible in this scheme with any query
sequence length over a second or two. Two songs would have to have the
same time course over the length of the query. In my experience this
simply doesn't happen except for pathologically chosen fingerprint
functions. The exception to this is pathologically chosen audio, like
silence. What all of this means is that the sequence data we collect
could be used by many people for many different purposes for many
years.)
How can we do this indexing? First of all note that this is a
well-studied problem as it is the basis for the BLAST tool that is used
in bioinformatics to search for similar DNA and protein sequences. We
might even be able to use BLAST ourselves if we put an appropriate cache
in front of it and only blasted songs whose hash we had never seen
before. But for now let's assume we implement something optimized to our
needs and do some back-of-the-envelope calculations.
Let's say we want to prove feasibility for a billion songs using the
technology of the near future. Let's say that we want to index the
entire length of each song for posterity, not just a fixed ~30 second
window, which would be more than sufficient for mp3 file identification.
Suppose these songs average 4 minutes in length. Assume that the
worst-case "fidelity" of the fingerprint function (likelihood of
returning the same number under distortion) is 50% as given above. (We
should do much, much better under the expected case of mp3 encoding of a
CD rip. I picked 50% to see how much really degraded cases hurt us.)
Assume that each "symbol" (fingerprint function output) is a number
between 0 and 63. Assume that the fingerprint function f(t) is sampled
every 100ms, and that f(t) and f(t+100ms) are statistically independent.
Assuming that each "symbol" is equally likely to some out of f(t) is too
much to ask. Now, because the symbol distribution will be nonflat, we
won't get as much information out of evaluating f(t) as we would out of
rolling a 64-sided die or pulling 6 bits out of a true random number
generator. Assume that the "entropy of f(t)" is more like 5 bits, that
we learn as much as we would out of rolling a 32-sided die. That could
be 32 equally likely symbols and 32 that have virtually no chance of
being output, or it could be some other nonuniform distribution that has
the same net effect.
To index a sequence, we will divide it into overlapping "fragment," each
a series of 6 symbols. We'll break each song in the database into these
fragments and put each mapping (fragment -> song ID containing it at
least once) in a giant hash table. Given a query sequence, we'll break
it into fragments too, look each fragment up in the hash table, and do a
more detailed analysis on each song that had more than n matching
fragments. The principal here is that as the length of the query
sequence increases (meaning a linear increase in the amount of work),
the expected number of fragment matches for a "correct song match"
increases much faster than that for a random track in the database that
is matching only "coincidentally," so we are better and better able to
pick an n that separates the two classes of songs even as we minimize
the chance that we could miss a correct match.
Because symbols, even if adjacent, are independent, the "entropy of a
fragment" is 6 symbols * 5 bits/symbol = 30 bits. That means if we pick
two random fragments, the chance of them matching is one in 2^30, about
one in a billion. We are free to create as many of these fragments per
second of audio as we want, as long as we do it in a consistent way. We
could take 6 adjacent symbols at each 100ms sample point, giving 10
fragments per second. We could do that every 200ms (or better, do it
every 100ms, hash the fragments, and discard the 50% of the with the
lower set of hash values) and get 5 fragments per second. We could
sample twice at every position, the second time skipping the second
symbol and including an extra one at the end, and get 20 fragments per
second. As we take more fragments, the size of our index increases
linearly, and the length of the query sequence we need under adverse
conditions decreases (sublinearly, I think, because the fragments
overlap.) Let's say we take 10 fragments a second, and let's
make-believe that they don't overlap, for purposes of making some
independence assumptions about them later.
There are 4*60*10^9 seconds of audio in the database, and 4*60*10^9*10
fragments in the index, or 2.4e12 fragment records (assuming for
simplicity that even if a fragment appears twice in one song, we store
it twice instead of optimizing and storing it once.) When we do a query
for a fragment, the chance that a random record in the database will
match by chance is one in 2^30. 2.4e12 / 2^30 = an expected 2235 song
ID's returned for any given fragment due to coincidence. Additionally,
our results might include something noncoincidental: a match to the
"correct" song. Since all six symbols must match and there is a 50%
chance that each of them will, if we assume that the failures are
independent, there is a one in 64 chance that the song ID of the correct
match will be in the results.
If we check 300 fragments from the query, which at 10 fragments a second
is 30 seconds of music, and there is one chance in 64 of getting a
correct match at each position, there is a .9489 chance of getting at
least two fragment matches for the correct song (I used the binomial
distribution calculator at
http://psych.rice.edu/online_stat/java/binomialProb.html for this.) The
expected number of matches is 300 / 64 = 4.7. On the other hand, we'll
have a total of about 2235 * 300 = 670500 coincidentally returned song
IDs, each a randomly (uniformly, independently, we assume) selected
number between 0 and a billion minus one. Using the binomial
distribution again and a Python program, I estimated that the expected
number of song IDs that will coincidentally appear at least twice across
all 300 fragments as 224.
So after we complete this process, we'll have a few hundred song IDs.
Most will be spurious, but with 95% confidence, even if 50% of the
symbols in the fingerprint got changed to random values, the correct
song will be in there (if there are multiple correct matches, each has a
95% chance.) We'll load all the song fingerprints off of disk and send
them to the client, which can take the edit distance or whatever to
distguish the correct match from the spurious matches, which will be
obviously wrong (random symbols plus two six-symbol shared substrings
with the query somewhere, not even in the right place relative to each
other.) I said 95% confidence, so what about the other 5% of the time?
Well, you can keep scanning through the query sequence, boosting the
odds as you go, until you fall off the end of the track and have to
declare defeat.
How long will all of this take? Putting 670500 entries in a hash table
is pretty fast; the dominating factor is a cache miss for each entry.
(You don't need a full hash table, just a big bit array. When you see a
song ID, hash it to get the bit index. If that bit is not set, the song
ID is being seen for the first time, so set the bit and continue. If
it's already set, we're probably seeing this song ID for the second
time, so add the song ID to the results list if it's not already there.)
I did a super-approximate test just now and my low end emachines Celeron
can take that many cache misses in 24ms. The expensive part is the 300
lookups in the fragment database.
I said that there were 2.4e12 fragment records. Let's say that each is a
32-bit integer, with the lower 31 bits giving the song ID (snugly
fitting our billion songs) and the high bit indicating the end of a list
of entries. That means the total amount of data, without the table of
pointers, is 9.6e12 bytes, or 8900 gigabytes, so it must go on disk and
not in RAM. 500 GB drives and even larger are routinely available today,
so let's say that the "near future" I mentioned holds 1 TB drives. At
that capacity a billion song database could feasibly fit on one server
with a big RAID card (voice of experience, except with 2001 technology)
though more likely it'd be distributed over a few machines with queries
routed by the fragment ID requested. If storage prices follow historical
trends so that these drives can be had for under $500 each, the total
cost of the storage (for the index) would be under $5k.
Then we'd need a file that say, given a fragment, where its associated
data is. A fragment is 6*6 bits long (remember, not uniformly
distributed, so "worth" only 5*6 bits) so there are 2^36 possible
fragments, most of which will probably get used at least once somewhere,
and we'll need to associate with each some sort of 64-bit pointer = 2^3
bytes, so we're looking at 2^39 bytes. That is 2^30 * 2^9 = 512 GB,
again large enough that no compression will make it fit in memory, so
figure that's on disk too.
The dominating factor in looking up a fragment to find the song IDs
whose sequences contain it is disk seek time. Typical consumer hard
drives claim to do a random seek in around 10ms. To look up a fragment
takes two seeks, one to read the table, and one to read the actual song
ID list, which at 2235 * 4 = ~9k doesn't take a significant amount of
time to read after the disk head is positioned. We are figuring on 300
fragment lookups, so that's 600 seeks, or 6000ms = 6 sec. This time can
be divided over all of the disks in the system, so in a ten-disk system
we'd have a 600ms query time.
600ms (or worse, 6 sec on a single-disk server!) is ages, but *remember
that this is for the "worst case distortion"* where 50% of the symbols
can change. Let's say that a symbol is only 25% likely to get scrambled,
so the chance of a fragment matching is ~18%. Then instead of 300
fragments, we only have to check 25 to have a .9522 chance of getting
two correct fragment matches, and we average only 1.56 other spurious
song ID matches to send to the client instead of 224. If we check 40
fragments, we'd have a .9962 chance, and average 4.00 spurious matches.
These would take 500ms and 800ms of disk time respectively; 50ms and
80ms divided over 10 disks. In reality I'm hoping we'll get more like
90% fidelity on clean, fresh-ripped audio; after all, there are a lot
more than 64 distinctly different sounds 100ms long. To adapt to both
scenarios, I imagine that the search process would be under client
control. The client would send a "next match" command until it was
satisfied with the results it had gotten.
[Edit: with a clever disk allocation scheme we could eliminate the table
and its seek and speed things up almost 50%. Just seek to an approximate
disk location based on hashing the fragment and read until you see a
header indicating the start of the fragment you want. You'll read for a
lot longer, but you'll avoid the extremely expensive seek.]
Remember also that the above is a thought experiment at the billion-song
scalability point. To get started, we'd just start collecting the
fingerprints and throw together a small-capacity server, with a simple
in-memory database or maybe a SQL backend, possibly even in a language
like Python. We can use whatever inefficient, lightweight indexing
scheme we prefer, and tune it to our small database (for example, make
the fragment length 5 symbols instead of 6, making the match odds much
higher and the required query length much shorter.) As database size and
load increases, we can do whatever we want to the indexing, transparent
to the client.
OK, at the top I promised to say something about the DSP side of things
-- the magic function that takes a block of audio and returns one of
these symbols. It turns out that the naive way to do this works
surprisingly well on digital files, with some basic tuning. Downsample
to a 8khz sampling rate or so ("telephone bandwidth) making sure to use
a lowpass filter to take out anything above the Nyquist frequency as you
do so. Remove DC offset (figure out what you need to subtract or add to
the sample to make the average sample 0 and do so.) Normalize volume
level (for example, find the mean deviation from zero and multiply the
signal by the appropriate value to make it 1.0.) Now take an FFT, using
of course an appropriate window function, and throw away the phase
information leaving only the power spectrum. Beforehand, come up with a
"codebook" of specta, each associated with a symbol. The output of the
function is the symbol associated with the codebook spectrum with the
minimum Euclidean distance to the observed spectrum. To make this
cheaper, you can do dimensionality reduction on the power spectrum and
the codebook, so you're taking Euclidean distance in 16- or
8-dimensional space instead of 2048-dimensional space. Principal
component analysis works just fine, but that's outside the scope of this
email. To build the codebook in the first place, build a training corpus
of example spectra and use your favorite clustering algorithm, such a
k-means.
The reason such a simple thing works is that you are using the FFT to
leak large amounts of temporal information (you are blurring time, much
like space is blurred when you take off your glasses, and very much like
jpegs are blurred at high compression ratios.) This happens two ways.
First of all, when you throw away the phase information, you're throwing
away the half of the FFT's output where most of the time structure
information ended up. Second, because you carefully picked a FFT window
that was a lot wider than your fingerprint function sampling interval,
when the window is jittered back and forth a bit, only a small amount of
information is entering and leaving the analysis window. The other
reason such a simple thing works is that even an 8-dimensional space is
very difficult for people to visualize, and Euclidean distance between
spectra doesn't work out quite the way you think it will.
At Tuneprint, we augmented this with a full psychoacoustic stack, going
from the physical power spectrum to a physiological neural excitation
plot. That is overkill. However it's probably reasonable to rescale the
frequency axis logarithmically, or according to the Bark scale (the
human frequency sensitivity curve) and maybe take the log of the energy.
This is dangerous territory because only mp3 encoders care how human
ears work. (mp3 encoders and people trying to defeat copy protection
measures without fucking up the way a song sounds.) Other distortions
see only the physical signal and distort accordingly. We don't have a
lot to fear from mp3 encoders, especially under 4khz.
As for other distortions, white noise nudges our point a random amount
in spectrum space, and a codebook-based approach will usually be really
resilient to this. Equalization is trickier. The first question is, how
much robustness/insensitivity do you actually want? A bit of sensitivity
to equalization changes can be great for distinguishing different
masters of the same work, but in an application where you're trying to
identify what's playing overhead in the supermarket using a crappy
microphone, you want a great deal of robustness. The quick hack to get
robustness is to filter the power spectrum, dividing each frequency bin
by the average of those around it. A better-performing answer is to use
a "feature-detector" based fingerprint function, which could be based on
features like the locations of likely note attacks. In the
sequence-based server architecture I talked about above, to get a
symbol, we'd combine the features into a vector and throw the vector
into a classifier. One problem with feature-based approaches is, what if
there are no features in this block of audio? Another distortion is
structured noise, like a person talking over a song or two songs mixed
together. If the noise has known properties and you have a good sample
of it, you can increase robustness by fingerprinting in a linear
subspace where the noise isn't, but feature-based approaches seem most
promising to me in environments with lots of structured noise.
To take a bit more of a theoretical view, you can divide everything
after taking the spectrum and before the codebook lookup into two
categories, linear operators and nonlinear operators. Any number of
linear operators can be lumped together into one big matrix
multiplication. The fingerprint function I'm thinking of will have three
parts. First of all, there will be some fixed nonlinearities like the
volume level scaling and the psychoacoustic normalization if we use it.
The output of all of this will be normalized spectra. Then there'll be a
multiplication by a Magic Transformation Matrix, which will project each
spectrum down into 8 dimensions or so. I will probably make this with
independent component analysis and some hacks to suppress dimensions of
variation that correspond to those in a noise training set; this is one
way to sneak in equalization robustness. Then there'll be a search
through a Magic Codebook with from a few dozen to a few hundred entries.
Each entry'd be a point in 8 dimensional space, and you'd scan through
the codebook, computing the square of the distance to each entry to find
the best match. I'd make the codebook with k-means, maybe with a bias to
keep the cells from getting too small (negative effect on robustness)
and as balanced in probability as possible (to maximize the information
content of a symbol.) On the other hand you can also argue that the
codebook should just be a regular quantization grid. I had success with
both approaches at TP.
What I'd like to do is make the client code read the Magic Matrix and
the Magic Codebook out of a file, perhaps even one that can be updated
by the server (they are tables sized around 1024x8 and 64x8
respectively). This would make it easy to do some large-scale testing
around the beginning.
Does this design make sense? Questions/comments/ideas/competing
proposals? Do you see any math errors?
The Powerbook that I usually run Matlab on is in for service right now,
so it'll be a bit before I can get cranking on Magic Matrices and Magic
Codebooks. But if there are people who are interested in trying my
approach, I can get some pseudocode out for the other parts, so we can
slot the tables in when I get my laptop back.
Also, if anyone wants my Tuneprint paper I can send you a PDF in email.
It has some interesting but irrelevant bits on subjects I didn't go into
above, like searching for timebase alignment hypotheses, the design of
TP's server cluster, our data on spectra dimensionality and clustering,
and my coauthor Matthew Belmonte's psychoacoustics work.]
Geoff Schmidt
gschmidt at gschmidt.org
More information about the MusicBrainz-devel
mailing list