[mb-devel] audio fingerprinting part 2: fdmf
Ben
bench at silentmedia.com
Mon Oct 10 20:15:02 UTC 2005
The short version is that I think that the algorithm used by fdmf is
a viable replacement for TRMs. As for the details..... well that's
the long version. :)
Overview of fdmf
-----------
fdmf is a GPL'd package available from http://www.w140.com/audio/. It
is designed to Find Duplicate Music Files (hence the less than catchy
name), and a large part of that task involves fingerprinting songs
and comparing the fingerprints. In other words, it is does the same
thing that TRMs attempt to do. The details of the fingerprint
generation are all laid out in the package README, but basically it
goes like this: take a binary file, decode it to wav, do some
spectral analysis, and then quantize the results into 3 separate 256-
bit fingerprints representing different things about the same song.
That's right, you end up with *three* fingerprints, for a total fdmf
signature of 768 bits. For clarity, I'll use "fingerprint" to talk
about one of the 256-bit keys, and "signature" to refer to all 3
fingerprints combined.
Already we can see a couple differences from TRMs. First, signature
generation is entirely client side. This is a good thing. Second,
signatures are nothing more than a reduction of existing data,
meaning that they are neither random nor likely to collide with
anything other than a very similar wav. This is also a good thing.
Third, the fuzzy logic of the matching is applied at match time, not
signature generation time. While this takes more cycles overall and
makes matching more expensive, I will later argue that this ends up
being a good thing as well.
I might not have made it clear, so let me say this explicitly: like
CD Index IDs, fdmf signatures are overly specific. It's quite likely
that different binary versions of the same song will have different
fdmf signatures. In fact, it's almost certain that there will be a
different signature for most (if not every) different binary file out
there. (This is in stark contrast to TRMs.) But if things that should
match each other will have different signatures, how will the matches
be detected?
Like I said, a fdmf signature is actually comprised of three 256-bit
fingerprints. These are not random, like TRMs are, but decompositions
of the song. In order to tell if two songs are the same, the hamming
distance between each of the corresponding fingerprints is computed,
and if the distance of all three fingerprints is less than some
threshold (call it T), then the songs are considered to be the same.
(The hamming distance is the number of bits that are different
between two bit strings.) Each of the 3 fingerprints might use a
different value for T (remember they measure different things) so T
is really an array of 3 thresholds, each ranging from 0-255. Also,
just to be clear, T should stay constant once settled upon, or else a
match check may return different results.
If T is set to be {0,0,0} then a given fdmf signature will only match
itself. Because fdmf signatures are overly specific, that is
undesirable. If T is set to be {255,255,255} then a given fdmf
signature will match anything. That is obviously bad. So the trick is
to figure out the optimal values for T. We would want values to be
large enough to recognize different versions of the same song, but
not so large as to think that two different songs are the same. The
author suggests some values that work for him, but my experiments
have show them to be too large.
How well fdmf works
-----------
I've run the algorithm over a large, diverse population of music, and
I'm happy to say that I didn't waste my time and that it works pretty
well. Certainly much better than TRMs. What's more, if we can accept
that a logical song might have many fdmf signatures (just like how a
logical album might have many CD Index IDs), then we simply set the
thresholds low and suddenly the algorithm works extremely well -
assuming we have people associating accurate signatures as they find
unknown signatures.
But enough qualitative talk.... here are my quantitative metrics:
1. The time to generate a fingerprint.
Because the entire file must be decoded, this depends on the file.
The fdmf author claims about 90% of this time is actually spent
decoding to wav, but, as that's a required step (unless somebody has
a bunch of wav they want to analyze....), then we can't throw it out.
As two points of reference, it takes an unloaded 2Ghz celeron about 7
seconds to generate a signature for a 4MB mp3 stored on a local
drive. It takes a fully loaded 1.6GHz PPC G5 about 20 seconds to
generate a signature for a similar file stored on an nfs mount.
2. The time to lookup a fingerprint.
As described above, looking for a matching fdmf signatures is not
nearly the trivial problem that a TRM lookup is. This is because the
fuzzy logic is applied at search time. A naive implementation would
need to take the search signature and compare it to every known
signature, returning which known signatures are under the threshold.
Obviously, O(n) algorithms are impractical for 4 million+ signatures.
(This is actually a fascinating computer science problem that many of
my friends spent way too much time on. If anybody wants to exercise
their CS theory, let me know and I'll give you the problem in an
abstract wording.) Unfortunately, we couldn't come up with anything
that was both accurate and less than O(n).... assuming that a newly
added signature has to be included in search results immediately.
If that assumption is NOT true - that is, if a signature doesn't have
to be included in search results until some time after it is known to
the system - then fdmf signatures can be made to simply be a matter
of an index lookup in SQL. In other words, cheap, much less than a
second, and still on par with TRM lookups. I'll get into the "how"
farther down.
3. How many logically different songs have the same fingerprint.
Ideally this would be 1.
Out of the 85,000 files I tested against, every file with the same
fdmf key also had the same md5 hash, was a different bitrate encoding
of the same source wav, or had been severely truncated to be less
than 3 seconds in length. The former two is what you'd expect, and
the latter is a corner case that I don't think any audio
fingerprinting will be to deal with gracefully.
4. How many logically different songs are returned in a given
fingerprint search. Ideally this would be 1.
This is completely dependent upon the threshold values chosen. Lower
thresholds keep this number low, but at the expense of the next
metric. In my sample set, any threshold less than {23,41,22} gave
ideal results. Thresholds less than {30,50,30} gave less than 5
incorrect matches, all of them related to various censoring levels of
lyrics. It is is expected that as the population of known songs
increases, the number of incorrect matches will eventually increase
with thresholds too large.
5. How many logically identical songs are missed by searching for the
fingerprint of any binary version of the song. Ideally this would be 0.
Again, this is completely dependent upon the threshold values chosen.
As the thresholds go up, this number goes down, but at the expense of
the previous metric. Unfortunately it is also completely dependent
upon the known songs. With my sample set, then, using the thresholds
{30,50,30} might have missed up to 36 logical songs, if every song
only had a single signature recorded. Using the thresholds of
{23,41,22}, up to 168 songs might have been missed.
6. How many fingerprints are required for logically identical songs.
Ideally this would be 1.
The fdmf algorithm really has no control over this. Because the
signatures are a reduction of the binary files, it is likely that
there will be as many signatures as there are binary files. While
that sounds daunting, remember that each signature is only 96B in
size, and does not need to be stored in memory.
As for the overall requirements:
- Generating fingerprints must scale. (at least 1,000/sec?)
YES - this is done entirely client-side. Hard to imaging something
more scalable than that. :)
- Looking up fingerprint matches must scale. (at least 1,000/sec?)
YES - although computationally expensive, one-time-only work can be
done before the signature is made available for search results. After
that work is done, it's a simple index lookup. Both the computation
and the SQL lookup can be load balanced over server farms, letting
this scale almost infinitely.
- Storing fingerprints must scale. (hold much more than 4M fingerprints)
YES - even though a signature is only 96B, it's quite likely that
every distinct binary file out there will generate its own signature.
So let's say an average of 20 signatures over 4 million songs, and we
still use less than 10GB of storage. By today's storage standards,
that's chump change. Still, if this is too much, a huffman tree
should be able to compress a signature a little bit, leading to some
unknown (but probably significant) level of space savings at the
expense of computation time when submitting a new signature. It
*should* go without saying that signatures do not need to be stored
in ram... but maybe I need to explicitly say that anyway. :)
- Free.
- Open source?
YES - the code is GPL'd. Does that mean it's patent-free? Hell if I
know. I'm no lawyer, but I suspect at the very least it means we'd
have plausible protection to not get sued before a cease and desist
was issued.
fdmf is not perfect. As noted above, it has trouble distinguishing
between very short snippets of sound, even though they are obviously
different to the ear. fdmf also has trouble differentiating between
versions of songs with uncensored vs. clean lyrics, though having a
tight enough threshold eliminates this problem. If the thresholds are
too tight, then songs that should be considered matches might not be
returned as such. This problem could be avoided, but only at the
expense of requiring more signature associations to a logical song.
That makes this less useful until a large set of signature
associations are built up.
fdmf is also not cheap in terms of backend resources. I'll get into
how to optimize it down below, but even though it can scale across
machines and the expense will be hidden from the end users, it will
require O(n) computations for each new signature, where n is the
number of known signatures. That will just keep getting more
expensive, though at least it will stay linear.
Finally, fdmf has some advantages over TRMs due to the difference in
algorithm strategy. Because the fuzzy logic is applied at search
time, that means that we can adjust those thresholds without having
to recompute signatures. It also means a client could specify the
precision of the matching they want to use. Both of these things are
impossible with TRMs, and seem to be real steps forward.
How fdmf might work for MB
-----------
From an end user perspective, fdmf signatures should be a drop-in
replacement for TRMs. The biggest change would probably be the
addition of libplot as a requirement. The signatures are obviously
longer than TRMs, but even when base64 encoded for UI display, they
are only 129 characters long. (If we wanted to shorten that up we
could encode the strings with ascii85, but I don't know what that
really gains us.)
Things are a little different on the backend. Obviously, the TRM
server goes away. But don't get rid of that hardware just yet.... now
is as good as time as any to describe how I envision to speed up
signature lookups from O(n) to very fast.
The trick is to notice that, as things are today, if a TRM isn't in
the system when you try to look it up, MB simply says it doesn't know
about it. fdmf signatures could work better, but that requires O(n)
computations at lookup time. So instead, this is what could happen.
If somebody searches for a signature that the system doesn't know
about, the system will simply report there are no matches, even if
there might be. At the same time, the system will record the search
signature and kick off O(n) computations, recording which known
signatures it matches. Once it has walked through all known
signatures and recorded the matches, it marks the new search
signature as "ready". The next time somebody searches for that
signature, all the matches are pre-cached, and the time spent is a
simple index lookup.
It's easy to see how this process naturally lends itself to parallel
computing. Therefore, even though adding a signature is expensive, it
can scale extremely well. The biggest bottleneck will be the database
which serves the computing farm, but because most of that load is
read-only, it's trivial to scale it across database servers. Not that
*that* is cheap either.... but it does scale well.
I've been recording all my data in postgres, so database support
shouldn't be a problem. I imagine that MB will want to have a
slightly more refined schema than what I've been using. :) I have
*not* yet cleaned up the install requirements or packaging of fdmf,
so for instance the signature generation is still a perl script which
calls decoders and the spline program from the plotutils package,
before dumping the key into postgres.... obviously, that's hardly
portable. But there's no reason why it shouldn't be easy to make some
ansi C tool that calls uses libmad and libplot to do all the same
work, and then sends the result to some tcp socket or generic web
service. I can't speak for windows, but that would let things run
just fine on posix.
Summary
-----------
Well, this has been a lengthy email. If you've gotten this far,
congratulations. Let me recap.
fdmf.....
1. appears to be a viable replacement to TRMs.
2. looks capable of avoiding all the problems TRMs have.
3. allows us to change the fuzziness of the searching sever side
without having to re-compute all signatures, if it turns out our
thresholds were set wrong.
4. allows clients to specify how fuzzy they want their searches to be.
5. will scale very well, but will probably require more CPU resources
than TRMs do.
As for next steps.... I don't know. I assume people would want to try
this on their own collections? If so, I can polish up the scripts
I've written. Perhaps we want to involve the author of the tool?
Should I spend time repackaging the fdmf signature generator into a
real posix app?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.musicbrainz.org/pipermail/musicbrainz-devel/attachments/20051010/5617f311/attachment.htm
More information about the MusicBrainz-devel
mailing list