[mb-devel] audio fingerprinting part 2: fdmf

Ben bench at silentmedia.com
Mon Oct 10 20:15:02 UTC 2005


The short version is that I think that the algorithm used by fdmf is  
a viable replacement for TRMs. As for the details..... well that's  
the long version. :)

Overview of fdmf
-----------
fdmf is a GPL'd package available from http://www.w140.com/audio/. It  
is designed to Find Duplicate Music Files (hence the less than catchy  
name), and a large part of that task involves fingerprinting songs  
and comparing the fingerprints. In other words, it is does the same  
thing that TRMs attempt to do. The details of the fingerprint  
generation are all laid out in the package README, but basically it  
goes like this: take a binary file, decode it to wav, do some  
spectral analysis, and then quantize the results into 3 separate 256- 
bit fingerprints representing different things about the same song.  
That's right, you end up with *three* fingerprints, for a total fdmf  
signature of 768 bits. For clarity, I'll use "fingerprint" to talk  
about one of the 256-bit keys, and "signature" to refer to all 3  
fingerprints combined.

Already we can see a couple differences from TRMs. First, signature  
generation is entirely client side. This is a good thing. Second,  
signatures are nothing more than a reduction of existing data,  
meaning that they are neither random nor likely to collide with  
anything other than a very similar wav. This is also a good thing.  
Third, the fuzzy logic of the matching is applied at match time, not  
signature generation time. While this takes more cycles overall and  
makes matching more expensive, I will later argue that this ends up  
being a good thing as well.

I might not have made it clear, so let me say this explicitly: like  
CD Index IDs, fdmf signatures are overly specific. It's quite likely  
that different binary versions of the same song will have different  
fdmf signatures. In fact, it's almost certain that there will be a  
different signature for most (if not every) different binary file out  
there. (This is in stark contrast to TRMs.) But if things that should  
match each other will have different signatures, how will the matches  
be detected?

Like I said, a fdmf signature is actually comprised of three 256-bit  
fingerprints. These are not random, like TRMs are, but decompositions  
of the song. In order to tell if two songs are the same, the hamming  
distance between each of the corresponding fingerprints is computed,  
and if the distance of all three fingerprints is less than some  
threshold (call it T), then the songs are considered to be the same.  
(The hamming distance is the number of bits that are different  
between two bit strings.) Each of the 3 fingerprints might use a  
different value for T (remember they measure different things) so T  
is really an array of 3 thresholds, each ranging from 0-255. Also,  
just to be clear, T should stay constant once settled upon, or else a  
match check may return different results.

If T is set to be {0,0,0} then a given fdmf signature will only match  
itself. Because fdmf signatures are overly specific, that is  
undesirable. If T is set to be {255,255,255} then a given fdmf  
signature will match anything. That is obviously bad. So the trick is  
to figure out the optimal values for T. We would want values to be  
large enough to recognize different versions of the same song, but  
not so large as to think that two different songs are the same. The  
author suggests some values that work for him, but my experiments  
have show them to be too large.


How well fdmf works
-----------
I've run the algorithm over a large, diverse population of music, and  
I'm happy to say that I didn't waste my time and that it works pretty  
well. Certainly much better than TRMs. What's more, if we can accept  
that a logical song might have many fdmf signatures (just like how a  
logical album might have many CD Index IDs), then we simply set the  
thresholds low and suddenly the algorithm works extremely well -  
assuming we have people associating accurate signatures as they find  
unknown signatures.

But enough qualitative talk.... here are my quantitative metrics:

1. The time to generate a fingerprint.
Because the entire file must be decoded, this depends on the file.  
The fdmf author claims about 90% of this time is actually spent  
decoding to wav, but, as that's a required step (unless somebody has  
a bunch of wav they want to analyze....), then we can't throw it out.  
As two points of reference, it takes an unloaded 2Ghz celeron about 7  
seconds to generate a signature for a 4MB mp3 stored on a local  
drive. It takes a fully loaded 1.6GHz PPC G5 about 20 seconds to  
generate a signature for a similar file stored on an nfs mount.

2. The time to lookup a fingerprint.
As described above, looking for a matching fdmf signatures is not  
nearly the trivial problem that a TRM lookup is. This is because the  
fuzzy logic is applied at search time. A naive implementation would  
need to take the search signature and compare it to every known  
signature, returning which known signatures are under the threshold.  
Obviously, O(n) algorithms are impractical for 4 million+ signatures.  
(This is actually a fascinating computer science problem that many of  
my friends spent way too much time on. If anybody wants to exercise  
their CS theory, let me know and I'll give you the problem in an  
abstract wording.) Unfortunately, we couldn't come up with anything  
that was both accurate and less than O(n).... assuming that a newly  
added signature has to be included in search results immediately.

If that assumption is NOT true - that is, if a signature doesn't have  
to be included in search results until some time after it is known to  
the system - then fdmf signatures can be made to simply be a matter  
of an index lookup in SQL. In other words, cheap, much less than a  
second, and still on par with TRM lookups. I'll get into the "how"  
farther down.

3. How many logically different songs have the same fingerprint.  
Ideally this would be 1.
Out of the 85,000 files I tested against, every file with the same  
fdmf key also had the same md5 hash, was a different bitrate encoding  
of the same source wav, or had been severely truncated to be less  
than 3 seconds in length. The former two is what you'd expect, and  
the latter is a corner case that I don't think any audio  
fingerprinting will be to deal with gracefully.

4. How many logically different songs are returned in a given  
fingerprint search.  Ideally this would be 1.
This is completely dependent upon the threshold values chosen. Lower  
thresholds keep this number low, but at the expense of the next  
metric. In my sample set, any threshold less than {23,41,22} gave  
ideal results. Thresholds less than {30,50,30} gave less than 5  
incorrect matches, all of them related to various censoring levels of  
lyrics. It is is expected that as the population of known songs  
increases, the number of incorrect matches will eventually increase  
with thresholds too large.

5. How many logically identical songs are missed by searching for the  
fingerprint of any binary version of the song. Ideally this would be 0.
Again, this is completely dependent upon the threshold values chosen.  
As the thresholds go up, this number goes down, but at the expense of  
the previous metric. Unfortunately it is also completely dependent  
upon the known songs. With my sample set, then, using the thresholds  
{30,50,30} might have missed up to 36  logical songs, if every song  
only had a single signature recorded. Using the thresholds of  
{23,41,22}, up to 168 songs might have been missed.

6. How many fingerprints are required for logically identical songs.  
Ideally this would be 1.
The fdmf algorithm really has no control over this. Because the  
signatures are a reduction of the binary files, it is likely that  
there will be as many signatures as there are binary files. While  
that sounds daunting, remember that each signature is only 96B in  
size, and does not need to be stored in memory.

As for the overall requirements:
- Generating fingerprints must scale. (at least 1,000/sec?)
YES - this is done entirely client-side. Hard to imaging something  
more scalable than that. :)

- Looking up fingerprint matches must scale. (at least 1,000/sec?)
YES - although computationally expensive, one-time-only work can be  
done before the signature is made available for search results. After  
that work is done, it's a simple index lookup. Both the computation  
and the SQL lookup can be load balanced over server farms, letting  
this scale almost infinitely.

- Storing fingerprints must scale. (hold much more than 4M fingerprints)
YES - even though a signature is only 96B, it's quite likely that  
every distinct binary file out there will generate its own signature.  
So let's say an average of 20 signatures over 4 million songs, and we  
still use less than 10GB of storage. By today's storage standards,  
that's chump change. Still, if this is too much, a huffman tree  
should be able to compress a signature a little bit, leading to some  
unknown (but probably significant) level of space savings at the  
expense of computation time when submitting a new signature. It  
*should* go without saying that signatures do not need to be stored  
in ram... but maybe I need to explicitly say that anyway. :)

- Free.
- Open source?
YES - the code is GPL'd. Does that mean it's patent-free? Hell if I  
know. I'm no lawyer, but I suspect at the very least it means we'd  
have plausible protection to not get sued before a cease and desist  
was issued.

fdmf is not perfect. As noted above, it has trouble distinguishing  
between very short snippets of sound, even though they are obviously  
different to the ear. fdmf also has trouble differentiating between  
versions of songs with uncensored vs. clean lyrics, though having a  
tight enough threshold eliminates this problem. If the thresholds are  
too tight, then songs that should be considered matches might not be  
returned as such. This problem could be avoided, but only at the  
expense of requiring more signature associations to a logical song.  
That makes this less useful until a large set of signature  
associations are built up.

fdmf is also not cheap in terms of backend resources. I'll get into  
how to optimize it down below, but even though it can scale across  
machines and the expense will be hidden from the end users, it will  
require O(n) computations for each new signature, where n is the  
number of known signatures. That will just keep getting more  
expensive, though at least it will stay linear.

Finally, fdmf has some advantages over TRMs due to the difference in  
algorithm strategy. Because the fuzzy logic is applied at search  
time, that means that we can adjust those thresholds  without having  
to recompute signatures. It also means a client could specify the  
precision of the matching they want to use. Both of these things are  
impossible with TRMs, and seem to be real steps forward.

How fdmf might work for MB
-----------
 From an end user perspective, fdmf signatures should be a drop-in  
replacement for TRMs. The biggest change would probably be the  
addition of libplot as a requirement. The signatures are obviously  
longer than TRMs, but even when base64 encoded for UI display, they  
are only 129 characters long. (If we wanted to shorten that up we  
could encode the strings with ascii85, but I don't know what that  
really gains us.)

Things are a little different on the backend. Obviously, the TRM  
server goes away. But don't get rid of that hardware just yet.... now  
is as good as time as any to describe how I envision to speed up  
signature lookups from O(n) to very fast.

The trick is to notice that, as things are today, if a TRM isn't in  
the system when you try to look it up, MB simply says it doesn't know  
about it. fdmf signatures could work better, but that requires O(n)  
computations at lookup time. So instead, this is what could happen.  
If somebody searches for a signature that the system doesn't know  
about, the system will simply report there are no matches, even if  
there might be. At the same time, the system will record the search  
signature and kick off O(n) computations, recording which known  
signatures it matches. Once it has walked through all known  
signatures and recorded the matches, it marks the new search  
signature as "ready". The next time somebody searches for that  
signature, all the matches are pre-cached, and the time spent is a  
simple index lookup.

It's easy to see how this process naturally lends itself to parallel  
computing. Therefore, even though adding a signature is expensive, it  
can scale extremely well. The biggest bottleneck will be the database  
which serves the computing farm, but because most of that load is  
read-only, it's trivial to scale it across database servers. Not that  
*that* is cheap either.... but it does scale well.

I've been recording all my data in postgres, so database support  
shouldn't be a problem. I imagine that MB will want to have a  
slightly more refined schema than what I've been using. :) I have  
*not* yet cleaned up the install requirements or packaging of fdmf,  
so for instance the signature generation is still a perl script which  
calls decoders and the spline program from the plotutils package,  
before dumping the key into postgres.... obviously, that's hardly  
portable. But there's no reason why it shouldn't be easy to make some  
ansi C tool that calls uses libmad and libplot to do all the same  
work, and then sends the result to some tcp socket or generic web  
service. I can't speak for windows, but that would let things run  
just fine on posix.


Summary
-----------
Well, this has been a lengthy email. If you've gotten this far,  
congratulations. Let me recap.

fdmf.....
1. appears to be a viable replacement to TRMs.
2. looks capable of avoiding all the problems TRMs have.
3. allows us to change the fuzziness of the searching sever side  
without having to re-compute all signatures, if it turns out our  
thresholds were set wrong.
4. allows clients to specify how fuzzy they want their searches to be.
5. will scale very well, but will probably require more CPU resources  
than TRMs do.

As for next steps.... I don't know. I assume people would want to try  
this on their own collections? If so, I can polish up the scripts  
I've written. Perhaps we want to involve the author of the tool?  
Should I spend time repackaging the fdmf signature generator into a  
real posix app?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.musicbrainz.org/pipermail/musicbrainz-devel/attachments/20051010/5617f311/attachment.htm


More information about the MusicBrainz-devel mailing list