[mb-devel] Music Matcher SOC 2007

Arnoldo Muller arnoldomuller at gmail.com
Sun Mar 18 01:40:16 UTC 2007


Hello!

I would like to propose a project for Google summer of code 2007.

I was reading the ideas presented by Geoff Schmidt and Ben wrote
regarding ways of implementing a "music matcher". I have been doing
binary program matching for over a year. My main objective is to help
detecting open source license violations. The idea is to perform
matches even though there are different compilers, obfuscators and
different options that generate very different binaries. I feel our
objectives are similar.

What I do is to "slice" the programs, and match those slices by using
a distance function. Because I do not want to match sequentially all
the database, a dimension reduction technique like s-map (a Japanese
technique) and a spatial tree or the pyramid technique can be used to
speed up the matching performance. Each slice is projected into a
vector, that later can be matched by doing some simple int
comparisons.

In a sense, matching slices of programs and matching chunks of audio
are similar tasks. Each chunk = slice is like a word of a document. In
my lab (AI deparment, Kyushu Institute of Technology, Japan) we have
been matching successfully audio and videos in real time for years by
using a similar approach based on Packed R-trees.

One of my ideas is that, if the pyramid technique is used, the
database can be divided into 2*d (d is the size of the projection, it
is not 3 for fdmf. you can make it as big as you want) chunks. Each
chunk can be distributed in the clients and queries can be distributed
and answered by clients themselves, just like bittorrent does. If data
grows too much on the client side, then the data can partitioned
further by using other criteria (pyramid height) and even further by
using a p+tree.

The only requirement is that the distance function satisfies the
triangular inequality, which is the case for fdmf as it is using the
Hamming distance.

My initial proposal is to speed up fdmf retrieval by using an m-tree
or an spatial index+s-map or better! a p+tree+s-map to speed up fdmf.
All the design should be directed towards distributed queries. If
there is time, I would like to optimize the feature extraction
process. There are no open source implementations of p+trees and one
has to be created from scratch. I want to do this too.

It might also be possible to do the same for libFooID.

 I would like to create a general framework that others can use to
experiment with different distance metrics for audio matching.

What do you think about my proposal? I am looking forward to hear your comments.

Thank you,

Arnoldo Muller



More information about the MusicBrainz-devel mailing list