[mb-devel] Future-proof FP, version 0.0.2
Jim C. Nasby
decibel at decibel.org
Mon May 28 22:07:28 UTC 2007
On Mon, May 28, 2007 at 04:53:50PM +0300, Juha Heljoranta wrote:
> Hi,
>
> Version 0.0.2 of the fingerprinting tool is available.
>
> http://www.fsfe.org/en/fellows/juha/fpfpf
>
> Give it a spin and let me know what do you think about it:)
>
> Personally I think that it is pretty solid and quite ready from the
> design point of view.
>
>
> The next major challenge is to setup the server side stuff... I have
> some ideas which I'd like to test. I've also tried to figure out the
> whole procedure as Geoff described in wiki
> (http://wiki.musicbrainz.org/FutureProofFingerPrint).
> Although, I have to confess that I cannot quite follow all the details.
> Mostly I'm puzzled how he came up with number 230 (something to do with
> "entropy") and equations like 2.4e12 / 230 = 2235.
The "230" should actually be "2^30", at least for the first reference.
I'm fixing in the wiki right now. (Does that thing run on MySQL? There's
a bunch of single-characters that are missing).
I've got some comments about the database... in any large database,
trying to do a lot of seeking is going to destroy performance, and
this is an example of that. Even in the best case of 25 lookups to
identify a fingerprint you're still looking at 25 seeks. Let's assume
that this 9TB database is stored on a nice, fast (and expensive) storage
array that has 50 drives in it. That still means an *ideal* lookup time
of 5ms per song, which means 200 requests per second. That's a perfect
case, with very expensive storage (I believe you'd pay on the order of
$50k-$100k for a 50 drive SAN). Note that I'm also ignoring the
secondary lookup.
If we want to get closer to reality, we're looking at something closer
to 25 drives. And 10ms *average* seek time may not be representative...
that's a best-case scenario. Let's assume it's 20ms. We're now at 20ms
per song lookup... 50 requests per second. And that's at 100% IO
saturation and efficiency, something you're not likely to achieve.
That doesn't sound very scalable to me... :)
I don't know the details behind generating the fragments, but from a
storage perspective we'd be much better off with fewer segments that are
larger. If instead of 6 samples each segment is 60, that would hopefully
cut the number of seeks by a factor of 10, which would mean 500-2000
requests per second (again, ignoring the secondary lookup).
Anyone have stats on current PUID lookup rates? :)
--
Decibel!, aka Jim C. Nasby, Database Architect decibel at decibel.org
Give your computer some brain candy! www.distributed.net Team #1828
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 187 bytes
Desc: not available
Url : http://lists.musicbrainz.org/pipermail/musicbrainz-devel/attachments/20070528/42abf93b/attachment.pgp
More information about the MusicBrainz-devel
mailing list