[mb-devel] Future-proof FP, the fingerprint database

Jim C. Nasby decibel at decibel.org
Sun Jun 3 23:38:40 UTC 2007


On Wed, May 30, 2007 at 02:13:44PM +0300, Juha Heljoranta wrote:
> Storage requirements for the whole fingerprint database would be then
> 2^30 * 76 bytes = 76 GiB.
<snip> 
> Because we have 32 of these lookup tables the total storage requirement
> is 32 * 4 GiB = 128 GiB.
> 
> Note: We are doing non-overlapping indexing. If we'd like to do full,
> overlapping, indexing then we would need 94 lookup tables. Storage
> requirement for such is threefold: 94 * 4 GiB = 376 GiB. Full indexing
> should improve search sensitivity.
<snip> 
> Requirements scaled to current musicbrainz database size
> --------------------------------------------------------
> 
> The database has roughly 2^23 = 8388608 tracks.
> 2^23 * 72bytes + track ids = 2^23 * 72 + 2^23 * 4 = 608 MiB.
> Lookup tables:
> 2^23/2^18 = 32 fingerprint ids per lookup table entry. Storage per table
> entry 32 * 4 = 128 bytes.
> Total lookup table size = 2^18 * 128 bytes + 1 MiB for table keys = 33 MiB.
> All 32 lookup tables together = 1056 MiB.
> 
> This would run pretty well on typical workstation with 2 GiB ram.
> Most importantly the lookup tables could be in ram.
> 
> To double the capacity either double the ram or setup similar machine
> and split lookup tables between the machines.
> 
> 
> 
> Boundary conditions
> -------------------
> 
> Minimum fingerprint length: 6 sec
> Highly recommended: only 1 (indexed) fingerprint per track.
> 
> Memory requirements
> 
>  Tracks / RAM
> 2^23  /   2 GiB
> 2^24  /   3 GiB
> 2^25  /   5 GiB
> 2^26  /   9 GiB
> ...
> 2^30  / 129 GiB
> 
> There is probably a small overhead which increases the requirements by
> 3-10%.
> 
> The maximum number of fingerprint ids: 2^32. Using 64bit fingerprint ids
> will roughly double the storage requirements.
> 
> Implementation
> --------------
> 
> This should be very straight forward.
> * Simple database layout
> * Simple indexing system
> * The search function is not too complicated
> * Overhead should be minimal. There should be no need for extra pointers
> etc. That is, the memory requirements should hold pretty well.

I like the potential scalability of this; presumably the only item that
would have to fit on a single machine are the indexes, but that
shouldn't be hard to do.

I agree with the original stuff in the entry; it would be good to get
something running quickly and move on from there. I don't think it'd be
too hard to code up customized data storage for this, but you could also
use a database if you wanted. If PostgreSQL was used, I estimate 28
bytes of overhead for the database and 40 bytes per lookup table entry.

One thing I think is important is to design the system with the
intention of eventually dividing it among multiple machines; not for CPU
reasons but in order to cheaply keep everything in memory.
-- 
Decibel!, aka Jim C. Nasby, Database Architect  decibel at decibel.org 
Give your computer some brain candy! www.distributed.net Team #1828
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 187 bytes
Desc: not available
Url : http://lists.musicbrainz.org/pipermail/musicbrainz-devel/attachments/20070603/571850b4/attachment.pgp


More information about the MusicBrainz-devel mailing list