[mb-devel] Future-proof FP, the fingerprint database
Jim C. Nasby
decibel at decibel.org
Sun Jun 3 23:38:40 UTC 2007
On Wed, May 30, 2007 at 02:13:44PM +0300, Juha Heljoranta wrote:
> Storage requirements for the whole fingerprint database would be then
> 2^30 * 76 bytes = 76 GiB.
<snip>
> Because we have 32 of these lookup tables the total storage requirement
> is 32 * 4 GiB = 128 GiB.
>
> Note: We are doing non-overlapping indexing. If we'd like to do full,
> overlapping, indexing then we would need 94 lookup tables. Storage
> requirement for such is threefold: 94 * 4 GiB = 376 GiB. Full indexing
> should improve search sensitivity.
<snip>
> Requirements scaled to current musicbrainz database size
> --------------------------------------------------------
>
> The database has roughly 2^23 = 8388608 tracks.
> 2^23 * 72bytes + track ids = 2^23 * 72 + 2^23 * 4 = 608 MiB.
> Lookup tables:
> 2^23/2^18 = 32 fingerprint ids per lookup table entry. Storage per table
> entry 32 * 4 = 128 bytes.
> Total lookup table size = 2^18 * 128 bytes + 1 MiB for table keys = 33 MiB.
> All 32 lookup tables together = 1056 MiB.
>
> This would run pretty well on typical workstation with 2 GiB ram.
> Most importantly the lookup tables could be in ram.
>
> To double the capacity either double the ram or setup similar machine
> and split lookup tables between the machines.
>
>
>
> Boundary conditions
> -------------------
>
> Minimum fingerprint length: 6 sec
> Highly recommended: only 1 (indexed) fingerprint per track.
>
> Memory requirements
>
> Tracks / RAM
> 2^23 / 2 GiB
> 2^24 / 3 GiB
> 2^25 / 5 GiB
> 2^26 / 9 GiB
> ...
> 2^30 / 129 GiB
>
> There is probably a small overhead which increases the requirements by
> 3-10%.
>
> The maximum number of fingerprint ids: 2^32. Using 64bit fingerprint ids
> will roughly double the storage requirements.
>
> Implementation
> --------------
>
> This should be very straight forward.
> * Simple database layout
> * Simple indexing system
> * The search function is not too complicated
> * Overhead should be minimal. There should be no need for extra pointers
> etc. That is, the memory requirements should hold pretty well.
I like the potential scalability of this; presumably the only item that
would have to fit on a single machine are the indexes, but that
shouldn't be hard to do.
I agree with the original stuff in the entry; it would be good to get
something running quickly and move on from there. I don't think it'd be
too hard to code up customized data storage for this, but you could also
use a database if you wanted. If PostgreSQL was used, I estimate 28
bytes of overhead for the database and 40 bytes per lookup table entry.
One thing I think is important is to design the system with the
intention of eventually dividing it among multiple machines; not for CPU
reasons but in order to cheaply keep everything in memory.
--
Decibel!, aka Jim C. Nasby, Database Architect decibel at decibel.org
Give your computer some brain candy! www.distributed.net Team #1828
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 187 bytes
Desc: not available
Url : http://lists.musicbrainz.org/pipermail/musicbrainz-devel/attachments/20070603/571850b4/attachment.pgp
More information about the MusicBrainz-devel
mailing list