[mb-devel] Future-proof FP, the fingerprint database
Juha Heljoranta
juha.heljoranta at iki.fi
Wed May 30 11:13:44 UTC 2007
Jim C. Nasby wrote:
> The "230" should actually be "2^30", at least for the first reference.
> I'm fixing in the wiki right now. (Does that thing run on MySQL? There's
> a bunch of single-characters that are missing).
Aah, yes of course! Now it makes much more sense :)
> I've got some comments about the database... in any large database,
> trying to do a lot of seeking is going to destroy performance, and
> this is an example of that. Even in the best case of 25 lookups to
> identify a fingerprint you're still looking at 25 seeks. Let's assume
> that this 9TB database is stored on a nice, fast (and expensive) storage
> array that has 50 drives in it. That still means an *ideal* lookup time
> of 5ms per song, which means 200 requests per second. That's a perfect
> case, with very expensive storage (I believe you'd pay on the order of
> $50k-$100k for a 50 drive SAN). Note that I'm also ignoring the
> secondary lookup.
>
> If we want to get closer to reality, we're looking at something closer
> to 25 drives. And 10ms *average* seek time may not be representative...
> that's a best-case scenario. Let's assume it's 20ms. We're now at 20ms
> per song lookup... 50 requests per second. And that's at 100% IO
> saturation and efficiency, something you're not likely to achieve.
>
> That doesn't sound very scalable to me... :)
>
> I don't know the details behind generating the fragments, but from a
> storage perspective we'd be much better off with fewer segments that are
> larger. If instead of 6 samples each segment is 60, that would hopefully
> cut the number of seeks by a factor of 10, which would mean 500-2000
> requests per second (again, ignoring the secondary lookup).
Good point. I have some sort of plan which might work. I got some
inspiration from how the BLAST works:)
Note: There are few boundary conditions and I am not sure if they are
acceptable.
Disclaimer: I am not an expert on this kind of stuff...
Fingerprint data
----------------
Database size = 2^30 tracks.
Each track has 6 second long (indexed) fingerprint. The 6 seconds would
translate to 96 chars/symbols == 72 bytes (the symbol rate of the
fingerprintfunction is 16 symbol / sec and each symbol is 6 bit long).
Of course the actual finger print can be longer than 6 seconds but only
a 6 second clip is indexed for searching purposes. And naturally,
nothing prevents us from indexing second 6 second clip per track.
However, adding more than one indexed fingerprint per track will hurt
the scalability badly.
Lets assume that each fingerprint has unique 32bit integer identifying it.
To store the fingerprint id and the fingerprint data we need 4 bytes +
72 bytes = 76 bytes.
Storage requirements for the whole fingerprint database would be then
2^30 * 76 bytes = 76 GiB.
Indexing
--------
Each fingerprint is chopped to 32 equally sized fragments. This means
that each fragment is 3 symbols == 18 bits long. Starting from first
fragment each fragment is placed to its respective lookup table with the
fingerprint id. Each lookup table has 2^18 keys presenting the all
possible 2^18 combinations.
To put it another way, we have 32 lookup tables each presenting
different parts of the fingerprint. Each lookup table has 2^18 keys
which will map to zero or more fingerprint id.
Size of the lookup table key index is 2^18 * 4 bytes = 1 MiB (if each
key/fragment is stored as an integer). If we have 2^30 fingerprints then
the average table will have 2^30/2^18 = 4096 fingerprint ids mapping to
a single fragment. All fingerprint ids which are stored to lookup table
which requires 4096 * 4 bytes * 2^18 = 4 GiB of storage.
Because we have 32 of these lookup tables the total storage requirement
is 32 * 4 GiB = 128 GiB.
Note: We are doing non-overlapping indexing. If we'd like to do full,
overlapping, indexing then we would need 94 lookup tables. Storage
requirement for such is threefold: 94 * 4 GiB = 376 GiB. Full indexing
should improve search sensitivity.
Searching
---------
Example: four minute track.
Four minute track has fingerprint length of 16 / symb per sec * 4 * 60
sec = 3840.
Chop fingerprint to overlapping fragments totaling 3838 fragments.
Take 32 fragments starting at search index == zero and query each
respective lookup table.
Each lookup table should return roughly 4096 fingerprint ids. Count how
many times each fingerprint id occurs in all 32 returned results.
Typically there will be 4096 * 32 = 131072 unique entries, that is, each
fingerprint id occurs only once. Store fingerprint id count results to
array which contains results from last 3 previous lookups. If some
fingerprint id occurs more than once during the last 3 previous lookups
store the fingerprint id to match candidate array.
Increment search index by one. Take 32 fragments and repeat the lookup.
Continue until all fragments are searched against the database. In this
case it would mean 3838 - 32 = 3806 search operations. This means that
the lookup tables are accessed 3806 * 32 = 121792 times during the whole
search operation.
After the search is complete we will need to return the fingerprint(s)
to the client. Under 50 % distortion an average fingerprint has roughly
4.8 undamaged fragments. Probability of picking two same fingerprint ids
during the search is: (4096 * 32)/2^30 = 0.000122. So ending up by
having the same id two or more times by accident is a small miracle.
Now, if under 50 % distortion we have, on average, 4.8 error free
fragments then the signal/noise ratio is *very* good. This practically
means that we will need to perform only single seek operation in order
to fetch the fingerprint. Either we have the fingerprint or we don't.
Simple.
Note: since we are not doing full indexing the signal / noise ratio will
probably suffer little.
Second note: The search function can be improved in many ways: stopping
the search if good enough match is found etc.
Performance
-----------
There is nothing too cpu intensive so that should not be a bottleneck.
Mostly integer comparisons when accessing the lookup tables.
The lookup tables must be in ram or in some other media which has *very*
low seek times.
Tracks / RAM
2^23 / 2 GiB
2^24 / 3 GiB
2^25 / 5 GiB
2^26 / 9 GiB
...
2^30 / 129 GiB
If lookup tables are on disk then the tables needs to be accessed 121792
times per four minute track. I think that this would translate to 121792
hard disk seek operations...
Requirements scaled to current musicbrainz database size
--------------------------------------------------------
The database has roughly 2^23 = 8388608 tracks.
2^23 * 72bytes + track ids = 2^23 * 72 + 2^23 * 4 = 608 MiB.
Lookup tables:
2^23/2^18 = 32 fingerprint ids per lookup table entry. Storage per table
entry 32 * 4 = 128 bytes.
Total lookup table size = 2^18 * 128 bytes + 1 MiB for table keys = 33 MiB.
All 32 lookup tables together = 1056 MiB.
This would run pretty well on typical workstation with 2 GiB ram.
Most importantly the lookup tables could be in ram.
To double the capacity either double the ram or setup similar machine
and split lookup tables between the machines.
Boundary conditions
-------------------
Minimum fingerprint length: 6 sec
Highly recommended: only 1 (indexed) fingerprint per track.
Memory requirements
Tracks / RAM
2^23 / 2 GiB
2^24 / 3 GiB
2^25 / 5 GiB
2^26 / 9 GiB
...
2^30 / 129 GiB
There is probably a small overhead which increases the requirements by
3-10%.
The maximum number of fingerprint ids: 2^32. Using 64bit fingerprint ids
will roughly double the storage requirements.
Implementation
--------------
This should be very straight forward.
* Simple database layout
* Simple indexing system
* The search function is not too complicated
* Overhead should be minimal. There should be no need for extra pointers
etc. That is, the memory requirements should hold pretty well.
More information about the MusicBrainz-devel
mailing list