[mb-devel] Some questions about the lucene index
Paul Taylor
paul_t100 at fastmail.fm
Sun May 13 10:33:30 UTC 2007
I've been trying to get a better understanding of how lucene works, I
assume that MusicBrainz uses the default lucene search, previously I
didnt release this normalises the results returned over the range 0 -1
(scaled to 0 -100) which is why the top ranking result often has a score
approaching 100 even though the query does match many of the terms. The
records are ranked but it would be great for developers if there was an
option to get the score returned before the normalization has been
applied for example the following query
track:Minus OR artist:beck OR release:odelay
would return 3.0 when all three values matched, 2.0 if two matched and
1.0 if only one matched, I would expect it to take it into account query
boosting, so that
track:Minus^3 OR artist:beck OR release:odelay
would return 5.0 if all three terms matched, 4.0 if the track and artist
term matched ecetera.
this would then allow us to set a minimum value to accept a potential
match, for example in the very simnplified case above I may only want to
accept a score of 2.0 above, meaning that either the track match. or at
least the artist and release matched.
Apparently this can be done using a lucene searcher that takes a
different type of HitCollecter, but I dont know the details and how you
are using Lucene.
Without this MusicBrainz functionality, having got the results I need to
try and and work out the unnormalised scores myself, however this is not
a simple matching process because of the way the Lucene Analysers
(removing punctuation,stop words ecetera) work. Is/could a small part of
the Lucene index be available as a file for testing purposes, I could
then use a tool such as Luke (http://www.getopt.org/luke/) to
fully analyse potential queries.
thanks Paul
More information about the MusicBrainz-devel
mailing list