[mb-devel] Some questions about the lucene index

Paul Taylor paul_t100 at fastmail.fm
Sun May 13 10:33:30 UTC 2007


I've been trying to get a better understanding of how lucene works, I 
assume that MusicBrainz uses the default lucene search, previously I 
didnt release this normalises the results returned over the range 0 -1 
(scaled to 0 -100) which is why the top ranking result often has a score 
approaching 100 even though the query does match many of the terms. The 
records are ranked but it would be great for developers if there was an 
option to get the score returned before the normalization has been 
applied for example the following query

track:Minus OR artist:beck OR release:odelay

would return 3.0 when all three values matched, 2.0 if two matched and 
1.0 if only one matched, I would expect it to take it into account query 
boosting, so that

track:Minus^3 OR artist:beck OR release:odelay

would return 5.0 if all three terms matched, 4.0 if the track and artist 
term matched ecetera.

this would then allow us to set a minimum value to accept a potential 
match, for example in the very simnplified case above I may only want to 
accept a score of 2.0 above, meaning that either the track match. or at 
least the artist and release matched.

Apparently this can be done using a lucene searcher that takes a 
different type of HitCollecter, but I dont know the details and how you 
are using Lucene.


Without this MusicBrainz functionality, having got the results I need to 
try and and work out the unnormalised scores myself, however this is not 
a simple matching process because of the way the Lucene Analysers 
(removing punctuation,stop words ecetera) work. Is/could a small part of 
the Lucene index be available as a file for testing purposes, I could 
then use a tool such as Luke (http://www.getopt.org/luke/) to
fully analyse potential queries.


thanks Paul









More information about the MusicBrainz-devel mailing list