[mb-devel] WebService usage

Vidar Wahlberg canidae at exent.net
Fri Sep 29 15:23:45 UTC 2006


shamelessly continuing this thread as this is related to the issue.
sorry 'bout the long mail.

last few days i didn't have anything useful to do, so i decided to look
a bit on this lucene thingy and see if i could learn something new.
and well, i'd claim i did.

when i tag my gross amounts of music files i don't want to sit there and
tag them more or less manually. i want to start a program, let it run
for as long as it would like and when i come back i want all my files to
be properly tagged. currently i'm having slight difficulties with
getting this to work as well as i wish it would.
in my perfect world, i send all the metadata i got about a song to a
server and the server will tell me exactly which song i got. ofcourse,
this is impossible, but i believe it's possible to come quite close
without too much fuss.

so, i downloaded lucene (the java one), spent some hours trying to get a
fair understanding of how the thing works and went on to build myself a
index. the index i'm stuck with now is not tuned alot, it's indexing
stuff i don't even search for and on top of that, it's 2.4gb and it took
more than 7 hours building the index.
further on i made myself a simple program to search the index and fed
this program with 976 filenames (stripped the extension) to see if the
results it spat out would somehow match the filenames.

before running the search i though to myself "no chance in hell this is
gonna be fast with a 2.4gb index", but quite surprisingly i was very
wrong.
i could do 10 queries per second on average with the filenames i fed it,
and just as important the metadata i was looking for was very frequently
returned within the first 10 hits.

so what's my philosphy about this?
put all unique metadata (albumartist, album, tracknum, track,
trackartist, ...) in a single field in a document.
simply search that field with all the metadata you got on a song.
it is fast, you don't have to worry about some moron putting both artist
and track in the track-tag (seen it lots of times in id3-tags) and you
actually don't have to do multiple queries to try different combinations
of filenames ("artist - album - track", "artist - track", "track -
artist", ...).

but enough fuss. i came up with the idea of putting all the metadata in
a single field and just search that last night, and the results i get
are very good. if i set up some server which spits out the 50 best
matches i can make the client do more throughrougly comparison of the
metadata i got from the server and the metadata i got from the file
(thus making the client do the hard work).

the code i've made can be found here: http://exent.net/~canidae/lousy/
it's only testing and nothing else, so yes it's messy, don't bother me
about the ugly code :p
you'll also find two text files there, "songs.txt" and "results.txt".
"songs.txt" is the filename of the songs i matched, only stripped of
their extension.
"results.txt" contains up to 10 hits from the search, but the "score"
given doesn't reflect how likely the hit is correct. rather think of it
how well the words matched, scoring must be done client side.
keep in mind that there are lots of norwegian titles, using our unusual
characters like "æ", "ø" and "å" which apparently were stored in
something else than utf-8, causing the characters to be displayed wrong.
this does effect the search as "sj├ªl" is not "sjæl". i've also _only_
searched using filenames and not taken anything from the tags, but the
idea is to just add the tags to the string you search for.
several of the songs can't be found in the database either, so they're
bound not to give a decent result (especially those songs with
"barne-tv" ("childrens tv" or something translated) in its title.

anyways, i'd like some thoughts about this.
i would love to see a similar feature over at musicbrainz.org where i
can just supply as much metadata as i got and get 50 or so results back
which i then process further client side.
infact, i'd like to help creating something like this as the music
archive i'm supposed to maintain get new tracks way faster than i can
tag them manually, so i need something automated with high precision.
although, python is not my favourite language, so if someone who've
played with this pylucene could pinpoint some docs and files i should
pay attention that would save me alot of hassle.

just for reference, here are the specs of the computer i tested on:
ibm r40 type 2681 (laptop)
2.0ghz pentium 4
512mb ram
using java 1.5


-- 
Regards,
Vidar Wahlberg



More information about the MusicBrainz-devel mailing list