[mb-devel] New proposal for artist similarity project
Luyi Chen
lychen1109 at gmail.com
Sun Mar 25 17:50:00 UTC 2007
I have submitted a proposal for artist similarity project to Google
Soc. Following is a strip down version to describe the main idea.
Would love to hear feedback.
What?
Use object data to generate similarity table of artist pairs according
to co-occurrence of artist in various sources. Besides multi-artists
and user subscription, I also include the following two new source. As
mentioned on the list, search log is not suitable.
- Edits voting. I suppose people mainly vote on favorite artists.
- Editing and tagging history of light users. This won't make sense
for heavy users, because they have too large collection. But if a
users only has two CDs, they should have some relationship. As to
tagging history, I just suppose we record it when PicardTagger queries
the service.
How?
All the data will be used to generate a 2d array with each row and
column represents an artist. We simply add up all co-occurrence hits
then divide them by frequency of artists to generate a normalized
vector to represent each artist.
Components in the vector represent co-occurrence with other artists.
This co-occurrence data can represent direct similarity between
artists.
Indirect similarity relationship, such as they both similar to third
artists so they should be similar, generated by co-relation of
vectors. If two artists are very similar, the co-relation should be 1.
If they are completely different, the co-relation should be -1.
Then the final similarity should be combined with co-occurrence and
co-relation. We should add some weights on any variables according to
the real result.
Why?
Simple, object data approach. No personal data in final data table, anonymous.
Luyi
.
More information about the MusicBrainz-devel
mailing list