[mb-devel] New proposal for artist similarity project

Luyi Chen lychen1109 at gmail.com
Sun Mar 25 17:50:00 UTC 2007


I have submitted a proposal for artist similarity project to Google
Soc. Following is a strip down version to describe the main idea.
Would love to hear feedback.

What?

Use object data to generate similarity table of artist pairs according
to co-occurrence of artist in various sources. Besides multi-artists
and user subscription, I also include the following two new source. As
mentioned on the list, search log is not suitable.

- Edits voting. I suppose people mainly vote on favorite artists.
- Editing and tagging history of light users. This won't make sense
for heavy users, because they have too large collection. But if a
users only has two CDs, they should have some relationship. As to
tagging history, I just suppose we record it when PicardTagger queries
the service.

How?

All the data will be used to generate a 2d array with each row and
column represents an artist. We simply add up all co-occurrence hits
then divide them by frequency of artists to generate a normalized
vector to represent each artist.

Components in the vector represent co-occurrence with other artists.
This co-occurrence data can represent direct similarity between
artists.

Indirect similarity relationship, such as they both similar to third
artists so they should be similar, generated by co-relation of
vectors. If two artists are very similar, the co-relation should be 1.
If they are completely different, the co-relation should be -1.

Then the final similarity should be combined with co-occurrence and
co-relation. We should add some weights on any variables according to
the real result.

Why?

Simple, object data approach. No personal data in final data table, anonymous.


Luyi



.



More information about the MusicBrainz-devel mailing list