[mb-style] Mechanically-assisted updating of typography on MusicBrainz
Per Starbäck
per.starback at gmail.com
Tue Feb 1 00:17:28 UTC 2011
I think it sounds good. A couple of points:
> * for the above case, skip any titles that contain words not in the
> dictionary for the language the word becomes in (i.e., make sure there
> are no non-English words in a title containing “don't”, or non-French
> ones in a title with “C'est”
This seems overly cautious to me. What are you trying to avoid?
Looking at "c'est" in track_name the most "suspicious-looking" ones
would be
Nous Deux C'est Pour La Vie (雨のソナタ 〜La Pluie〜French Version)
愛野美奈子(小松彩夏)/C'est La Vie〜私の中の恋する部分
C'est La Vie (σήμερα)
C'est la vie 〜 私のなかの恋する部分
but there is no reason to don't exchange those, right?
> * depending on the success with apostrophes, I might add other tests
> for quotes (with appropriate tests for pairs of quotes and language),
> en-dashes (e.g. hyphens between things that make sense as a range,
> avoiding BootlegTitleStyle and the like) and em-dashes (most
> space-hyphen-space sequences in the database should really be
> em-dashes, but it may be hard to test for exceptions so much of this
> will be manual).
Dashes are tricky, since different style guides use it differently.
What's written like " - " in ASCII might be
em+dash, or
space + em-dash + space, or
space + en-dash + space.
depending on preferences. See
http://en.wikipedia.org/wiki/Dash#En_dash_versus_em_dash .
Just like a publisher can have a house style regarding this,
Musicbrainz could have that, rather than mimicking different releases.
(Just like a printed publication listing the same titles would use the
house style of the publication all the way through.)
But complicating the issue it is also language dependent. In my native
Swedish you use space + en-dash + space (nowadays) according to all
style guides, for instance.
> * I’m not sure what to do with actual hyphens. I’ve only recently
> found out there’s a separate Unicode character for a hyphen (U+2010),
> distinct from the ASCII hyphen-minus char, but I’m not sure if it’s
> actually preferred. (I currently type an ASCII hyphen/minus for
> hyphens and the other dashes as Unicode, but I might change this if I
> find a good reason to.)
Even though I haven't used it myself yet, I think the true HYPHEN
character should be used, if not for other reasons just as a statement
that this HYPHEN shouldn't be exchanged by further heuristics.
> Anyway, there probably isn’t a lot of reason
> to worry about these. There are about 25k unique titles with this
> character, and about half of them are isolated hyphens, most of which
> will probably become em-dashes.
There are a few that should be MINUS SIGN as well (like "Liquid Cool
(Space -320°F Biostatic Ambient mix)") (but just space + hyphen-minus
+ digits isn't enough, since there are lots of cases like "Hallo Boss,
Hallo -1962" as well).
More information about the MusicBrainz-style
mailing list