[Playlist] CDATA in URL sections?
Matthias Friedrich
matt at mafr.de
Sun May 13 20:59:14 UTC 2007
On Sunday, 2007-05-13, Sebastian Pipping wrote:
> Lucas Gonze wrote:
>> Here's the situation I'm thinking of:
>> <location>
>> http://example.com
>> </location>
>> That's legal, but the URL is invalid until you trim off the leading and
>> trailing whitespace.
> That's confusing me. Shouldn't either the URL be valid
> because whitespace is ignored or the whole location
> element be invalid since the body contains more than
> just an URI?
Things are a bit tricky when it comes to whitespace in XML. In data
oriented XML it is pretty common to use whitespace freely to indent
elements, but to be stricter when it comes to the content of leaf
elements like <location>. The second edition of XML 1.0 even specifies
an xml:space element to indicate whitespace handling [1], but I never
saw anyone using it.
In our case, we have a schema, and the content of <location> is declared
as XSD's anyURI. Since anyURI is an atomic type, is not derived from
"string", and has the whiteSpace facet, the whitespace processing
strategy is "collapse". This means, the processing application has to
collapse sequences of #x20 (ASCII 0x20, the space character) into one
and remove leading and trailing space characters, as per [2]. This
behavior is why spaces in URLs should be encoded, BTW, because the
normalization can break the URI. Note that this all only applies to the
space character, not to whitespace characters in general (i.e. whatever
isspace(3) or other functions consider whitespace). Validators like
xmlstarlet usually don't catch this subtlety.
Summing it all up, this means the processing application has to remove
leading and trailing spaces and collapse others into one if it wants
to follow the specs precisely. Leading tabs, for example, must not be
touched, although this most probably leads to invalid URIs. The XML
parser itself won't help you without it having access to the schema, so
this is up to the application.
In the interest of interoperability and common sense (Postel's principle
etc.), generating applications should not use any unencoded whitespace
in anyURI elements. Receiving applications must at least perform the
"collapse" normalization. Other leading whitespace can probably be
removed, too, while you can never tell if trailing tab characters are
part of the URI or not.
So please, be very strict when it comes to specifying formatting unless
you're *exactly* sure what the consequences are. I had a hard time to
get strictness into the draft, so let's not give this up. People usually
don't understand XML completely (I don't say I do!), so giving them
choices is no good idea.
Cheers,
Matthias
[1] http://www.w3.org/TR/2000/WD-xml-2e-20000814#sec-white-space
[2] http://www.w3.org/TR/xmlschema-2/#rf-whiteSpace
More information about the Playlist
mailing list