Dan asks:

what's up with it being HTML 4.01 Transitional//EN in the only template that comes in the zip? i was just perusing the code and that caught my eye because of the supposed concentration on new technologies and best practices... what gives?

Dan, talk about walking into a minefield. I will try to distill more than a hundred messages used to come to a decision on this topic into a concise reply to your query.

In order to serve correct XHTML, the server must not only serve correct markup, but a correct content-type. If the content type is not an XML type (what is supposed to be used with XHTML, since it is XML) then the browser will interpret the XHTML code as poorly-formed HTML.

What really happens when you receive XHTML markup with a text/html content type (this is how WordPress serves pages, for the most part) is your browser ignores all of the extra characters and invalid non-HTML markup that is part of the XHTML. That it does this is a blessing for you, because otherwise your beautiful but improperly served XHTML code would render like garbage. A byproduct of browsers having to deal with sloppy HTML code over the years is that they are used to taking garbage-like XHTML and making it look nice in spite of itself.

Serving XHTML with a text/html content-type rather than a valid XML content type (application/xhtml+xml seems correct) is allowed by the W3C spec, but it causes the browser to render the XHTML as if it's HTML. So why didn't you code it as HTML in the first place, which is less likely for the browser to misinterpret when using the text/html content type?

Serving with a correct content type is possible, but it requires that your markup is XML-valid. There is so much user-provided content on a blog that would need to be filtered to make it XML-valid that doing so is prohibitive. Note that a single incorrectly placed tag or un-encoded element (remember, XML doesn't have all of the entities that HTML does!) would cause your entire page, and possibly your entire site, not to render in teh browser -- at all.

After consulting with experts in markup and standards who made recommendations to us not to use XHTML because it's mostly a broken standard, we decided that HTML works, it does what we want, and we can serve pages with it that validate. Compared to invalidly serving XHTML markup as text/html content, we would rather serve valid content for what content type we specify. Compared to attempting to assure that all themes, posts, and comments were valid XML before they are served by converting them somehow, we would rather focus on a standard that is well-traveled and has a future with WHATWG and HTML5.

That said, it is entirely possible to create and serve valid XHTML pages with any content type you like, valid, semi-valid, or invalid, from Habari. However, there are no tools in Habari for validating your output as XML before output, so if you screw it up, it's on you. The only place you will continue to find HTML even if you change your public site's content type is in Habari's admin.

A proper blogging tool that outputs true XHTML would have an XML parser/validator and force you to add new content nodes to a DOM before output. It's simply not practical to expect that concatenated strings will always result in valid XHTML, given the abundance of user-supplied content on a blog.

Comments

Comment by owen on .
owen
It is one thing to check if user-supplied content is valid, it's another to convert typically invalid XML, but HTML-parseable content into XML. If you could guarantee that all user input could be converted into valid XML, then that might be possible. Putting things into cdata would just put them into cdata. Unless the cdata is required part of the XHTML that gets parsed, the markup that's in it will be ignored. Sure, it's technically possible to validate and potentially convert all input so you could output what you want, but it's a lot more work to attain something that's pretty insignificant. The question about XHTML that I failed to ask before is why you would really want it when the HTML renders just fine. Machines have gotten really good at reading even cruddy HTML, so why output XHTML and take the chance at malformedness when HTML does a better job of what you're trying to accomplish? Just a thought.
Comment by Pat on .
Pat
As the application taking in the information and writing out the information, shouldn't you be able to validate this pretty easily? While there could be contextual/syntax issues, it seems provable that embedding valid XML into other valid XML would always result in valid XML. If you validated each possible element inserted into the page, it seems like you should be able to create valid XML. It wouldn't be efficient, but it'd be simple and work. Couldn't you just stick everything into a cdata tag? I'm not trying to push you to do anything in particular, but it doesn't seem like it should be too hard to make this happen, if you wanted. Hmm... You might get screwed with the "live comment preview" type stuff, though.
Comment by owen on .
owen
Incidentally, Dan, your site serves XHTML markup using a text/html content type. You currently have a post on your home page that includes an unescaped character as part of a code example. Even in HTML this isn't valid, but when rendered as HTML, the browser is lenient enough to give you a pass. If the browser was to render your XHTML code as actual XML, this wouldn't validate. It's little things like this that make rendering output as truly valid XHTML so difficult.
Comment by Geoffrey Sneddon on .
Geoffrey Sneddon
With the requirements for parsing DTDs, it is questionable whether XML is any easier to parse. Certainly in terms of code size and runtime size and speed there is very little difference between the two. In terms of rendering correctly, whether the document is sent as XHTML or HTML is irrelevant — both are just used a means to an end to create a DOM, and it is from that DOM that the document is rendered. The main issue with HTML is its current near complete lack of being defined. The current draft of HTML 5 does have a defined parsing algorithm, but it has one or two major issues (namely that |html|, |head|, and |body| aren't always implied). That isn't to say that XML is perfectly defined, e.g., it is undefined what to do when you have an invalid URI in an XML document. Once HTML is properly defined, it is very hard to see any advantages for XML in the vast majority of cases: very few websites have needs for things like MathML and SVG (which both might actually be serialisable in HTML5).
Comment by Pat on .
Pat
It depends on your particular agenda. XHTML is MUCH easier to parse... anyone could write a parser with relatively no work. This would allow for easy entrance into that arena. HTML, on the other hand, is painful to render correctly, especially if you want to handle te wonky cases that exist out there. It makes the minimum bar for a usable new parsing engine very high, keeping competition low. The earlier people start to switch, the sooner XHTML can become "the new standard".
Comment by Dan on .
Dan
Very, very, interesting points... somehow this topic passed me by, but I looked into it and it's intriguing. Yes, my personal site is a really bad example, but I fixed the errors that I stopped trying to keep up with and even had it serving application/xhtml+xml temporarily (until I realized AdSense was breaking (not that I've made even a dime with it yet)), so now it's serving text/html again until I get around to implementing this: http://www.cssplay.co.uk/menu/adsense.html And check out the "29th May 2006" Update on there... It didn't take long to get it validating (sans the adsense thing), but I can see where the decision here came from. However, I can't imagine that it's that much of a burden on the browser/CPU to render the page properly with xHTML markup. As you mentioned, it's allowable by the W3C spec for XHTML 1.0. Probably a good percentage of bloggers aren't going to be concerned with any of this. I can see where delivering application/xhtml+xml headers with xHTML 1.1 would be a complete PITA at this point. I hadn't looked at the admin code when I wrote the comment. I was genuinely curious, not trying to be a jerk, and hadn't heard the whole Content-Type argument just yet. Thanks for cluing me in on this!
Comment by owen on .
owen
Using your personal site was just an example of the seemingly simple difficulties that present themselves when filtering user input for an XML target on any site. Yes, it's easy to correct the problems after the fact, but most users don't want to bother with that, and so the burden lies with the tools they use - in other words, Habari. To be clear about what I'm saying, it's not a burden on the browser to render an XHTML-like page served with a text/html content-type, but rather it's difficult for the tool on the back end to accept user input in its varied forms for generating valid XML in output. You are absolutely right that bloggers aren't going to care about this. But I hope that they do care that we're thinking about these decisions in advance for them to give them the most flexibility and to be more future-proof. I hope that after an explanation of what we've decided that they assent because they either understand our decision and accept it or they simply trust us to make a good decision not by fiat or default but consensus.