The Problem with XML

XML is an amazingly successful standard for representing data in a portable and cross-platform readable way. Unfortunately also suffers from that limitation Crocodile Dundee noted about certain bush-foods: you can live on it, but it tastes like sh*t. XML is mostly unfit for human consumption.

There, let’s see how many are still reading beyond this point. 🙂

So What Is XML?

XML is a Markup Language. Wikipedia gives us three types of markup languages: presentational, procedural, and descriptive. As a language, XML is a bit of an odd beast, because it adds no semantics at all to the data of itself and can fit all three categories. The most you can say about it, is that it adds structure, but what the purpose of that structure is, is up to the reader. Literally. The reader parses the XML and needs to know the purpose of the structure, because the syntax of the document won’t help much.

As for the syntax of an XML document, you’ve got three choices:

Do nothing. Just keep yourself to the simple rules of matching open en close tags (those <some-tag> and </some-tag> thingies) and you’ll be fine. Don’t expect anything while reading, however, because what tags are used are entirely up to the writer.
Use a Document Type Definition (a DTD). A DTD is another document that contains a description of all the rules as to what tags are allowed. Best known DTDs are those written to describe HTML.
Use an XML Schema Definition. (an XSD) This one is like a DTD, but better. An XSD itself is a valid XML document (oh how we love those self referring things), but the most exciting bit is that you can combine them. So a SOAP-based Web Services document can have part of it conform to a separately defined WS-Security XSD, even if the latter was unknown when the former was defined.

If you want to check the syntax of an XML document for correctness, the first option leaves you without much help. You can check if open- and close-tags are nicely matched and if it is structurally sound, but that’s about it. Number 2 is a lot more versatile, but XSDs (nr 3) allow much better type checking; you can specify things like “here goes a number between 1 and 25” or even “here goes a word of 5 to 7 characters with the third a ‘x’ or an ‘z’“. As a result, in the world of IT, XSDs are the most widely used documents for specifying XML document syntax.

So What is XML Used For?

Well, basically, everything, as long as it involves a program reading or writing data. Even though many today may not know it, we came from a world where data from machine with brand A would not necessarily be readable by a machine from brand B without work. Computer scientists (yes indeed, not just students) love to argue about such mindbogglingly silly things as to which was better: big-endian or little-endian. “Aha!” you declare, “I read Gulliver’s travels. This is about eggs, right?” Eh, no actually. Ultra short (‘scuse the pun): if I have a two-byte value (so 16 bits, usually referred to as a short integer), which byte do I store first; the low-order (little-endian) or the high-order (big-endian)? Read Wikipedia for a summary of the arguments. The Internet protocols solved this by introducing “network byte order” (see the section on “Endianness in networking”), but still, doing it in text might be the only way to prevent any discussions.

So here we are: a Data Format with as one of its main advantages that it is technology neutral. Even better: a human can read it. So there again, this is the big thing with XML: it is a standard, syntax can be strictly specified, and even humans can read and write it. So who can guess what we humans do next? Indeed: we fight over what character set it is written with (Sigh. So you can specify it; choose what you want and be done with it), complain about its bulkiness, and start using it for whatever comes to mind.

So What is XML Good For?

Remember that bit about exchanging data between different systems? Yes, XML is great for exchanging data. This whole rage about Service Oriented Architectures wouldn’t have taken off as it did without SOAP based Web Services. Practically all software development platforms provide for a relatively simple way to produce or consume XML. If XSDs have been used, a lot of checking on the validity of the data (the actual content) can be automated as well. If you feel like it, you can read and write the message’s content by hand, because it really is human readable. If you’re lucky, the developer (or designer) even made sure the tag’s names and the document’s structure are understandable. Because if they didn’t, you’re in hell…

So where does it Suck?

XML is a terrible format for a human with a simple text editor. Since we don’t want to waste bandwidth unnecessarily, XML by default is not “pretty printed”, because that involves adding a lot of whitespace so a human can (somewhat) recognize the structure. But even if it is formatted, the file soon becomes too large to read easily. I’ve long held the conviction that XML is great for tool builders, because you need a specialized tool to show the content in its proper form. You want tables, lists, structured dialogs. Take a Maven POM. If you add Maven support to Eclipse, you get a multi-page editor for POMs, because the bare XML is not really quick to read. You use the scrollbar big-time. Take deployment descriptors, Faces configuration, persistence configuration, application server configurations, worse: BPEL, XSLT, even simply XHTML! You either need to be well-versed in the format, or you’re stuck trying to find a good tool for handling the files.

So why do we Keep Doing This to Ourselves?

I’ve got this feeling it’s all down to a form of laziness. Not necessarily a bad form of laziness, because I’m all for reuse, but since we have high-quality parsers and generators for XML, which are relatively easy to use, why bother? Designing a good data model is non-trivial. We’ve been pampered with not just great tooling, but also a methodology that explicitly states you shouldn’t do a design up-front, because “You Ain’t Gonna Need It”. So we just write our classes and forget the decades of research on database modelling, annotating the classes so they’ll get mapped to database tables, and let our lovely JPA layer create the database schema for us. Designing for performance up-front? Bah! Humbug! Create an understandable syntax? Why? I can read it! Use any XML capable editor, you lazy-bones! Do I need to pre-cook everything for you?

I guess you could say I beg to differ. Take the following, which do you prefer?

Option 1:

<dependency> <groupId>org.jboss.spec</groupId> <artifactId>jboss-javaee-6.0</artifactId> <version>3.0.1.Final</version> <type>pom</type> <scope>import</scope> </dependency>

Option 2:

dependency("jboss-javaee-6.0", groupId="org.jboss.spec", version="3.0.1.Final", type="pom", scope="import");

For a piece of software, the second one is just as easy to read as the first, while for a human, the second is a lot easier. There is no reason I can think of, why we need to be tortured with XML-for-everything, other than not wanting to spend the time on a good technical data model and a fitting textual representation, given that humans need to read and write it. And don’t complain about having to write that parser; there are tools for that, just as with XML. Sure, they may need some dusting off and probably could use an update to use our modern-day languages, but I wouldn’t be surprised if they already have been. It’s just that they don’t get the attention they deserve, because nobody thinks about using them.

Now where did I keep that copy of those LL, LR, and LALR parser generators? Written in Modular Pascal, and who uses that nowadays, but they could be converted to a more modern language…