XML: the most terrifying thing on the planet

Rule 1 of the Internet— is never, ever trust the Internet. Reading from a socket is the equivalent of putting your hand into an unknown box that’s full of snakes, but you don’t know it’s full of snakes and it ends up pretty touch-and-go as to whether you get out in one piece.

Getting XML over the wire is like that, but more terrifying and more dangerous.

Here’s one heinous example:

Exponential Entity Expansion

By defining entities which form chains that fan out exponentially when they’re dereferenced, the load and memory used on a host can be attacked. It goes a little something like this:

{% highlight xml %}

]> &d; {% endhighlight %}

Each reference triggers a growing number of dereferences, turning the few hundred bytes above into a multi gigabyte ordeal.

Exponential entity expansion is about as much fun as a wet fart.

External Entity Expansion

More fun along similar lines: using the SYSTEM identifier, an external resource can be loaded and, if you’ve very lucky, end up in a bit of arbitrary code execution.

When the URI is a URL (e.g. a http:// locator) some parsers download the resource from the remote location and embed them into the XML document verbatim. (Explains Christian Heimes.)

Maybe I’ll start giving these ratings. That one gets 4 / 5 frowny faces.

DTD Retrieval

Nearly identical to External Entity Expansion. But for DTDs. There’s a further variant for local files too, if you aren’t depressed enough.

PANIC!

Thankfully there are lovely libraries (mmmmmm libraries) for python. defusedxml is now available on pypy and there are some good best practices to follow in this python.org blog post.

Also worth a read are the following: