XML: the most terrifying thing on the planet
Rule 1 of the Internet— is never, ever trust the Internet. Reading from a socket is the equivalent of putting your hand into an unknown box that’s full of snakes, but you don’t know it’s full of snakes and it ends up pretty touch-and-go as to whether you get out in one piece.
Getting XML over the wire is like that, but more terrifying and more dangerous.
Here’s one heinous example:
Exponential Entity Expansion
By defining entities which form chains that fan out exponentially when they’re dereferenced, the load and memory used on a host can be attacked. It goes a little something like this:
{% highlight xml %}
]> &d; {% endhighlight %}
Each reference triggers a growing number of dereferences, turning the few hundred bytes above into a multi gigabyte ordeal.
Exponential entity expansion is about as much fun as a wet fart.
External Entity Expansion
More fun along similar lines: using the SYSTEM
identifier, an external
resource can be loaded and, if you’ve very lucky, end up in a bit of arbitrary
code execution.
When the URI is a URL (e.g. a http:// locator) some parsers download the resource from the remote location and embed them into the XML document verbatim. (Explains Christian Heimes.)
Maybe I’ll start giving these ratings. That one gets 4 / 5 frowny faces.
DTD Retrieval
Nearly identical to External Entity Expansion. But for DTDs. There’s a further variant for local files too, if you aren’t depressed enough.
PANIC!
Thankfully there are lovely libraries (mmmmmm libraries) for python.
defusedxml
is now available on pypy and there are some good best
practices to follow in this python.org blog post.
Also worth a read are the following: