Created attachment 276475 [details]
screenshot of broken page layout due to user-controlled markup
github.com displays lots of user-controlled markup in comments on issues and pull requests. Users submit Markdown (which can include raw HTML) and we transform it to HTML and display it on our pages. Serving user-controlled markup poses a security risk, which we mitigate in two ways:
1. We sanitize the HTML that comes out of our Markdown parser. We only allow certain elements, attributes, and attribute values and strip out anything that doesn’t match our whitelist.
2. We use mechanisms like Content Security Policy to provide an extra layer of protection if any unwanted content slips through the sanitization.
But even with these mitigations it is still possible for users to unintentionally break our UI (and thus it may be possible for malicious users to use this for phishing etc.). One way for this to happen is via interactions between lists and tables in the HTML parser. A live example of this exists on https://github.com/mantisbt/mantisbt/pull/272. See the attached screenshot. (The rendering issues are less dramatic when you’re logged out, but still present.) I’ll explain what’s going on here in detail.
One of the comments on this page contains the following Markdown:
- echo '<div><table class="width75">', "\n",
+ echo '<div><table class="width-70">', "\n",
Clearly this is an excerpt of a diff, and if the user had enclosed it in triple-backticks (```) it would have been rendered as such. But since it is not escaped by backticks, it gets interpreted as an unordered list, generating the following HTML:
<li>echo '<div><table class="width75">', "\n",</li>
<li>echo '<div><table class="width-70">', "\n",</li>
This HTML is then sent through our HTML sanitizer. The sanitizer is based on an HTML-compliant parser (https://github.com/google/gumbo-parser) which parses this HTML as a document fragment, strips out elements, attributes, and attribute values that don’t match our whitelist, and then reserializes the document fragment to produce the following HTML:
<li>echo '<div>', "\n",<li>echo '<div></div></li><table>
Looking at the innerHTML view on the Live DOM Viewer shows the same result (other than the class attributes that the sanitizer removed): http://software.hixie.ch/utilities/js/live-dom-viewer/?saved=4062
This sanitized HTML is then inserted into our page’s markup. You can see the skeleton of our page’s markup here: http://software.hixie.ch/utilities/js/live-dom-viewer/?saved=4063
Inserting the sanitized HTML between the “user content start/end” comments gives this: http://software.hixie.ch/utilities/js/live-dom-viewer/?saved=4064
You can see that the “this should be inside the box” text has moved outside the box. The user-controlled markup has broken our page’s markup and layout.
We’ve talked about working around this by wrapping all user-controlled markup in <object> tags (http://software.hixie.ch/utilities/js/live-dom-viewer/saved/4065), or possibly putting each comment inside <table><td> (http://software.hixie.ch/utilities/js/live-dom-viewer/?saved=4066). I don’t think there are many ways to break outside of those in the parser. But this solution (if it actually works) is obviously a bit of a hack.
It would be great if there were a way to tell the browser to keep the user-controlled markup from breaking out of its container element, both in terms of the DOM and visually.
<iframe seamless sandbox srcdoc> seemed like a solution for this, but seamless iframes have been removed from the spec. (This solution is extra appealing because of the browser-provided sandboxing; perhaps we could someday get rid of the sanitizing we do on our servers to remove scripts, etc.)
Some strawman proposals:
<untrusted srcdoc="<p>HTML goes here</p>">
This is similar to <iframe srcdoc>. We'd of course have to ensure the attribute value does not contain any quotes on the server side.
<p>HTML goes here</p>
For this one to work, the parsing would need to be similar to <textarea> where it consumes all characters until the next instance of "</untrusted>". And of course we'd have to ensure to remove "</untrusted>" from the content itself on the server side.