Bug 68089 - WebKit doesn't parse "#" as delimiter for fragment identifier in data URIs
Summary: WebKit doesn't parse "#" as delimiter for fragment identifier in data URIs
Status: NEW
Alias: None
Product: WebKit
Classification: Unclassified
Component: Platform (show other bugs)
Version: 528+ (Nightly build)
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Nobody
URL: data:text/html,<html>no hash should b...
Keywords:
Depends on:
Blocks: 91791
  Show dependency treegraph
 
Reported: 2011-09-14 10:32 PDT by Daniel Holbert
Modified: 2019-03-13 07:28 PDT (History)
18 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Daniel Holbert 2011-09-14 10:32:55 PDT
STEPS TO REPRODUCE:
 1. Load this URI directly:
      data:text/html,<html>no hash should be visible:#

EXPECTED RESULTS:
  No # should be visible on the resulting rendered content.  The "#" should signify the end of the URI and the start of the fragment identifier.

ACTUAL RESULTS:
  The page renders with a "#" as part of the content.

Firefox versions > 6 as well as Opera 11.51 show EXPECTED RESULTS.

Chromium shows ACTUAL RESULTS
 version tested: 14.0.835.126 (Developer Build 99097 Linux) Ubuntu 11.10
Comment 1 Daniel Holbert 2011-09-14 10:54:35 PDT
ALTERNATE STEPS TO REPRODUCE (slightly trickier, but same underlying issue):
 Load this URI:
=========
data:application/xhtml+xml,<html xmlns='http://www.w3.org/1999/xhtml'><div style='height:3000px; width: 300px; background: red'/><div id='target' style='height: 3000px; width: 300px; background: green;'/></html>#target
=========

ACTUAL RESULTS:  Error page, for an XML parse error, saying:
>  This page contains the following errors:
>  error on line 1 at column 47: Extra content at the end of the document
>  Below is a rendering of the page up to the first error.
> [rendering of the page]

EXPECTED RESULTS:
 (a) No errors reported
 (b) tall green rect should be visible (and no red should be visible).  That is -- the initial load should seek past the red to the 'target' element.

Again, Opera & Firefox give EXPECTED RESULTS, and Chromium 14 gives ACTUAL RESULTS.
Comment 2 Daniel Holbert 2011-09-14 10:58:48 PDT
(note: in both comment 0 and comment 1, Opera only gives "expected results" if I view the data URIs *directly*, whereas it matches WebKit's behavior if I view it in an <img> / <embed>.  That inconsistency appears to be a bug in Opera.)
Comment 3 Alexey Proskuryakov 2011-09-14 14:53:12 PDT
Hmm, there is probably some history behind this that I don't know. Looks like Chrome and Safari might have disagreed on this (per bug 22882), but no longer do.

What does IE do?
Comment 4 Adam Barth 2011-09-14 14:59:33 PDT
There's a thread about this topic on whatwg currently.  This issue is subtle.
Comment 5 Alexey Proskuryakov 2011-09-14 15:44:50 PDT
Thanks, that is <http://www.mail-archive.com/whatwg@lists.whatwg.org/msg28146.html>.

As mentioned on the thread, one can expect consistency between data:, javascript: and possibly even mailto: in this regard.
Comment 6 Adam Barth 2011-09-14 15:49:32 PDT
The underlying issue is that there's a dream that all URIs should be parseable with a single algorithm, but I don't think we can realize that dream given compatibility constraints.  Compounded on that are tricky issues of using relative fragments in documents created from data URLs.  WebKit does something somewhat magical to make this "work".

It's very unclear to me what the best solution to all these constraints is.
Comment 7 Robert O'Callahan 2011-09-18 22:22:08 PDT
Ryosuke said "It seems like sticking to RFC is a cleaner option if possible." Is it possible for Webkit? If not, we need to change the spec.
Comment 8 Adam Barth 2011-09-18 22:26:26 PDT
> Is it possible for Webkit? If not, we need to change the spec.

I haven't finished reading the discussion on whatwg yet.  My understanding is that the specification is currently being revised in the IRI working group.  I'll try to catch up on the discussion on Monday.  Is there more to read besides this bug and the whatwg thread?
Comment 9 Robert O'Callahan 2011-09-18 22:33:04 PDT
Not that I know of.
Comment 10 Adam Barth 2011-09-19 17:47:22 PDT
Ok.  I've caught up on my reading.  I don't see a big reason to change behavior here.  The dream of having all URIs parse the same way is just a dream.  Some common URI schemes, such as javascript URIs, can't be done this way.  There are also obscure ones, like ed2k, that can't work this way either.

The main reason I see to change our parsing of data URLs is in connection with fragment references in SVG.  I'm not sure what the best solution there is, but using the RFC 3986 regexp on data URLs doesn't seem like the best approach, especially give that doing so has resulted in Firefox getting compat complaints.
Comment 11 Robert O'Callahan 2011-09-19 20:54:04 PDT
OK, can you jump back into the WHATWG thread and propose a suitable spec change? I expect some opposition and I don't want Daniel or I to be the man in the middle :-).
Comment 12 Ian 'Hixie' Hickson 2011-09-19 20:59:30 PDT
IMHO what Firefox does here makes much more sense. Otherwise, how do you do fragment identifiers with data: URLs? There's simply no good reason to not support fragment identifiers here as far as I can tell, and there's no reason not to — it's trivial to escape the # character.
Comment 13 Alexey Proskuryakov 2011-09-19 22:46:23 PDT
I agree with Adam.

If same document fragment references in data: URLs somehow become really important, we'll likely want to shortcut calculation by not completing the "reltative" fragment-only references for performance reasons anyway.
Comment 14 Julian Reschke 2011-09-20 00:48:03 PDT
(In reply to comment #10)
> Ok.  I've caught up on my reading.  I don't see a big reason to change behavior here.  The dream of having all URIs parse the same way is just a dream.  Some common URI schemes, such as javascript URIs, can't be done this way.  There are also obscure ones, like ed2k, that can't work this way either.
> 
> The main reason I see to change our parsing of data URLs is in connection with fragment references in SVG.  I'm not sure what the best solution there is, but using the RFC 3986 regexp on data URLs doesn't seem like the best approach, especially give that doing so has resulted in Firefox getting compat complaints.

Sounds like "(b) and (c) are weird as well, so we can't make (a) less weirder". 

ed2k, as you say, is obscure. Does it affect browsers?

javascript is indeed interesting. I still think these aren't proper identifiers at all, and only live in browsers, so browser people may want to reserve the scheme name, and just specify what they do with them as something being in use where URIs/IRIs appear, but not really belonging to this class of identifiers.

What makes data different from the other schemes you mentioned is:

- browser implementations differ (as far as I can tell)

- they are produced/consumed in other places (per spec)

- they actually denote a representation to which a fragment can be applied (I don't think this is the case for javascript or ed2k)
Comment 15 Adam Barth 2011-09-20 00:51:44 PDT
I guess I'm still not clear on what problem we're solving.  Is there a web site that would work better if we changed WebKit's behavior?
Comment 16 Robert O'Callahan 2011-09-20 01:25:25 PDT
The problem I want to solve here is interop. If you think Webkit's behavior is fine, please document it and propose it to the list.
Comment 17 Robert O'Callahan 2011-09-20 01:40:25 PDT
When you do that it would be good if you can describe how relative URIs should work in data: URIs.

Given
data:text/html,<a href="#foo">hello</a><div id="#foo" style="margin-top:10000px;">pick me!</div>
clicking on the link seems to navigate to about:blank in Chrome, which is unexpected.
Comment 18 Adam Barth 2011-09-20 10:39:54 PDT
> The problem I want to solve here is interop. If you think Webkit's behavior is fine, please document it and propose it to the list.

I'm sorry that I don't have time to do that at the moment.  Previously I was working a full specification of URL handling, but I've stopped working on that spec.  You can find my work-in-progress here:

https://github.com/abarth/url-spec/blob/master/drafts/url.xml

> When you do that it would be good if you can describe how relative URIs should work in data: URIs.

It's pretty simple:

1) If the relative URL is an absolute URL, then the resolved URL is that absolute URL (appropriately canonicalized).

2) Otherwise, the resolved URL is an invalid URL.

Separately, trying to navigate to an invalid URL results in a navigation to about:blank.

In the spec, this is marked as a TODO:

        <t>TODO: If base-url's scheme is not hierarchical, we can't resolve as
        a relative URL.  We'll probably want to return an invalid URL.  Check
        what happens when resolving an empty string as a relative URL with a
        non-hierarchical base.</t>

From the point of view of the spec, data URLs are not hierarchical, which essentially means we parse out the scheme and leave the remainder of the URL to be parsed by a scheme-specific parser.
Comment 19 Robert O'Callahan 2011-09-20 11:51:05 PDT
OK. The following HTML displays a green rectangle in Chrome (the #greenRect is allowed to resolve to an element in the data: document), so I guess you'd consider that a bug? Should I file it?

<img src="data:image/svg+xml,&lt;svg xmlns='http://www.w3.org/2000/svg' xmlns:xlink='http://www.w3.org/1999/xlink'&gt;&lt;defs&gt;&lt;rect id='greenRect' height='50' width='300' fill='green'/&gt;&lt;/defs&gt;&lt;text y='30' fill='red'&gt;fail&lt;/text&gt;&lt;use xlink:href='#greenRect'/&gt;&lt;/svg&gt;">
Comment 20 Adam Barth 2011-09-20 12:03:36 PDT
> OK. The following HTML displays a green rectangle in Chrome (the #greenRect is allowed to resolve to an element in the data: document), so I guess you'd consider that a bug? Should I file it?

We can look at the code, but I think SVG has hacks to avoid URL resolution.  I believe that was discussed in more detail on the whatwg thread.  I don't have any specific knowledge about how that works.
Comment 21 Robert O'Callahan 2011-09-20 12:26:36 PDT
(In reply to comment #20)
> We can look at the code, but I think SVG has hacks to avoid URL resolution.  I
> believe that was discussed in more detail on the whatwg thread.

I can't find that in the thread.

When you say "SVG has hacks", I assume you mean Webkit's SVG implementation? Given that changing <img src> to <iframe src> makes the #greenRect not resolve, it does look like something to do with SVG images specifically. I can't imagine that we really want SVG images to behave differently here ... do you agree?
Comment 22 Adam Barth 2011-09-20 12:44:47 PDT
> I can't find that in the thread.

It was mentioned in Comment #10 and Comment #13 in this bug.  In the thread, it was mentioned in Boris Zbarsky's message on Sat, Sep 10, 2011 at 5:32 PM.

> When you say "SVG has hacks", I assume you mean Webkit's SVG implementation?

Yes.

> Given that changing <img src> to <iframe src> makes the #greenRect not resolve, it does look like something to do with SVG images specifically. I can't imagine that we really want SVG images to behave differently here ... do you agree?

My perspective is that SVG image itself a giant hack itself (or, at least, the WebKit implementation thereof is a giant hack).  Our implementation is really very bad and doesn't have a path forward to not being full of bugs.

I'm sorry to not be more helpful about this topic, but we've reached the limits of my knowledge.  I can go read the code to understand what's going on in the SVG case, but I learned about it from Boris' email.  I've studied the URL resolution behavior in more detail, and it works as I've described.

If we'd had this discussion back in April, I would have been much more excited about digging in here and finishing the URL spec.  At the moment, I'd rather give the IRIbis working group some breathing room and a chance to succeed.  Once they've either succeeded or failed, we should talk more about writing a high-quality specification for URL handling.
Comment 23 Robert O'Callahan 2011-09-20 13:20:06 PDT
(In reply to comment #22)
> > I can't find that in the thread.
> 
> It was mentioned in Comment #10 and Comment #13 in this bug.  In the thread,
> it was mentioned in Boris Zbarsky's message on Sat, Sep 10, 2011 at 5:32 PM.

Thanks.

> If we'd had this discussion back in April, I would have been much more excited
> about digging in here and finishing the URL spec.  At the moment, I'd rather
> give the IRIbis working group some breathing room and a chance to succeed.
> Once they've either succeeded or failed, we should talk more about writing a
> high-quality specification for URL handling.

Fair enough, although the status quo is quite unsatisfactory.

Of the three options:
1) Current behavior of Firefox and Opera: allow data: URIs to have fragment references, parse everything after the first # as a fragment reference, and allow relative URIs consisting of fragment references to resolve relative to a data: URI.
2) Your suggestion in comment #18 applied consistently to SVG images: disallow fragment references in data: URIs, resolve all URIs relative to a data: URI to an invalid URI.
3) Your suggestion in comment #18, except with something magical happening for SVG (possibly only in an image context) (e.g., have SVG special-case fragment-only relative URIs to bypass normal relative URI resolution)
#2 would cripple data: URIs for SVG images. #3 seems unnecessarily complex, and limiting it to SVG would be difficult as we increase the integration of SVG with CSS and HTML. #1 seems best to me.
Comment 24 Adam Barth 2011-09-20 14:01:37 PDT
Doesn't option (1) have compatibility problems?  I thought that's why this discussion started.
Comment 25 Robert O'Callahan 2011-09-20 14:24:49 PDT
The issue arose on our side when one of our developers complained that an SVG image with a data: URI containing unescaped #s worked in Webkit but not in Firefox. I don't know whether or not Web content depends on this.
Comment 26 Adam Barth 2011-09-20 14:26:57 PDT
The net-net of this discussion seems to be that you're not interesting in changing your behavior at this time and we're not interested in changing our behavior at this time.  That means we'll probably need to revisit this topic in the future.
Comment 27 Robert O'Callahan 2011-09-20 14:38:00 PDT
Personally I wouldn't mind changing our behavior if there was some agreement on what to change it to. In fact, the WHATWG thread started with Daniel and I proposing a behavior change (which, unfortunately, everyone hated :-) ).
Comment 28 Adam Barth 2011-09-20 14:45:08 PDT
It doesn't seem like there is an agreement based on Comment #10 and Comment #13.  I agree that WebKit's current behavior isn't particularly aesthetic, but treating # in data URLs as the beginning of the fragment is also pretty unappealing.

I certainly don't have a strong design to strong-arm you into adopting WebKit's behavior.  I'm sorry that there doesn't appear to be a solution that's compelling enough for everyone to adopt.  As you know, I'm a big fan of interoperability.
Comment 29 Henry S. Thompson 2011-10-19 04:30:17 PDT
(In reply to comment #18)

> . . .

> > When you do that it would be good if you can describe how relative URIs should work in data: URIs.
> 
> It's pretty simple:
> 
> 1) If the relative URL is an absolute URL, then the resolved URL is that absolute URL (appropriately canonicalized).
> 
> 2) Otherwise, the resolved URL is an invalid URL.
> 
> Separately, trying to navigate to an invalid URL results in a navigation to about:blank.
> 
> In the spec, this is marked as a TODO:
> 
>         <t>TODO: If base-url's scheme is not hierarchical, we can't resolve as
>         a relative URL.  We'll probably want to return an invalid URL.  Check
>         what happens when resolving an empty string as a relative URL with a
>         non-hierarchical base.</t>
> 
> From the point of view of the spec, data URLs are not hierarchical, which essentially means we parse out the scheme and leave the remainder of the URL to be parsed by a scheme-specific parser.

I think we have to be careful here to distinguish relative URIs in general from same-document references.  I claim that same-document references (and all the examples of relative data: URIs we've seen so far _are_ same-document references) _may_ make sense for non-hierarchical URI schemes, and should not be treated as errors.

So, for example, clicking on the link in that appears at the beginning of the result of interpreting

 data:application/xhtml+xml,<html xmlns='http://www.w3.org/1999/xhtml'><a href='#target'>go green</a><div style='height:3000px; width: 300px; background: red'/><div id='target' style='height: 3000px; width: 300px; background: green;'/></html>

should work, IMO.
Comment 30 Daniel Holbert 2011-10-21 13:11:10 PDT
(In reply to comment #29)
> So, for example, clicking on the link in that appears at the beginning of the result of interpreting
> 
>  data:application/xhtml+xml,<html xmlns='http://www.w3.org/1999/xhtml'><a href='#target'>go green</a>  [etc]

I would agree with you if you replaced "#" with "%23".

But in the URI as you provided it, the "#" technically terminates the main portion of the URI and indicates the beginning of the fragment-identifier.  So your URI just contains this (invalid) XML:
> <html xmlns='http://www.w3.org/1999/xhtml'><a href='
and that's it.

This is how both Opera and Firefox handle that URI, when they load it directly.  I agree that this strict behavior is a bit unintuitive (see the whatwg thread I started about trying to make it better[1]), but the general consensus in that thread was that the strict behavior (that Firefox implements) is correct.

http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2011-September/033162.html
Comment 31 Dominic Mazzoni 2014-06-19 09:48:14 PDT
We've been discussing this on the Chromium bug tracker: http://crbug.com/123004

Does anyone have any new thoughts on this?
Comment 32 j.j. 2018-01-24 15:21:45 PST
Blink
Intent to Deprecate and Remove: Support for '#' in data URI body

https://groups.google.com/a/chromium.org/forum/#!topic/blink-dev/_g_HnAHkMPo
Comment 33 Stephen McGruer 2019-03-13 07:28:32 PDT
FYI (without any comment on what anyone should or shouldn't do!): Blink's (second) attempt at removing this landed in our M72 milestone, and has now survived an entire milestone in the field (as M73 has started rolling out). Chromium is now aligned with the spec and Firefox; '#' in data URIs mark the end of the content and the start of the fragment.