Bug 131587

Summary: userVisibleString should not try to "encode" host names
Product: WebKit Reporter: Darin Adler <darin>
Component: WebCore Misc.Assignee: Darin Adler <darin>
Status: RESOLVED FIXED    
Severity: Normal CC: ap, jeffm, mitz
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: Mac   
OS: All   
Attachments:
Description Flags
Patch
none
Patch ap: review+

Description Darin Adler 2014-04-13 00:54:00 PDT
userVisibleString should not try to "encode" host names
Comment 1 Darin Adler 2014-04-13 01:00:20 PDT
Created attachment 229224 [details]
Patch
Comment 2 Alexey Proskuryakov 2014-04-13 11:49:41 PDT
Comment on attachment 229224 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=229224&action=review

> Source/WebCore/ChangeLog:3
> +        userVisibleString should not try to "encode" host names

The optional encoding was added in <http://trac.webkit.org/changeset/32886> as part of fixing a security bug. I spent some time trying to figure out why it was added, but was not successful.

Did the changes in <http://trac.webkit.org/changeset/32886> become obsolete when we decided to show punycode for suspicious domain names? If so, can createStringWithEscapedUnsafeCharacters() be removed now? Can you add a regression test for <rdar://problem/5884383>, given that you substantially change the code added to fix it?

> Source/WebCore/ChangeLog:19
> +        (WebCore::userVisibleString): Only call mapHostNames is host name decoding is

Typo: is/if.

> Source/WebCore/platform/mac/WebCoreNSURLExtras.mm:861
> +    if (needsHostNameDecoding)

needsHostNameDecoding is true any time "xn--" is seen anywhere in the URL string, not just in host name. It may be appropriate to rename it to something like hostNameMayBeIDNEncoded in this patch.

> Tools/TestWebKitAPI/Tests/mac/URLExtras.mm:75
> +    EXPECT_STREQ("http://site.com\xEF\xBC\x8Fothersite.org", originalDataAsString(literalURL("http://site.com\xEF\xBC\x8Fothersite.org")));

It's not clear to me why this is correct behavior.
Comment 3 Darin Adler 2014-04-13 18:00:48 PDT
(In reply to comment #2)
> Did the changes in <http://trac.webkit.org/changeset/32886> become obsolete when we decided to show punycode for suspicious domain names?

Not sure. The test case from that old bug involves %-escaped characters in a webpage. When I test it in Safari, I see %-escape sequences, not puny code, in the smart search field.

> Can you add a regression test for <rdar://problem/5884383>, given that you substantially change the code added to fix it?

Yes.

I will add test cases to cover these kinds of URLs when they come in through URLWithUserTypedString, and also when they do not.

It seems that original issue is almost identical to the bug being looked at now. It’s just that the character is a space rather than a slash, and the trick is to push the end of the hostname off the right side of the field rather than having the browser consider it not part of the hostname.

I think the real issue here is sufficient test coverage. Once we add enough tests we can try to make refinements to the code, and maybe discover more that is unneeded.

> Typo: is/if.

Fixed.

> > Source/WebCore/platform/mac/WebCoreNSURLExtras.mm:861
> > +    if (needsHostNameDecoding)
> 
> needsHostNameDecoding is true any time "xn--" is seen anywhere in the URL string, not just in host name. It may be appropriate to rename it to something like hostNameMayBeIDNEncoded in this patch.

Yes. I believe it’s just a performance optimization, so I think it should be named mayNeedHostNameDecoding.

> > Tools/TestWebKitAPI/Tests/mac/URLExtras.mm:75
> > +    EXPECT_STREQ("http://site.com\xEF\xBC\x8Fothersite.org", originalDataAsString(literalURL("http://site.com\xEF\xBC\x8Fothersite.org")));
> 
> It's not clear to me why this is correct behavior.

The job of the originalData family of functions is primarily to pass URL bytes through unchanged. That’s what is supposed to happen here and what does happen. I’m not saying we might never considering changing this behavior, but here neither the input nor the output function is the one that’s supposed to process URLs; these functions are supposed to just pass them through unmodified.
Comment 4 Darin Adler 2014-04-13 18:04:47 PDT
(In reply to comment #2)
> > +    EXPECT_STREQ("http://site.com\xEF\xBC\x8Fothersite.org", originalDataAsString(literalURL("http://site.com\xEF\xBC\x8Fothersite.org")));
> 
> It's not clear to me why this is correct behavior.

Oh, I see, you were saying that you’d expect this URL to get encoded with xn--. I think that’s something we might do some day, but at the moment the functions do not do that. The actual URLs in the webpages have to have the xn-- in them.

We have separate bug reports saying this is a missing feature, and to fix that we might end up changing these functions.
Comment 5 Darin Adler 2014-04-13 18:56:31 PDT
Created attachment 229255 [details]
Patch
Comment 6 Alexey Proskuryakov 2014-04-13 22:22:29 PDT
Comment on attachment 229255 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=229255&action=review

> Oh, I see, you were saying that you’d expect this URL to get encoded with xn--.

Possibly, or perhaps we should reject such URLs in URLWithData outright. I just don't expect non-ASCII bytes to come out of any URL accessors ever, except for "user visible string" one.

> Source/WebCore/platform/mac/WebCoreNSURLExtras.mm:918
> -    result = mapHostNames(result, !needsHostNameDecoding);
> +    if (mayNeedHostNameDecoding)
> +        result = mapHostNames(result, NO);

I'm still not sure if I understand why we needed !needsHostNameDecoding six years ago, and don't need it any more. Will go with the theory that it was a mistake back then - the boolean argument may be misread as "encode or do not encode" instead of "encode or decode" that it is.
Comment 7 Darin Adler 2014-04-13 23:12:15 PDT
(In reply to comment #6)
> I just don't expect non-ASCII bytes to come out of any URL accessors ever, except for "user visible string" one.

For parts of the URL other than the hostname, preserving non-ASCII bytes can be critical. There are some servers that need to get back the exact same bytes they gave us to fetch a webpage, even though if we interpret those bytes as UTF-8 they are invalid.

Keep in mind that we are just writing functions here, not designing the NSURL class. We can’t invent an invariant for the class that doesn’t exist. The class can, and does, store arbitrary bytes and return those bytes. Sometimes we use NSURL to carry those bytes from one place to another without interpreting or mangling them.

I think there is room for further improvement here, but this is not something I would change lightly.
Comment 8 Darin Adler 2014-04-13 23:58:55 PDT
Committed r167211: <http://trac.webkit.org/changeset/167211>