Bug 13167 - Unescape %-escaped hostnames and convert them to punycode before DNS lookup
Summary: Unescape %-escaped hostnames and convert them to punycode before DNS lookup
Status: NEW
Alias: None
Product: WebKit
Classification: Unclassified
Component: Page Loading (show other bugs)
Version: 523.x (Safari 3)
Hardware: All All
: P2 Normal
Assignee: Nobody
URL: http://sailor%e6%9c%88.com/
Keywords: InRadar
Depends on:
Blocks:
 
Reported: 2007-03-22 17:55 PDT by Jungshik Shin
Modified: 2008-01-17 09:54 PST (History)
4 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jungshik Shin 2007-03-22 17:55:50 PDT
ftp://ftp.rfc-editor.org/in-notes/rfc3986.txt

RFC 3686 3.2.2 specifies that %-encoded hostnames need to be supported. 

Needless to say, IDN support needs to be added first (is it now supported in the trunk?). If not yet supported, has it been filed as a bug here? My quick search (not so thorough) turned up nothing..

The corresponding Gecko bug is at 

https://bugzilla.mozilla.org/show_bug.cgi?id=309671
Comment 1 Mark Rowe (bdash) 2007-03-23 04:01:36 PDT
I believe the support for IDN is presently at the WebKit level.  A good test URL is http://www.xn--sailor-183m.com/ -- Safari loads it correctly and handles http://www.sailor月.com/ in the URL bar correctly too.  It doesn't handle http://www.sailor%e6%9c%88.com/ though, which is what you mention in this bug report.   The behaviour that I observe is http://www.sailor%e6%9c%88.com/ is converted into http://www.sailor月.com/ in the Safari address bar, but the page load fails due to www.sailor%e6%9c%88.com being used in the DNS lookup.
Comment 2 Mark Rowe (bdash) 2007-03-23 04:42:25 PDT
Sigh.  The mangled URL is intended to be the kanji character equivalent of the %-escaped triplet.
Comment 3 Jungshik Shin 2007-03-23 10:23:31 PDT
Thanks for the info. Indeed, WebKit trunk supports IDN. Can you tell me when it was fixed? 

 I've just tried http://www.청와대.kr and it worked fine.   (before submitting a comment with non-ASCII characters, make sure that View | Encoding is set to UTF-8. If you had done that, you wouldn't have had a problem you mentioned in comment #2). 
Comment 4 Mark Rowe (bdash) 2007-03-23 10:31:36 PDT
As far as I am aware, Safari 2.0 supports IDN correctly too.  Unless I am mistaken it is not a recent addition to WebKit.

As far as UTF-8 goes, your comment shows up with garbled characters too as Bugzilla doesn't specify any character set in its HTTP headers or document header.  I should look at fixing this on the server side so that all pages are served as UTF-8 and forms are submitted as the same.
Comment 5 Jungshik Shin 2007-03-23 10:55:28 PDT
(In reply to comment #4)
> As far as I am aware, Safari 2.0 supports IDN correctly too.  Unless I am
> mistaken it is not a recent addition to WebKit.

Thanks a lot for the info. Indeed, Safari 2.0.4 on my Mac supports it well. I should have tried it before asking. 
 
> As far as UTF-8 goes, your comment shows up with garbled characters too as
> Bugzilla doesn't specify any character set in its HTTP headers or document
> header.  

Of course, I'm well aware of that. :-) I thought it's obvious that you should set view | encoding to UTF-8 when reading my comment :-)  In your case, characters not covered by the encoding in effect (most likely ISO-8859-1 or Windows-1252) when you submitted comment were converted to NCRs and stored that way on bugzilla DB so that simply changing the encoding on the browser-side does not give back the original. In my case, UTF-8 byte sequences are stored in the DB and 'emitted' to  a browser so that just changing the encoding works. 


> I should look at fixing this on the server side so that all pages are
> served as UTF-8 and forms are submitted as the same.
 
It took bugzilla.mozilla.org to fix that problem 5+ years !!! WebKit bugzilla has only 13k bugs and I guess most of them are straight ASCII so that it should be easier. See http://bugzilla.mozilla.org/show_bug.cgi?id=126266 (and bugs that were made its dupe and it spun off) about a long and winding road they took. 


Comment 6 Mark Rowe (bdash) 2007-04-27 03:00:14 PDT
<rdar://problem/5166146>
Comment 7 Rosyna 2007-05-14 05:16:21 PDT
radr://4379131 I believe is also this exact bug.
Comment 8 Eric Seidel (no email) 2008-01-17 01:28:55 PST
My guess is that this bug lies in:
static DeprecatedString encodeHostname(const DeprecatedString &s)

which uses uidna_IDNToASCII (I believe to handle unicode # escapes).

If that's true, then uidna_IDNToASCII probably doesn't handle % escapes and we'd just have to fix them up first before passing it through.

This is all just a guess however.