Bug 7461 - Always encode the path part of an URI as UTF-8
Summary: Always encode the path part of an URI as UTF-8
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: Platform (show other bugs)
Version: 420+
Hardware: Mac OS X 10.4
: P2 Normal
Assignee: Alexey Proskuryakov
URL: http://www.w3.org/2001/08/iri-test/re...
Keywords:
Depends on:
Blocks:
 
Reported: 2006-02-25 01:51 PST by Alexey Proskuryakov
Modified: 2006-06-24 08:11 PDT (History)
1 user (show)

See Also:


Attachments
proposed fix (4.62 KB, patch)
2006-06-24 05:45 PDT, Alexey Proskuryakov
darin: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Alexey Proskuryakov 2006-02-25 01:51:24 PST
From <https://bugzilla.mozilla.org/show_bug.cgi?id=261929>.

WinIE 6 and Opera by default encode the path part of the URL as UTF-8, and use the page encoding only for the query part. A proposed standard on Internationalized Resource Identifiers <http://www.w3.org/International/iri-edit/> says that UTF-8 should be unconditionally used for all parts, and IE 7 beta preview2 reportedly works this way, intentionally or not.

Safari uses the page encoding even for the path part, matching Firefox (see the Mozilla bug mentioned above).

Besides the W3C test from the bug URL, the following page has been mentioned as an example: <http://www.cdpkorea.com/zboard4/zboard.php?id=pdsboard&page=1&page_num=20&select_arrange=headnum&desc=&sn=off&ss=on&sc=on&keyword=&no=43865&category> (there should be four photos, not four replacement images).
Comment 1 Alexey Proskuryakov 2006-06-24 05:45:09 PDT
Created attachment 9001 [details]
proposed fix

The major browsers disagree on many details of non-ASCII URI handling; also, both Firefox 3 and WinIE 7 include major changes to it. This patch makes a single modification that seems undisputed, and includes a test that verifies the status quo.
Comment 2 Alexey Proskuryakov 2006-06-24 05:46:17 PDT
Comment on attachment 9001 [details]
proposed fix

Please disregard the empty utf8-window-location.html in the patch.
Comment 3 Darin Adler 2006-06-24 07:39:33 PDT
Comment on attachment 9001 [details]
proposed fix

I wonder what the real-world impact of this is going to be. It's interesting hearing what the various browsers do, but I also wonder what the various websites do. Do we know any websites where the old Safari would work and the new one would fail?

r=me
Comment 4 Alexey Proskuryakov 2006-06-24 08:11:23 PDT
Committed revision 15010.

(In reply to comment #3)
The Mozilla bug mentions one page that needs this change, and has the following comment:
"In the past, I saw many web sites asking their visitors to turn off 'Always send URLs in UTF-8' in MSIE. These days, I rarely see it."

Sites that would regress are those running older Unices, with file systems not in UTF-8 (and without an Apache module recoding file paths). Since WinIE and Opera default to UTF-8 for paths, such sites are apparently rare. I do not know any examples.

Actually, I'm surprised that we didn't have bug reports with this issue being a root cause (or are they all in Radar?). Myself, I did see people building in-house .asp pages with Windows Cyrillic charset and Russian file names; those won't work in current Firefox and Safari releases (I think; never had a chance to actually test that).