RESOLVED FIXED 21635
Use UTF-8 for Unicode encodings other than UTF-8 when submitting a form or encoding the query part of a URL
https://bugs.webkit.org/show_bug.cgi?id=21635
Summary Use UTF-8 for Unicode encodings other than UTF-8 when submitting a form or en...
Jungshik Shin
Reported 2008-10-15 22:43:22 PDT
HTML5 specifies that a form submitted from a UTF-16 page should use UTF-8. Firefox ( before WHATWG was formed) had used UTF-8 not just for UTF-16(LE|BE) but also for UTF-32(LE|BE) and UTF-7. The same can be said of the way the query part of the URL is encoded when it's in a UTF-7,16,32 page.
Attachments
test case in UTF-16LE (1.24 KB, text/html)
2008-10-15 22:49 PDT, Jungshik Shin
no flags
patch without layout test (5.05 KB, patch)
2008-10-15 23:22 PDT, Jungshik Shin
no flags
updated patch (still without layout tests) (5.05 KB, patch)
2008-10-16 00:01 PDT, Jungshik Shin
no flags
UTF-7 test case (335 bytes, text/html)
2008-10-16 00:02 PDT, Jungshik Shin
no flags
patch with layout tests (52.57 KB, patch)
2008-10-21 13:43 PDT, Jungshik Shin
darin: review-
upated patch (53.82 KB, patch)
2008-11-05 10:29 PST, Jungshik Shin
darin: review+
patch for check-in (53.27 KB, patch)
2008-11-24 18:19 PST, Jungshik Shin
jshin: review+
patch for check-in with 'url' => 'URL' (53.27 KB, patch)
2008-11-25 10:59 PST, Jungshik Shin
no flags
UTF-16,32 test files (2.31 KB, application/octet-stream)
2008-12-01 21:27 PST, Jungshik Shin
no flags
Jungshik Shin
Comment 1 2008-10-15 22:49:21 PDT
Created attachment 24384 [details] test case in UTF-16LE Two characters in the forms are U+2122 and U+5341 Try submitting 'GET' and also pressing the link below POST.
Jungshik Shin
Comment 2 2008-10-15 23:16:29 PDT
Change the summary to cover both form submission and the query part encoding.
Jungshik Shin
Comment 3 2008-10-15 23:22:15 PDT
Created attachment 24385 [details] patch without layout test Not yet made layout tests. And, it does not check for BOCU, SCSU, CESU-8.
Jungshik Shin
Comment 4 2008-10-16 00:01:58 PDT
Created attachment 24389 [details] updated patch (still without layout tests) In the previous patch, I forgot to replace closest8BitEquivalent with encodingForFormSubmission in HTMLFormElement.cpp BTW, this issue was reported for Chrome ( http://code.google.com/p/chromium/issues/detail?id=3433 )
Jungshik Shin
Comment 5 2008-10-16 00:02:41 PDT
Created attachment 24390 [details] UTF-7 test case
Alexey Proskuryakov
Comment 6 2008-10-16 00:27:10 PDT
What does IE do in this case?
Jungshik Shin
Comment 7 2008-10-16 10:54:04 PDT
Two test files uploaded include what IE does for a url with a query parameter. For UTF-16LE, it appears that it converts to the default encoding (en-US IE does For form submission from a UTF-16 page, IE does what Firefox does (what HTML5 spec recommends and what my patch implements). That is, it uses UTF-8. For form submission in UTF-7, I haven't yet tried it with IE.
Jungshik Shin
Comment 8 2008-10-21 13:43:23 PDT
Created attachment 24544 [details] patch with layout tests I added layout tests for UTF-{7,16LE,16BE,32LE,32BE} and url, post and get.
Darin Adler
Comment 9 2008-10-24 11:59:36 PDT
Comment on attachment 24544 [details] patch with layout tests Looks great! > + // Treat UTF-16, UTF-32, UTF-7 as UTF-8 > + init(base, relative, encoding.encodingForFormSubmission()); I think this comment is unclear. The comment really means "using the encodingForFormSubmission function will ..." and that's clear right now when we're making the change but someone coming later to look at this function won't necessarily understand that. Could you write a comment that says that a little more specifically? It seems unfortunate that the URL code is calling encodingForFormSubmission. In what way is URL encoding the same thing as form submission? Should there be a differently named synonym here instead? Or maybe the comment can explain that encoding of queries is essentially a "part of form submission"? > -static void addEncodingName(HashSet<const char*>& set, const char* name) > +namespace { > + > +void addEncodingName(HashSet<const char*>& set, const char* name) Throughout the WebKit project, we've been using static to get internal linkage instead of using anonymous namespaces, with very few exceptions. I understand that you can accomplish the same thing with an anonymous namespace, but is there a reason to start using that technique here? I'd prefer to make the change only if there's a good reason to do so. Using a mix of the two is probably less clear that consistently using one or the other. The anonymous namespace technique doesn't work as well with some debuggers, for example, so there needs to be an advantage to offset that disadvantage. > +bool TextEncoding::isNonByteBasedEncoding() const > +{ > + return *this == UTF16LittleEndianEncoding() || > + *this == UTF16BigEndianEncoding() || > + *this == UTF32BigEndianEncoding() || > + *this == UTF32LittleEndianEncoding(); > +} Our coding style is to put such operators at the beginnings of lines rather than the ends of lines. For this new function you've used the term "ByteBased", and for the existing function we used the term "8Bit". I'd like to see the functions use the same terminology. So we could rename the old one or the new one. > +// TODO(jungshik) : Do we have to care about BOCU, SCSU, and CESU-8? > +// Those encodings along with UTF-{7,8,16,32} come with 'ICU'. This is not the format we use to leave comments behind. We always use FIXME and we don't attach our names in that format. Also, I think this should be clearer. Try to write a comment someone can notice later. > +// HTML5 > +// specifies that UA should not support UTF-7, BOCU, SCSU and CESU-8. > +// To block them, closest8BitEquivalent (called in TextResourceLoader) > +// should be changed. I don't think this comment is in the right place. It's not about encodingForFormSubmission, is it? I don't understand what "blocking" these character sets would entail, but it seems that we wouldn't want to decode them at all. I'm going to say review- but these comments are really minor issues.
Jungshik Shin
Comment 10 2008-10-29 16:19:31 PDT
Thank you, Darin, for review. I'll address your comments and upload a new patch early next week.
Jungshik Shin
Comment 11 2008-11-05 10:29:14 PST
Created attachment 24910 [details] upated patch Darin, can you take another look? I used 'NonByteBased' for both functions because 'Non8Bit' can mean both '7bit encodings' like ISO-2022-JP and GB-HZ and non-byte-based encodings like UTF-16/32. Also, I just used 'file static' instead of anonymous namespace to follow the webkit convention. I updated comments and try to be as clear as possible (but I'm afraid I'm not so successful).
Jungshik Shin
Comment 12 2008-11-24 13:28:48 PST
Darin, would you take a look at my latest patch? Thank you !
Darin Adler
Comment 13 2008-11-24 13:36:55 PST
Comment on attachment 24910 [details] upated patch > + // For UTF-{7,16,32}, we want to use UTF-8 for the query part as > + // we do when submitting a form. A form with GET method > + // has its contents added to a url as query params and it makes sense > + // to be consistent whether we're dealing with a literal url with > + // the query part referred to in an HTML document in UTF-{7,16,32} > + // or a url formed for form submission. So, in both cases, > + // we use encodingForFormSubmission, which returns UTF-8 for > + // UTF-{7,16,32} while leaving other encodings alone. In the comment it should be URL rather than url. This comment is quite good, but probably a bit too wordy. I'd say: // For UTF-{7,16,32}, we want to use UTF-8 for the query part as // we do when submitting a form. A form with GET method // has its contents added to a URL as query params and it makes sense // to be consistent. And leave it at that. > +// HTML5 specifies that UTF-8 be used in form submission when a form is > +// is a part of a document in UTF-16 probably because UTF-16 is not a > +// byte-based encoding and can contain 0x00. By extension, the same > +// should be done for UTF-32. In case of UTF-7, it is a byte-based encoding, > +// but it's fraught with problems and we'd rather steer clear of it. Great comment! We normally use one space after a period, not two. > +// FIXME : Do we also have to return UTF-8 for BOCU, SCSU, and CESU-8? > +// Those algorithmic encodings are supported in ICU, but I have never > +// seen them being used for a web page. Instead of dealing with them > +// here for form submission, it's probably better not to decode them at > +// all. I don't think we need this comment. Lets just omit it. If you do want to keep the comment, please remove the space after "FIXME". If you did want this comment, I'd suggest putting it in the ICU source file. And yes, lets turn these off in TextCodecICU. > + * http/tests/misc/submit-get-in-utf16be-expected.txt: Added. > + * http/tests/misc/submit-get-in-utf16be.html: Added. > + * http/tests/misc/submit-get-in-utf16le-expected.txt: Added. > + * http/tests/misc/submit-get-in-utf16le.html: Added. > + * http/tests/misc/submit-get-in-utf32be-expected.txt: Added. > + * http/tests/misc/submit-get-in-utf32be.html: Added. > + * http/tests/misc/submit-get-in-utf32le-expected.txt: Added. > + * http/tests/misc/submit-get-in-utf32le.html: Added. > + * http/tests/misc/submit-get-in-utf7-expected.txt: Added. > + * http/tests/misc/submit-get-in-utf7.html: Added. > + * http/tests/misc/submit-post-in-utf16be-expected.txt: Added. > + * http/tests/misc/submit-post-in-utf16be.html: Added. > + * http/tests/misc/submit-post-in-utf16le-expected.txt: Added. > + * http/tests/misc/submit-post-in-utf16le.html: Added. > + * http/tests/misc/submit-post-in-utf32be-expected.txt: Added. > + * http/tests/misc/submit-post-in-utf32be.html: Added. > + * http/tests/misc/submit-post-in-utf32le-expected.txt: Added. > + * http/tests/misc/submit-post-in-utf32le.html: Added. > + * http/tests/misc/submit-post-in-utf7-expected.txt: Added. > + * http/tests/misc/submit-post-in-utf7.html: Added. > + * http/tests/misc/url-in-utf16be-expected.txt: Added. > + * http/tests/misc/url-in-utf16be.html: Added. > + * http/tests/misc/url-in-utf16le-expected.txt: Added. > + * http/tests/misc/url-in-utf16le.html: Added. > + * http/tests/misc/url-in-utf32be-expected.txt: Added. > + * http/tests/misc/url-in-utf32be.html: Added. > + * http/tests/misc/url-in-utf32le-expected.txt: Added. > + * http/tests/misc/url-in-utf32le.html: Added. > + * http/tests/misc/url-in-utf7-expected.txt: Added. > + * http/tests/misc/url-in-utf7.html: Added. These tests are great, but it's too bad we didn't find a way to do this testing with a single test driver file instead of having them all be separate tests. For example take a look at the techniques used in char-encoding.html used put lots of tests all into a single driver. r=me as-is; would be nice to make those slight comment improvements
Jungshik Shin
Comment 14 2008-11-24 18:19:56 PST
Created attachment 25465 [details] patch for check-in Thank you for the review. Transferring review+ to this patch. This patch ddressed Darin's comments on comments :-) This patch is ready for landing. I'll file another bug on making a driver test to test all the related encodings.
Jungshik Shin
Comment 15 2008-11-25 10:59:57 PST
Created attachment 25489 [details] patch for check-in with 'url' => 'URL' I forgot to capitalize url in KURL.cpp (in a comment). This patch does that.
Darin Fisher (:fishd, Google)
Comment 16 2008-11-25 11:31:59 PST
Darin Fisher (:fishd, Google)
Comment 17 2008-11-25 13:01:25 PST
The UTF7 tests have been disabled due to failures observed on some of the buildbots: http://trac.webkit.org/changeset/38759
Darin Fisher (:fishd, Google)
Comment 18 2008-11-25 13:04:19 PST
See https://bugs.webkit.org/show_bug.cgi?id=22492 for the tests that were disabled.
Jungshik Shin
Comment 19 2008-12-01 21:27:06 PST
Created attachment 25662 [details] UTF-16,32 test files This file contains utf-16,32 test files that were omitted in the previous check-in. I'm transferring r+ because they're a part of the previous patch. The files in this patch should be landed AFTER the patch for bug 22492 is landed.
Jungshik Shin
Comment 20 2008-12-02 09:53:46 PST
Comment on attachment 25662 [details] UTF-16,32 test files I'm marking this as obsolete. It turned out that 'svn-apply' can deal with binary files. I uploaded a new patch to bug 22492 that includes utf-16/32 tests.
Alexey Proskuryakov
Comment 21 2009-04-02 02:00:46 PDT
*** Bug 18277 has been marked as a duplicate of this bug. ***
Note You need to log in before you can comment on or make changes to this bug.