21635 – Use UTF-8 for Unicode encodings other than UTF-8 when submitting a form or encoding the query part of a URL

RESOLVED FIXED 21635

Use UTF-8 for Unicode encodings other than UTF-8 when submitting a form or encoding the query part of a URL

https://bugs.webkit.org/show_bug.cgi?id=21635

Summary Use UTF-8 for Unicode encodings other than UTF-8 when submitting a form or en...

Jungshik Shin

Reported 2008-10-15 22:43:22 PDT

HTML5 specifies that a form submitted from a UTF-16 page should use UTF-8. Firefox ( before WHATWG was formed) had used UTF-8 not just for UTF-16(LE|BE) but also for UTF-32(LE|BE) and UTF-7. The same can be said of the way the query part of the URL is encoded when it's in a UTF-7,16,32 page.

Attachments
test case in UTF-16LE (1.24 KB, text/html) 2008-10-15 22:49 PDT, Jungshik Shin	no flags	Details
patch without layout test (5.05 KB, patch) 2008-10-15 23:22 PDT, Jungshik Shin	no flags	Details Formatted Diff Diff
updated patch (still without layout tests) (5.05 KB, patch) 2008-10-16 00:01 PDT, Jungshik Shin	no flags	Details Formatted Diff Diff
UTF-7 test case (335 bytes, text/html) 2008-10-16 00:02 PDT, Jungshik Shin	no flags	Details
patch with layout tests (52.57 KB, patch) 2008-10-21 13:43 PDT, Jungshik Shin	darin: review-	Details Formatted Diff Diff
upated patch (53.82 KB, patch) 2008-11-05 10:29 PST, Jungshik Shin	darin: review+	Details Formatted Diff Diff
patch for check-in (53.27 KB, patch) 2008-11-24 18:19 PST, Jungshik Shin	jshin: review+	Details Formatted Diff Diff
patch for check-in with 'url' => 'URL' (53.27 KB, patch) 2008-11-25 10:59 PST, Jungshik Shin	no flags	Details Formatted Diff Diff
UTF-16,32 test files (2.31 KB, application/octet-stream) 2008-12-01 21:27 PST, Jungshik Shin	no flags	Details
Show Obsolete (6) View All Add attachment proposed patch, testcase, etc.

Jungshik Shin

Comment 1 2008-10-15 22:49:21 PDT

Created attachment 24384 [details] test case in UTF-16LE Two characters in the forms are U+2122 and U+5341 Try submitting 'GET' and also pressing the link below POST.

Jungshik Shin

Comment 2 2008-10-15 23:16:29 PDT

Change the summary to cover both form submission and the query part encoding.

Jungshik Shin

Comment 3 2008-10-15 23:22:15 PDT

Created attachment 24385 [details] patch without layout test Not yet made layout tests. And, it does not check for BOCU, SCSU, CESU-8.

Jungshik Shin

Comment 4 2008-10-16 00:01:58 PDT

Created attachment 24389 [details] updated patch (still without layout tests) In the previous patch, I forgot to replace closest8BitEquivalent with encodingForFormSubmission in HTMLFormElement.cpp BTW, this issue was reported for Chrome ( http://code.google.com/p/chromium/issues/detail?id=3433 )

Jungshik Shin

Comment 5 2008-10-16 00:02:41 PDT

Created attachment 24390 [details] UTF-7 test case

Alexey Proskuryakov

Comment 6 2008-10-16 00:27:10 PDT

What does IE do in this case?

Jungshik Shin

Comment 7 2008-10-16 10:54:04 PDT

Two test files uploaded include what IE does for a url with a query parameter. For UTF-16LE, it appears that it converts to the default encoding (en-US IE does For form submission from a UTF-16 page, IE does what Firefox does (what HTML5 spec recommends and what my patch implements). That is, it uses UTF-8. For form submission in UTF-7, I haven't yet tried it with IE.

Jungshik Shin

Comment 8 2008-10-21 13:43:23 PDT

Created attachment 24544 [details] patch with layout tests I added layout tests for UTF-{7,16LE,16BE,32LE,32BE} and url, post and get.

Darin Adler

Comment 9 2008-10-24 11:59:36 PDT

Comment on attachment 24544 [details] patch with layout tests Looks great! > + // Treat UTF-16, UTF-32, UTF-7 as UTF-8 > + init(base, relative, encoding.encodingForFormSubmission()); I think this comment is unclear. The comment really means "using the encodingForFormSubmission function will ..." and that's clear right now when we're making the change but someone coming later to look at this function won't necessarily understand that. Could you write a comment that says that a little more specifically? It seems unfortunate that the URL code is calling encodingForFormSubmission. In what way is URL encoding the same thing as form submission? Should there be a differently named synonym here instead? Or maybe the comment can explain that encoding of queries is essentially a "part of form submission"? > -static void addEncodingName(HashSet<const char*>& set, const char* name) > +namespace { > + > +void addEncodingName(HashSet<const char*>& set, const char* name) Throughout the WebKit project, we've been using static to get internal linkage instead of using anonymous namespaces, with very few exceptions. I understand that you can accomplish the same thing with an anonymous namespace, but is there a reason to start using that technique here? I'd prefer to make the change only if there's a good reason to do so. Using a mix of the two is probably less clear that consistently using one or the other. The anonymous namespace technique doesn't work as well with some debuggers, for example, so there needs to be an advantage to offset that disadvantage. > +bool TextEncoding::isNonByteBasedEncoding() const > +{ > + return *this == UTF16LittleEndianEncoding() || > + *this == UTF16BigEndianEncoding() || > + *this == UTF32BigEndianEncoding() || > + *this == UTF32LittleEndianEncoding(); > +} Our coding style is to put such operators at the beginnings of lines rather than the ends of lines. For this new function you've used the term "ByteBased", and for the existing function we used the term "8Bit". I'd like to see the functions use the same terminology. So we could rename the old one or the new one. > +// TODO(jungshik) : Do we have to care about BOCU, SCSU, and CESU-8? > +// Those encodings along with UTF-{7,8,16,32} come with 'ICU'. This is not the format we use to leave comments behind. We always use FIXME and we don't attach our names in that format. Also, I think this should be clearer. Try to write a comment someone can notice later. > +// HTML5 > +// specifies that UA should not support UTF-7, BOCU, SCSU and CESU-8. > +// To block them, closest8BitEquivalent (called in TextResourceLoader) > +// should be changed. I don't think this comment is in the right place. It's not about encodingForFormSubmission, is it? I don't understand what "blocking" these character sets would entail, but it seems that we wouldn't want to decode them at all. I'm going to say review- but these comments are really minor issues.

Jungshik Shin

Comment 10 2008-10-29 16:19:31 PDT

Thank you, Darin, for review. I'll address your comments and upload a new patch early next week.

Jungshik Shin

Comment 11 2008-11-05 10:29:14 PST

Created attachment 24910 [details] upated patch Darin, can you take another look? I used 'NonByteBased' for both functions because 'Non8Bit' can mean both '7bit encodings' like ISO-2022-JP and GB-HZ and non-byte-based encodings like UTF-16/32. Also, I just used 'file static' instead of anonymous namespace to follow the webkit convention. I updated comments and try to be as clear as possible (but I'm afraid I'm not so successful).

Jungshik Shin

Comment 12 2008-11-24 13:28:48 PST

Darin, would you take a look at my latest patch? Thank you !

Darin Adler

Comment 13 2008-11-24 13:36:55 PST

Comment on attachment 24910 [details] upated patch > + // For UTF-{7,16,32}, we want to use UTF-8 for the query part as > + // we do when submitting a form. A form with GET method > + // has its contents added to a url as query params and it makes sense > + // to be consistent whether we're dealing with a literal url with > + // the query part referred to in an HTML document in UTF-{7,16,32} > + // or a url formed for form submission. So, in both cases, > + // we use encodingForFormSubmission, which returns UTF-8 for > + // UTF-{7,16,32} while leaving other encodings alone. In the comment it should be URL rather than url. This comment is quite good, but probably a bit too wordy. I'd say: // For UTF-{7,16,32}, we want to use UTF-8 for the query part as // we do when submitting a form. A form with GET method // has its contents added to a URL as query params and it makes sense // to be consistent. And leave it at that. > +// HTML5 specifies that UTF-8 be used in form submission when a form is > +// is a part of a document in UTF-16 probably because UTF-16 is not a > +// byte-based encoding and can contain 0x00. By extension, the same > +// should be done for UTF-32. In case of UTF-7, it is a byte-based encoding, > +// but it's fraught with problems and we'd rather steer clear of it. Great comment! We normally use one space after a period, not two. > +// FIXME : Do we also have to return UTF-8 for BOCU, SCSU, and CESU-8? > +// Those algorithmic encodings are supported in ICU, but I have never > +// seen them being used for a web page. Instead of dealing with them > +// here for form submission, it's probably better not to decode them at > +// all. I don't think we need this comment. Lets just omit it. If you do want to keep the comment, please remove the space after "FIXME". If you did want this comment, I'd suggest putting it in the ICU source file. And yes, lets turn these off in TextCodecICU. > + * http/tests/misc/submit-get-in-utf16be-expected.txt: Added. > + * http/tests/misc/submit-get-in-utf16be.html: Added. > + * http/tests/misc/submit-get-in-utf16le-expected.txt: Added. > + * http/tests/misc/submit-get-in-utf16le.html: Added. > + * http/tests/misc/submit-get-in-utf32be-expected.txt: Added. > + * http/tests/misc/submit-get-in-utf32be.html: Added. > + * http/tests/misc/submit-get-in-utf32le-expected.txt: Added. > + * http/tests/misc/submit-get-in-utf32le.html: Added. > + * http/tests/misc/submit-get-in-utf7-expected.txt: Added. > + * http/tests/misc/submit-get-in-utf7.html: Added. > + * http/tests/misc/submit-post-in-utf16be-expected.txt: Added. > + * http/tests/misc/submit-post-in-utf16be.html: Added. > + * http/tests/misc/submit-post-in-utf16le-expected.txt: Added. > + * http/tests/misc/submit-post-in-utf16le.html: Added. > + * http/tests/misc/submit-post-in-utf32be-expected.txt: Added. > + * http/tests/misc/submit-post-in-utf32be.html: Added. > + * http/tests/misc/submit-post-in-utf32le-expected.txt: Added. > + * http/tests/misc/submit-post-in-utf32le.html: Added. > + * http/tests/misc/submit-post-in-utf7-expected.txt: Added. > + * http/tests/misc/submit-post-in-utf7.html: Added. > + * http/tests/misc/url-in-utf16be-expected.txt: Added. > + * http/tests/misc/url-in-utf16be.html: Added. > + * http/tests/misc/url-in-utf16le-expected.txt: Added. > + * http/tests/misc/url-in-utf16le.html: Added. > + * http/tests/misc/url-in-utf32be-expected.txt: Added. > + * http/tests/misc/url-in-utf32be.html: Added. > + * http/tests/misc/url-in-utf32le-expected.txt: Added. > + * http/tests/misc/url-in-utf32le.html: Added. > + * http/tests/misc/url-in-utf7-expected.txt: Added. > + * http/tests/misc/url-in-utf7.html: Added. These tests are great, but it's too bad we didn't find a way to do this testing with a single test driver file instead of having them all be separate tests. For example take a look at the techniques used in char-encoding.html used put lots of tests all into a single driver. r=me as-is; would be nice to make those slight comment improvements

Jungshik Shin

Comment 14 2008-11-24 18:19:56 PST

Created attachment 25465 [details] patch for check-in Thank you for the review. Transferring review+ to this patch. This patch ddressed Darin's comments on comments :-) This patch is ready for landing. I'll file another bug on making a driver test to test all the related encodings.

Jungshik Shin

Comment 15 2008-11-25 10:59:57 PST

Created attachment 25489 [details] patch for check-in with 'url' => 'URL' I forgot to capitalize url in KURL.cpp (in a comment). This patch does that.

Darin Fisher (:fishd, Google)

Comment 16 2008-11-25 11:31:59 PST

http://trac.webkit.org/changeset/38755

Darin Fisher (:fishd, Google)

Comment 17 2008-11-25 13:01:25 PST

The UTF7 tests have been disabled due to failures observed on some of the buildbots: http://trac.webkit.org/changeset/38759

Darin Fisher (:fishd, Google)

Comment 18 2008-11-25 13:04:19 PST

See https://bugs.webkit.org/show_bug.cgi?id=22492 for the tests that were disabled.

Jungshik Shin

Comment 19 2008-12-01 21:27:06 PST

Created attachment 25662 [details] UTF-16,32 test files This file contains utf-16,32 test files that were omitted in the previous check-in. I'm transferring r+ because they're a part of the previous patch. The files in this patch should be landed AFTER the patch for bug 22492 is landed.

Jungshik Shin

Comment 20 2008-12-02 09:53:46 PST

Comment on attachment 25662 [details] UTF-16,32 test files I'm marking this as obsolete. It turned out that 'svn-apply' can deal with binary files. I uploaded a new patch to bug 22492 that includes utf-16/32 tests.

Alexey Proskuryakov

Comment 21 2009-04-02 02:00:46 PDT

*** Bug 18277 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Normal

Classification Unclassified

Version 528+ (Nightly build)

Hardware All

OS All

Product WebKit

Component Forms

Assignee

Jungshik Shin

Reported

2008-10-15 22:43 PDT

Modified

2009-04-02 02:00 PDT History

CC List

3 users Show

URL

Keywords

Duplicates (1)

18277 View as bug list

Depends on

Blocks