Bug 21635 - Use UTF-8 for Unicode encodings other than UTF-8 when submitting a form or encoding the query part of a URL
Summary: Use UTF-8 for Unicode encodings other than UTF-8 when submitting a form or en...
Alias: None
Product: WebKit
Classification: Unclassified
Component: Forms (show other bugs)
Version: 528+ (Nightly build)
Hardware: All All
: P2 Normal
Assignee: Jungshik Shin
: 18277 (view as bug list)
Depends on:
Reported: 2008-10-15 22:43 PDT by Jungshik Shin
Modified: 2009-04-02 02:00 PDT (History)
3 users (show)

See Also:

test case in UTF-16LE (1.24 KB, text/html)
2008-10-15 22:49 PDT, Jungshik Shin
no flags Details
patch without layout test (5.05 KB, patch)
2008-10-15 23:22 PDT, Jungshik Shin
no flags Details | Formatted Diff | Diff
updated patch (still without layout tests) (5.05 KB, patch)
2008-10-16 00:01 PDT, Jungshik Shin
no flags Details | Formatted Diff | Diff
UTF-7 test case (335 bytes, text/html)
2008-10-16 00:02 PDT, Jungshik Shin
no flags Details
patch with layout tests (52.57 KB, patch)
2008-10-21 13:43 PDT, Jungshik Shin
darin: review-
Details | Formatted Diff | Diff
upated patch (53.82 KB, patch)
2008-11-05 10:29 PST, Jungshik Shin
darin: review+
Details | Formatted Diff | Diff
patch for check-in (53.27 KB, patch)
2008-11-24 18:19 PST, Jungshik Shin
jshin: review+
Details | Formatted Diff | Diff
patch for check-in with 'url' => 'URL' (53.27 KB, patch)
2008-11-25 10:59 PST, Jungshik Shin
no flags Details | Formatted Diff | Diff
UTF-16,32 test files (2.31 KB, application/octet-stream)
2008-12-01 21:27 PST, Jungshik Shin
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Jungshik Shin 2008-10-15 22:43:22 PDT
HTML5 specifies that a form submitted from a UTF-16 page should use UTF-8.  
Firefox ( before WHATWG was formed) had used UTF-8 not just for UTF-16(LE|BE) but also for UTF-32(LE|BE) and UTF-7. 

The same can be said of the way the query part of the URL is encoded when it's in a UTF-7,16,32 page.
Comment 1 Jungshik Shin 2008-10-15 22:49:21 PDT
Created attachment 24384 [details]
test case in UTF-16LE

Two characters in the forms are U+2122 and U+5341
Try submitting 'GET' and also pressing the link below POST.
Comment 2 Jungshik Shin 2008-10-15 23:16:29 PDT
Change the summary to cover both form submission and the query part encoding. 
Comment 3 Jungshik Shin 2008-10-15 23:22:15 PDT
Created attachment 24385 [details]
patch without layout test

Not yet made layout tests. And, it does not check for BOCU, SCSU, CESU-8.
Comment 4 Jungshik Shin 2008-10-16 00:01:58 PDT
Created attachment 24389 [details]
updated patch (still without layout tests)

In the previous patch, I forgot to replace closest8BitEquivalent with encodingForFormSubmission in HTMLFormElement.cpp

BTW, this issue was reported for Chrome ( http://code.google.com/p/chromium/issues/detail?id=3433 )
Comment 5 Jungshik Shin 2008-10-16 00:02:41 PDT
Created attachment 24390 [details]
UTF-7 test case
Comment 6 Alexey Proskuryakov 2008-10-16 00:27:10 PDT
What does IE do in this case?
Comment 7 Jungshik Shin 2008-10-16 10:54:04 PDT
Two test files uploaded include what IE does for a url with a query parameter. 
For UTF-16LE, it appears that it converts to the default encoding (en-US IE does 

For form submission from a UTF-16 page, IE does what Firefox does (what HTML5 spec recommends and what my patch implements). That is, it uses UTF-8.

For form submission in UTF-7, I haven't yet tried it with IE. 

Comment 8 Jungshik Shin 2008-10-21 13:43:23 PDT
Created attachment 24544 [details]
patch with layout tests

I added layout tests for UTF-{7,16LE,16BE,32LE,32BE} and url, post and get.
Comment 9 Darin Adler 2008-10-24 11:59:36 PDT
Comment on attachment 24544 [details]
patch with layout tests

Looks great!

> +    // Treat UTF-16, UTF-32, UTF-7 as UTF-8
> +    init(base, relative, encoding.encodingForFormSubmission());

I think this comment is unclear. The comment really means "using the encodingForFormSubmission function will ..." and that's clear right now when we're making the change but someone coming later to look at this function won't necessarily understand that. Could you write a comment that says that a little more specifically?

It seems unfortunate that the URL code is calling encodingForFormSubmission. In what way is URL encoding the same thing as form submission? Should there be a differently named synonym here instead? Or maybe the comment can explain that encoding of queries is essentially a "part of form submission"?

> -static void addEncodingName(HashSet<const char*>& set, const char* name)
> +namespace {
> +
> +void addEncodingName(HashSet<const char*>& set, const char* name)

Throughout the WebKit project, we've been using static to get internal linkage instead of using anonymous namespaces, with very few exceptions. I understand that you can accomplish the same thing with an anonymous namespace, but is there a reason to start using that technique here? I'd prefer to make the change only if there's a good reason to do so. Using a mix of the two is probably less clear that consistently using one or the other.

The anonymous namespace technique doesn't work as well with some debuggers, for example, so there needs to be an advantage to offset that disadvantage.

> +bool TextEncoding::isNonByteBasedEncoding() const
> +{
> +    return *this == UTF16LittleEndianEncoding() ||
> +           *this == UTF16BigEndianEncoding() ||
> +           *this == UTF32BigEndianEncoding() ||
> +           *this == UTF32LittleEndianEncoding();
> +}

Our coding style is to put such operators at the beginnings of lines rather than the ends of lines.

For this new function you've used the term "ByteBased", and for the existing function we used the term "8Bit". I'd like to see the functions use the same terminology. So we could rename the old one or the new one.

> +// TODO(jungshik) : Do we have to care about BOCU, SCSU, and CESU-8?
> +// Those encodings along with UTF-{7,8,16,32} come with 'ICU'.

This is not the format we use to leave comments behind. We always use FIXME and we don't attach our names in that format. Also, I think this should be clearer. Try to write a comment someone can notice later.

> +// HTML5
> +// specifies that UA should not support UTF-7, BOCU, SCSU and CESU-8. 
> +// To block them, closest8BitEquivalent (called in TextResourceLoader)
> +// should be changed. 

I don't think this comment is in the right place. It's not about encodingForFormSubmission, is it? I don't understand what "blocking" these character sets would entail, but it seems that we wouldn't want to decode them at all.

I'm going to say review- but these comments are really minor issues.
Comment 10 Jungshik Shin 2008-10-29 16:19:31 PDT
Thank you, Darin, for review. I'll address your comments and upload a new patch early next week. 
Comment 11 Jungshik Shin 2008-11-05 10:29:14 PST
Created attachment 24910 [details]
upated patch

Darin, can you take another look?  
I used 'NonByteBased' for both functions because 'Non8Bit' can mean both '7bit encodings' like ISO-2022-JP and GB-HZ and non-byte-based encodings like UTF-16/32.  
Also, I just used 'file static' instead of anonymous namespace to follow the webkit convention. 
I updated comments and try to be as clear as possible (but I'm afraid I'm not so successful).
Comment 12 Jungshik Shin 2008-11-24 13:28:48 PST
Darin, would you take a look at my latest patch? Thank you !

Comment 13 Darin Adler 2008-11-24 13:36:55 PST
Comment on attachment 24910 [details]
upated patch

> +    // For UTF-{7,16,32}, we want to use UTF-8 for the query part as 
> +    // we do when submitting a form. A form with GET method
> +    // has its contents added to a url as query params and it makes sense
> +    // to be consistent whether we're dealing with a literal url with
> +    // the query part referred to in an HTML document in UTF-{7,16,32} 
> +    // or a url formed for form submission. So, in both cases, 
> +    // we use encodingForFormSubmission, which returns UTF-8 for 
> +    // UTF-{7,16,32} while leaving other encodings alone. 

In the comment it should be URL rather than url. This comment is quite good, but probably a bit too wordy. I'd say:

    // For UTF-{7,16,32}, we want to use UTF-8 for the query part as 
    // we do when submitting a form. A form with GET method
    // has its contents added to a URL as query params and it makes sense
    // to be consistent.

And leave it at that.

> +// HTML5 specifies that UTF-8 be used in form submission when a form is 
> +// is a part of a document in UTF-16 probably because UTF-16 is not a 
> +// byte-based encoding and can contain 0x00.  By extension, the same
> +// should be done for UTF-32. In case of UTF-7, it is a byte-based encoding,
> +// but it's fraught with problems and we'd rather steer clear of it. 

Great comment! We normally use one space after a period, not two.

> +// FIXME : Do we also have to return UTF-8 for BOCU, SCSU, and CESU-8?
> +// Those algorithmic encodings are supported in ICU, but I have never
> +// seen them being used for a web page. Instead of dealing with them 
> +// here for form submission, it's probably better not to decode them at
> +// all.

I don't think we need this comment. Lets just omit it. If you do want to keep the comment, please remove the space after "FIXME".

If you did want this comment, I'd suggest putting it in the ICU source file. And yes, lets turn these off in TextCodecICU.

> +        * http/tests/misc/submit-get-in-utf16be-expected.txt: Added.
> +        * http/tests/misc/submit-get-in-utf16be.html: Added.
> +        * http/tests/misc/submit-get-in-utf16le-expected.txt: Added.
> +        * http/tests/misc/submit-get-in-utf16le.html: Added.
> +        * http/tests/misc/submit-get-in-utf32be-expected.txt: Added.
> +        * http/tests/misc/submit-get-in-utf32be.html: Added.
> +        * http/tests/misc/submit-get-in-utf32le-expected.txt: Added.
> +        * http/tests/misc/submit-get-in-utf32le.html: Added.
> +        * http/tests/misc/submit-get-in-utf7-expected.txt: Added.
> +        * http/tests/misc/submit-get-in-utf7.html: Added.
> +        * http/tests/misc/submit-post-in-utf16be-expected.txt: Added.
> +        * http/tests/misc/submit-post-in-utf16be.html: Added.
> +        * http/tests/misc/submit-post-in-utf16le-expected.txt: Added.
> +        * http/tests/misc/submit-post-in-utf16le.html: Added.
> +        * http/tests/misc/submit-post-in-utf32be-expected.txt: Added.
> +        * http/tests/misc/submit-post-in-utf32be.html: Added.
> +        * http/tests/misc/submit-post-in-utf32le-expected.txt: Added.
> +        * http/tests/misc/submit-post-in-utf32le.html: Added.
> +        * http/tests/misc/submit-post-in-utf7-expected.txt: Added.
> +        * http/tests/misc/submit-post-in-utf7.html: Added.
> +        * http/tests/misc/url-in-utf16be-expected.txt: Added.
> +        * http/tests/misc/url-in-utf16be.html: Added.
> +        * http/tests/misc/url-in-utf16le-expected.txt: Added.
> +        * http/tests/misc/url-in-utf16le.html: Added.
> +        * http/tests/misc/url-in-utf32be-expected.txt: Added.
> +        * http/tests/misc/url-in-utf32be.html: Added.
> +        * http/tests/misc/url-in-utf32le-expected.txt: Added.
> +        * http/tests/misc/url-in-utf32le.html: Added.
> +        * http/tests/misc/url-in-utf7-expected.txt: Added.
> +        * http/tests/misc/url-in-utf7.html: Added.

These tests are great, but it's too bad we didn't find a way to do this testing with a single test driver file instead of having them all be separate tests. For example take a look at the techniques used in char-encoding.html used put lots of tests all into a single driver.

r=me as-is; would be nice to make those slight comment improvements
Comment 14 Jungshik Shin 2008-11-24 18:19:56 PST
Created attachment 25465 [details]
patch for check-in

Thank you for the review. Transferring review+ to this patch.

This patch ddressed Darin's comments on comments :-)
This patch is ready for landing. I'll file another bug on making a driver test to test all the related encodings.
Comment 15 Jungshik Shin 2008-11-25 10:59:57 PST
Created attachment 25489 [details]
patch for check-in with 'url' => 'URL'

I forgot to capitalize url in KURL.cpp (in a comment). This patch does that.
Comment 16 Darin Fisher (:fishd, Google) 2008-11-25 11:31:59 PST
Comment 17 Darin Fisher (:fishd, Google) 2008-11-25 13:01:25 PST
The UTF7 tests have been disabled due to failures observed on some of the buildbots:
Comment 18 Darin Fisher (:fishd, Google) 2008-11-25 13:04:19 PST
See https://bugs.webkit.org/show_bug.cgi?id=22492 for the tests that were disabled.
Comment 19 Jungshik Shin 2008-12-01 21:27:06 PST
Created attachment 25662 [details]
UTF-16,32 test files 

This file contains utf-16,32 test files that were omitted in the previous check-in. 
I'm transferring r+ because they're a part of the previous patch. 

The files in this patch should be landed AFTER the patch for bug 22492 is landed.
Comment 20 Jungshik Shin 2008-12-02 09:53:46 PST
Comment on attachment 25662 [details]
UTF-16,32 test files 

I'm marking this as obsolete. It turned out that 'svn-apply' can deal with binary files. I uploaded a new patch to bug 22492 that includes utf-16/32 tests.
Comment 21 Alexey Proskuryakov 2009-04-02 02:00:46 PDT
*** Bug 18277 has been marked as a duplicate of this bug. ***