WebKit Bugzilla
New
Browse
Log In
×
Sign in with GitHub
or
Remember my login
Create Account
·
Forgot Password
Forgotten password account recovery
RESOLVED FIXED
21635
Use UTF-8 for Unicode encodings other than UTF-8 when submitting a form or encoding the query part of a URL
https://bugs.webkit.org/show_bug.cgi?id=21635
Summary
Use UTF-8 for Unicode encodings other than UTF-8 when submitting a form or en...
Jungshik Shin
Reported
2008-10-15 22:43:22 PDT
HTML5 specifies that a form submitted from a UTF-16 page should use UTF-8. Firefox ( before WHATWG was formed) had used UTF-8 not just for UTF-16(LE|BE) but also for UTF-32(LE|BE) and UTF-7. The same can be said of the way the query part of the URL is encoded when it's in a UTF-7,16,32 page.
Attachments
test case in UTF-16LE
(1.24 KB, text/html)
2008-10-15 22:49 PDT
,
Jungshik Shin
no flags
Details
patch without layout test
(5.05 KB, patch)
2008-10-15 23:22 PDT
,
Jungshik Shin
no flags
Details
Formatted Diff
Diff
updated patch (still without layout tests)
(5.05 KB, patch)
2008-10-16 00:01 PDT
,
Jungshik Shin
no flags
Details
Formatted Diff
Diff
UTF-7 test case
(335 bytes, text/html)
2008-10-16 00:02 PDT
,
Jungshik Shin
no flags
Details
patch with layout tests
(52.57 KB, patch)
2008-10-21 13:43 PDT
,
Jungshik Shin
darin
: review-
Details
Formatted Diff
Diff
upated patch
(53.82 KB, patch)
2008-11-05 10:29 PST
,
Jungshik Shin
darin
: review+
Details
Formatted Diff
Diff
patch for check-in
(53.27 KB, patch)
2008-11-24 18:19 PST
,
Jungshik Shin
jshin
: review+
Details
Formatted Diff
Diff
patch for check-in with 'url' => 'URL'
(53.27 KB, patch)
2008-11-25 10:59 PST
,
Jungshik Shin
no flags
Details
Formatted Diff
Diff
UTF-16,32 test files
(2.31 KB, application/octet-stream)
2008-12-01 21:27 PST
,
Jungshik Shin
no flags
Details
Show Obsolete
(6)
View All
Add attachment
proposed patch, testcase, etc.
Jungshik Shin
Comment 1
2008-10-15 22:49:21 PDT
Created
attachment 24384
[details]
test case in UTF-16LE Two characters in the forms are U+2122 and U+5341 Try submitting 'GET' and also pressing the link below POST.
Jungshik Shin
Comment 2
2008-10-15 23:16:29 PDT
Change the summary to cover both form submission and the query part encoding.
Jungshik Shin
Comment 3
2008-10-15 23:22:15 PDT
Created
attachment 24385
[details]
patch without layout test Not yet made layout tests. And, it does not check for BOCU, SCSU, CESU-8.
Jungshik Shin
Comment 4
2008-10-16 00:01:58 PDT
Created
attachment 24389
[details]
updated patch (still without layout tests) In the previous patch, I forgot to replace closest8BitEquivalent with encodingForFormSubmission in HTMLFormElement.cpp BTW, this issue was reported for Chrome (
http://code.google.com/p/chromium/issues/detail?id=3433
)
Jungshik Shin
Comment 5
2008-10-16 00:02:41 PDT
Created
attachment 24390
[details]
UTF-7 test case
Alexey Proskuryakov
Comment 6
2008-10-16 00:27:10 PDT
What does IE do in this case?
Jungshik Shin
Comment 7
2008-10-16 10:54:04 PDT
Two test files uploaded include what IE does for a url with a query parameter. For UTF-16LE, it appears that it converts to the default encoding (en-US IE does For form submission from a UTF-16 page, IE does what Firefox does (what HTML5 spec recommends and what my patch implements). That is, it uses UTF-8. For form submission in UTF-7, I haven't yet tried it with IE.
Jungshik Shin
Comment 8
2008-10-21 13:43:23 PDT
Created
attachment 24544
[details]
patch with layout tests I added layout tests for UTF-{7,16LE,16BE,32LE,32BE} and url, post and get.
Darin Adler
Comment 9
2008-10-24 11:59:36 PDT
Comment on
attachment 24544
[details]
patch with layout tests Looks great!
> + // Treat UTF-16, UTF-32, UTF-7 as UTF-8 > + init(base, relative, encoding.encodingForFormSubmission());
I think this comment is unclear. The comment really means "using the encodingForFormSubmission function will ..." and that's clear right now when we're making the change but someone coming later to look at this function won't necessarily understand that. Could you write a comment that says that a little more specifically? It seems unfortunate that the URL code is calling encodingForFormSubmission. In what way is URL encoding the same thing as form submission? Should there be a differently named synonym here instead? Or maybe the comment can explain that encoding of queries is essentially a "part of form submission"?
> -static void addEncodingName(HashSet<const char*>& set, const char* name) > +namespace { > + > +void addEncodingName(HashSet<const char*>& set, const char* name)
Throughout the WebKit project, we've been using static to get internal linkage instead of using anonymous namespaces, with very few exceptions. I understand that you can accomplish the same thing with an anonymous namespace, but is there a reason to start using that technique here? I'd prefer to make the change only if there's a good reason to do so. Using a mix of the two is probably less clear that consistently using one or the other. The anonymous namespace technique doesn't work as well with some debuggers, for example, so there needs to be an advantage to offset that disadvantage.
> +bool TextEncoding::isNonByteBasedEncoding() const > +{ > + return *this == UTF16LittleEndianEncoding() || > + *this == UTF16BigEndianEncoding() || > + *this == UTF32BigEndianEncoding() || > + *this == UTF32LittleEndianEncoding(); > +}
Our coding style is to put such operators at the beginnings of lines rather than the ends of lines. For this new function you've used the term "ByteBased", and for the existing function we used the term "8Bit". I'd like to see the functions use the same terminology. So we could rename the old one or the new one.
> +// TODO(jungshik) : Do we have to care about BOCU, SCSU, and CESU-8? > +// Those encodings along with UTF-{7,8,16,32} come with 'ICU'.
This is not the format we use to leave comments behind. We always use FIXME and we don't attach our names in that format. Also, I think this should be clearer. Try to write a comment someone can notice later.
> +// HTML5 > +// specifies that UA should not support UTF-7, BOCU, SCSU and CESU-8. > +// To block them, closest8BitEquivalent (called in TextResourceLoader) > +// should be changed.
I don't think this comment is in the right place. It's not about encodingForFormSubmission, is it? I don't understand what "blocking" these character sets would entail, but it seems that we wouldn't want to decode them at all. I'm going to say review- but these comments are really minor issues.
Jungshik Shin
Comment 10
2008-10-29 16:19:31 PDT
Thank you, Darin, for review. I'll address your comments and upload a new patch early next week.
Jungshik Shin
Comment 11
2008-11-05 10:29:14 PST
Created
attachment 24910
[details]
upated patch Darin, can you take another look? I used 'NonByteBased' for both functions because 'Non8Bit' can mean both '7bit encodings' like ISO-2022-JP and GB-HZ and non-byte-based encodings like UTF-16/32. Also, I just used 'file static' instead of anonymous namespace to follow the webkit convention. I updated comments and try to be as clear as possible (but I'm afraid I'm not so successful).
Jungshik Shin
Comment 12
2008-11-24 13:28:48 PST
Darin, would you take a look at my latest patch? Thank you !
Darin Adler
Comment 13
2008-11-24 13:36:55 PST
Comment on
attachment 24910
[details]
upated patch
> + // For UTF-{7,16,32}, we want to use UTF-8 for the query part as > + // we do when submitting a form. A form with GET method > + // has its contents added to a url as query params and it makes sense > + // to be consistent whether we're dealing with a literal url with > + // the query part referred to in an HTML document in UTF-{7,16,32} > + // or a url formed for form submission. So, in both cases, > + // we use encodingForFormSubmission, which returns UTF-8 for > + // UTF-{7,16,32} while leaving other encodings alone.
In the comment it should be URL rather than url. This comment is quite good, but probably a bit too wordy. I'd say: // For UTF-{7,16,32}, we want to use UTF-8 for the query part as // we do when submitting a form. A form with GET method // has its contents added to a URL as query params and it makes sense // to be consistent. And leave it at that.
> +// HTML5 specifies that UTF-8 be used in form submission when a form is > +// is a part of a document in UTF-16 probably because UTF-16 is not a > +// byte-based encoding and can contain 0x00. By extension, the same > +// should be done for UTF-32. In case of UTF-7, it is a byte-based encoding, > +// but it's fraught with problems and we'd rather steer clear of it.
Great comment! We normally use one space after a period, not two.
> +// FIXME : Do we also have to return UTF-8 for BOCU, SCSU, and CESU-8? > +// Those algorithmic encodings are supported in ICU, but I have never > +// seen them being used for a web page. Instead of dealing with them > +// here for form submission, it's probably better not to decode them at > +// all.
I don't think we need this comment. Lets just omit it. If you do want to keep the comment, please remove the space after "FIXME". If you did want this comment, I'd suggest putting it in the ICU source file. And yes, lets turn these off in TextCodecICU.
> + * http/tests/misc/submit-get-in-utf16be-expected.txt: Added. > + * http/tests/misc/submit-get-in-utf16be.html: Added. > + * http/tests/misc/submit-get-in-utf16le-expected.txt: Added. > + * http/tests/misc/submit-get-in-utf16le.html: Added. > + * http/tests/misc/submit-get-in-utf32be-expected.txt: Added. > + * http/tests/misc/submit-get-in-utf32be.html: Added. > + * http/tests/misc/submit-get-in-utf32le-expected.txt: Added. > + * http/tests/misc/submit-get-in-utf32le.html: Added. > + * http/tests/misc/submit-get-in-utf7-expected.txt: Added. > + * http/tests/misc/submit-get-in-utf7.html: Added. > + * http/tests/misc/submit-post-in-utf16be-expected.txt: Added. > + * http/tests/misc/submit-post-in-utf16be.html: Added. > + * http/tests/misc/submit-post-in-utf16le-expected.txt: Added. > + * http/tests/misc/submit-post-in-utf16le.html: Added. > + * http/tests/misc/submit-post-in-utf32be-expected.txt: Added. > + * http/tests/misc/submit-post-in-utf32be.html: Added. > + * http/tests/misc/submit-post-in-utf32le-expected.txt: Added. > + * http/tests/misc/submit-post-in-utf32le.html: Added. > + * http/tests/misc/submit-post-in-utf7-expected.txt: Added. > + * http/tests/misc/submit-post-in-utf7.html: Added. > + * http/tests/misc/url-in-utf16be-expected.txt: Added. > + * http/tests/misc/url-in-utf16be.html: Added. > + * http/tests/misc/url-in-utf16le-expected.txt: Added. > + * http/tests/misc/url-in-utf16le.html: Added. > + * http/tests/misc/url-in-utf32be-expected.txt: Added. > + * http/tests/misc/url-in-utf32be.html: Added. > + * http/tests/misc/url-in-utf32le-expected.txt: Added. > + * http/tests/misc/url-in-utf32le.html: Added. > + * http/tests/misc/url-in-utf7-expected.txt: Added. > + * http/tests/misc/url-in-utf7.html: Added.
These tests are great, but it's too bad we didn't find a way to do this testing with a single test driver file instead of having them all be separate tests. For example take a look at the techniques used in char-encoding.html used put lots of tests all into a single driver. r=me as-is; would be nice to make those slight comment improvements
Jungshik Shin
Comment 14
2008-11-24 18:19:56 PST
Created
attachment 25465
[details]
patch for check-in Thank you for the review. Transferring review+ to this patch. This patch ddressed Darin's comments on comments :-) This patch is ready for landing. I'll file another bug on making a driver test to test all the related encodings.
Jungshik Shin
Comment 15
2008-11-25 10:59:57 PST
Created
attachment 25489
[details]
patch for check-in with 'url' => 'URL' I forgot to capitalize url in KURL.cpp (in a comment). This patch does that.
Darin Fisher (:fishd, Google)
Comment 16
2008-11-25 11:31:59 PST
http://trac.webkit.org/changeset/38755
Darin Fisher (:fishd, Google)
Comment 17
2008-11-25 13:01:25 PST
The UTF7 tests have been disabled due to failures observed on some of the buildbots:
http://trac.webkit.org/changeset/38759
Darin Fisher (:fishd, Google)
Comment 18
2008-11-25 13:04:19 PST
See
https://bugs.webkit.org/show_bug.cgi?id=22492
for the tests that were disabled.
Jungshik Shin
Comment 19
2008-12-01 21:27:06 PST
Created
attachment 25662
[details]
UTF-16,32 test files This file contains utf-16,32 test files that were omitted in the previous check-in. I'm transferring r+ because they're a part of the previous patch. The files in this patch should be landed AFTER the patch for
bug 22492
is landed.
Jungshik Shin
Comment 20
2008-12-02 09:53:46 PST
Comment on
attachment 25662
[details]
UTF-16,32 test files I'm marking this as obsolete. It turned out that 'svn-apply' can deal with binary files. I uploaded a new patch to
bug 22492
that includes utf-16/32 tests.
Alexey Proskuryakov
Comment 21
2009-04-02 02:00:46 PDT
***
Bug 18277
has been marked as a duplicate of this bug. ***
Note
You need to
log in
before you can comment on or make changes to this bug.
Top of Page
Format For Printing
XML
Clone This Bug