VERIFIED FIXED 4821
Text in submitted forms should be entity-encoded if the current encoding doesn't support it
https://bugs.webkit.org/show_bug.cgi?id=4821
Summary Text in submitted forms should be entity-encoded if the current encoding does...
Alexey Proskuryakov
Reported 2005-09-03 02:06:48 PDT
This is potentially a reduction of bug 4575 (but its source is not available, so I can only guess). Steps to reproduce: 1. Open the attached test case. 2. Paste some non-Roman text in the text field (e.g., copy something from www.apple.jp or www.apple.ru). 3. Click POST Results: the URL contains question marks (%3F) instead of non-Roman characters Expected results: Such characters should be encoded as HTML entities, for example "form.html?q=%26% 231090%3B%26%231077%3B%26%231089%3B%26%231090%3B" for "тест" Discussion: Firefox does encode text this way. The test case explicitly specifies Latin-1 encoding, so it should work regardless of your browser settings.
Attachments
manual test case (253 bytes, text/html)
2005-09-03 02:07 PDT, Alexey Proskuryakov
no flags
proposed patch (4.42 KB, patch)
2005-09-18 07:37 PDT, Alexey Proskuryakov
no flags
proposed patch (1.98 KB, patch)
2005-09-18 07:39 PDT, Alexey Proskuryakov
darin: review-
Automated test (966 bytes, text/html)
2005-09-19 11:42 PDT, Alexey Proskuryakov
no flags
Alternative patch (4.24 KB, patch)
2005-09-28 09:33 PDT, Alexey Proskuryakov
mjs: review+
Alexey Proskuryakov
Comment 1 2005-09-03 02:07:27 PDT
Created attachment 3730 [details] manual test case
Mark Rowe (bdash)
Comment 2 2005-09-03 02:21:25 PDT
Confirmed with both WebKit 412.7 and ToT WebKit.
Alexey Proskuryakov
Comment 3 2005-09-05 21:39:12 PDT
Actually, this of course doesn't test POST, but rather GET... Renamed.
Alexey Proskuryakov
Comment 4 2005-09-18 07:37:18 PDT
Created attachment 3936 [details] proposed patch Change in QTextCodec::fromUnicode(). It is only used twice - from html_formimpl.cpp and from KWQKURL.mm. In both cases, it seems that this change shouldn't do any harm (and of course, it fixes this issue). Unfortunately, I have no idea about how to make an automated test for this.
Alexey Proskuryakov
Comment 5 2005-09-18 07:39:09 PDT
Created attachment 3937 [details] proposed patch An unrelated change sneaked into the previous patch...
Eric Seidel (no email)
Comment 6 2005-09-18 13:21:50 PDT
Comment on attachment 3937 [details] proposed patch Looks fine to me. The //FIXME: support surogate pairs should be turned into a bugzilla bug instead of a code comment. I'll leave this to darin or mjs for final approval however.
Alexey Proskuryakov
Comment 7 2005-09-18 13:32:14 PDT
(In reply to comment #6) > The //FIXME: support surogate pairs should be turned into a > bugzilla bug instead of a code comment. I took a shortcut because of other FIXMEs in the code suggesting that it may be completely rewritten at some point... If really needed, it's not too hard to implement this.
Maciej Stachowiak
Comment 8 2005-09-19 00:33:56 PDT
I think this manual test case can be turned into an automated one, with some trickery. You can either do the submission in a frame, and then examine its URL from the outside to see if the encoding worked, or you can submit to a data: URL which contains an embedded script that checks document.location (the latter is trickier to pull off). Also, is it right to just change this in QTextCodec::fromUnicode? That function is called in a couple of different places and it's not clear to me if they should all have this behavior. Please either limit this escaping behavior to the form submission case, or provide test cases to show it is right in other cases as well. In general, though this seems like a very good change. The code itself looks right but I want to make sure we have covered our bases on testing.
Alexey Proskuryakov
Comment 9 2005-09-19 11:42:48 PDT
Created attachment 3946 [details] Automated test
Alexey Proskuryakov
Comment 10 2005-09-19 12:41:37 PDT
(In reply to comment #8) A search reveals that fromUnicode is used in: 1) html_formimpl.cpp, FormDataList::appendString(). This is the place where this problem occurs. 2) html_formimpl.cpp, HTMLFormElementImpl::formData(). Here the names of attached files are converted. Having question marks in uploaded file names isn't nice at all, and a smart receiver will probably be even able to handle entities. 3) KWQKURL.mm, encodeRelativeString(), which is only used in KURL constructor. The only place where this constructor is used with a non-default codec is URLWithAttributeString from WebCoreBridge. This function is then exported from WebDOMOperations.h. Well, I do not have much evidence for (3), and do not know how to test its usage... But it is also a URL, and converting parts of URLs to question marks is highly unlikely to do any good IMHO.
Alexey Proskuryakov
Comment 11 2005-09-19 13:47:39 PDT
Comment on attachment 3937 [details] proposed patch I haven't done exactly what was requested, but thought I'll try again with this analysis...
Darin Adler
Comment 12 2005-09-19 16:02:04 PDT
Comment on attachment 3937 [details] proposed patch This looks OK, but I'd rather see this ported to ICU and not using CFString.
Alexey Proskuryakov
Comment 13 2005-09-19 23:13:16 PDT
(In reply to comment #12) Moving from a working API to an SPI doesn't make me very excited, so I'd prefer not to do that in a fix for this particular issue...
Darin Adler
Comment 14 2005-09-20 17:00:44 PDT
ICU is a cross-platform API. Calling it an SPI is very Apple-centric, and in my opinion, wrong.
Alexey Proskuryakov
Comment 15 2005-09-21 00:48:48 PDT
(In reply to comment #14) We 3rd party developers have been strongly discouraged from even looking in this direction by Deborah Goldsmith herself (who is a member of ICU Project Management Committee and Apple's Unicode liaison). She probably forgot more reasons for that than I will ever know, and I know quite a few :) Sorry for stating the obvious, but every time an Apple app uses an (SPI, API, whatever) that 3rd parties cannot use, it reduces the attention to real APIs. This case is particularly clear - every time CFString doesn't serve WebKit well, it is something that should be fixed in CFString. Hopefully, this explains my aversion to ICU migration.
Maciej Stachowiak
Comment 16 2005-09-21 02:03:25 PDT
We have discussed this with Deborah Goldsmith and she's happy with WebCore and JavaScriptCore using ICU. In fact, she OK'd us shipping the headers in JavaScriptCore/WebCore even though ICU is theoretically SPI at the moment. We are in general moving towards more and more direct use of ICU for our text encoding needs, as it allows the code to be more portable. Please consider ICU (I'll mark r- so you can think about it) but I think it's fine to land this change as-is, since the previous version already used CFString for this.
Alexey Proskuryakov
Comment 17 2005-09-23 11:06:11 PDT
(In reply to comment #16) Well... Ok :). But this isn't going to be straightforward, I'm afraid. What I do not quite understand is the hack that's used for a backslash when the encoding is kCFStringEncodingShiftJIS_X0213_00 or kCFStringEncodingEUC_JP. The Shift-JIS case is more or less clear (there are a lot of conflicting versions of this encoding, but at least the end result matches Shift- JIS from ICU). EUC_JP is weird: backslash is converted fine without this hack, but fails to get converted at all with it. There's also a similar unhandled issue with tilde: CFString converts it into 0x8160 = U+301C = WAVE DASH, while in ICU version it stays at 0x7e. Then, this page <http://blogs.msdn.com/michkap/archive/2005/09/17/469941.aspx> says the backslash problem also exists for Korean, which doesn't get such special treatment in WebKit. So far, looks like ICU should be just used directly, and this hack should be removed... But the function backslashAsCurrencySymbol() is also present in RenderObject and in KWQKHTMLPart - it's used a lot, and I'm not sure for what reason.
Alexey Proskuryakov
Comment 18 2005-09-25 12:03:44 PDT
Comment on attachment 3937 [details] proposed patch I'll work on an ICU-based patch, but because of the aforementioned complications, it should probably be done as a separate step, not while fixing this specific issue.
Darin Adler
Comment 19 2005-09-25 19:38:45 PDT
Sure, we should do the ICU thing separately. The special backslash handling predates the use of ICU and is a complicated issue. The backslash character and the yen character share the same character code, and which of the two the character represents ends up depending on the context.
Darin Adler
Comment 20 2005-09-25 19:44:17 PDT
Comment on attachment 3937 [details] proposed patch Looks close, but should not be checked in as-is. This code is not correct for the use of fromUnicode in KURL. Question marks aren't correct either. I believe that we want to use %-escape sequences in the KURL code. As Maciej said in his earlier comment, we should not make this change to fromUnicode -- we should make it in a function used only by the form code.
Alexey Proskuryakov
Comment 21 2005-09-25 21:30:12 PDT
(In reply to comment #20) As I understand it, %-encoding won't work for KURL - these codes are supposed to be from the same encoding we are converting to, and the characters in question aren't representable in this encoding. So, we have to choose between entities, question marks and omitting the offending characters. OTOH, there is at least one case when entities are certainly a correct choice for KURL - it's when URLWithAttributeString is used to do something similar to form submission. Since we cannot know in advance what URLWithAttributeString is used for, this probably means that entity-encoding is the right choice for all uses of fromUnicode.
Alexey Proskuryakov
Comment 22 2005-09-28 09:33:55 PDT
Created attachment 4078 [details] Alternative patch This patch limits changes to forms code, preserving the old (question marks) behavior for KURL.
Maciej Stachowiak
Comment 23 2005-09-28 15:18:52 PDT
Comment on attachment 4078 [details] Alternative patch r=me
Arno Bosse
Comment 24 2005-10-10 11:33:40 PDT
Hooray, it works! The original case that prompted this (unicode entered into text boxes in Blackboard v6 - which uses ISO-Latin 1 throughout) now behaves correctly using the latest 10/10/05 build. Very many thanks to all concerned.
Alexey Proskuryakov
Comment 25 2005-10-10 12:46:53 PDT
*** Bug 4575 has been marked as a duplicate of this bug. ***
Note You need to log in before you can comment on or make changes to this bug.