Bug 4821 - Text in submitted forms should be entity-encoded if the current encoding doesn't support it
Summary: Text in submitted forms should be entity-encoded if the current encoding does...
Status: VERIFIED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: Forms (show other bugs)
Version: 420+
Hardware: Mac OS X 10.4
: P2 Normal
Assignee: Alexey Proskuryakov
URL:
Keywords:
: 4575 (view as bug list)
Depends on:
Blocks: 5232
  Show dependency treegraph
 
Reported: 2005-09-03 02:06 PDT by Alexey Proskuryakov
Modified: 2005-10-10 12:46 PDT (History)
2 users (show)

See Also:


Attachments
manual test case (253 bytes, text/html)
2005-09-03 02:07 PDT, Alexey Proskuryakov
no flags Details
proposed patch (4.42 KB, patch)
2005-09-18 07:37 PDT, Alexey Proskuryakov
no flags Details | Formatted Diff | Diff
proposed patch (1.98 KB, patch)
2005-09-18 07:39 PDT, Alexey Proskuryakov
darin: review-
Details | Formatted Diff | Diff
Automated test (966 bytes, text/html)
2005-09-19 11:42 PDT, Alexey Proskuryakov
no flags Details
Alternative patch (4.24 KB, patch)
2005-09-28 09:33 PDT, Alexey Proskuryakov
mjs: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Alexey Proskuryakov 2005-09-03 02:06:48 PDT
This is potentially a reduction of bug 4575 (but its source is not available, so I can only guess).

Steps to reproduce:
1. Open the attached test case.
2. Paste some non-Roman text in the text field (e.g., copy something from www.apple.jp or 
www.apple.ru).
3. Click POST

Results: the URL contains question marks (%3F) instead of non-Roman characters
Expected results: Such characters should be encoded as HTML entities, for example "form.html?q=%26%
231090%3B%26%231077%3B%26%231089%3B%26%231090%3B" for "тест"

Discussion: Firefox does encode text this way.

The test case explicitly specifies Latin-1 encoding, so it should work regardless of your browser 
settings.
Comment 1 Alexey Proskuryakov 2005-09-03 02:07:27 PDT
Created attachment 3730 [details]
manual test case
Comment 2 Mark Rowe (bdash) 2005-09-03 02:21:25 PDT
Confirmed with both WebKit 412.7 and ToT WebKit.
Comment 3 Alexey Proskuryakov 2005-09-05 21:39:12 PDT
Actually, this of course doesn't test POST, but rather GET... Renamed.
Comment 4 Alexey Proskuryakov 2005-09-18 07:37:18 PDT
Created attachment 3936 [details]
proposed patch

Change in QTextCodec::fromUnicode(). It is only used twice - from
html_formimpl.cpp and from KWQKURL.mm. In both cases, it seems that this change
shouldn't do any harm (and of course, it fixes this issue).

Unfortunately, I have no idea about how to make an automated test for this.
Comment 5 Alexey Proskuryakov 2005-09-18 07:39:09 PDT
Created attachment 3937 [details]
proposed patch

An unrelated change sneaked into the previous patch...
Comment 6 Eric Seidel (no email) 2005-09-18 13:21:50 PDT
Comment on attachment 3937 [details]
proposed patch

Looks fine to me.  The //FIXME: support surogate pairs should be turned into a
bugzilla bug instead of a code comment.  I'll leave this to darin or mjs for
final approval however.
Comment 7 Alexey Proskuryakov 2005-09-18 13:32:14 PDT
(In reply to comment #6)
> The //FIXME: support surogate pairs should be turned into a
> bugzilla bug instead of a code comment.

I took a shortcut because of other FIXMEs in the code suggesting that it may be completely rewritten at 
some point... If really needed, it's not too hard to implement this.
Comment 8 Maciej Stachowiak 2005-09-19 00:33:56 PDT
I think this manual test case can be turned into an automated one, with some trickery. You can either 
do the submission in a frame, and then examine its URL from the outside to see if the encoding 
worked, or you can submit to a data: URL which contains an embedded script that checks 
document.location (the latter is trickier to pull off).

Also, is it right to just change this in QTextCodec::fromUnicode? That function is called in a couple of 
different places and it's not clear to me if they should all have this behavior. Please either limit this 
escaping behavior to the form submission case, or provide test cases to show it is right in other cases 
as well.

In general, though this seems like a very good change. The code itself looks right but I want to make 
sure we have covered our bases on testing.
Comment 9 Alexey Proskuryakov 2005-09-19 11:42:48 PDT
Created attachment 3946 [details]
Automated test
Comment 10 Alexey Proskuryakov 2005-09-19 12:41:37 PDT
(In reply to comment #8)
A search reveals that fromUnicode is used in:
1) html_formimpl.cpp, FormDataList::appendString(). This is the place where this problem occurs.
2) html_formimpl.cpp, HTMLFormElementImpl::formData(). Here the names of attached files are 
converted. Having question marks in uploaded file names isn't nice at all, and a smart receiver will 
probably be even able to handle entities.
3) KWQKURL.mm, encodeRelativeString(), which is only used in KURL constructor. The only place where 
this constructor is used with a non-default codec is URLWithAttributeString from WebCoreBridge. This 
function is then exported from WebDOMOperations.h.

Well, I do not have much evidence for (3), and do not know how to test its usage... But it is also a URL, 
and converting parts of URLs to question marks is highly unlikely to do any good IMHO.
Comment 11 Alexey Proskuryakov 2005-09-19 13:47:39 PDT
Comment on attachment 3937 [details]
proposed patch

I haven't done exactly what was requested, but thought I'll try again with this
analysis...
Comment 12 Darin Adler 2005-09-19 16:02:04 PDT
Comment on attachment 3937 [details]
proposed patch

This looks OK, but I'd rather see this ported to ICU and not using CFString.
Comment 13 Alexey Proskuryakov 2005-09-19 23:13:16 PDT
(In reply to comment #12)
Moving from a working API to an SPI doesn't make me very excited, so I'd prefer not to do that in a fix for 
this particular issue...
Comment 14 Darin Adler 2005-09-20 17:00:44 PDT
ICU is a cross-platform API. Calling it an SPI is very Apple-centric, and in my opinion, wrong.
Comment 15 Alexey Proskuryakov 2005-09-21 00:48:48 PDT
(In reply to comment #14)

We 3rd party developers have been strongly discouraged from even looking in this direction by Deborah 
Goldsmith herself (who is a member of ICU Project Management Committee and Apple's Unicode 
liaison). She probably forgot more reasons for that than I will ever know, and I know quite a few :)

Sorry for stating the obvious, but every time an Apple app uses an (SPI, API, whatever) that 3rd parties 
cannot use, it reduces the attention to real APIs. This case is particularly clear - every time CFString 
doesn't serve WebKit well, it is something that should be fixed in CFString.

Hopefully, this explains my aversion to ICU migration.
Comment 16 Maciej Stachowiak 2005-09-21 02:03:25 PDT
We have discussed this with Deborah Goldsmith and she's happy with WebCore and JavaScriptCore using 
ICU. In fact, she OK'd us shipping the headers in JavaScriptCore/WebCore even though ICU is theoretically 
SPI at the moment. We are in general moving towards more and more direct use of ICU for our text 
encoding needs, as it allows the code to be more portable.

Please consider ICU (I'll mark r- so you can think about it) but I think it's fine to land this change as-is, 
since the previous version already used CFString for this.
Comment 17 Alexey Proskuryakov 2005-09-23 11:06:11 PDT
(In reply to comment #16)

Well... Ok :). But this isn't going to be straightforward, I'm afraid.

What I do not quite understand is the hack that's used for a backslash when the encoding is 
kCFStringEncodingShiftJIS_X0213_00 or kCFStringEncodingEUC_JP. The Shift-JIS case is more or less 
clear (there are a lot of conflicting versions of this encoding, but at least the end result matches Shift-
JIS from ICU). EUC_JP is weird: backslash is converted fine without this hack, but fails to get converted 
at all with it.

There's also a similar unhandled issue with tilde: CFString converts it into 0x8160 = U+301C = WAVE 
DASH, while in ICU version it stays at 0x7e.

Then, this page <http://blogs.msdn.com/michkap/archive/2005/09/17/469941.aspx> says the 
backslash problem also exists for Korean, which doesn't get such special treatment in WebKit.

So far, looks like ICU should be just used directly, and this hack should be removed... But the function 
backslashAsCurrencySymbol() is also present in RenderObject and in KWQKHTMLPart - it's used a lot, 
and I'm not sure for what reason.
Comment 18 Alexey Proskuryakov 2005-09-25 12:03:44 PDT
Comment on attachment 3937 [details]
proposed patch

I'll work on an ICU-based patch, but because of the aforementioned
complications, it should probably be done as a separate step, not while fixing
this specific issue.
Comment 19 Darin Adler 2005-09-25 19:38:45 PDT
Sure, we should do the ICU thing separately.

The special backslash handling predates the use of ICU and is a complicated issue. The backslash 
character and the yen character share the same character code, and which of the two the character 
represents ends up depending on the context.
Comment 20 Darin Adler 2005-09-25 19:44:17 PDT
Comment on attachment 3937 [details]
proposed patch

Looks close, but should not be checked in as-is.

This code is not correct for the use of fromUnicode in KURL. Question marks
aren't correct either. I believe that we want to use %-escape sequences in the
KURL code. As Maciej said in his earlier comment, we should not make this
change to fromUnicode -- we should make it in a function used only by the form
code.
Comment 21 Alexey Proskuryakov 2005-09-25 21:30:12 PDT
(In reply to comment #20)
As I understand it, %-encoding won't work for KURL - these codes are supposed to be from the same 
encoding we are converting to, and the characters in question aren't representable in this encoding. So, we 
have to choose between entities, question marks and omitting the offending characters.

OTOH, there is at least one case when entities are certainly a correct choice for KURL - it's when 
URLWithAttributeString is used to do something similar to form submission. Since we cannot know in 
advance what URLWithAttributeString is used for, this probably means that entity-encoding is the right 
choice for all uses of fromUnicode.
Comment 22 Alexey Proskuryakov 2005-09-28 09:33:55 PDT
Created attachment 4078 [details]
Alternative patch

This patch limits changes to forms code, preserving the old (question marks)
behavior for KURL.
Comment 23 Maciej Stachowiak 2005-09-28 15:18:52 PDT
Comment on attachment 4078 [details]
Alternative patch

r=me
Comment 24 Arno Bosse 2005-10-10 11:33:40 PDT
Hooray, it works! The original case that prompted this (unicode entered into text boxes in Blackboard v6 - 
which uses ISO-Latin 1 throughout) now behaves correctly using the latest 10/10/05 build. Very many 
thanks to all concerned.
Comment 25 Alexey Proskuryakov 2005-10-10 12:46:53 PDT
*** Bug 4575 has been marked as a duplicate of this bug. ***