Bug 15119

Summary: Non-UTF-8 query parameters not handled properly when not generated by the form code
Product: WebKit Reporter: Brett Wilson (Google) <brettw>
Component: WebCore Misc.Assignee: Brett Wilson (Google) <brettw>
Status: RESOLVED FIXED    
Severity: Normal CC: ap, gavin.sharp, ian, julian.reschke, mrowe, webkit
Priority: P2 Keywords: InRadar
Version: 523.x (Safari 3)   
Hardware: All   
OS: All   
Attachments:
Description Flags
Test case
none
Patch with layout test
darin: review+
Revised patch
darin: review-
Revised patch x2 darin: review+

Description Brett Wilson (Google) 2007-08-30 13:15:58 PDT
If I have a link on a page with query characters not ASCII, KURL will convert them to the encoding of the page. If this link contains characters not present in the page character set (it could have been generated by script) that character will be replaced with a "?" in the final URL.

Instead, it should do like the form encoder, which causes ICU to generate entity names. In addition, the non-numeric characters ("&#x") in these entity names should be escaped, but these characters should NOT be escaped when they were not generated by ICU.
Comment 1 Brett Wilson (Google) 2007-08-30 13:37:10 PDT
Created attachment 16166 [details]
Test case

Here is an example that illustrates this problem. The page is in Big5, and script generates a URL that is not encodable in Big5.

Firefox, IE, and Safari all generate a query in the encoding of the page when possible. (In this case, you will see gibberish in the Google result page since Google expects UTF-8.)

When the character is not representable, Firefox falls back on UTF-8, which amusingly gives the "correct" answer for Google. IE and Safari both substitute a literal question mark for the invalid character.

The only right thing to do in this case would be to use the same logic as the form code does when you paste the character into a form field on such a page and submit it.
Comment 2 Alexey Proskuryakov 2007-08-30 22:39:15 PDT
> Firefox, IE, and Safari all generate a query in the encoding of the page when
> possible.

Firefox has URL encoding behavior configurable via its "network.standard-url.encode-utf8" about:config option. Searching b.m.o comments for this option name gives a good list of issues to consider when fixing this bug. Also see bug 7461.

Obviously, a question mark is not a good replacement for an un-encodable character.
Comment 3 Robert Blaut 2008-01-29 22:42:34 PST
I'm not sure but searching from test case shows me proper results so this bug is probably fixed.
Comment 4 Alexey Proskuryakov 2008-01-30 00:06:20 PST
I do not think anything has changed here: clicking on the first link still results is a search for "?".
Comment 5 Robert Blaut 2008-01-30 00:11:13 PST
So the test case is confusing :( It should contain clear message how to perform test, expected result and current result.
 
Comment 6 Alexey Proskuryakov 2008-02-01 14:28:12 PST
<rdar://problem/4306856>
Comment 7 Brett Wilson (Google) 2008-02-28 23:10:09 PST
Created attachment 19443 [details]
Patch with layout test

This adds an additional optional argument to the text encoding functions. This is required since the query may contain "?" and ";" that should not be escaped, yet the ones in the added entities should.

The form code does not need this flag. It builds the string up using unescaped entities. When doing a post they shouldn't be escaped. When doing a get with the parameters in the URL, all "&", ";" and "#" will be percent-escaped (since they shouldn't interfere with the regular parsing of the URL).
Comment 8 Darin Adler 2008-02-29 10:02:01 PST
Comment on attachment 19443 [details]
Patch with layout test

Looks good.

It's too bad that there's so much repeated code in the various codecs. Is there some way to share more of the code? More helper functions in the base class, perhaps?

+// Invalid character handler when wriuting escaped entities for unrepresentable

Typo here: "wriuting".

+        char number[32];
+        sprintf(number, "%%26%%23%d%%3B", codePoint);

Security screening tools are going to complain about our use of sprintf. Lets use snprintf as we do in source files like String.cpp using the <wtf/StringExtras.h> header. I know the existing code was using sprintf, but lets not.

+        UChar outChar;
+        if ((outChar = getGbkEscape(codePoint))) {

If you're going to initialize the variable in the if statement, then please define it inside the if statement too. Then you won't need the double parentheses.

+            return;
+        } else {

Please don't use else after return.

+        // allowEntities will cause characters not encodable in the output
+        // character encoding to be escaped as in XML/HTML entities.
+        // For example, U+06DE will be "&#1758;" (in octal). When unset, some
+        // replacement character will be unsed for unencodable characters.
+        //
+        // When escapeEntities is set, non-alphanumeric characters in the
+        // entity will be escaped using URL rules. In the above example, of
+        // "&#1758;" it will generate "%26%231758#3B". This flag has no effect
+        // when allowEntities is false.
+        CString encode(const UChar*, size_t length, bool allowEntities = false, bool escapeEntities = false) const;

I think that two booleans is an ugly way to pass the entity policy. How about an enum instead? Or maybe separate virtual  encode functions rather than a single function with a flag. Do we really need all three policies? 

Sorry to be so nitpicky, but I find the spacing in comments like this hard to read. If there are spaces between the paragraphs, but not between  the last paragraph and the function definition it makes it hard to read.

I'm going to say review+, but I'd prefer to see a new patch with the issues I mention above fixed.
Comment 9 Brett Wilson (Google) 2008-02-29 20:02:01 PST
Created attachment 19468 [details]
Revised patch

> It's too bad that there's so much repeated code in the various codecs. Is there
> some way to share more of the code? More helper functions in the base class,
> perhaps?

I added a static function in TextCodec to do character replacement. This consolidates all the printfs in one place. I would have made it protected, but it is used by several file-static functions. It makes the code a little shorter and reduces duplication.

> I think that two booleans is an ugly way to pass the entity policy. How about
> an enum instead? Or maybe separate virtual  encode functions rather than a
> single function with a flag. Do we really need all three policies?

You're right, I was trying to minimize changes to calling code, but that's probably a silly reason and it turns out this flag was only used in a couple of places. I made this an enum in TextCodec which makes it more clear. It also makes the function more self-documenting, and I removed the comment where you were complaining about spacing.

I fixed everything else.
Comment 10 Darin Adler 2008-03-02 18:43:06 PST
Comment on attachment 19468 [details]
Revised patch

Looking good!

As documented in http://webkit.org/coding/coding-style.html, we don't use all capitals for enums. It should be Substitute, Entity and EscapedEntity rather than SUBSTIUTE, ENTITY and ESCAPED_ENTITY. I also would suggest making UnencodableHandling a top level member of WebCore and naming the enum values QuestionMarksForUnencodables, EntitiesForUnencodables, and URLEncodedEntitiesForUnencodables.

(The all capital style for macros was invented to keep the macros from colliding with other uses of the same identifier and it's not needed for things that aren't macros.)

I don't think the encode function needs a default value for UnencodableHandling. I'd prefer to see it capitalized at every call site.

I'm slightly uncomfortable with the non-thread-safe implementation of unencodeableCharReplacement. How about having the function take a reference to an array of the right size instead? Then the caller can supply the buffer. I'd also suggest using int& rather than int* for the length out pointer.

 284                                      UChar32 codePoint, UConverterCallbackReason reason, UErrorCode* err) {

We put braces on separate lines not on the same line as the declaration.

 295 // Substutites special GBK characters, escaping all other unassigned entities.

Typo here in the word "Substutites".

 309 // Combines both gbkUrlEscapedEntityCallack with GBK character substitution.

Strange wording here: Both/with.

 79             int replLen;
 80             const char* repl = TextCodec::unencodableCharReplacement(c, unencodable, &replLen);

I prefer actual words rather than fragments like "repl" for variable names.

I'm going to say review- this time because I'd like at least some of these style issues to be fixed.
Comment 11 Brett Wilson (Google) 2008-03-03 22:27:07 PST
Created attachment 19514 [details]
Revised patch x2

> As documented in http://webkit.org/coding/coding-style.html, we don't use all
> capitals for enums. It should be Substitute, Entity and EscapedEntity rather
> than SUBSTIUTE, ENTITY and ESCAPED_ENTITY. I also would suggest making
> UnencodableHandling a top level member of WebCore and naming the enum values
> QuestionMarksForUnencodables, EntitiesForUnencodables, and
> URLEncodedEntitiesForUnencodables.

Ugh, I've got way too many coding styles in my head nowadays.


> I don't think the encode function needs a default value for
> UnencodableHandling. I'd prefer to see it capitalized at every call site.

I assume you mean specified at every call site? I was initially very resistant to this because I didn't want to change 1000 files, but there are only a couple places it's used. I agree, don't think the default value is very clear here.


> I'm slightly uncomfortable with the non-thread-safe implementation of
> unencodeableCharReplacement. How about having the function take a reference to
> an array of the right size instead? Then the caller can supply the buffer. I'd
> also suggest using int& rather than int* for the length out pointer.

I used a buffer and returned the length (since I don't use the return value for anything) which I think is more clear than an out param.

I changed XMLHTTPRequest code to specify entities rather than substitution. It looks like it's currently hard-coded for UTF-8, so this won't be an issue, but I'm pretty sure when you correctly support the document encoding, entities will be correct here.
Comment 12 Darin Adler 2008-03-07 12:28:56 PST
Comment on attachment 19514 [details]
Revised patch x2

Looks great.

 59     switch (unencodable) {
 60     case QuestionMarksForUnencodables:
 61         replacement[0] = '?';
 62         replacement[1] = 0;
 63         return 1;
 64     case EntitiesForUnencodables:
 65         snprintf(replacement, 32, "&#%u;", codePoint);
 66         break;
 67     case URLEncodedEntitiesForUnencodables:
 68         snprintf(replacement, 32, "%%26%%23%u%%3B", codePoint);
 69         break;
 70     default:
 71         ASSERT_NOT_REACHED();
 72         replacement[0] = 0;
 73     }
 74     return static_cast<int>(strlen(replacement));

I normally like to leave out the default: case so that GCC will warn if we ever add a new value for the enum. That requires a little bit of reorganization, though. Not very important.

I noticed that at some other call sites, you did not include a default: ASSERT_NOT_REACHED. It would be better to be consistent.

 44         // Substitues the replacement character "?".

Typo "Substitutes" is spelled wrong.

It's a little annoying that the value "32" is repeated at all the various sites. Maybe it should have been named. Maybe a typedef for char[32] would have been the best way to not repeat the value in multiple places; could use sizeof(type) in the few places where the number is needed (passing in to snprintf).

r=me as is, but there is still a little room for improvement.
Comment 13 Mark Rowe (bdash) 2008-03-12 10:31:49 PDT
Marvin, can you please make the improvements Darin has suggested and upload a new revision of the patch to be landed?
Comment 14 Darin Adler 2008-03-16 20:47:23 PDT
I decided to do the requested tweaks myself.

Committed revision 31089.
Comment 15 Brett Wilson (Google) 2008-04-10 08:18:52 PDT
Thanks Darin! Sorry I took so long. I was on vacation and then so busy as a result of begin on vacation I didn't get to this.
Comment 16 Julian Reschke 2008-06-29 05:12:57 PDT
Exactly how is this approach better than either sticking with what Webkit did before (mirroring IE/Opera), *or* doing what Firefox does (which at least has a minimal chance of being understood by the server)?
Comment 17 Ian 'Hixie' Hickson 2008-06-29 14:23:05 PDT
(HTML5 makes this non-conforming at the moment.)
Comment 18 Brett Wilson (Google) 2008-07-30 17:46:50 PDT
(In reply to comment #16)
> Exactly how is this approach better than either sticking with what Webkit did
> before (mirroring IE/Opera), *or* doing what Firefox does (which at least has a
> minimal chance of being understood by the server)?

This is how things are encoded by the form code of all browsers. Therefore, most servers are already setup to handle this type of input in the query encodings. IE/Opera lose the data, and Firefox generates gibberish in the encoding the server expects. Those options seem worse, no?