[WTFURL] Add URLQueryCanonicalizer
Created attachment 66334 [details]
Comment on attachment 66334 [details]
View in context: https://bugs.webkit.org/attachment.cgi?id=66334&action=review
r- because at least the all-capts typenames definitely need to be fixed
> +extern const char hexCharacterTable[0x10];
It would be more conventional to say  instead of [0x10]/
> +// Write a single character, escaped, to the output. This always escapes: it
> +// does no checking that thee character requires escaping.
> +// Escaping makes sense only 8 bit chars, so code works in all cases of
> +// input parameters (8/16bit).
Comment seems excessive - the function name is reasonably descriptive as it is.
> +template<typename INCHAR, typename OUTCHAR>
> +inline void appendEscapedCharacter(INCHAR ch, URLBuffer<OUTCHAR>& buffer)
- All-caps typenames are not standard WebKit style.
- Consider naming the function appendURLEscapedCharacter. Given only the context of the WTF namespace, it might not be clear what kind of escaping is intended, but "URL escaped" or "percent escaped" would be pretty unambiguous.
> +// Query canonicalization in IE
> +// ----------------------------
> +// IE is very permissive for query parameters specified in links on the page
> +// (in contrast to links that it constructs itself based on form data). It does
> +// not unescape any character. It does not reject any escape sequence (be they
> +// invalid like "%2y" or freaky like %00).
> +// IE only escapes spaces and nothing else. Embedded NULLs, tabs (0x09),
> +// LF (0x0a), and CR (0x0d) are removed (this probably happens at an earlier
> +// layer since they are removed from all portions of the URL). All other
> +// characters are passed unmodified. Invalid UTF-16 sequences are preserved as
> +// well, with each character in the input being converted to UTF-8. It is the
> +// server's job to make sense of this invalid query.
> +// Invalid multibyte sequences (for example, invalid UTF-8 on a UTF-8 page)
> +// are converted to the invalid character and sent as unescaped UTF-8 (0xef,
> +// 0xbf, 0xbd). This may not be canonicalization, the parser may generate these
> +// strings before the URL handler ever sees them.
> +// Our query canonicalization
> +// --------------------------
> +// We escape all non-ASCII characters and control characters, like Firefox.
> +// This is more conformant to the URL spec, and there do not seem to be many
> +// problems relating to Firefox's behavior.
> +// Like IE, we will never unescape (although the application may want to try
> +// unescaping to present the user with a more understandable URL). We will
> +// replace all invalid sequences (including invalid UTF-16 sequences, which IE
> +// doesn't) with the "invalid character," and we will escape it.
Does this comment really need to be essay-length? Perhaps it could be pared down to the key non-obvious information. In particular, anything that someone needing to read or modify this code is likely to need to know. I'm not sure a detailed spec of IE's behavior, which is not in fact implemented here, is likely to be all that relevant.
> +template<typename CHAR, typename OUTCHAR, void convertCharset(const CHAR*, int length, URLBuffer<char>&)>
Please don't use all-caps for template parameters. Also, why CHAR/OUTCHAR here instead of INCHAR/OUTCHAR as in the code above?
> + static void DoCanonicalizeQuery(const CHAR* spec, const URLComponent& query, URLBuffer<OUTCHAR>& buffer, URLComponent& resultQuery)
Member function names should not start with a capital letter in WebKit style. Also "do" seems redundant - why not just "canonicalizeQuery"?
> + // FIXME: This should be an unsigned comparison.
Does this FIXME really need fixing? If so, please do. Tentatively, it seems like it might be broken when CHAR is "char" and the platform has signed char.
> + // Appends the given string to the output, escaping characters that do not
> + // match the given |type| in SharedCharTypes. This version will accept 8 or 16
> + // bit characters, but assumes that they have only 7-bit values. It also assumes
> + // that all UTF-8 values are correct, so doesn't bother checking
Comment is too verbose. I can see how it's useful to document the preconditions of the function, but that would be even better done with assertions than with comments. From reading the source, it's not clear how SharedCharTypes gets involved. SharedCharTypes does not appear elsewhere in this patch. Is that part of the comment accurate?
Thanks for the review. I'll be more aggressive about removing comments in the future. :) I'm sad the stylebot and I both missed the capitalization issues.
Created attachment 70625 [details]
Comment on attachment 70625 [details]
Comment on attachment 70625 [details]
Clearing flags on attachment: 70625
Committed r69686: <http://trac.webkit.org/changeset/69686>
All reviewed patches have been landed. Closing bug.
http://trac.webkit.org/changeset/69686 might have broken Chromium Mac Release