decodeURLEscapeSequences will unescape "%00" in a URL to generate a string with embedded NULLs. I don't know of a specific problem with this, but it can be very dangerous, because then all the string handling in the rest of the application must handle NULLs correctly. There have been security holes in other web browsers because of doing stuff like this. IE doesn't unescape NULLs in this case (it converts them to "%2500"), and I don't think WebKit should, either.
When decodeURLEscapeSequences encounters an escape sequence, it will unescape it and then try to convert it using the given encoding. Normally, this is UTF-8. If you just request the path() of a URL, it will automatically decode it! This is a destructive operation.
If you generate a character not in the character set (in this case, UTF-8), then it will be replaced with the "invalid character." If you try to compose this again, the URL will be corrupted.
This is actually a pretty big problem. If I'm on a Japanese page with a path encoded as ShiftJIS (escaped), if that page requests document.location.pathname, it will be wrong.
I don't think that these URL components should be unescaped at all by the component getters like KURL.host() and KURL.path(). Neither Firefox nor IE does this. IE is a little tricky. It will keep around the unescaped form so it will depend on whether the original link the URL was Unicode or escaped.
I can do a patch for this, but it would be nice to get some comments on my approach first.
CCing folks who would know/feel-responsible-for URL handling in WebKit.
Can we create a test that demonstrates these problems? Do these affect real websites?
We should make fixes based on effects rather than on critique of the code. I'd like to see us start with a test case. And it would be really great to have an example of at least one website that will work better if we make the code change.
I'm not fully convinced by the "embedded null characters are dangerous" argument. We don't have code that treats null characters as a special case, and I don't find the fact that other browsers had code like that to be a compelling argument. I'm more convinced by the "be consistent with IE" argument, though, so it may not be important what I think about the other argument. But also, you can change my mind if I'm wrong.
> This is actually a pretty big problem. If I'm on a Japanese page with a path
> encoded as ShiftJIS (escaped), if that page requests
> document.location.pathname, it will be wrong.
I'd like to understand exactly what "wrong" means here. We do need to match the behavior of other browsers.
But as you probably know, in general URLs don't necessarily correspond to a Unicode string. They are sent, byte for byte, to the server, and the server responds, so even URLs with invalid %-escape sequences might work on some sites and servers.
> We don't have code that treats null characters as a special case
I'm not arguing for or against this change, but Windows APIs do special-case NULLs, so routines that validate strings for using in Windows system calls often must understand the magical, string-terminating semantics of NULL.
(In reply to comment #5)
> I'm not arguing for or against this change, but Windows APIs do special-case
> NULLs, so routines that validate strings for using in Windows system calls
> often must understand the magical, string-terminating semantics of NULL.
Makes sense. Any code path that involves calling String::charactersWithNullTermination could run afoul of a null character, and although I can't find any examples of doing that with a URL or a piece of a URL it's possible there could be some.
(In reply to comment #7)
> Another thought is that if we have a code path that has a problem with URLs
> with null characters, it can possibly be triggered by making a URL with
I believe that will get escaped by KURL automatically.
Sorry, I forgot to add the effect for the encoding problem:
I have part of a layout test that demonstrates the path being messed up. If you have a page at "foo.com/asdf%F0" and you say document.location.pathname, it will give you "/asdf\xFFFD" (the %F0 being replaced by the Unicode "replacement character"). Although I don't know of any sites that break because of this, it's hard to argue that it's not wrong.