Bug 23546

Summary: Upstream GoogleURL implementation of KURL
Product: WebKit Reporter: Darin Fisher (:fishd, Google) <fishd>
Component: PlatformAssignee: Darin Fisher (:fishd, Google) <fishd>
Status: RESOLVED FIXED    
Severity: Normal CC: brettw, darin
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: All   
OS: All   
Attachments:
Description Flags
v1 patch darin: review+

Comment 1 Darin Fisher (:fishd, Google) 2009-01-26 09:48:46 PST
Created attachment 27037 [details]
v1 patch

See corresponding USE(GOOGLEURL) defines in KURL.h
Comment 2 Darin Adler 2009-01-27 09:48:20 PST
Comment on attachment 27037 [details]
v1 patch

Since we're planning to change KURL's name to URL in the future, it's unfortunate the the new filename is GKURL.cpp. I think URLGoogle.cpp might fit in better with the naming in the platform layer in general. Just like KURLMac.mm, only forward-looking in that it omits the K.

> +#include "config.h"
> +#include "KURL.h"
> +
> +#include "CString.h"
> +#include "NotImplemented.h"
> +#include "TextEncoding.h"
> +#include <wtf/Vector.h>
> +
> +#if USE(GOOGLEURL)

It's better to put the #if up a little higher, before the includes. That's how we do it in all the SVG source files.

> +    WebCoreCharsetConverter(const TextEncoding* encoding)

Since a TextEncoding is really just a wrapped const char*, it makes sense to pass it by value rather than using a TextEncoding*. Unless you're just using TextEncoding* as a way to pass the null value. I guess it's OK to leave it that way since that's how it's done in the KURL constructor, but that really seems like something weak we may later want to change.

> +inline void assertProtocolIsGood(const char* protocol)

> +inline const url_parse::UTF16Char* CharactersOrEmpty(const String& str)

> +inline bool isUnicodeEncoding(const TextEncoding* encoding)

> +bool lowerCaseEqualsASCII(const char* begin, const char* end, const char* str)

Functions used only inside a source file should be qualified with the word "static" to give them internal linkage. lowerCaseEqualsASCII seems like something that should be in a shared source file, not inside this one particular Google-specific one. Also, it should assert that the string has no non-ASCII or uppercase characters in it. I could have sworn we already have something similar somewhere in the tree, but I couldn't quickly find it.

> +GoogleURLPrivate::GoogleURLPrivate()

Why isn't this in the WebCore namespace?

> +CFURLRef KURL::createCFURL() const {

Wrong formatting.

> +// We handle "parameters" separated be a semicolon, while the old KURL does
> +// not, which can lead to different results in some cases.

Typo here. "be a semicolon".

I'd like to see a test case demonstrating the effect of this difference. The comment implies it has some observable effect.

> +String KURL::host() const
> +{
> +    // Note: WebKit decode_string()s here.
> +    return m_url.componentString(m_url.m_parsed.host);
> +}

Is that a bug? Test case?

> +// Returns 0 when there is no port or it is invalid.
> +//
> +// We define invalid port numbers to be invalid URLs, and they will be
> +// rejected by the canonicalizer. WebKit's old one will allow them in
> +// parsing, and return 0 from this port() function.

This comment is confusing. I think what you're saying is that this function returns 0 when the port number is invalid or missing. I don't see the point of saying "WebKit's old one". This is an implementation of KURL::port too! Both functions have this behavior because it's the KURL class design.

> +unsigned short int KURL::port() const

Excess "int" here.

> +    // Note: WebKit decode_string()s here.
> +    return m_url.componentString(m_url.m_parsed.password);

Bug? Test case?

> +// Returns the empty string if there is no username.
> +String KURL::user() const
> +{
> +    // Note: WebKit decode_string()s here.
> +    return m_url.componentString(m_url.m_parsed.username);
> +}

Bug? Test case? I have the same question about all those places where you say "WebKit decode_string()s here". Also, since this code is part of WebKit, it's a little curious to refer to the other KURL as "WebKit". Welcome to WebKit, guys!

We need more test cases rather than just code comments about these various behaviors.

> +// This function is used only in the JSC build.
> +void KURL::setHostAndPort(const String& s) {

Misplaced brace here.

> +        // When set with the empty string or something that doesn't begin with
> +        // a question mark, WebKit will add a question mark for you. The only
> +        // way this isn't compatible is if you call this function with an empty
> +        // string. Old KURL will leave a '?' with nothing following it in the
> +        // URL, whereas we'll clear it.

Why is it OK to have this difference in behavior?

> +// Copy the KURL version here on Sept 12, 2008 while doing the webkit merge.

I don't understand this comment. Maybe it's just the verb tense that's confusing me.

> +// FIXME(erg) Somehow share this with KURL? Like we'd theoretically merge with
> +// decodeURLEscapeSequences below?

Our syntax is "FIXME:" without the person's initials.

> +#ifdef KURL_DECORATE_GLOBALS

What's this ifdef about?

> +// In WebKit's implementation, this is called by every component getter.
> +// It will unescape every character, including NULL. This is scary, and may
> +// cause security holes.

Please contribute to the shared WebKit project by creating test cases demonstrating why this is a bad idea. Our project policy is that when you fix a bug, you write a test case demonstrating what was wrong before. Ideally, this policy applies even if you're fixing the bug by replacing the entire KURL class with your own library!

r=me -- please do consider my comments though; I really think you need test cases demonstrating the many bugs this file claims to be fixing
Comment 3 Darin Fisher (:fishd, Google) 2009-01-27 10:07:45 PST
Hi Darin, thanks for all the thoughtful feedback.  I will revise appropriately, and look into test cases.  This code is actually not my personal work.  I'm just playing code shepherd :)

Brett has filed many bugs against KURL that he believes to be important and which GKURL avoids.  I agree that expressing those differences in terms of layout tests would be good.

I think the issue of decode_string is not about a particular bug per se but about a general rule of thumb that unescaping URL fragments is risky business.  In Mozilla, we had security bugs related to URL unescaping.  It turned out in most cases, that the browser simply didn't need to be unescaping those strings.  By pushing the unescaping out to only the places where it is really needed (e.g., HTTP authentication code), things got a lot more sane and easy to reason about.
Comment 4 Darin Fisher (:fishd, Google) 2009-01-27 10:08:21 PST
+darin@apple.  forgot to cc you when I replied.  please see comment #3.  thanks!
Comment 5 Darin Fisher (:fishd, Google) 2009-01-27 10:14:27 PST
On the topic of naming, how about I go with KURLGoogle and KURLGooglePrivate for now.  Once we are ready to drop the "K" prefix, we can do so across the board.  Otherwise, things might be a bit confusing in the meantime.  Let me know if you object.
Comment 6 Brett Wilson (Google) 2009-01-27 11:48:31 PST
>> + // When set with the empty string or something that doesn't begin with
>> + // a question mark, WebKit will add a question mark for you. The only
>> + // way this isn't compatible is if you call this function with an empty
>> + // string. Old KURL will leave a '?' with nothing following it in the
>> + // URL, whereas we'll clear it.
>
> Why is it OK to have this difference in behavior?

I think I did it this way because it would require a little more complexity to support this, and http://foo.com/bar? is the same as http://foo.com/bar

We can add a FIXME if you are concerned about this.
Comment 7 Darin Adler 2009-01-27 11:57:46 PST
(In reply to comment #5)
> On the topic of naming, how about I go with KURLGoogle and KURLGooglePrivate
> for now.  Once we are ready to drop the "K" prefix, we can do so across the
> board.

Perfect!
Comment 8 Darin Adler 2009-01-27 11:58:52 PST
(In reply to comment #3)
> I think the issue of decode_string is not about a particular bug per se but
> about a general rule of thumb that unescaping URL fragments is risky business. 
> In Mozilla, we had security bugs related to URL unescaping.  It turned out in
> most cases, that the browser simply didn't need to be unescaping those strings.
>  By pushing the unescaping out to only the places where it is really needed
> (e.g., HTTP authentication code), things got a lot more sane and easy to reason
> about.

I agree in principle; the current design is influenced by how KDE's KURL worked, and it could be very easily changed given the small number of callers. That's why I'm so interested in the test cases. Fixes could be easily done!
Comment 9 Darin Adler 2009-01-27 12:00:19 PST
(In reply to comment #6)
> I think I did it this way because it would require a little more complexity to
> support this, and http://foo.com/bar? is the same as http://foo.com/bar
> 
> We can add a FIXME if you are concerned about this.

I think we should fix the two implementations to match unless there's a compelling reason for them to differ. If the current KURL behavior without Google URL is no better, then lets do the trivial work to fix it so they can be identical.

A FIXME would be OK. A fix (with a test case demonstrating what changed or demonstrating there is no change) even better.
Comment 10 Brett Wilson (Google) 2009-01-27 18:07:43 PST
(In reply to comment #9)
> I think we should fix the two implementations to match unless there's a
> compelling reason for them to differ. If the current KURL behavior without
> Google URL is no better, then lets do the trivial work to fix it so they can be
> identical.
> 
> A FIXME would be OK. A fix (with a test case demonstrating what changed or
> demonstrating there is no change) even better.

Let's do the FIXME and I'll address this once fishd lands this patch.
Comment 11 Darin Fisher (:fishd, Google) 2009-01-28 00:20:26 PST
http://trac.webkit.org/changeset/40307

I added the FIXME comment and cleaned up the other issues as discussed.