Bug 16538 - KURL should use String instead of DeprecatedString
Summary: KURL should use String instead of DeprecatedString
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: WebCore Misc. (show other bugs)
Version: 528+ (Nightly build)
Hardware: All All
: P2 Normal
Assignee: Darin Adler
URL:
Keywords:
Depends on: 16485 16487
Blocks:
  Show dependency treegraph
 
Reported: 2007-12-20 11:14 PST by Brett Wilson (Google)
Modified: 2008-02-28 16:45 PST (History)
2 users (show)

See Also:


Attachments
Patch (67.66 KB, patch)
2008-01-29 17:42 PST, Brett Wilson (Google)
darin: review-
Details | Formatted Diff | Diff
Patch for KURL to use string internally (77.12 KB, patch)
2008-02-01 11:28 PST, Brett Wilson (Google)
no flags Details | Formatted Diff | Diff
New patch addressing most comments (88.08 KB, patch)
2008-02-05 20:01 PST, Brett Wilson (Google)
no flags Details | Formatted Diff | Diff
patch rebased and with backwards protocolIs fixed (88.49 KB, patch)
2008-02-10 08:07 PST, Darin Adler
no flags Details | Formatted Diff | Diff
more ambitious approach, eliminating more DeprecatedString (163.95 KB, patch)
2008-02-10 10:54 PST, Darin Adler
no flags Details | Formatted Diff | Diff
patch (403.46 KB, patch)
2008-02-12 13:15 PST, Darin Adler
eric: review+
Details | Formatted Diff | Diff
patch, now with fewer broken platforms (430.01 KB, patch)
2008-02-14 18:15 PST, Darin Adler
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Brett Wilson (Google) 2007-12-20 11:14:18 PST
KURL should use String instead of DeprecatedString
Comment 1 Brett Wilson (Google) 2008-01-29 17:42:28 PST
Created attachment 18785 [details]
Patch

This replaces String with DeprecatedString in KURL's interface.

The internal representation is still a DeprecatedString. I tried two different timer to do the change the internals, but it is challenging because the parsing code relies on the utf-16/utf-8 duality of the string. This might be more doable with a fresh look. MacDome was also expressing interest in doing this task. This interface patch is designed to be the first step in fixing KURL; clearly the internals will need to be fixed, but these are both complicated patches which I think are better separated.

There are more conversions which will make the code slower, although I am pretty sure it will not be measurable: About half the callers already had a String, and this patch does not change the number of conversions in those cases. The other half (which had a DeprecatedString) will need an extra string conversion. This also breaks the optimization where the string inside KURL would just reference the argument without a malloc in some cases. I left this in with some FIXMEs for the arguments, so it will be easy to fix when the internals are also converted to String.

-----

I made a number of changes to KURL aside from just String conversions:

Added a lot of comments to the header file

I made the constructor explicit. I kept finding places where there was a String, and it was being converted to a KURL when being passed to a function without the code realizing that this was a potentially slow operation. This highlights some really stupid usage, often involving resolution of a relative URL with a string base. For example, in CSSParser::parseBorderImage:
  context.commitImage(new CSSImageValue(
      KURL(KURL(styleElement->baseURL()), uri).string(), styleElement));
Now the fact that baseURL should probably return a KURL is more obvious, or, if not, at least tells you how expensive the operation you are doing is. I would say everybody should use KURL and the above code should be:
  context.commitImage(new CSSImageValue(
      KURL(styleElement->baseURL(), uri), styleElement));
A lot of the diff relates to this change.

I added a new function, protocolIs(const char*). A lot of users were doing this pattern:
  if (url.protocol() == "http");
This is bad for several reasons. First, it does a case-sensitive compare, which is incorrect (KURL does not canonicalize the scheme). To work around this, some callers do something like this:
  if (url.protocol().lower() == "http")
This makes the second problem even worse, which is that lots of string temporaries are getting created for an operation which doesn't need it.  My new function handles different cases and does not create new objects, while making the usage more clear.

I fixed some signed/unsigned types in findHostnamesInMailToURL and other places which are suspicious and generate warnings.

-----

I made two changes outside of KURL relating to bug 16538, which this patch fixes. Some code expects the string in a KURL to be non-NULL, despite there being many ways to make a NULL KURL. This code is not tripped because URLs oftem go through string conversions, i.e. KURL->String->KURL, and this process will lose the isNull-ness of the string and it will report empty instead. I made a fix in IconDatabase.cpp, and one in DocLoader.cpp (which I previously filed bug 16485 for).

In loader.cpp I removed some dead code.
Comment 2 Eric Seidel (no email) 2008-01-30 03:06:26 PST
Comment on attachment 18785 [details]
Patch

This looks great.

I think all additions of String( are probably bogus.  String does not have any explicit constructors (I don't think).

I think this patch clearly points out that certain things like:
HTMLAnchorElement::href()
Document::url()
Element::baseURL()

We can convert those to use KURL as soon as we teach the JS bindings how to silently convert to a UString in the getters.  Perhaps we already have a StringImp* toJS(KURL) function...


isAllASCII could move onto StringImpl if there are any other callers of DeprecatedString::isAllASCII in the sourcebase

It's very sad that:
char *strBuffer;
uses manual memory management.  However you didn't change that, so we can fix that later in a separate patch.

I understand that you are not adding ifs here, but any ifs that you touch where you're changing:
if (foo) {
return DeprecatedString();
}
you could update them to match our style guidelines and remove the { }.  That's not required for this patch, but it's good to know for the future at least.

Why not use your new protocolIs function here?
return protocol().lower() == "file";

This seems incorrect:
unsigned separator = findFirstOf(s, 0, "://");

// WILL include the "?".
you use // Will
on the next line.

All * and & should be next to the argument names.  Per our style guidelines.

I think there is a String version of completeURL:
KURL link(element->document()->completeURL(href.deprecatedString()));

I'm going to leave this as ? since I expect darin may have further comments.
Comment 3 Eric Seidel (no email) 2008-01-30 03:07:15 PST
It's unfortunate that we don't have more URL handling tests.  :sigh:
Comment 4 Darin Adler 2008-01-30 06:06:52 PST
Comment on attachment 18785 [details]
Patch

This patch looks great! I particularly like protocolIs().

> but it is challenging because the parsing code relies on the utf-16/utf-8 duality of the string

I'm not sure what you mean here. DeprecatedString does not have UTF-16/UTF-8 duality. It does have a feature where you can just call ascii() on it and see the characters as an 8-bit string but:

   1) it doesn't actually do anything really efficient -- it pretty much just allocates a new 8-bit buffer
   2) it chops characters and does not do a UTF-16 to UTF-8 conversion
   3) there's very little code that actually takes advantage of this

Would you be willing to take another pass at that aspect of the patch? I ask because I think it would be better to not land this with the comments "this optimization is broken for now". I really don't think we need to stage this so that there's a stage with the broken optimization.

String is inefficient if you do thinks like concatenate; any modifications to string length require reallocating the buffer. The same is not true about DeprecatedString. So I'm concerned about replacing functions that build up a DeprecatedString by appending with functions that do the same thing with String. This is the most challenging issue to be considered when replacing DeprecatedString with String.

encodeHostnames and substituteBackslashes are examples of this. To do this efficiently with String you need to build up the new string as a Vector<UChar> and then use adopt() to turn it into a string once done modifying it. There many not be the appropriate helper functions yet for String and Vector<UChar>, and we've even considered adding a new StringBuilder class.

This is one of the reasons I personally haven't converted these functions yet yet. Just converting to String and hoping that performance is acceptable may not produce acceptable results. Please consider Vector<UChar> for functions that build strings by appending. It's pretty straightforward, and anything inconvenient can probably be papered over with a few helper functions.

The mix of signed and unsigned integers is starting to get a little awkward. Since our existing string functions mostly use int, I'd prefer that you continue that to avoid ugly casts in new code. Or we could move in the other direction and try to move away from using int for string character indices. I'm not pleased with the mix.

When checking for the 0 character this new code has a mix of styles. Sometimes it's "!c", other times "c == 0", other times "c == '\0'". I'd prefer more consistency.

-    ASSERT(!pageURLOriginal.isNull());

Was this assertion not actually guaranteed? If so, then I understand the bug fix. If not, then I'm puzzled by why you made the change here.

+static bool isAllASCII(const String& str)
+{
+    for (unsigned i = 0; i < str.length(); i++) {
+        if (str[i] > 0x7f)
+            return false;
+    }
+    return true;
+}

This function already exists in RenderText.cpp, although it works on a StringImpl rather than a String. I would prefer that we unify and share a single one rather than adding another. The RenderText.cpp one uses an algorithm that's considerably faster for long strings where characters actually are all ASCII, and since all-ASCII is the normal case I'd prefer to do it that way.

+// ASCII. The detination buffer must be large enough, and it is not NULL
+// terminated unless the source has a NULL.

Typo. Missing the "s" in destination.

Normally NULL refers the pointer value and not the 0 character. The 0 character is usually called NUL, with one L. Or you can say "null-terminated" and "null character" and not SHOUT about the null character. There are many other places you call this character NULL -- those should all be fixed.

+        dest[i] = static_cast<char>(src[i]);

There's no need for an explicit cast here. I don't know of any compiler WebKit supports that gives a warning for narrowing conversions like this one. And while it would be nice to be explicit that there's a type conversion going on, I think the general guideline to avoid casts when possible is probably slightly more important.

+// Returns the index of the first index in string |s| of any of the characters
+// in |toFind|. |toFind| should be a NULL-terminated string, all characters up
+// to the NULL will be searched. Returns UINT_MAX if not found.
+unsigned findFirstOf(const String& s, unsigned startPos, const char* toFind)

I would prefer that if you need something like this that you add it to the String class rather than put in a local function inside KURL. Or if you really think it should be local, it should be marked static so it gets internal linkage.

If you went to add it to the String class you'd see the mismatch, where you're using unsigned and the other functions use int. It would be best if we were consistent about this. If I was starting a new class I'd probably use unsigned, but I'd prefer to not have the single new function go in a different direction from all the existing ones. We could change them all in the future if you think unsigned is better for this sort of thing.

+        // Note that we don't pass anything as the second argument. When the
+        // innards of KURL are converted from DeprecatedString to String, this
+        // should be put back. It is an optimization that prevents allocating
+        // another string in the common case that nothing changes.

Are you sure it's OK to lose this optimization? I'm pretty sure it's a real issue. We added this optimization for a reason, and I don't think we should take it out lightly.

+    for (i = 0; i < schemeEndPos; i++) {
+        if (!proto[i] || toASCIILower(urlString[i].unicode()) != proto[i])
+            return false;
+    }
+    return proto[i] == 0;  // We should have consumed all characters in the argument.

Why use ! inside the loop and == 0 outside the loop? I'd prefer that you be consistent. I think we should add an assert here that the passed-in protocol does not contain any ASCII uppercase characters or non-ASCII characters. I also think that the name "proto" isn't so great, since it could easily mean "prototype" instead. Why not use the entire word protocol?

-    return protocol() == "file";
+    return protocol().lower() == "file";

This should be using protocolIs().

+    // DANGER! Thos does not do things like IDN, and the resulting host name

Typo "this" not "thos".

+    // Returns true if this URL hsa a path. Note that "http://foo.com/" has a

Typo "has" not "hsa".

+    // Returns true of the current URL's protocol is the same as the NULL-

Typo "if" not "of".

+    void setProtocol(const String &);
+    void setHost(const String &);

There should not be spaces before these "&" symbols. In general it would be good if you fixed the positions of * and & characters even in existing code you touch. Our WebKit coding style is Type* and Type& rather than Type * and Type &.

For the functions where you're adding comments before the function declaration, please also add blank lines to group the comments with the functions they are describing. I find this hard to read:

    void f();
    // Pass a full path.
    void g(const char*).
    void h();

I'd prefer that there be a blank line after f and before h in a case like this.

+        m_frame->loader()->begin(KURL("http://localhost/placeholder.svg")); // create the empty document

Why is http a good idea here? Can we come up with something better? I think it would be bad if something decided to do I/O and talk to the local HTTP server! That could well be a valid URL to a different SVG.

There are enough comments here that I'm going to say review- for now, but this seems like great work. Thanks for taking this on.
Comment 5 Brett Wilson (Google) 2008-02-01 11:28:45 PST
Created attachment 18850 [details]
Patch for KURL to use string internally

I tried to do as Darin suggested, changing the entire KURL innards to use String.

This took me more than 1.5 days of solid work, and mostly works, but has some bugs. For example, GMail takes forever to load and some bogus looking URLs flash by in the URL bar in Safari. I have also seen some stack corruption that I have not been able to reproduce, and I broke dom/xhtml/level3/core/nodegetbaseuri04.xhtml in a way that I don't understand.

I do not think that fixing this is technically worthwhile. Even if we fix the above bugs, I am certain that I have introduced some subtle bugs, incorrect behavior, and crashes in this patch that I do not know about. Lacking any tests, and given the poor state of this code, doing this change would be risky and take a very long time.

You mentioned that the concern  was the broken optimization for storing Strings without copies when they are unchanged, but I do not think that this is a particularly important operation. It is less important than fixing all the String<->DeprecatedString conversions in the way that KURL is called.

I think that checking in the interface fix, and then fixing bug 17098 is the best way to go, since that library contains a lot of tests that we can run to validate the results. With the interface patch in, the integration for that bug will be easy and much lower risk.
Comment 6 Brett Wilson (Google) 2008-02-05 20:01:37 PST
Created attachment 18947 [details]
New patch addressing most comments

MacDome's review:

I think all additions of String( are probably bogus.  String does not have any
explicit constructors (I don't think).

You are correct, I removed these.

> you could update them to match our style guidelines and remove the { }.  That's
> not required for this patch, but it's good to know for the future at least.

I fixed a bunch of these, although I don't pretend to have fixed all cases.

> All * and & should be next to the argument names.  Per our style guidelines.

...as well as these when I noticed them.

I fixed your other suggestions.

-----------
Darin's review:

> The mix of signed and unsigned integers is starting to get a little awkward.

It's unfortunate that String uses both unsigned and int to refer to string indices. With this design, it is impossible for any code using strings to be very consistent. I changed some things when I got unsigneds out of string to use int. I have changed some of these back and used static cast on string.length instead. The code is a little more consistent now.

> encodeHostnames and substituteBackslashes are examples of this.

I replaced these functions with the versions from the patch in the middle I gave up on. These worked well, and were not the cause of the problems I reported (the problems were in KURL::parse).

The rest of the functions still use DeprecatedString when concatenating, with the exception of substituteBackslashes. This function isn't very commonly used, it only does one string concat, and to replace, I would have to write my own version of replace() which seemed like a waste of time to me. I can change this too if you really want.

> -    ASSERT(!pageURLOriginal.isNull());
>
> Was this assertion not actually guaranteed? If so, then I understand the bug
> fix. If not, then I'm puzzled by why you made the change here.

That's correct. The code that calls it is closed source, but I think it gets the URL from some frame or loader related class, and blindly passes it to this function. Because old KURL would lose the NULLness of strings, it would be an empty but non-NULL URL in some cases when now it is NULL when it was constructed as a NULL KURL (not sure exactly what cases those are).

Was this assertion not actually guaranteed? If so, then I understand the bug
fix. If not, then I'm puzzled by why you made the change here.

> +static bool isAllASCII(const String& str)
> ...
> This function already exists in RenderText.cpp, although it works on a
> StringImpl rather than a String. I would prefer that we unify and share a
> single one rather than adding another. The RenderText.cpp one uses an algorithm
> that's considerably faster for long strings where characters actually are all
> ASCII, and since all-ASCII is the normal case I'd prefer to do it that way.

FIXME FIGURE THIS OUT

> +        dest[i] = static_cast<char>(src[i]);
>
> There's no need for an explicit cast here. I don't know of any compiler WebKit
> supports that gives a warning for narrowing conversions like this one. And
> while it would be nice to be explicit that there's a type conversion going on,
> I think the general guideline to avoid casts when possible is probably slightly
> more important.

You are mistaken, Visual C++ gives:
warning C4244: '=' : conversion from 'unsigned short' to 'char', possible loss of data.
I think it's very important when you are losing data to know that you meant to lose it when you read the code, even if the compiler issued no warning.

> +unsigned findFirstOf(const String& s, unsigned startPos, const char* toFind)
>
> I would prefer that if you need something like this that you add it to the
> String class rather than put in a local function inside KURL. Or if you really
> think it should be local, it should be marked static so it gets internal
> linkage.

I forgot the static here, I have added it. This is used on Vectors as well, so I can't add it to the String class.

I think a general problem here with this function, isAllASCII, and substituteBackslashes is that you are requiring that String not be used for this type of manipulation (which I understand), while WebKit lacks the infrastructure to do anything beyond stdlib outside of the String/StringImpl class. I think there are four options: move all StringImpl's guts out to a file that takes lengths and offsets, make StringImpl able to use a buffer it doesn't own, have a nice place to put this type of function outside of String[Impl], or accumulate a whole lot of custom-written string manipulation functions in each file.

Fixing this is decidedly outside the scope of this patch, but should be considered soon.

> +        // Note that we don't pass anything as the second argument. When the
> +        // innards of KURL are converted from DeprecatedString to String, this
> +        // should be put back. It is an optimization that prevents allocating
> +        // another string in the common case that nothing changes.
>
> Are you sure it's OK to lose this optimization? I'm pretty sure it's a real
> issue. We added this optimization for a reason, and I don't think we should
> take it out lightly.

I think this is a nice optimization. See my previous comment about why I don't think this is practical in this patch. Plus, since probably half of the callers of the string class have a DeprecatedString to start out with, you already lost the optimization for those callers. The cost of not doing this is only the cost of doing a String<->DeprecatedString conversion, which is also done all over.

I think an even better way of improving this would be to reduce the number of KURL->String->KURL operations. A large number of the cases where the string doesn't change and this optimization could apply are a result of this sequence of events. Making more things use KURL instead of String would make it even faster than this optimization. I have filed a few bugs on this and am hoping to do some work in this area once this patch is in.

After talking to MacDome, I think fixing bug 17098 would be a faster and, surprisingly, a less risky way of fixing this optimization. I would consider my KURL-to-use-String-internally patch pretty risky without any tests.


> +    for (i = 0; i < schemeEndPos; i++) {
> +        if (!proto[i] || toASCIILower(urlString[i].unicode()) != proto[i])
> +            return false;
> +    }
> +    return proto[i] == 0;  // We should have consumed all characters in the
> argument.
>
> Why use ! inside the loop and == 0 outside the loop? I'd prefer that you be
> consistent. I think we should add an assert here that the passed-in protocol
> does not contain any ASCII uppercase characters or non-ASCII characters. I also
> think that the name "proto" isn't so great, since it could easily mean
> "prototype" instead. Why not use the entire word protocol?

I fixed the name and 0 comparisons, and added the assertion you requested.

I fixed the grammar and spelling errors, and placement of '&' in function prototypes.

> +        m_frame->loader()->begin(KURL("http://localhost/placeholder.svg")); //
> create the empty document

> Why is http a good idea here? Can we come up with something better? I think it
> would be bad if something decided to do I/O and talk to the local HTTP server!
> That could well be a valid URL to a different SVG.

Do you think KURL() would be OK here? I changed it to this.

---------
The main thing I didn't address is where isAllASCII should live. I would like some specific guidance on this given that I don't always have a String to call it on.
Comment 7 Darin Adler 2008-02-06 06:57:37 PST
(In reply to comment #6)
> I think it's very important when you are losing data to know that you meant to
> lose it when you read the code, even if the compiler issued no warning.

Not to be pedantic, but I don't agree with this principle. We are often losing data when working with integers in C. For example.

    short x;
    short y;
    x = y + 1;

The above loses data. You might need a larger integer type, perhaps int, to hold the value "y + 1".

I think the cast in this case is only useful because the warning exists. Unnecessary casts have costs -- they can do unnecessary or incorrect type conversions or extra work and suppress useful warnings -- so I prefer to include them only when they have a clear benefit.

> I think a general problem here with this function, isAllASCII, and
> substituteBackslashes is that you are requiring that String not be used for
> this type of manipulation (which I understand), while WebKit lacks the
> infrastructure to do anything beyond stdlib outside of the String/StringImpl
> class. I think there are four options: move all StringImpl's guts out to a file
> that takes lengths and offsets, make StringImpl able to use a buffer it doesn't
> own, have a nice place to put this type of function outside of String[Impl], or
> accumulate a whole lot of custom-written string manipulation functions in each
> file.

I think that any StringImpl function that can run on characters outside a string that we need outside the string classes should be moved into a separately accessible free function. Because StringImpl is such an ugly name, I'd suggest putting those functions into String.h (sadly, its real name is PlatformString.h) like this:

    bool isAllASCII(const UChar*, int length);

They can be called from StringImpl.cpp and String.cpp as needed, or in some cases we can have a shared inline version if function call overhead is too much.

There's no need to do this all at once. If you need 2 or 3 functions, you can move them. We'll get there eventually as long as we are clear on what the direction is.

> > +        // Note that we don't pass anything as the second argument. When the
> > +        // innards of KURL are converted from DeprecatedString to String, this
> > +        // should be put back. It is an optimization that prevents allocating
> > +        // another string in the common case that nothing changes.
> >
> > Are you sure it's OK to lose this optimization? I'm pretty sure it's a real
> > issue. We added this optimization for a reason, and I don't think we should
> > take it out lightly.
> 
> I think this is a nice optimization. See my previous comment about why I don't
> think this is practical in this patch. Plus, since probably half of the callers
> of the string class have a DeprecatedString to start out with, you already lost
> the optimization for those callers. The cost of not doing this is only the cost
> of doing a String<->DeprecatedString conversion, which is also done all over.

I really worry about arguments like this unless we can couple them with performance measurements. I don't want to end up in an intermediate state with a slowdown.

Too many times I have believed my change was OK and it actually resulted in a measurable slowdown. I don't know how to mitigate that possibility for this patch.

> I think an even better way of improving this would be to reduce the number of
> KURL->String->KURL operations.

Agreed. My concern is not about getting maximum possible future improvement, but rather about landing this patch without being in a slower state.

> After talking to MacDome, I think fixing bug 17098 would be a faster and,
> surprisingly, a less risky way of fixing this optimization. I would consider my
> KURL-to-use-String-internally patch pretty risky without any tests.

I'm pretty skeptical about adding the new library layer and getting faster. I believe you did performance measurements and things were slower, not faster.

Patches for WebKit need to make the project faster or keep them the same speed except under very unusual circumstances.

> > +        m_frame->loader()->begin(KURL("http://localhost/placeholder.svg")); //
> > create the empty document
> 
> > Why is http a good idea here? Can we come up with something better? I think it
> > would be bad if something decided to do I/O and talk to the local HTTP server!
> > That could well be a valid URL to a different SVG.
> 
> Do you think KURL() would be OK here? I changed it to this.

Probably.

> The main thing I didn't address is where isAllASCII should live. I would like
> some specific guidance on this given that I don't always have a String to call
> it on.

As I said above, PlatformString.h seems a fine place for now. It's not hard to move it later.
Comment 8 Darin Adler 2008-02-06 07:08:54 PST
(In reply to comment #7)
> > > +        // Note that we don't pass anything as the second argument. When the
> > > +        // innards of KURL are converted from DeprecatedString to String, this
> > > +        // should be put back. It is an optimization that prevents allocating
> > > +        // another string in the common case that nothing changes.
> > >
> > > Are you sure it's OK to lose this optimization? I'm pretty sure it's a real
> > > issue. We added this optimization for a reason, and I don't think we should
> > > take it out lightly.
> > 
> > I think this is a nice optimization. See my previous comment about why I don't
> > think this is practical in this patch. Plus, since probably half of the callers
> > of the string class have a DeprecatedString to start out with, you already lost
> > the optimization for those callers. The cost of not doing this is only the cost
> > of doing a String<->DeprecatedString conversion, which is also done all over.
> 
> I really worry about arguments like this unless we can couple them with
> performance measurements. I don't want to end up in an intermediate state with
> a slowdown.

Let me be more clear here.

It may be that this optimization only matters at exactly one call site, or a very small number of call sites. To prove to ourselves that we're OK, we need some kind of measurement. Perhaps a count of the number of allocations in some browsing session with and without this change or something along those lines would work. We need some kind of testing, not just "cross our fingers" mode.

You're probably right that almost all callers don't depend on this optimization, but there may be a single heavily used callsite that currently does. If you can disprove this then I have no objection to getting rid of it. And I suggest removing it entirely rather than leaving it half in place.

Another way to test the impact would be to set up the i-Bench HTML test suite -- it's challenging but possible to do this.

(At Apple we have an internal page load speed test that we sometimes use to check changes like this to see if they slow things down, but we haven't been able to find one that doesn't rely on material that we don't hold the copyright to. The WebKit project doesn't have a shared open source speed test for web browsing yet.)

I'll review your new patch as soon as I get a chance.

PS: I also don't know which previous comment you're referring to when you say "why I don't think this is practical". But maybe that will become more clear when I re-review the patch and re-read the comments.
Comment 9 Darin Adler 2008-02-06 07:10:46 PST
(In reply to comment #8)
> PS: I also don't know which previous comment you're referring to when you say
> "why I don't think this is practical". But maybe that will become more clear
> when I re-review the patch and re-read the comments.

I see now. You tried again and it didn't work at all well.
Comment 10 Eric Seidel (no email) 2008-02-08 14:16:23 PST
Comment on attachment 18947 [details]
New patch addressing most comments

I think the parse() optimization hack should probably be pushed down into parse (so that it's in one place, instead of at all the call sites).  the parse() signature would become:
String parse(char*, StringImpl*), and the StringImpl* would just be ignored until parse() is converted to use Stirng/UChar internally.

I wonder if it wouldn't be helpful to convert document->url() to KURL before this patch.  I could potentially do that  for you if you need.

In general the patch looks fine.  Darin should take a peak.
Comment 11 Eric Seidel (no email) 2008-02-08 14:18:40 PST
Comment on attachment 18947 [details]
New patch addressing most comments

Regarding the perf concern.  I'm not sure you can do much about it.  We can't be scared of this code.  There is no externally-available Page Load Test (similar to moz's page cycler test), so one of the apple folks will just have to validate if it's faster/slower.  I expect it won't be any slower.  This patch gets rid of string conversions if anything.
Comment 12 Darin Adler 2008-02-09 23:03:46 PST
(In reply to comment #11)
> (From update of attachment 18947 [details] [edit])
> There is no externally-available Page Load Test
> (similar to moz's page cycler test), so one of the apple folks will just have
> to validate if it's faster/slower.

Eric, is it really true that there's no way for anyone to do any performance tests outside Apple?

What about my suggestion above of doing some web browsing and counting the number of calls to the memory allocator?
Comment 13 Eric Seidel (no email) 2008-02-09 23:49:49 PST
(In reply to comment #12)
> (In reply to comment #11)
> > (From update of attachment 18947 [details] [edit] [edit])
> > There is no externally-available Page Load Test
> > (similar to moz's page cycler test), so one of the apple folks will just have
> > to validate if it's faster/slower.
> 
> Eric, is it really true that there's no way for anyone to do any performance
> tests outside Apple?

My interpretation of the status quo: Passing the Apple PLT w/o regression would be sufficient for performance testing.  Passing some other test of performance but not passing the Apple PLT would not be sufficient.

If Marvin is capable of writing such a patch, I expect he's probably capable of coming up with some way to test it too. But since we have the PLT and it (has at least historically been) "the final say", we might as well just run it against the patch.
Comment 14 Darin Adler 2008-02-10 07:27:03 PST
I've applied the patch, and I'm doing some performance testing now.

I don't understand the status of this relative to the two bugs blocking this. Do we need fixes for those? Are they included in this patch?
Comment 15 Darin Adler 2008-02-10 07:34:09 PST
My performance test indicates a 0.7% slowdown in page loading from this change.

But I need to do some debugging too. I see this message:

    2008-02-10 07:32:00.195 Safari[98730:10b] *** -[NSObject URL]: unrecognized selector sent to instance 0x1cc7480
    2008-02-10 07:32:00.196 Safari[98730:10b] *** WebKit discarded an uncaught exception in the webView:resource:didFailLoadingWithError:fromDataSource: delegate: <NSInvalidArgumentException> *** -[NSObject URL]: unrecognized selector sent to instance 0x1cc7480

I'll figure out next what's causing that.
Comment 16 Darin Adler 2008-02-10 07:40:08 PST
With this patch applied, and a debug version of Safari, I immediately hit an assertion failure inside Safari. NSURLConnection is returning error -1002 "unsupported URL" so a load is failing, and then Safari's activity window decided that there's an inconsistency of some sort. I'll do some debugging to learn more.
Comment 17 Darin Adler 2008-02-10 07:46:30 PST
The problem is that the string for iconURL() is returning ":". I'll figure out why.
Comment 18 Darin Adler 2008-02-10 07:52:06 PST
The bug was here:

-    if (m_URL.protocol() != "http" && m_URL.protocol() != "https")
-        return "";

+    if (m_URL.protocolIs("http") && m_URL.protocolIs("https"))
+        return KURL();

The above change reverses the sense of the protocol check.
Comment 19 Darin Adler 2008-02-10 08:07:09 PST
Created attachment 19035 [details]
patch rebased and with backwards protocolIs fixed

I tested this version of the patch carefully. I still see 0.7% slowdown. We need to come up with a version that's a speedup rather than a slowdown.
Comment 20 Darin Adler 2008-02-10 10:52:25 PST
I've been working on a version of the patch that *does* change KURL to use String internally. It going well. I measured a 1.3% speedup with this one. It's not done, but I'll attach my work in progress.
Comment 21 Darin Adler 2008-02-10 10:54:12 PST
Created attachment 19037 [details]
more ambitious approach, eliminating more DeprecatedString

Assigning this to myself for now, since this is nearly ready to review I think.
Comment 22 Brett Wilson (Google) 2008-02-11 08:52:35 PST
(In reply to comment #21)
> Created an attachment (id=19037) [edit]
> more ambitious approach, eliminating more DeprecatedString
> Assigning this to myself for now, since this is nearly ready to review I think.

OK! :) I hope my work was at least moderately helpful.
Comment 23 Darin Adler 2008-02-11 09:22:04 PST
(In reply to comment #22)
> I hope my work was at least moderately helpful.

Sure, I started with your work and left most of it as-is.
Comment 24 Darin Adler 2008-02-12 13:15:32 PST
Created attachment 19095 [details]
patch
Comment 25 Eric Seidel (no email) 2008-02-12 14:02:06 PST
Comment on attachment 19095 [details]
patch

You're going for the "largest patch of the year" award it would seem.  Sad to see all these style changes mixed in. :(
Comment 26 Darin Adler 2008-02-12 16:25:08 PST
Comment on attachment 19095 [details]
patch

It was pretty hard to whip the patch into shape, so I fell a little short of my usual standards for single-purpose patches.

One big challenge was making sure the patch was a speedup, not a slowdown.

It would be practical to break off various pieces (KJS namespace change, style tweaks, additions to PlatformString.h, specific cases of transition away from DeprecatedString that aren't involved with KURL). When someone reviews this I'll look for specific advice about that.

The meaty bit of the patch is still pretty big though.
Comment 27 Eric Seidel (no email) 2008-02-12 16:55:22 PST
(In reply to comment #26)
> (From update of attachment 19095 [details] [edit])
> It was pretty hard to whip the patch into shape, so I fell a little short of my
> usual standards for single-purpose patches.

It's not a big deal.  I look forward to reviewing it... I just will have to find time this week to set aside the 1-2 hours required.
Comment 28 Eric Seidel (no email) 2008-02-13 23:53:17 PST
Comment on attachment 19095 [details]
patch

I survived!

Seems odd to have this outside of setPort:
          case Port: {
180           int port = str.toInt();
181           if (port < 0 || port > 0xFFFF)
182               port = 0;
183           url.setPort(port);
184           break;
185       }

I think you should s/string/url/ on these comments:
KJS::JSValue* jsStringOrNull(const KURL&); // null if the string is null

These could be one line ifs:
102             if (HTMLAnchorElement* anchor = [self anchorElement]) {
1103                 NSURL *href = anchor->href();
1104                 return href;
1105             }
} else if (m_renderer->isImage() && m_renderer->element() && m_renderer->element()->hasTagName(imgTag)) {
1107             NSURL *src = static_cast<HTMLImageElement*>(m_renderer->element())->src();
1108             return src;
1109         }

Better if you could file a bug:
2185                 // FIXME: What's the point of changing the value of a local variable
2186                 // (title) that's not looked at afterward? Obviously there's no test
2187                 // case for this!
2188                 if (!n->isHTMLElement()) {
2189                     String modifiedTitle = title.domString();
2190                     title = modifiedTitle.replace('&', "&&");
2191                 }
21872192 #endif

Sigh.  Sadly KURL Node::baseURI() const is yet-another recursive walk through your parent chain using virtual methods. :(

We might want to add mailtoProtocol, httpProtocol, ftpProtocol AtomicStrings some day.

Missing space before ':'
page->chrome()->addMessageToConsole(HTMLMessageSource,
1481         isWarning(errorCode) ? WarningMessageLevel: ErrorMessageLevel,
1482         message, lineNumber, document->url().string());


Those would be different depending on how it's exposed to the DOM bindings:
      if (m_frame->tree() && m_frame->tree()->parent())
-          return "";       
+         return KURL();

Seems like something we should test:
         KURL r = dl->doc()->url();
-          if (r.protocol().startsWith("http") && r.path().isEmpty())
+         if ((r.protocolIs("http") || r.protocolIs("https")) && r.path().isEmpty())
              r.setPath("/");

Why?
-          DeprecatedString domain = r.host();
-          if (dl->doc()->isHTMLDocument())
-              domain = static_cast<HTMLDocument*>(dl->doc())->domain().deprecatedString();

Looks good to me!
Comment 29 Brett Wilson (Google) 2008-02-14 08:17:20 PST
> Seems like something we should test:
>          KURL r = dl->doc()->url();
> -          if (r.protocol().startsWith("http") && r.path().isEmpty())
> +         if ((r.protocolIs("http") || r.protocolIs("https")) &&
> r.path().isEmpty())
>               r.setPath("/");

This seems a little suspicious to me. I think KURL should canonicalize empty schemes to "/"  for all "hierarchical" URLs in parse(). If you can get here with an empty path for http, it seems like you could get here with an empty path for ftp, etc.
Comment 30 Darin Adler 2008-02-14 09:35:44 PST
(In reply to comment #29)
> > Seems like something we should test:
> >          KURL r = dl->doc()->url();
> > -          if (r.protocol().startsWith("http") && r.path().isEmpty())
> > +         if ((r.protocolIs("http") || r.protocolIs("https")) &&
> > r.path().isEmpty())
> >               r.setPath("/");
> 
> This seems a little suspicious to me. I think KURL should canonicalize empty
> schemes to "/"  for all "hierarchical" URLs in parse(). If you can get here
> with an empty path for http, it seems like you could get here with an empty
> path for ftp, etc.

It's worth looking into.

Note that the fact that there is no such thing as an empty path is a property of the HTTP protocol; I'm not certain that it's necessarily true of all hierarchical schemes. I believe it's possible that you could have an FTP server that could handle "/" different from no path at all. That's not possible with HTTP, though, since the path has to be sent in the GET or POST, and there's no syntax for an empty path. Or maybe it's true of FTP too, but not necessarily of every single hierarchical scheme.

I'm not saying that what you're suggesting is wrong, but rather that it needs to be considered carefully. Moving the logic into KURL would be great! I don't like where it currently is.

I'd be happy to see this simplified and resolved further, although that seems a separate issue from this patch. The attempt with this patch is to make a more efficient implementation *without* changing semantics. We can also fix bugs having to do with incorrect semantics, and we should do it!
Comment 31 Darin Adler 2008-02-14 09:41:56 PST
(In reply to comment #28)
> Seems odd to have this outside of setPort:
>           case Port: {
> 180           int port = str.toInt();
> 181           if (port < 0 || port > 0xFFFF)
> 182               port = 0;
> 183           url.setPort(port);
> 184           break;
> 185       }

The setPort function takes an unsigned short. That's why the caller needs to check that the integer value fits into an unsigned short. This is simply "type conversion" code that knows that out of range values need to change to zeroes. There are probably many more-elegant ways of structuring this. If you have a suggestion, I'd be happy to improve it!

> I think you should s/string/url/ on these comments:
> KJS::JSValue* jsStringOrNull(const KURL&); // null if the string is null

Will do.

> These could be one line ifs:
> 102             if (HTMLAnchorElement* anchor = [self anchorElement]) {
> 1103                 NSURL *href = anchor->href();
> 1104                 return href;
> 1105             }
> } else if (m_renderer->isImage() && m_renderer->element() &&
> m_renderer->element()->hasTagName(imgTag)) {
> 1107             NSURL *src =
> static_cast<HTMLImageElement*>(m_renderer->element())->src();
> 1108             return src;
> 1109         }

Actually, they could not. We need the local variable to convert the type from KURL to NSURL *. The return type is "id", which is not specific enough. I originally wrote these as single lines. I decided that a local variable was superior to an explicit static_cast or a function call of some sort.

> Better if you could file a bug:
> 2185                 // FIXME: What's the point of changing the value of a
> local variable
> 2186                 // (title) that's not looked at afterward? Obviously
> there's no test
> 2187                 // case for this!
> 2188                 if (!n->isHTMLElement()) {
> 2189                     String modifiedTitle = title.domString();
> 2190                     title = modifiedTitle.replace('&', "&&");
> 2191                 }
> 21872192 #endif

Will do.

> Sigh.  Sadly KURL Node::baseURI() const is yet-another recursive walk through
> your parent chain using virtual methods. :(

Very sad, yes.

> We might want to add mailtoProtocol, httpProtocol, ftpProtocol AtomicStrings
> some day.

Could be. But the protocolIs function wants a const char*; an AtomicSting would not be helpful for those callsites. Perhaps some constants in the KURL header of type const char* would be good.

> Missing space before ':'
> page->chrome()->addMessageToConsole(HTMLMessageSource,
> 1481         isWarning(errorCode) ? WarningMessageLevel: ErrorMessageLevel,
> 1482         message, lineNumber, document->url().string());

Will fix.

> Those would be different depending on how it's exposed to the DOM bindings:
>       if (m_frame->tree() && m_frame->tree()->parent())
> -          return "";       
> +         return KURL();

Good point. The function is iconURL() and it's an internal function, not exposed to the DOM at all. I looked at the callers when I made this change.

> Seems like something we should test:
>          KURL r = dl->doc()->url();
> -          if (r.protocol().startsWith("http") && r.path().isEmpty())
> +         if ((r.protocolIs("http") || r.protocolIs("https")) &&
> r.path().isEmpty())
>               r.setPath("/");

I'm not sure what you mean exactly. The old code checked for both "http" and "https" using startsWith. It would also handle "httpasdfads", which was silly.

I don't think this merits a test case. I do think it's not really very good code. But I preserved its old semantics pretty closely.

> Why?
> -          DeprecatedString domain = r.host();
> -          if (dl->doc()->isHTMLDocument())
> -              domain =
> static_cast<HTMLDocument*>(dl->doc())->domain().deprecatedString();

I mentioned this in the change log. The domain string was computed, but never used.
Comment 32 Darin Adler 2008-02-14 18:15:36 PST
Created attachment 19132 [details]
patch, now with fewer broken platforms
Comment 34 Brett Wilson (Google) 2008-02-28 16:45:53 PST
Thanks for all the work, Darin!