Bug 39701 - UTF-16 code points compare() for String objects
Summary: UTF-16 code points compare() for String objects
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: WebCore Misc. (show other bugs)
Version: 528+ (Nightly build)
Hardware: All All
: P2 Normal
Assignee: Luiz Agostini
URL:
Keywords:
Depends on:
Blocks: 39220
  Show dependency treegraph
 
Reported: 2010-05-25 17:16 PDT by Luiz Agostini
Modified: 2010-07-01 01:06 PDT (History)
7 users (show)

See Also:


Attachments
patch 1 (5.84 KB, patch)
2010-05-25 17:26 PDT, Luiz Agostini
darin: review-
Details | Formatted Diff | Diff
patch 2 (5.81 KB, patch)
2010-05-25 18:15 PDT, Luiz Agostini
no flags Details | Formatted Diff | Diff
patch 3 (6.50 KB, patch)
2010-05-25 19:50 PDT, Luiz Agostini
no flags Details | Formatted Diff | Diff
patch 4 (6.14 KB, patch)
2010-05-26 14:26 PDT, Luiz Agostini
darin: review-
Details | Formatted Diff | Diff
patch 5 (7.44 KB, patch)
2010-05-27 09:05 PDT, Luiz Agostini
darin: review-
darin: commit-queue-
Details | Formatted Diff | Diff
patch 6 (7.13 KB, patch)
2010-05-27 14:05 PDT, Luiz Agostini
darin: review+
Details | Formatted Diff | Diff
patch 7 (7.13 KB, patch)
2010-05-27 14:24 PDT, Luiz Agostini
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Luiz Agostini 2010-05-25 17:16:38 PDT
There is no good option for ascii String comparison that can be used for lexicographical sort in webkit.
Comment 1 Luiz Agostini 2010-05-25 17:26:52 PDT
Created attachment 57056 [details]
patch 1
Comment 2 Darin Adler 2010-05-25 17:28:37 PDT
Comment on attachment 57056 [details]
patch 1

This is not an ASCII-only comparison. It's comparing UTF-16 code points, so asciiOnlyCompare is not the right name for it. ASCII doesn't enter into it. Could you explain your goals here? Where do you want to use this function?
Comment 3 WebKit Review Bot 2010-05-25 17:29:05 PDT
Attachment 57056 [details] did not pass style-queue:

Failed to run "['WebKitTools/Scripts/check-webkit-style', '--no-squash']" exit_code: 1
JavaScriptCore/wtf/text/StringImpl.cpp:486:  l is incorrectly named. Don't use the single letter 'l' as an identifier name.  [readability/naming] [4]
Total errors found: 1 in 6 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 4 Luiz Agostini 2010-05-25 17:39:31 PDT
(In reply to comment #2)
> (From update of attachment 57056 [details])
> This is not an ASCII-only comparison. It's comparing UTF-16 code points, so asciiOnlyCompare is not the right name for it. ASCII doesn't enter into it. Could you explain your goals here? Where do you want to use this function?

I need to sort a vector of String objects in lexicographical order. For that I plan to use std::sort. The problem is that I did not find a way to do it easily.

Looking for sorts in webkit I found this

    return strcmp(parentDirectory().utf8().data(), compareTo.parentDirectory().utf8().data());

in PluginPackage::compare. This seems to be very expensive because it does a lot of copy/conversion to do the comparison.

Is there any good option for that?

The name is not good. I chose it just because the method do not consider localizations and is ok for ascii.
Comment 5 Luiz Agostini 2010-05-25 18:15:04 PDT
Created attachment 57059 [details]
patch 2
Comment 6 Darin Adler 2010-05-25 18:34:29 PDT
Comment on attachment 57059 [details]
patch 2

Seems fine to move this code where it can be used on a WTF string.

>  int compare(const UString& s1, const UString& s2)

The old version of this considers a null string equal to an empty string. The new version considers a null string to be less than an empty string. Are you sure it's OK to change behavior in that way?

Do we really need a UString::compare member function? Why is that better than what we already had?

> +    static int compare(const StringImpl*, const StringImpl*);

Is there some reason this is better as a String and StringImpl static member instead of being a free function? I think a free function would be better.
Comment 7 Luiz Agostini 2010-05-25 19:39:21 PDT
(In reply to comment #6)
> (From update of attachment 57059 [details])
> Seems fine to move this code where it can be used on a WTF string.
> 
> >  int compare(const UString& s1, const UString& s2)
> 
> The old version of this considers a null string equal to an empty string. The new version considers a null string to be less than an empty string. Are you sure it's OK to change behavior in that way?

no. do not want to change any behavior.

> 
> Do we really need a UString::compare member function? Why is that better than what we already had?
> 
> > +    static int compare(const StringImpl*, const StringImpl*);
> 
> Is there some reason this is better as a String and StringImpl static member instead of being a free function? I think a free function would be better.

I am concerned about overloads because of all these string types and all those not explicit constructors.
I think that some scope delimiters would help. :)
Comment 8 Luiz Agostini 2010-05-25 19:50:18 PDT
Created attachment 57065 [details]
patch 3
Comment 9 Alexey Proskuryakov 2010-05-25 19:57:56 PDT
I still don't understand what this function is for. You said that you needed lexicographical order, but that's not what it implements. Could you perhaps want Collator.h?

If it's to be exposed, it needs a name that makes it clear that the function does something that's almost never appropriate.
Comment 10 Luiz Agostini 2010-05-25 20:33:28 PDT
(In reply to comment #9)
> I still don't understand what this function is for. You said that you needed lexicographical order, but that's not what it implements. Could you perhaps want Collator.h?
> 
> If it's to be exposed, it needs a name that makes it clear that the function does something that's almost never appropriate.

I want to compare lists of strings. I will sort then and then compare each string. Just to identify lists that contain exactly the same strings any comparison is ok.

It happens that the strings I will manipulate are compose by ascii characters only. I am considering that this proposed code point comparison, for ascii characters only in english, would be lexicographical.

Collator.h may be used.
Comment 11 Luiz Agostini 2010-05-25 21:01:29 PDT
(In reply to comment #10)
> (In reply to comment #9)
> > I still don't understand what this function is for. You said that you needed lexicographical order, but that's not what it implements. Could you perhaps want Collator.h?
> > 
> > If it's to be exposed, it needs a name that makes it clear that the function does something that's almost never appropriate.
> 
> I want to compare lists of strings. I will sort then and then compare each string. Just to identify lists that contain exactly the same strings any comparison is ok.
> 
> It happens that the strings I will manipulate are compose by ascii characters only. I am considering that this proposed code point comparison, for ascii characters only in english, would be lexicographical.
> 
> Collator.h may be used.

Actual use cases is serialisation of media queries (bug 39220).
Comment 12 Alexey Proskuryakov 2010-05-26 00:13:46 PDT
I see. Comparing code points is generally not sufficient even for equality checking (because the same character can be encoded in composed and decomposed forms). This will of course work for ASCII, but it's important to not make this function look like general purpose one.

> Actual use cases is serialisation of media queries (bug 39220).

Are these case sensitive?
Comment 13 Kenneth Rohde Christiansen 2010-05-26 05:32:49 PDT
(In reply to comment #12)
> I see. Comparing code points is generally not sufficient even for equality checking (because the same character can be encoded in composed and decomposed forms). This will of course work for ASCII, but it's important to not make this function look like general purpose one.
> 
> > Actual use cases is serialisation of media queries (bug 39220).
> 
> Are these case sensitive?

They are supposed to be written in lowercase, but if you write "NOT" it will be seen as "not". I am not sure if it is allowed to write (COLOR) instead of (color), but that is at least not handled by the parser right now.
Comment 14 Luiz Agostini 2010-05-26 13:41:29 PDT
(In reply to comment #13)
> (In reply to comment #12)
> > I see. Comparing code points is generally not sufficient even for equality checking (because the same character can be encoded in composed and decomposed forms). This will of course work for ASCII, but it's important to not make this function look like general purpose one.
> > 
> > > Actual use cases is serialisation of media queries (bug 39220).
> > 
> > Are these case sensitive?
> 
> They are supposed to be written in lowercase, but if you write "NOT" it will be seen as "not". I am not sure if it is allowed to write (COLOR) instead of (color), but that is at least not handled by the parser right now.

The comparison if the strings is case sensitive. (http://dev.w3.org/csswg/cssom/#comparing-media-queries)
Comment 15 Luiz Agostini 2010-05-26 14:26:35 PDT
Created attachment 57137 [details]
patch 4
Comment 16 Luiz Agostini 2010-05-26 17:08:21 PDT
Darin, could you take a look if you have time?
Comment 17 Darin Adler 2010-05-26 17:29:36 PDT
Comment on attachment 57137 [details]
patch 4

It’s fine to have a comparison function that just compares the UChars in a string and does no case folding or normalization.

A) But I don't see any reason to make this comparison function be a class member. Neither a static member nor a non-static member. It should just be a free function, the way the function in UString.h already is. We can overload it to take more types. This patch instead adds a member function to StringImpl, but there's no reason for it to be a member function. It should just be a free function.

I think codePointCompare is an OK name for the function. Better than just "compare" since it might convince people not to use it in cases where they shouldn’t.

B) We should rename the one that works on UString too, to be consistent. I believe it's only called in exactly one place in JavaScriptCore.

C) The comparison function for UString that just calls through to the codePointCompare that takes StringImpl* should be inlined. It's in a hot code path and I don't want an extra level of function call added.

Now that I see the one place where codePointCompare is called inside JavaScriptCore, I realize that it's a code path where the strings can never be null strings. That has two implications:

    D) We don't have to treat null strings and empty strings as equal.
    E) If we wanted to we could make things slightly more efficient by making that call site not even check for null.

review-, but I think you're getting close now to something we should land. You should fix at least (A) and (C) above. And (B) is also a good idea and probably easy to do. No real need to do (D) or (E), though.
Comment 18 Luiz Agostini 2010-05-27 09:05:42 PDT
Created attachment 57251 [details]
patch 5
Comment 19 Luiz Agostini 2010-05-27 09:24:04 PDT
(In reply to comment #17)
> (From update of attachment 57137 [details])
> It’s fine to have a comparison function that just compares the UChars in a string and does no case folding or normalization.
> 
> A) But I don't see any reason to make this comparison function be a class member. Neither a static member nor a non-static member. It should just be a free function, the way the function in UString.h already is. We can overload it to take more types. This patch instead adds a member function to StringImpl, but there's no reason for it to be a member function. It should just be a free function.

done

> 
> I think codePointCompare is an OK name for the function. Better than just "compare" since it might convince people not to use it in cases where they shouldn’t.

done

> 
> B) We should rename the one that works on UString too, to be consistent. I believe it's only called in exactly one place in JavaScriptCore.

done

> 
> C) The comparison function for UString that just calls through to the codePointCompare that takes StringImpl* should be inlined. It's in a hot code path and I don't want an extra level of function call added.

done

> 
> Now that I see the one place where codePointCompare is called inside JavaScriptCore, I realize that it's a code path where the strings can never be null strings. That has two implications:
> 
>     D) We don't have to treat null strings and empty strings as equal.
>     E) If we wanted to we could make things slightly more efficient by making that call site not even check for null.

Because of not being a committer, although I am always watching the bots I may not be able to react in time in an eventual problem with any of the platforms. That is why I decided not to make D and E for now.
I am currently waiting for a committer nomination approval. :)

> 
> review-, but I think you're getting close now to something we should land. You should fix at least (A) and (C) above. And (B) is also a good idea and probably easy to do. No real need to do (D) or (E), though.

thanks for the review so far. :)
Comment 20 Darin Adler 2010-05-27 13:11:49 PDT
Comment on attachment 57251 [details]
patch 5

> +        friend int codePointCompare(const UString&, const UString&);

There's no need for this function to be a friend.

> +    inline int codePointCompare(const UString& s1, const UString& s2)
> +    {
> +        return codePointCompare(s1.m_rep.get(), s2.m_rep.get());
> +    }

This can use the rep() function instead of getting at m_rep directly.

r=me if you remove the unnecessary friendship
Comment 21 Luiz Agostini 2010-05-27 14:05:47 PDT
Created attachment 57274 [details]
patch 6
Comment 22 Darin Adler 2010-05-27 14:08:58 PDT
Comment on attachment 57274 [details]
patch 6

> +int codePointCompare(const String& a, const String& b);

Damn, forgot to ask you to remove these argument names. We can always do that later.
Comment 23 Luiz Agostini 2010-05-27 14:18:46 PDT
(In reply to comment #22)
> (From update of attachment 57274 [details])
> > +int codePointCompare(const String& a, const String& b);
> 
> Damn, forgot to ask you to remove these argument names. We can always do that later.

I can do it now. just one minute
Comment 24 Luiz Agostini 2010-05-27 14:24:35 PDT
Created attachment 57276 [details]
patch 7
Comment 25 WebKit Commit Bot 2010-05-27 16:44:06 PDT
Comment on attachment 57276 [details]
patch 7

Clearing flags on attachment: 57276

Committed r60332: <http://trac.webkit.org/changeset/60332>
Comment 26 WebKit Commit Bot 2010-05-27 16:44:13 PDT
All reviewed patches have been landed.  Closing bug.
Comment 27 Simon Hausmann 2010-05-31 05:44:11 PDT
Luiz, this is currently scheduled for integration into the 2.0 release branch. Is this intentional?

The patch doesn't apply - due to renamings - so I'd need your help if we really want to integrate it :)
Comment 28 Luiz Agostini 2010-05-31 06:47:48 PDT
(In reply to comment #27)
> Luiz, this is currently scheduled for integration into the 2.0 release branch. Is this intentional?
> 
> The patch doesn't apply - due to renamings - so I'd need your help if we really want to integrate it :)

Simon,

bugs 39220 and 37205 depend on this one. I expect bug 39220 to get in soon. 37205 depends on how specification evolves.

I think that we could wait until 37205 get in and then integrate them all.
Comment 29 Simon Hausmann 2010-05-31 07:46:00 PDT
(In reply to comment #28)
> (In reply to comment #27)
> > Luiz, this is currently scheduled for integration into the 2.0 release branch. Is this intentional?
> > 
> > The patch doesn't apply - due to renamings - so I'd need your help if we really want to integrate it :)
> 
> Simon,
> 
> bugs 39220 and 37205 depend on this one. I expect bug 39220 to get in soon. 37205 depends on how specification evolves.
> 
> I think that we could wait until 37205 get in and then integrate them all.

I see, that makes sense.

Could you make a version of the string patch that applies against qtwebkit-2.0? It should be straightforward, most of the string classes simply moved from WebCore/platform/text into JavaScriptCore/wtf/text but the methods and class names remained.
Comment 30 Simon Hausmann 2010-07-01 01:06:37 PDT
Luiz: ping. I need your help with a patch for the release branch :)