Bug 137165 - Support modern for loops over StringView code units and code points
Summary: Support modern for loops over StringView code units and code points
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: New Bugs (show other bugs)
Version: 528+ (Nightly build)
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Myles C. Maxfield
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-09-26 19:49 PDT by Myles C. Maxfield
Modified: 2014-10-03 10:58 PDT (History)
10 users (show)

See Also:


Attachments
Patch (9.45 KB, patch)
2014-09-26 20:03 PDT, Myles C. Maxfield
no flags Details | Formatted Diff | Diff
Patch (9.95 KB, patch)
2014-09-28 21:53 PDT, Myles C. Maxfield
no flags Details | Formatted Diff | Diff
Patch (10.67 KB, patch)
2014-09-28 22:37 PDT, Myles C. Maxfield
no flags Details | Formatted Diff | Diff
Patch (10.66 KB, patch)
2014-09-28 22:41 PDT, Myles C. Maxfield
no flags Details | Formatted Diff | Diff
Patch (10.67 KB, patch)
2014-09-28 22:43 PDT, Myles C. Maxfield
no flags Details | Formatted Diff | Diff
Patch (13.50 KB, patch)
2014-09-29 12:18 PDT, Myles C. Maxfield
no flags Details | Formatted Diff | Diff
Patch (10.50 KB, patch)
2014-09-30 09:48 PDT, Myles C. Maxfield
no flags Details | Formatted Diff | Diff
Patch (10.04 KB, patch)
2014-10-02 17:31 PDT, Myles C. Maxfield
darin: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Myles C. Maxfield 2014-09-26 19:49:44 PDT
Support modern for loops over StringViews
Comment 1 Myles C. Maxfield 2014-09-26 20:03:16 PDT
Created attachment 238754 [details]
Patch
Comment 2 Darin Adler 2014-09-28 16:37:54 PDT
Comment on attachment 238754 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=238754&action=review

I think this is a pretty good idea.

But what guarantees these objects don’t last longer than the underlying StringView? It seems really easy to make the mistake of keeping one of these alive when the StringView is gone. For example, this code would not work:

    for (UChar character : functionReturningStringView().codeUnits())

I think we should consider a different version of these that will work even if the StringView has been destroyed. Probably the best way to do this is to use a StringView rather than a const StringView& as a data member of the iterator.

We need to test how codePoints works on invalid sequences; such as when the last character is the lead and there is no trail, and when a trail appears with no lead. U16_NEXT handles these cases, but I am not sure that our code handles them correctly.

I am not sure that

> Source/WTF/ChangeLog:8
> +        This patch adds two functions, codepoints() and codeunits(), on StringView.

Code point and code unit are each two word phrases, not single words. Lets not smash them together into one word. uNless there is a good reason.

> Source/WTF/wtf/text/StringView.h:40
> +class StringViewCodepoints {

I think this should be a member class. So it would be declared inside the StringView class.

    class StringView {
        ... 
        class CodePoints;

Then later, like this:

    class StringView::CodePoints {
        ...
    };

> Source/WTF/wtf/text/StringView.h:47
> +        bool operator!=(const Iterator&) const;

I think we should supply both == and != operators.

> Source/WTF/wtf/text/StringView.h:57
> +    StringViewCodepoints(const StringView&);

Might want this to be explicit.

> Source/WTF/wtf/text/StringView.h:60
> +private:

Need a blank line before this.

> Source/WTF/wtf/text/StringView.h:240
> +    StringViewCodepoints codepoints() const
> +    {
> +        return StringViewCodepoints(*this);
> +    }
> +
> +    StringViewCodeunits codeunits() const
> +    {
> +        return StringViewCodeunits(*this);
> +    }

Please put the bodies of these functions outside the class. I have a patch coming soon that does this for all StringView member functions.

> Source/WTF/wtf/text/StringView.h:348
> +    if (m_cacheIsValid) {
> +        m_index = m_nextIndexCache;
> +        m_cacheIsValid = false;

Would read better with early return.

> Tools/TestWebKitAPI/Tests/WTF/StringView.cpp:158
> +    for (auto codepoint : StringView(builder.toString()).codepoints()) {

I think this code would crash if it was run under a memory debugger; the StringView won’t stay alive.
Comment 3 Myles C. Maxfield 2014-09-28 21:47:57 PDT
Comment on attachment 238754 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=238754&action=review

>> Source/WTF/ChangeLog:8
>> +        This patch adds two functions, codepoints() and codeunits(), on StringView.
> 
> Code point and code unit are each two word phrases, not single words. Lets not smash them together into one word. uNless there is a good reason.

Done.

>> Source/WTF/wtf/text/StringView.h:40
>> +class StringViewCodepoints {
> 
> I think this should be a member class. So it would be declared inside the StringView class.
> 
>     class StringView {
>         ... 
>         class CodePoints;
> 
> Then later, like this:
> 
>     class StringView::CodePoints {
>         ...
>     };

Done. I really like the translation of StringViewCodePoints to StringView::CodePoints. There is a way to describe exactly what I'm trying to describe!

>> Source/WTF/wtf/text/StringView.h:47
>> +        bool operator!=(const Iterator&) const;
> 
> I think we should supply both == and != operators.

Done.

>> Source/WTF/wtf/text/StringView.h:57
>> +    StringViewCodepoints(const StringView&);
> 
> Might want this to be explicit.

Done.

>> Source/WTF/wtf/text/StringView.h:60
>> +private:
> 
> Need a blank line before this.

Done.

>> Source/WTF/wtf/text/StringView.h:240
>> +    }
> 
> Please put the bodies of these functions outside the class. I have a patch coming soon that does this for all StringView member functions.

Done.

>> Source/WTF/wtf/text/StringView.h:348
>> +        m_cacheIsValid = false;
> 
> Would read better with early return.

Done.

>> Tools/TestWebKitAPI/Tests/WTF/StringView.cpp:158
>> +    for (auto codepoint : StringView(builder.toString()).codepoints()) {
> 
> I think this code would crash if it was run under a memory debugger; the StringView won’t stay alive.

I tested this out and you're right. I've made the internal member a StringView instead of a const StringView&.
Comment 4 Myles C. Maxfield 2014-09-28 21:53:23 PDT
Created attachment 238839 [details]
Patch
Comment 5 Myles C. Maxfield 2014-09-28 22:37:56 PDT
Created attachment 238840 [details]
Patch
Comment 6 Myles C. Maxfield 2014-09-28 22:41:41 PDT
Created attachment 238841 [details]
Patch
Comment 7 Myles C. Maxfield 2014-09-28 22:43:30 PDT
Created attachment 238842 [details]
Patch
Comment 8 Said Abou-Hallawa 2014-09-29 10:29:38 PDT
I have two comments on this change:

1. There is no way to use this code to traverse the string backward.  The std::string creates  reverse beginning and  reverse end iterators which can be used to traverse the string backward.  Maybe you need to do the same for your iterators.  Or maybe we can add operator--() to your iterators.

2. It was had for me to keep track to with your code in StringView::CodePoints::Iterator::operator++() and StringView::CodePoints::Iterator::operator*().  It seems you have two different cases: is8Bit() and !is8Bit().  When is8Bit() is true you do not need to cache the values m_nextIndexCache and m_codepointCache.  But you are doing the following:

  a) you are not initializing m_codepointCache in the constructor. I know it is guarded by having m_cacheIsValid set to false in the constructor.  And m_codepointCache is going to be fixed in operator++() or operator*() if is8Bit() is false.  But I think initializing it to nullptr in the constructor might make things clearer.
  b) In StringView::CodePoints::Iterator::operator*() and if m_stringView.is8Bit() you are ++m_nextIndexCache.  This looks weird to me to have the operator*() increments the index cache in this cache.  And actually it can produce bugs: Consider this example:
    StringView strView = StringView(String("ABC")).codePoints();
    Iterator it = strView.begin();
    std::cout << *it;
    std::cout << *it;
    it++;
    std::cout << *it;

Result will be:
  AAC

Expected:
  AAB

The reason is in  operator*() we moved the m_nextIndexCache twice and in operator++() we set the m_index to m_nextIndexCache which was '2'.
  c) I would suggest in StringView::CodePoints::Iterator::operator*() and StringView::CodePoints::Iterator::operator++() to handle the cases is8Bit() and !is8Bit() in completely different code paths since when is8Bit() is true, you do not need the cache and when is8Bit()  is true you need the cache.  I personally like the code when it looks like a tree:  The code starts and diverges and keep diverging.  But I got confused when it looks like a diamond: The code starts, diverge and then converge again.  The code in these two functions looks like a diamond.
Comment 9 Myles C. Maxfield 2014-09-29 11:27:56 PDT
Comment on attachment 238842 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=238842&action=review

> Source/WTF/wtf/text/StringView.h:387
> +        ++m_nextIndexCache;

This is wrong if operator++() has been called multiple times since the last call to operator*()
Comment 10 Myles C. Maxfield 2014-09-29 12:14:23 PDT
Replies inline.
(In reply to comment #8)
> I have two comments on this change:
> 
> 1. There is no way to use this code to traverse the string backward.  The std::string creates  reverse beginning and  reverse end iterators which can be used to traverse the string backward.  Maybe you need to do the same for your iterators.  Or maybe we can add operator--() to your iterators.

Yep. That is a non-goal. If we need this in the future, we can add it in.

> 
> 2. It was had for me to keep track to with your code in StringView::CodePoints::Iterator::operator++() and StringView::CodePoints::Iterator::operator*().  It seems you have two different cases: is8Bit() and !is8Bit().  When is8Bit() is true you do not need to cache the values m_nextIndexCache and m_codepointCache.  But you are doing the following:
> 
>   a) you are not initializing m_codepointCache in the constructor. I know it is guarded by having m_cacheIsValid set to false in the constructor.  And m_codepointCache is going to be fixed in operator++() or operator*() if is8Bit() is false.  But I think initializing it to nullptr in the constructor might make things clearer.

The m_cacheIsValid should be enough, but it's cheap enough to do this that I'll do it for readability.

>   b) In StringView::CodePoints::Iterator::operator*() and if m_stringView.is8Bit() you are ++m_nextIndexCache.  This looks weird to me to have the operator*() increments the index cache in this cache.  And actually it can produce bugs: Consider this example:
>     StringView strView = StringView(String("ABC")).codePoints();
>     Iterator it = strView.begin();
>     std::cout << *it;
>     std::cout << *it;
>     it++;
>     std::cout << *it;
> 
> Result will be:
>   AAC
> 
> Expected:
>   AAB
> 
> The reason is in  operator*() we moved the m_nextIndexCache twice and in operator++() we set the m_index to m_nextIndexCache which was '2'.

Line 390 makes this not possible. However, you bring up a good point - I don't have any tests regarding calling operator++() or operator*() multiple times in a row. I've added some tests.

>   c) I would suggest in StringView::CodePoints::Iterator::operator*() and StringView::CodePoints::Iterator::operator++() to handle the cases is8Bit() and !is8Bit() in completely different code paths since when is8Bit() is true, you do not need the cache and when is8Bit()  is true you need the cache.  I personally like the code when it looks like a tree:  The code starts and diverges and keep diverging.  But I got confused when it looks like a diamond: The code starts, diverge and then converge again.  The code in these two functions looks like a diamond.

I agree with your premise but disagree with your conclusion. I don't think that having four iterator classes would be worth the amount of complexity that could be avoided. There are only 4 variables inside CodePoints, and each of the 4 is a (variation on an) int.
Comment 11 Myles C. Maxfield 2014-09-29 12:18:20 PDT
(In reply to comment #8)
> I have two comments on this change:
> 
> 1. There is no way to use this code to traverse the string backward.  The std::string creates  reverse beginning and  reverse end iterators which can be used to traverse the string backward.  Maybe you need to do the same for your iterators.  Or maybe we can add operator--() to your iterators.
> 
> 2. It was had for me to keep track to with your code in StringView::CodePoints::Iterator::operator++() and StringView::CodePoints::Iterator::operator*().  It seems you have two different cases: is8Bit() and !is8Bit().  When is8Bit() is true you do not need to cache the values m_nextIndexCache and m_codepointCache.  But you are doing the following:
> 
>   a) you are not initializing m_codepointCache in the constructor. I know it is guarded by having m_cacheIsValid set to false in the constructor.  And m_codepointCache is going to be fixed in operator++() or operator*() if is8Bit() is false.  But I think initializing it to nullptr in the constructor might make things clearer.
>   b) In StringView::CodePoints::Iterator::operator*() and if m_stringView.is8Bit() you are ++m_nextIndexCache.  This looks weird to me to have the operator*() increments the index cache in this cache.  And actually it can produce bugs: Consider this example:
>     StringView strView = StringView(String("ABC")).codePoints();
>     Iterator it = strView.begin();
>     std::cout << *it;
>     std::cout << *it;
>     it++;
>     std::cout << *it;
> 
> Result will be:
>   AAC
> 
> Expected:
>   AAB
> 
> The reason is in  operator*() we moved the m_nextIndexCache twice and in operator++() we set the m_index to m_nextIndexCache which was '2'.
>   c) I would suggest in StringView::CodePoints::Iterator::operator*() and StringView::CodePoints::Iterator::operator++() to handle the cases is8Bit() and !is8Bit() in completely different code paths since when is8Bit() is true, you do not need the cache and when is8Bit()  is true you need the cache.  I personally like the code when it looks like a tree:  The code starts and diverges and keep diverging.  But I got confused when it looks like a diamond: The code starts, diverge and then converge again.  The code in these two functions looks like a diamond.

I've added some ASSERTs that should make it more obvious what is going on.
Comment 12 Myles C. Maxfield 2014-09-29 12:18:44 PDT
Created attachment 238865 [details]
Patch
Comment 13 Myles C. Maxfield 2014-09-30 09:48:12 PDT
Created attachment 238931 [details]
Patch
Comment 14 Darin Adler 2014-10-02 16:17:54 PDT
Comment on attachment 238931 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=238931&action=review

review- because we don’t want StringView == StringView to do what is coded here.

> Source/WTF/wtf/text/StringView.h:41
> +    class CodePoints;
> +    class CodeUnits;

I suggest defining these right before they are used instead of at the top of the class.

> Source/WTF/wtf/text/StringView.h:188
> +    bool operator==(const StringView&) const;
> +    bool operator!=(const StringView&) const;

These operators should be implemented as free functions, not member functions. If these are member functions, then no type conversions can be done on the left side of the == and !=, but if they are free functions they can be.

> Source/WTF/wtf/text/StringView.h:247
> +#if !ASSERT_DISABLED
> +        mutable Mode m_mode;
> +#endif

I think a boolean named:

    bool m_alreadyIncremented;

or something like that, would be clearer than this enum, which I find mysterious.

> Source/WTF/wtf/text/StringView.h:339
> +inline bool StringView::operator==(const StringView& other) const
> +{
> +    return m_characters == other.m_characters && m_length == other.m_length;
> +}

I don’t think it’s good idea to have a StringView == operator that checks identity rather than contents of the strings. I think an == operator on two string views should compare the actual characters, not the pointers. And should return true if the two views look at two different ranges of memory with the same characters, even if one is 8-bit and the other 16-bit.

Or we could just not have an == operator, but having one with this meaning seems like a bad idea.

> Source/WTF/wtf/text/StringView.h:396
> +    return m_stringView == other.m_stringView && m_index == other.m_index;

This should assert that &m_stringView == &other.m_stringView and should not look at m_stringView at all in non-debug builds.

> Source/WTF/wtf/text/StringView.h:438
> +    return m_stringView == other.m_stringView && m_index == other.m_index;

This should assert that &m_stringView == &other.m_stringView and should not look at m_stringView at all in non-debug builds.
Comment 15 Myles C. Maxfield 2014-10-02 17:01:50 PDT
Comment on attachment 238931 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=238931&action=review

>> Source/WTF/wtf/text/StringView.h:41
>> +    class CodeUnits;
> 
> I suggest defining these right before they are used instead of at the top of the class.

Done.

>> Source/WTF/wtf/text/StringView.h:188
>> +    bool operator!=(const StringView&) const;
> 
> These operators should be implemented as free functions, not member functions. If these are member functions, then no type conversions can be done on the left side of the == and !=, but if they are free functions they can be.

Given your comment below, I think providing these operators is out of scope for this patch.

>> Source/WTF/wtf/text/StringView.h:247
>> +#endif
> 
> I think a boolean named:
> 
>     bool m_alreadyIncremented;
> 
> or something like that, would be clearer than this enum, which I find mysterious.

Done.

>> Source/WTF/wtf/text/StringView.h:339
>> +}
> 
> I don’t think it’s good idea to have a StringView == operator that checks identity rather than contents of the strings. I think an == operator on two string views should compare the actual characters, not the pointers. And should return true if the two views look at two different ranges of memory with the same characters, even if one is 8-bit and the other 16-bit.
> 
> Or we could just not have an == operator, but having one with this meaning seems like a bad idea.

Given your comment below, I think providing an == operator is out of scope for this patch. There are different ways to look at the characters inside a StringView: we could compare them code point by code point, or we could use ICU to canonicalize the strings and then compare the canonicalized versions. We could compare code units instead if we wanted to avoid the U16_NEXT() and know that both strings are in UTF-16.

>> Source/WTF/wtf/text/StringView.h:396
>> +    return m_stringView == other.m_stringView && m_index == other.m_index;
> 
> This should assert that &m_stringView == &other.m_stringView and should not look at m_stringView at all in non-debug builds.

Done.

>> Source/WTF/wtf/text/StringView.h:438
>> +    return m_stringView == other.m_stringView && m_index == other.m_index;
> 
> This should assert that &m_stringView == &other.m_stringView and should not look at m_stringView at all in non-debug builds.

Done.
Comment 16 Myles C. Maxfield 2014-10-02 17:31:09 PDT
Created attachment 239162 [details]
Patch
Comment 17 Darin Adler 2014-10-02 17:36:15 PDT
Comment on attachment 239162 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=239162&action=review

> Source/WTF/wtf/text/StringView.h:223
> +        mutable bool m_alreadyIncremented;

I guess this isn’t as clear as I thought it would be, because I meant exactly the opposite of what you did!

> Source/WTF/wtf/text/StringView.h:389
> +    m_alreadyIncremented = false;

My idea is that we would be setting this to true here, because as part of the * operator, we “already incremented” the iterator.
Comment 18 Darin Adler 2014-10-03 09:08:20 PDT
Comment on attachment 239162 [details]
Patch

This is OK, but I think the m_alreadyIncremented boolean is confusing. What I proposed was actually the opposite of what you wrote, showing that the two of us don’t even agree on this!
Comment 19 Myles C. Maxfield 2014-10-03 09:11:32 PDT
(In reply to comment #18)
> (From update of attachment 239162 [details])
> This is OK, but I think the m_alreadyIncremented boolean is confusing. What I proposed was actually the opposite of what you wrote, showing that the two of us don’t even agree on this!

Interesting. I thought you meant I should make line 380 set it to true because, once you've run the increment operator, the iterator has already been incremented, so you can't increment it again.
Comment 20 Darin Adler 2014-10-03 09:17:04 PDT
(In reply to comment #19)
> (In reply to comment #18)
> > This is OK, but I think the m_alreadyIncremented boolean is confusing. What I proposed was actually the opposite of what you wrote, showing that the two of us don’t even agree on this!
> 
> Interesting. I thought you meant I should make line 380 set it to true because, once you've run the increment operator, the iterator has already been incremented, so you can't increment it again.

Yes, makes logical sense but is not what I meant.

I meant that once you use the * operator the iterator is in an unusual state. It has already been incremented, so when you then actually use the ++ operator you don’t have to do any incrementing. Further, it’s not safe to do operations like == on it when it’s in this “already incremented” state because they will yield incorrect results. So, critically, StringView::CodePoints::Iterator::operator== is missing assertions that check that neither of the two iterators is in this unusual “already incremented” state, whatever we call the boolean.
Comment 21 Myles C. Maxfield 2014-10-03 09:34:44 PDT
(In reply to comment #20)
> (In reply to comment #19)
> > (In reply to comment #18)
> > > This is OK, but I think the m_alreadyIncremented boolean is confusing. What I proposed was actually the opposite of what you wrote, showing that the two of us don’t even agree on this!
> > 
> > Interesting. I thought you meant I should make line 380 set it to true because, once you've run the increment operator, the iterator has already been incremented, so you can't increment it again.
> 
> Yes, makes logical sense but is not what I meant.
> 
> I meant that once you use the * operator the iterator is in an unusual state. It has already been incremented, so when you then actually use the ++ operator you don’t have to do any incrementing. Further, it’s not safe to do operations like == on it when it’s in this “already incremented” state because they will yield incorrect results. So, critically, StringView::CodePoints::Iterator::operator== is missing assertions that check that neither of the two iterators is in this unusual “already incremented” state, whatever we call the boolean.

Ah, I understand. Your thinking of the term "increment" is referring to the internal m_index variable, and my thinking of the term "increment" is referring to the operator++ API call.  I'll switch the boolean (and add the missing ASSERTs).
Comment 22 Myles C. Maxfield 2014-10-03 10:58:11 PDT
http://trac.webkit.org/changeset/174271