Bug 163225

Summary: StringView.split() should use an iterator design pattern instead of allocating a Vector
Product: WebKit Reporter: Alex Christensen <achristensen>
Component: Web Template FrameworkAssignee: Daniel Bates <dbates>
Status: RESOLVED FIXED    
Severity: Normal CC: benjamin, cdumez, cmarcelo, commit-queue, darin, dbates
Priority: P2    
Version: Other   
Hardware: Unspecified   
OS: Unspecified   
Attachments:
Description Flags
Patch and unit tests
none
Patch and unit tests darin: review+

Description Alex Christensen 2016-10-10 11:08:42 PDT
See feedback from https://bugs.webkit.org/show_bug.cgi?id=161920
Maybe we should do the same with String.split.
Comment 1 Daniel Bates 2017-01-10 17:20:00 PST
Created attachment 298528 [details]
Patch and unit tests
Comment 2 WebKit Commit Bot 2017-01-10 17:20:51 PST
Attachment 298528 [details] did not pass style-queue:


ERROR: Source/WTF/wtf/text/StringView.cpp:104:  Code inside a namespace should not be indented.  [whitespace/indent] [4]
Total errors found: 1 in 7 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 3 Darin Adler 2017-01-10 23:39:01 PST
Comment on attachment 298528 [details]
Patch and unit tests

View in context: https://bugs.webkit.org/attachment.cgi?id=298528&action=review

Some of these new functions should be inlined, I think. I mentioned some specific ones, but maybe even more of them. I’m not 100% sure how to figure out the right level of inlining.

> Source/WTF/wtf/text/StringView.cpp:117
> +auto StringView::Fields::end() const -> Iterator
> +{
> +    return Iterator { *this, m_stringView.length() };
> +}

Might want to make an even more efficient implementation of this that uses a different constructor, although we might have to pull some trick since there’s no trivial way to have two constructors that both have no arguments (the enum trick perhaps). There is no reason to call StringView::find when initializing. Could call something that sticks the length into both m_beginPosition and m_endPosition.

> Source/WTF/wtf/text/StringView.cpp:125
> +    unsigned length = fields.m_stringView.length();
> +    if (m_beginPosition > length)
> +        m_beginPosition = length;

I would have used std::min for this, but I suggest removing this code entirely, since I don’t see the need to support startOffset.

> Source/WTF/wtf/text/StringView.cpp:129
> +void StringView::Fields::Iterator::findNextField()

This function should be named findNextSubstring; I am not sure that the things we get when splitting are "fields".

> Source/WTF/wtf/text/StringView.cpp:131
> +    const StringView& stringView = m_fields.m_stringView;

Since the stringView is only used twice, I am not sure we should bother with the local variable. Or if you want to keep it, I would use auto:

    auto& string = m_fields.m_string;

> Source/WTF/wtf/text/StringView.cpp:137
> +    while ((m_endPosition = stringView.find(m_fields.m_separator, m_beginPosition)) != notFound) {
> +        if (m_endPosition != m_beginPosition)
> +            return;
> +        m_beginPosition = m_endPosition + 1;
> +    }
> +    m_endPosition = stringView.length();

If we change m_endPosition to be unsigned then we can’t have it directly get the result of find, but I think that might be better anyway, writing it more like this. However, in this version the for line is so long, so maybe that’s a reason to not do this version.

    for (size_t separatorPosition; (separatorPosition = m_fields.m_string.find(m_fields.separator, m_beginPosition)) != notFound, ++m_beginPosition) {
        if (separatorPosition > m_beginPosition) {
            m_endPosition = separatorPosition;
            return;
        }
    }
    m_endPosition = m_fields.m_string.length();

> Source/WTF/wtf/text/StringView.cpp:142
> +    return m_fields.m_stringView.substring(m_beginPosition, m_endPosition - m_beginPosition);

This returns the empty string when at the end. But I think it would be even better to assert. We would just add this:

    ASSERT(m_beginPosition < m_endPosition);
    ASSERT(m_endPosition < m_fields.m_string.length());

Seems like we should consider inlining this function since boils down to a single function call.

> Source/WTF/wtf/text/StringView.cpp:152
> +    unsigned length = m_fields.m_stringView.length();
> +    m_beginPosition = m_endPosition + 1;
> +    if (m_beginPosition > length)
> +        m_beginPosition = length;
> +    findNextField();
> +    return *this;

I suggest this alternate implementation:

    ASSERT(m_beginPosition < m_endPosition);
    if (m_endPosition == m_fields.m_stringView.length()) {
        m_beginPosition = m_endPosition;
        return *this;
    }
    m_beginPosition = m_endPosition + 1;
    findNextField();
    return *this;

The assertion is the one we would hit if we erroneously call ++ after reaching the end. Could also add this internal consistency check assertion:

    ASSERT(m_endPosition <= m_fields.m_string.length());

> Source/WTF/wtf/text/StringView.cpp:159
> +    return m_beginPosition == other.m_beginPosition && m_endPosition == other.m_endPosition
> +        && &m_fields.m_stringView == &other.m_fields.m_stringView
> +        && m_fields.m_separator == other.m_fields.m_separator;

If I was writing this, I would assert that the iterators are from the same call to split, rather than returning false in such cases. There is no valid reason to compare iterators from two different calls to split.

There is some confused code here. The last two comparisons do the same thing as this:

    &m_fields == &other.m_fields

Since m_stringView is a StringView, and not a const StringView&, and should be, its address is just an internal pointer inside the Fields structure. So it will only be equal if we have two references to the same Fields structure. And also, each substring is uniquely identified by the begin position and the end position. So we don’t need to compare both. This implementation would work well, the assertion would catch you if you used == on iterators from two different calls to split.

    ASSERT(&m_fields == &other.m_fields);
    return m_beginPosition == other.m_beginPosition;

Once we write it this way, it becomes clear that is a great candidate for inlining since the non-assert part of the function is a single super-simple expression.

Here are five other internal consistency check assertions we could include if we like:

    ASSERT((m_beginPosition == other.m_beginPosition) == (m_endPosition == other.m_endPosition));
    ASSERT(m_beginPosition <= m_endPosition);
    ASSERT(m_endPosition <= m_fields.m_string.length());
    ASSERT(other.m_beginPosition <= other.m_endPosition);
    ASSERT(other.m_endPosition <= other.m_fields.m_string.length());

> Source/WTF/wtf/text/StringView.h:121
> +    class Fields;

I don’t think Fields is a great name for this class. I would call this SplitResult or SplitSubstrings.

Later we might need different names for variations because we might need different flavors of split. Split based on a character, a string, and maybe even a regular expression. Split returning the appropriate number of empty substrings, alongside this kind of split that skips all empty substrings.

> Source/WTF/wtf/text/StringView.h:122
> +    WTF_EXPORT_STRING_API Fields split(UChar, unsigned startOffset = 0) const;

It’s easy to use substring on a StringView to start at an arbitrary location, and it is efficient and does not memory allocation, so the startOffset seems like an unneeded feature and I suggest omitting it. These were needed in the String class because substrings there are expensive and to be avoided. If one was going to call string.split('/', 1) one could call string.subview(1).split('/') instead.

> Source/WTF/wtf/text/StringView.h:620
> +    explicit Fields(const StringView&, UChar separator, unsigned startOffset);

The type of the argument here should be StringView, not const StringView&, because StringView is efficient to pass by value so there is no reason to use a reference here.

> Source/WTF/wtf/text/StringView.h:627
> +    StringView m_stringView;

In the context of this class, which is a member of StringView, I think we can just name this m_string since there is no ambiguity about the fact that this is a string view. But roughly speaking, a string.

> Source/WTF/wtf/text/StringView.h:673
> +    Iterator& operator++(int) = delete;

I believe this is unneeded. If you define operator++() and don’t define operator++(int), it automatically will give an error if you try to post-increment. An explicit delete is not necessary.

> Source/WTF/wtf/text/StringView.h:683
> +    friend Fields;

Why does this work? I’ve always seen this written as friend class Fields before. Is this preferred syntax?

> Source/WTF/wtf/text/StringView.h:685
> +    const Fields& m_fields;

I would probably rename this m_result after changing the type name from Fields to SplitResult or m_substrings if we use the name SplitSubstrings.

(Hard extra credit project: Would be nice if we could figure out a way to make the iterator assert at runtime if we accidentally used this after the Fields object was destroyed. I wouldn’t want to greatly complicate these classes just to check that, but it would be a bad bug if we kept an iterator around after m_fields was gone.)

> Source/WTF/wtf/text/StringView.h:687
> +    size_t m_beginPosition;
> +    size_t m_endPosition;

These should be unsigned, not size_t. Positions within strings are unsigned, except that annoyingly the result of find is a size_t.

I think "begin position" is a strange name for a data member, because "begin" is a verb, not a noun. The noun is "beginning". I know that the C++ standard library uses begin in this peculiar way, but I don’t think we need to follow suit. Maybe startPosition instead. Or even m_substringStart/m_substringEnd or m_substringPosition/m_substringEndPosition or m_position/m_endPosition.

I also think it would be easy to write this in terms of a length instead of an end position. The three places that currently set the end position (there are only three) would just subtract the start position to compute a length, and then the code in ++ would just say: m_position += m_length + 1.

I think that m_position/m_length may in fact be my favorite way to write this, but it would change almost all the assertions I suggested above.

> Source/WebCore/platform/URLParser.cpp:2709
> -    for (auto& bytes : sequences) {
> +    for (const auto& bytes : input.split('&')) {

This should just be "auto" or "StringView". There is no reason to write "const auto&", which does the same thing as auto in this case, and I don’t think it’s particularly clear.

> Tools/TestWebKitAPI/Tests/WTF/StringView.cpp:274
> +    Vector<String> actual = vectorFromFields(a.split('T'));
> +    Vector<String> expected({ "his is a sentence." });
> +    ASSERT_EQ(expected.size(), actual.size());
> +    for (size_t i = 0; i < actual.size(); ++i)
> +        EXPECT_STREQ(expected[i].ascii().data(), actual[i].ascii().data()) << "Vectors differ at index " << i;

Shouldn’t use ascii() unless debugging. Should use utf8() instead.

I don’t understand the use of ASSERT for the size check and EXPECT for the values of each element. I think we would want EXPECT for both.

Is there no better idiom for checking if two vectors of strings are the same in one of our unit tests? Code is so repetitive; we should probably use a helper function instead of repeating ourselves so many times. Maybe even have a function where you pass a StringView, a split character, and an initializer list, so each test case could be just one line; I find that style helps encourage me to write a lot of test cases since they don’t occupy much space. Unless it’s valuable to have the line number of the EXPECT failure give us a clue which test failed. I am not familiar enough with the behavior of the test macros to know whether it helps.

> Tools/TestWebKitAPI/Tests/WTF/StringView.cpp:277
> +    expected = Vector<String>({ "This is a sentence" });

I don’t think we need to state the type here, just the braces should work.

> Tools/TestWebKitAPI/Tests/WTF/StringView.cpp:314
> +    actual = vectorFromFields(a.split(' ', a.length() + 1));
> +    EXPECT_EQ(0U, actual.size());

If we are keeping startOffset, we probably also want a test where it is the maximum unsigned value, to make sure we don’t have any overflow.

> Tools/TestWebKitAPI/Tests/WTF/StringView.cpp:351
> +    // Different separators
> +    EXPECT_FALSE(a.split(' ', a.length()).begin() == a.split(',', a.length()).begin());
> +    EXPECT_FALSE(a.split(' ', a.length()).end() == a.split(',', a.length()).end());
> +
> +    // Different StringView objects
> +    EXPECT_FALSE(a.split(' ').begin() == b.split(' ').begin());
> +    EXPECT_FALSE(a.split(' ', a.length()).begin() == b.split(' ', b.length()).begin());
> +    EXPECT_FALSE(a.split(' ', a.length()).end() == b.split(' ', b.length()).end());

This does not seem like an important property of the iterators to test. If you take my suggestion above, these tests are all just going to trigger assertion failures.

Not sure what the best practice is for testing assertion failures. If there was a way I would want to test for the assertion in ++ too.
Comment 4 Daniel Bates 2017-01-17 17:03:14 PST
Comment on attachment 298528 [details]
Patch and unit tests

View in context: https://bugs.webkit.org/attachment.cgi?id=298528&action=review

>> Source/WTF/wtf/text/StringView.cpp:117
>> +}
> 
> Might want to make an even more efficient implementation of this that uses a different constructor, although we might have to pull some trick since there’s no trivial way to have two constructors that both have no arguments (the enum trick perhaps). There is no reason to call StringView::find when initializing. Could call something that sticks the length into both m_beginPosition and m_endPosition.

Will add another constructor, Iterator(const SplitResult&, PositionTag), so that we can differentiate between iterator instantiation in StringView::Fields::{begin, end}(). The Iterator(const SplitResult&, PositionTag) constructor will be used to instantiate and initialize the "at-end" positioned iterator returned by StringView::Fields::end() and will not call StringView::find().

>> Source/WTF/wtf/text/StringView.cpp:125
>> +        m_beginPosition = length;
> 
> I would have used std::min for this, but I suggest removing this code entirely, since I don’t see the need to support startOffset.

Will remove support for a start offset and hence remove this code.

>> Source/WTF/wtf/text/StringView.cpp:129
>> +void StringView::Fields::Iterator::findNextField()
> 
> This function should be named findNextSubstring; I am not sure that the things we get when splitting are "fields".

Will rename to findNextSubstring.

>> Source/WTF/wtf/text/StringView.cpp:131
>> +    const StringView& stringView = m_fields.m_stringView;
> 
> Since the stringView is only used twice, I am not sure we should bother with the local variable. Or if you want to keep it, I would use auto:
> 
>     auto& string = m_fields.m_string;

Will remove the local.

>> Source/WTF/wtf/text/StringView.cpp:137
>> +    m_endPosition = stringView.length();
> 
> If we change m_endPosition to be unsigned then we can’t have it directly get the result of find, but I think that might be better anyway, writing it more like this. However, in this version the for line is so long, so maybe that’s a reason to not do this version.
> 
>     for (size_t separatorPosition; (separatorPosition = m_fields.m_string.find(m_fields.separator, m_beginPosition)) != notFound, ++m_beginPosition) {
>         if (separatorPosition > m_beginPosition) {
>             m_endPosition = separatorPosition;
>             return;
>         }
>     }
>     m_endPosition = m_fields.m_string.length();

Will change data type of m_{begin, end}Position from size_t to unsigned and will replace the implementation of this function with your suggested implementation.

>> Source/WTF/wtf/text/StringView.cpp:142
>> +    return m_fields.m_stringView.substring(m_beginPosition, m_endPosition - m_beginPosition);
> 
> This returns the empty string when at the end. But I think it would be even better to assert. We would just add this:
> 
>     ASSERT(m_beginPosition < m_endPosition);
>     ASSERT(m_endPosition < m_fields.m_string.length());
> 
> Seems like we should consider inlining this function since boils down to a single function call.

Will assert that the iterator is not at or past the end of the string.

>> Source/WTF/wtf/text/StringView.cpp:152
>> +    return *this;
> 
> I suggest this alternate implementation:
> 
>     ASSERT(m_beginPosition < m_endPosition);
>     if (m_endPosition == m_fields.m_stringView.length()) {
>         m_beginPosition = m_endPosition;
>         return *this;
>     }
>     m_beginPosition = m_endPosition + 1;
>     findNextField();
>     return *this;
> 
> The assertion is the one we would hit if we erroneously call ++ after reaching the end. Could also add this internal consistency check assertion:
> 
>     ASSERT(m_endPosition <= m_fields.m_string.length());

Will use suggested implementation and add internal consistency check assertion after the first assertion in your suggested implementation.

>> Source/WTF/wtf/text/StringView.cpp:159
>> +        && m_fields.m_separator == other.m_fields.m_separator;
> 
> If I was writing this, I would assert that the iterators are from the same call to split, rather than returning false in such cases. There is no valid reason to compare iterators from two different calls to split.
> 
> There is some confused code here. The last two comparisons do the same thing as this:
> 
>     &m_fields == &other.m_fields
> 
> Since m_stringView is a StringView, and not a const StringView&, and should be, its address is just an internal pointer inside the Fields structure. So it will only be equal if we have two references to the same Fields structure. And also, each substring is uniquely identified by the begin position and the end position. So we don’t need to compare both. This implementation would work well, the assertion would catch you if you used == on iterators from two different calls to split.
> 
>     ASSERT(&m_fields == &other.m_fields);
>     return m_beginPosition == other.m_beginPosition;
> 
> Once we write it this way, it becomes clear that is a great candidate for inlining since the non-assert part of the function is a single super-simple expression.
> 
> Here are five other internal consistency check assertions we could include if we like:
> 
>     ASSERT((m_beginPosition == other.m_beginPosition) == (m_endPosition == other.m_endPosition));
>     ASSERT(m_beginPosition <= m_endPosition);
>     ASSERT(m_endPosition <= m_fields.m_string.length());
>     ASSERT(other.m_beginPosition <= other.m_endPosition);
>     ASSERT(other.m_endPosition <= other.m_fields.m_string.length());

Will assert that the iterators are from the same call to split, add internal consistency checks, and make this an inline function.

>> Source/WTF/wtf/text/StringView.h:121
>> +    class Fields;
> 
> I don’t think Fields is a great name for this class. I would call this SplitResult or SplitSubstrings.
> 
> Later we might need different names for variations because we might need different flavors of split. Split based on a character, a string, and maybe even a regular expression. Split returning the appropriate number of empty substrings, alongside this kind of split that skips all empty substrings.

Will rename class to SplitResult.

>> Source/WTF/wtf/text/StringView.h:122
>> +    WTF_EXPORT_STRING_API Fields split(UChar, unsigned startOffset = 0) const;
> 
> It’s easy to use substring on a StringView to start at an arbitrary location, and it is efficient and does not memory allocation, so the startOffset seems like an unneeded feature and I suggest omitting it. These were needed in the String class because substrings there are expensive and to be avoided. If one was going to call string.split('/', 1) one could call string.subview(1).split('/') instead.

Will remove support for specifying a start offset.

>> Source/WTF/wtf/text/StringView.h:620
>> +    explicit Fields(const StringView&, UChar separator, unsigned startOffset);
> 
> The type of the argument here should be StringView, not const StringView&, because StringView is efficient to pass by value so there is no reason to use a reference here.

Will change data type of first parameter from "const StringView&" to StringView.

>> Source/WTF/wtf/text/StringView.h:627
>> +    StringView m_stringView;
> 
> In the context of this class, which is a member of StringView, I think we can just name this m_string since there is no ambiguity about the fact that this is a string view. But roughly speaking, a string.

Will rename to m_string.

>> Source/WTF/wtf/text/StringView.h:673
>> +    Iterator& operator++(int) = delete;
> 
> I believe this is unneeded. If you define operator++() and don’t define operator++(int), it automatically will give an error if you try to post-increment. An explicit delete is not necessary.

Will remove this statement.

I chose to explicitly delete this function to convey to a reader in both code and in a compile-time failure that the post-increment operator was explicitly not implemented (*) as opposed to imply that it was inadvertently not implemented. I take it that such a distinction is not meaningful and trying to invoke deliberate consideration on the usefulness of having a post-increment operator should someone invoke it is not meaningful.

(*) In the hopes that this would make a person consider the benefit of implementing it and whether they really need it.

>> Source/WTF/wtf/text/StringView.h:683
>> +    friend Fields;
> 
> Why does this work? I’ve always seen this written as friend class Fields before. Is this preferred syntax?

This makes use of the C++11 simple type specifier syntax of the friend specifier. See (4) of <http://en.cppreference.com/w/cpp/language/friend> for more details. I do not see a reason to be against such syntax. Let me know if you are against it.

>> Source/WTF/wtf/text/StringView.h:685
>> +    const Fields& m_fields;
> 
> I would probably rename this m_result after changing the type name from Fields to SplitResult or m_substrings if we use the name SplitSubstrings.
> 
> (Hard extra credit project: Would be nice if we could figure out a way to make the iterator assert at runtime if we accidentally used this after the Fields object was destroyed. I wouldn’t want to greatly complicate these classes just to check that, but it would be a bad bug if we kept an iterator around after m_fields was gone.)

Will rename m_fields to m_result as I chose to rename the class from Fields to SplitResult.

>> Source/WTF/wtf/text/StringView.h:687
>> +    size_t m_endPosition;
> 
> These should be unsigned, not size_t. Positions within strings are unsigned, except that annoyingly the result of find is a size_t.
> 
> I think "begin position" is a strange name for a data member, because "begin" is a verb, not a noun. The noun is "beginning". I know that the C++ standard library uses begin in this peculiar way, but I don’t think we need to follow suit. Maybe startPosition instead. Or even m_substringStart/m_substringEnd or m_substringPosition/m_substringEndPosition or m_position/m_endPosition.
> 
> I also think it would be easy to write this in terms of a length instead of an end position. The three places that currently set the end position (there are only three) would just subtract the start position to compute a length, and then the code in ++ would just say: m_position += m_length + 1.
> 
> I think that m_position/m_length may in fact be my favorite way to write this, but it would change almost all the assertions I suggested above.

Will use m_position/m_length (both unsigned datatypes) and update implementation and suggested assertions.

>> Source/WebCore/platform/URLParser.cpp:2709
>> +    for (const auto& bytes : input.split('&')) {
> 
> This should just be "auto" or "StringView". There is no reason to write "const auto&", which does the same thing as auto in this case, and I don’t think it’s particularly clear.

Will change "const auto&" to StringView.

>> Tools/TestWebKitAPI/Tests/WTF/StringView.cpp:274
>> +        EXPECT_STREQ(expected[i].ascii().data(), actual[i].ascii().data()) << "Vectors differ at index " << i;
> 
> Shouldn’t use ascii() unless debugging. Should use utf8() instead.
> 
> I don’t understand the use of ASSERT for the size check and EXPECT for the values of each element. I think we would want EXPECT for both.
> 
> Is there no better idiom for checking if two vectors of strings are the same in one of our unit tests? Code is so repetitive; we should probably use a helper function instead of repeating ourselves so many times. Maybe even have a function where you pass a StringView, a split character, and an initializer list, so each test case could be just one line; I find that style helps encourage me to write a lot of test cases since they don’t occupy much space. Unless it’s valuable to have the line number of the EXPECT failure give us a clue which test failed. I am not familiar enough with the behavior of the test macros to know whether it helps.

Will substitute utf8() for ascii() here and throughout this function.

Notice that the EXPECT*() macro functions continue test execution on failure and that if expected.size() != actual.size() then the test driver's helper process will crash due to an out-of-bounds read. I thought it would be clearer to a person to abort the test before we attempt to perform an out-of-bounds read that crashes the test driver's helper process. Therefore I chose to use the ASSERT*() macro function to abort the current test.

I am also unhappy to duplicate the logic to compare two vectors. I did not see an existing idiom and defining a "check vector" function would affect the line numbers of a failed ASSERT/EXPECT (as they would be relative to the "check vector" function as opposed to the code that called it) and would require emitting some other information to know about the caller of this function. Knowing that you can always rely on the line number of a failed ASSERT/EXPECT can save time (maybe we already violate this assumption in other tests?) compared to determining that the line number is meaningless and needing to read the test failure output again and grep the code for the caller of the "check vector" routine or set breakpoints to determine the caller of the "check vector" function. Let me know if we favor a "check vector" function over meaningful line numbers.

>> Tools/TestWebKitAPI/Tests/WTF/StringView.cpp:277
>> +    expected = Vector<String>({ "This is a sentence" });
> 
> I don’t think we need to state the type here, just the braces should work.

Will remove.

>> Tools/TestWebKitAPI/Tests/WTF/StringView.cpp:314
>> +    EXPECT_EQ(0U, actual.size());
> 
> If we are keeping startOffset, we probably also want a test where it is the maximum unsigned value, to make sure we don’t have any overflow.

Will remove the offset tests as I will remove support for splitting from a start offset.

>> Tools/TestWebKitAPI/Tests/WTF/StringView.cpp:351
>> +    EXPECT_FALSE(a.split(' ', a.length()).end() == b.split(' ', b.length()).end());
> 
> This does not seem like an important property of the iterators to test. If you take my suggestion above, these tests are all just going to trigger assertion failures.
> 
> Not sure what the best practice is for testing assertion failures. If there was a way I would want to test for the assertion in ++ too.

For now, I hope you do not mind that I remove these tests.

Additional remarks:

We may be able to test for an assertion failure using a Death Test, <https://github.com/google/googletest/blob/master/googletest/docs/AdvancedGuide.md#death-tests>. Obviously this has the disadvantages of slowing down tests runs (the test driver will need to spawn a new subprocess for each death test and we need to wait to generate the crash report) and filling the disk with crash reports. We may be able to come up with a clever way to avoid generating a crash report. Let me know if you would like me to investigate this further.
Comment 5 Daniel Bates 2017-01-17 19:48:25 PST
Created attachment 299113 [details]
Patch and unit tests

Updated patch based on Darin Adler's feedback.
Comment 6 Daniel Bates 2017-01-17 19:50:00 PST
Comment on attachment 299113 [details]
Patch and unit tests

View in context: https://bugs.webkit.org/attachment.cgi?id=299113&action=review

> Source/WTF/ChangeLog:15
> +        StringView.split() now returns a StringView::Fields object that implements begin()/end()

Will substitute SplitResult for Fields before landing.

> Source/WTF/ChangeLog:17
> +        character. For example, to iterate over the 'c'-separated fields of a StringView v, you

Will substitute substring for field before landing.

> Source/WTF/ChangeLog:20
> +        for (StringView field : v.split('c'))

Ditto.

> Source/WTF/ChangeLog:21
> +            // Do something with field.

Ditto.
Comment 7 WebKit Commit Bot 2017-01-17 19:51:19 PST
Attachment 299113 [details] did not pass style-queue:


ERROR: Source/WTF/wtf/text/StringView.h:887:  Code inside a namespace should not be indented.  [whitespace/indent] [4]
Total errors found: 1 in 7 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 8 Daniel Bates 2017-01-17 19:53:36 PST
(In reply to comment #3)
> (Hard extra credit project: Would be nice if we could figure out a way to
> make the iterator assert at runtime if we accidentally used this after the
> Fields object was destroyed. I wouldn’t want to greatly complicate these
> classes just to check that, but it would be a bad bug if we kept an iterator
> around after m_fields was gone.)
> 

I did not mean to ignore this remark. I have not thought about this problem yet. I will shortly.
Comment 9 Darin Adler 2017-01-17 21:49:11 PST
Comment on attachment 299113 [details]
Patch and unit tests

View in context: https://bugs.webkit.org/attachment.cgi?id=299113&action=review

Looks excellent as is. A few thoughts for a tiny bit of refinement.

> Source/WTF/wtf/text/StringView.cpp:111
> +    ASSERT(m_length <= m_result.m_string.length());
> +    ASSERT(m_position <= m_result.m_string.length() - m_length);

These are essentially the same assertions as in operator*, but for some reason you left out the ASSERT(m_length) here. But I suggest we not initialize length and write this instead:

    ASSERT(m_position < m_result.m_string.length());

The above is the assertion that is important. That catches callers that use the iterator when we are past the end of the strings. The following m_length assertions are OK but I think not really all that valuable, so maybe just omit them:

    ASSERT(m_length);
    ASSERT(m_length <= m_result.m_string.length() - m_position);

> Source/WTF/wtf/text/StringView.cpp:119
> +    if (m_position == m_result.m_string.length() - m_length) {
> +        m_position = m_result.m_string.length();
> +        m_length = 0;
> +        return *this;
> +    }
> +    m_position += m_length + 1;
> +    findNextSubstring();
> +    return *this;

I think this can be written a bit more tightly:

    m_position += m_length;
    if (m_position < m_result.m_string.length()) {
        ++m_position;
        findNextSubstring();
    }
    return *this;

> Source/WTF/wtf/text/StringView.h:687
> +    unsigned m_length { 0 };

We should consider not initializing m_length for slightly better efficiency; why not save one store of a zero in both constructors? The assertions I suggested above in the order listed make it clearer that we don’t have to set m_length except as a result of calling findNextSubstring. When we are at the end of the string we don’t care what m_length is set to, since calling the * or ++ operators in that case is an error.

> Source/WTF/wtf/text/StringView.h:917
> +    ASSERT(m_length);
> +    ASSERT(m_length <= m_result.m_string.length());
> +    ASSERT(m_position <= m_result.m_string.length() - m_length);

See above for my suggestion of what to replace these assertions with that does not depend on m_length being set.

> Source/WTF/wtf/text/StringView.h:927
> +    ASSERT(m_length <= m_result.m_string.length());
> +    ASSERT(m_position <= m_result.m_string.length() - m_length);
> +    ASSERT(other.m_length <= other.m_result.m_string.length());
> +    ASSERT(other.m_position <= other.m_result.m_string.length() - other.m_length);

If we don’t set m_length at the end of the string, we have to live without these assertions or write longer versions. But I think we can live without them. No real need to assert all these things. Could assert position <= length on both *this and other, but even that seems unneeded to me; the important assertion is that both iterators are over the same string, since that can detect programming mistakes. These are internal consistency checks that check invariants that won’t ever be wrong. So I think we can just leave the one assertion above and remove all four of these.
Comment 10 Daniel Bates 2017-01-23 16:18:03 PST
(In reply to comment #9)
> [...]
> > Source/WTF/wtf/text/StringView.cpp:111
> > +    ASSERT(m_length <= m_result.m_string.length());
> > +    ASSERT(m_position <= m_result.m_string.length() - m_length);
> 
> These are essentially the same assertions as in operator*, but for some
> reason you left out the ASSERT(m_length) here. But I suggest we not
> initialize length and write this instead:
> 
>     ASSERT(m_position < m_result.m_string.length());
> 

Will not initialize m_length and add this assert.

> The above is the assertion that is important. That catches callers that use
> the iterator when we are past the end of the strings. The following m_length
> assertions are OK but I think not really all that valuable, so maybe just
> omit them:
> 
>     ASSERT(m_length);
>     ASSERT(m_length <= m_result.m_string.length() - m_position);
>

Will omit these assertions.

> > Source/WTF/wtf/text/StringView.cpp:119
> > +    if (m_position == m_result.m_string.length() - m_length) {
> > +        m_position = m_result.m_string.length();
> > +        m_length = 0;
> > +        return *this;
> > +    }
> > +    m_position += m_length + 1;
> > +    findNextSubstring();
> > +    return *this;
> 
> I think this can be written a bit more tightly:
> 
>     m_position += m_length;
>     if (m_position < m_result.m_string.length()) {
>         ++m_position;
>         findNextSubstring();
>     }
>     return *this;
> 

Will update code to match your proposed implementation.

> > Source/WTF/wtf/text/StringView.h:687
> > +    unsigned m_length { 0 };
> 

Will remove initialization of m_length.

> We should consider not initializing m_length for slightly better efficiency;
> why not save one store of a zero in both constructors? The assertions I
> suggested above in the order listed make it clearer that we don’t have to
> set m_length except as a result of calling findNextSubstring. When we are at
> the end of the string we don’t care what m_length is set to, since calling
> the * or ++ operators in that case is an error.
> 
> > Source/WTF/wtf/text/StringView.h:917
> > +    ASSERT(m_length);
> > +    ASSERT(m_length <= m_result.m_string.length());
> > +    ASSERT(m_position <= m_result.m_string.length() - m_length);
> 
> See above for my suggestion of what to replace these assertions with that
> does not depend on m_length being set.
> 

Will replace these assertions with:

ASSERT(m_position < m_result.m_string.length());

> > Source/WTF/wtf/text/StringView.h:927
> > +    ASSERT(m_length <= m_result.m_string.length());
> > +    ASSERT(m_position <= m_result.m_string.length() - m_length);
> > +    ASSERT(other.m_length <= other.m_result.m_string.length());
> > +    ASSERT(other.m_position <= other.m_result.m_string.length() - other.m_length);
> 
> If we don’t set m_length at the end of the string, we have to live without
> these assertions or write longer versions. But I think we can live without
> them. No real need to assert all these things. Could assert position <=
> length on both *this and other, but even that seems unneeded to me; the
> important assertion is that both iterators are over the same string, since
> that can detect programming mistakes. These are internal consistency checks
> that check invariants that won’t ever be wrong. So I think we can just leave
> the one assertion above and remove all four of these.

Will remove these assertions such that StringView::SplitResult::Iterator::operator==() has exactly one assertion:

ASSERT(&m_result == &other.m_result);
Comment 11 Daniel Bates 2017-01-24 09:46:24 PST
Committed r211087: <http://trac.webkit.org/changeset/211087>