Bug 162315 - Optimize StringView::CodePoints::Iterator
Summary: Optimize StringView::CodePoints::Iterator
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: New Bugs (show other bugs)
Version: WebKit Nightly Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Alex Christensen
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-09-20 13:55 PDT by Alex Christensen
Modified: 2021-03-23 08:57 PDT (History)
6 users (show)

See Also:


Attachments
Patch (4.65 KB, patch)
2016-09-20 13:56 PDT, Alex Christensen
no flags Details | Formatted Diff | Diff
Patch (4.48 KB, patch)
2021-03-22 13:25 PDT, Alex Christensen
ews-feeder: commit-queue-
Details | Formatted Diff | Diff
Patch (4.46 KB, patch)
2021-03-22 14:07 PDT, Alex Christensen
no flags Details | Formatted Diff | Diff
Patch (7.41 KB, patch)
2021-03-22 17:04 PDT, Alex Christensen
no flags Details | Formatted Diff | Diff
Patch (4.48 KB, patch)
2021-03-22 18:48 PDT, Alex Christensen
darin: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Alex Christensen 2016-09-20 13:55:06 PDT
Optimize StringView::CodePoints::Iterator
Comment 1 Alex Christensen 2016-09-20 13:56:51 PDT
Created attachment 289396 [details]
Patch
Comment 2 Darin Adler 2016-12-09 20:32:40 PST
Comment on attachment 289396 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=289396&action=review

> Source/WTF/wtf/text/StringView.h:626
> -    StringView m_stringView;
> +    const StringView& m_stringView;

This change is not good. If the StringView that was passed in to this iterator is destroyed, we will access the memory location the StringView was once stored at. The old implementation did not require that the StringView still be around, just that the underlying string data still be around. This is especially dangerous if the StringView was a temporary, perhaps the result of a call to StringView::substring.

So if you want to optimize this, please do it a safer way.

> Source/WTF/wtf/text/StringView.h:678
> +    const void* m_current;
> +    const void* m_end;
> +    bool m_is8Bit;

Doing this gets rid of all the ASSERT(underlyingStringIsValid()) checks that would have been done each time the string view is used when StringView::length, StringView::characters8, and StringView::characters16 was called; this is what catches us if we use a stale iterator after the string is destroyed. To preserve those checks we need to add additional code inside #if CHECK_STRINGVIEW_LIFETIME and presumably the easiest approach is to keep a copy of the StringView around in a separate data member only when those checks are enabled.

To avoid all the type casting, I suggest we do things one of these two ways. Something like this:

    union {
        const LChar* m_current8;
        const UChar* m_current16;
    };
    union {
        const LChar* m_end8;
        const UChar* m_end16;
    };
    bool m_is8Bit;

Or we something different, like this; is8Bit could be based on the variant.

    Variant<std::pair<const LChar*, const LChar*>, std::pair<const UChar*, const UChar*>> m_characters;

The use of pair is a bit ugly, so maybe if you like that separate style, something more elegant.

> Source/WTF/wtf/text/StringView.h:733
>  {

Should ASSERT(index <= stringView.length()).

> Source/WTF/wtf/text/StringView.h:735
> +        const LChar* begin = stringView.characters8();

I suggest auto or auto* here.

> Source/WTF/wtf/text/StringView.h:737
> +        m_current = reinterpret_cast<const void*>(begin + index);
> +        m_end = reinterpret_cast<const void*>(begin + stringView.length());

Even if we continue to use const void*, there is no need for these type casts. It is legal to assign const LChar* to const void*. But if we use unions we won’t need type casts here.

> Source/WTF/wtf/text/StringView.h:739
> +        const UChar* begin = stringView.characters16();

Ditto.

> Source/WTF/wtf/text/StringView.h:741
> +        m_current = reinterpret_cast<const void*>(begin + index);
> +        m_end = reinterpret_cast<const void*>(begin + stringView.length());

Ditto.

> Source/WTF/wtf/text/StringView.h:749
> +        m_current = reinterpret_cast<const void*>(reinterpret_cast<const LChar*>(m_current) + 1);

No need for all the reinterpret_cast, because static_cast has the power to cast from const void* because of idioms like this one, and there is no need to cast *to* const void*. Just:

    m_current = static_cast<const LChar*>(m_current) + 1;

Or:

    ++m_current8;

Depending on whether you take my advice above.

> Source/WTF/wtf/text/StringView.h:752
> +        size_t length = reinterpret_cast<const UChar*>(m_end) - reinterpret_cast<const UChar*>(m_current);

These should be static_cast, or no cast at all if we use union.

> Source/WTF/wtf/text/StringView.h:753
> +        U16_FWD_1(reinterpret_cast<const UChar*>(m_current), i, length);

Ditto.

> Source/WTF/wtf/text/StringView.h:754
> +        m_current = reinterpret_cast<const void*>(reinterpret_cast<const UChar*>(m_current) + i);

Ditto.

> Source/WTF/wtf/text/StringView.h:765
>  inline auto StringView::CodePoints::Iterator::operator=(const Iterator& other) -> Iterator&
>  {
> -    m_stringView = other.m_stringView;
> -    m_nextCodePointOffset = other.m_nextCodePointOffset;
> -    m_codePoint = other.m_codePoint;
> +    m_current = other.m_current;
> +    m_end = other.m_end;
> +    m_is8Bit = other.m_is8Bit;
>      return *this;
>  }

We don’t need to write this out. This is what would be generated if we just didn’t define an explicit operator= function. So I suggest not defining one and letting the compiler take care of generating a correct function.

> Source/WTF/wtf/text/StringView.h:771
> +        return *reinterpret_cast<const LChar*>(m_current);

Should be static_cast or no cast at all if we use the union.

> Source/WTF/wtf/text/StringView.h:772
> +    UChar32 c;

I would prefer to use a word or phrase, rather than a single letter, for this local variable. I think "codePoint" would do nicely.

> Source/WTF/wtf/text/StringView.h:773
> +    U16_GET(reinterpret_cast<const UChar*>(m_current), 0, 0, reinterpret_cast<const UChar*>(m_end) - reinterpret_cast<const UChar*>(m_current), c);

Ditto. I am also slightly concerned that doing this work separately from U16_FWD_1 might result in code that checks for surrogate pairs twice, so is always unnecessarily inefficient, but I am not sure.
Comment 3 Alex Christensen 2021-03-22 13:25:59 PDT
Created attachment 423927 [details]
Patch
Comment 4 Alex Christensen 2021-03-22 14:07:22 PDT
Created attachment 423934 [details]
Patch
Comment 5 Darin Adler 2021-03-22 15:45:49 PDT
Comment on attachment 423934 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=423934&action=review

> Source/WTF/wtf/text/StringView.h:847
> +    union {
> +        const LChar* m_current8;
> +        const UChar* m_current16;
> +    };
> +    union {
> +        const LChar* m_end8;
> +        const UChar* m_end16;
> +    };

We could consider using const void* for both of these. This would help us understand the code is in operator== without relying on properties of unions that aren’t deeply guaranteed.

We would use static_cast to pull values out: there would be only 4 places we’d need them.

> Source/WTF/wtf/text/StringView.h:933
> +        size_t length = m_end16 - m_current16;

Why local variable here?

> Source/WTF/wtf/text/StringView.h:951
> +    U16_GET(m_current16, 0, 0, m_end16 - m_current16, codePoint);

But not here for the same kind of argument?
Comment 6 Alex Christensen 2021-03-22 17:04:34 PDT
Created attachment 423965 [details]
Patch
Comment 7 Darin Adler 2021-03-22 18:06:21 PDT
Comment on attachment 423965 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=423965&action=review

> Source/WTF/ChangeLog:10
> +        In order to make const members, I need to remove the ability to reassign an iterator to a different iterator.
> +        This ability was only used in tests, so I modified the test to still compile successfully after this const addition.

Why did you need to make const members?
Comment 8 Alex Christensen 2021-03-22 18:07:20 PDT
Because I can?  I think it's good to have iterators being unable to switch between 8 and 16 bit at least.
Comment 9 Darin Adler 2021-03-22 18:26:56 PDT
(In reply to Alex Christensen from comment #8)
> Because I can?  I think it's good to have iterators being unable to switch
> between 8 and 16 bit at least.

That doesn’t seem like a worthwhile thing.

Iterators can’t "switch"; part of the benefit of them being objects is that they stay consistent, for example "current" and "end" will stay consistent and not get mixed up. But like any other typical value-type object, they can be overwritten by a completely different iterator using the assignment operator. I don’t think it is valuable to make a value object like this intrinsically const; if a client wants one they can’t modify by accident they can make it const. It’s like making an int const so it can’t change from an even number to an odd number.

Iterators in the standard library are like this. For example, you can take an iterator pointing to the beginning of a std::set and overwrite it with an iterator pointing to the beginning of another set::set, or whatever, as long as the sets are of the same type.

I suggest we drop this const.
Comment 10 Alex Christensen 2021-03-22 18:48:32 PDT
Created attachment 423976 [details]
Patch
Comment 11 Darin Adler 2021-03-22 19:09:34 PDT
Comment on attachment 423976 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=423976&action=review

> Source/WTF/wtf/text/StringView.h:922
> +        ASSERT(m_current < m_end);

Seems like we can hoist this assertion out of the if statement.

> Source/WTF/wtf/text/StringView.h:940
> +        ASSERT(m_current < m_end);

Ditto.
Comment 12 Alex Christensen 2021-03-23 08:57:00 PDT
r274872