Bug 229907 - Add a fast path for atomizing strings when parsing HTML
Summary: Add a fast path for atomizing strings when parsing HTML
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: DOM (show other bugs)
Version: WebKit Nightly Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Wenson Hsieh
URL:
Keywords: InRadar
Depends on:
Blocks:
 
Reported: 2021-09-03 18:02 PDT by Wenson Hsieh
Modified: 2021-09-08 07:01 PDT (History)
11 users (show)

See Also:


Attachments
For EWS (6.63 KB, patch)
2021-09-05 13:26 PDT, Wenson Hsieh
no flags Details | Formatted Diff | Diff
Patch (6.92 KB, patch)
2021-09-06 11:38 PDT, Wenson Hsieh
no flags Details | Formatted Diff | Diff
Patch (7.00 KB, patch)
2021-09-07 08:30 PDT, Wenson Hsieh
no flags Details | Formatted Diff | Diff
v2 (18.43 KB, patch)
2021-09-07 15:27 PDT, Wenson Hsieh
no flags Details | Formatted Diff | Diff
For EWS (19.05 KB, patch)
2021-09-07 17:49 PDT, Wenson Hsieh
no flags Details | Formatted Diff | Diff
For EWS (18.92 KB, patch)
2021-09-07 21:36 PDT, Wenson Hsieh
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Wenson Hsieh 2021-09-03 18:02:58 PDT
.
Comment 1 Wenson Hsieh 2021-09-05 13:26:55 PDT
Created attachment 437365 [details]
For EWS
Comment 2 Cameron McCormack (:heycam) 2021-09-05 17:41:44 PDT
Comment on attachment 437365 [details]
For EWS

View in context: https://bugs.webkit.org/attachment.cgi?id=437365&action=review

Looks good, so a non-reviewer r+ from me.

> Source/WebCore/html/parser/AtomHTMLToken.h:215
> +    static constexpr auto maxStringLengthForCache = 36;

Might be worth a comment (here or in the ChangeLog entry) to say where 36 comes from.

> Source/WebCore/html/parser/AtomHTMLToken.h:222
> +    if (!WTF::equal(foundImpl.get(), string.data(), string.size())) {

Use length rather than string.size(), or drop the length variable earlier.

> Source/WebCore/html/parser/AtomHTMLToken.h:258
> +        m_name = atomize<AtomStringType::TagOrAttributeName>(token.name());

I guess this will let us fast path one more atomization, since the name in the <!DOCTYPE> is almost certainly "html", and we'll later probably have an <html> and an </html> tag.
Comment 3 Wenson Hsieh 2021-09-06 11:27:05 PDT
Comment on attachment 437365 [details]
For EWS

View in context: https://bugs.webkit.org/attachment.cgi?id=437365&action=review

Thanks for taking a look!

>> Source/WebCore/html/parser/AtomHTMLToken.h:215
>> +    static constexpr auto maxStringLengthForCache = 36;
> 
> Might be worth a comment (here or in the ChangeLog entry) to say where 36 comes from.

Good point — I'll add an explanation for this in the ChangeLog.

I chose to add an upper bound longer strings are less likely in practice to be very frequently re-atomized, compared to shorter strings. This specific 36-character limit actually allows us to cache certain UUIDs in Speedometer that are repeatedly atomized during parsing, which significantly increases the cache hit rate in Speedometer.

>> Source/WebCore/html/parser/AtomHTMLToken.h:222
>> +    if (!WTF::equal(foundImpl.get(), string.data(), string.size())) {
> 
> Use length rather than string.size(), or drop the length variable earlier.

👍🏻 Changed to use `length` here.

>> Source/WebCore/html/parser/AtomHTMLToken.h:258
>> +        m_name = atomize<AtomStringType::TagOrAttributeName>(token.name());
> 
> I guess this will let us fast path one more atomization, since the name in the <!DOCTYPE> is almost certainly "html", and we'll later probably have an <html> and an </html> tag.

Indeed — while this part is not really critical for the performance boost, I figured it would be nice to just change all of the atomization call sites to use the `atomize<>` helper for consistency.
Comment 4 Wenson Hsieh 2021-09-06 11:38:04 PDT
Created attachment 437430 [details]
Patch
Comment 5 Yusuke Suzuki 2021-09-07 07:23:44 PDT
Comment on attachment 437430 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=437430&action=review

> Source/WebCore/html/parser/AtomHTMLToken.h:200
> +    static NeverDestroyed<std::array<RefPtr<AtomStringImpl>, atomizedStringImplTableCapacity>> tables[2];

Use `MainThreadNeverDestroyed<>` instead :)
Comment 6 Wenson Hsieh 2021-09-07 08:30:43 PDT
Created attachment 437510 [details]
Patch
Comment 7 Wenson Hsieh 2021-09-07 08:30:58 PDT
(In reply to Yusuke Suzuki from comment #5)
> Comment on attachment 437430 [details]
> Patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=437430&action=review
> 
> > Source/WebCore/html/parser/AtomHTMLToken.h:200
> > +    static NeverDestroyed<std::array<RefPtr<AtomStringImpl>, atomizedStringImplTableCapacity>> tables[2];
> 
> Use `MainThreadNeverDestroyed<>` instead :)

Fixed — thanks!
Comment 8 Yusuke Suzuki 2021-09-07 11:33:17 PDT
Comment on attachment 437510 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=437510&action=review

> Source/WebCore/html/parser/AtomHTMLToken.h:200
> +    static MainThreadNeverDestroyed<std::array<RefPtr<AtomStringImpl>, atomizedStringImplTableCapacity>> tables[2];

One problem I found is that this table is caching arbitrary string. It is possible that this string is very large and it is not know tag name.
So I think we need a good story of clearing this cache, otherwise, we will potentially leak large strings.
Can you add some mechanism to make this cache cleared? (e.g. clearing this when innerHTML finishes).
Comment 9 Wenson Hsieh 2021-09-07 11:46:23 PDT
(In reply to Yusuke Suzuki from comment #8)
> Comment on attachment 437510 [details]
> Patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=437510&action=review
> 
> > Source/WebCore/html/parser/AtomHTMLToken.h:200
> > +    static MainThreadNeverDestroyed<std::array<RefPtr<AtomStringImpl>, atomizedStringImplTableCapacity>> tables[2];
> 
> One problem I found is that this table is caching arbitrary string. It is
> possible that this string is very large and it is not know tag name.
> So I think we need a good story of clearing this cache, otherwise, we will
> potentially leak large strings.
> Can you add some mechanism to make this cache cleared? (e.g. clearing this
> when innerHTML finishes).

I did think about clearing after `setInnerHTML`, but I was worried that that would hinder some of the performance benefit of using this approach :/.

Do you think that the upper-bound on cached String length (36) might be enough to mitigate this?
Comment 10 Yusuke Suzuki 2021-09-07 11:59:49 PDT
Comment on attachment 437510 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=437510&action=review

>>> Source/WebCore/html/parser/AtomHTMLToken.h:200
>>> +    static MainThreadNeverDestroyed<std::array<RefPtr<AtomStringImpl>, atomizedStringImplTableCapacity>> tables[2];
>> 
>> One problem I found is that this table is caching arbitrary string. It is possible that this string is very large and it is not know tag name.
>> So I think we need a good story of clearing this cache, otherwise, we will potentially leak large strings.
>> Can you add some mechanism to make this cache cleared? (e.g. clearing this when innerHTML finishes).
> 
> I did think about clearing after `setInnerHTML`, but I was worried that that would hinder some of the performance benefit of using this approach :/.
> 
> Do you think that the upper-bound on cached String length (36) might be enough to mitigate this?

I think adding an upperbound sounds good. And at least, we should clear this cache when memory warning happens.

1. Can you add length upperbound here?
2. Can you wire memory warning thing to this cache to clear things when warning happens?
3. And can you add FIXME+bugzilla URL about more periodic clearing?
4. What do you think about clearing this cache when GC timer is fired? (GCController.cpp in WebCore)
Comment 11 Yusuke Suzuki 2021-09-07 12:02:38 PDT
Comment on attachment 437510 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=437510&action=review

>>>> Source/WebCore/html/parser/AtomHTMLToken.h:200
>>>> +    static MainThreadNeverDestroyed<std::array<RefPtr<AtomStringImpl>, atomizedStringImplTableCapacity>> tables[2];
>>> 
>>> One problem I found is that this table is caching arbitrary string. It is possible that this string is very large and it is not know tag name.
>>> So I think we need a good story of clearing this cache, otherwise, we will potentially leak large strings.
>>> Can you add some mechanism to make this cache cleared? (e.g. clearing this when innerHTML finishes).
>> 
>> I did think about clearing after `setInnerHTML`, but I was worried that that would hinder some of the performance benefit of using this approach :/.
>> 
>> Do you think that the upper-bound on cached String length (36) might be enough to mitigate this?
> 
> I think adding an upperbound sounds good. And at least, we should clear this cache when memory warning happens.
> 
> 1. Can you add length upperbound here?
> 2. Can you wire memory warning thing to this cache to clear things when warning happens?
> 3. And can you add FIXME+bugzilla URL about more periodic clearing?
> 4. What do you think about clearing this cache when GC timer is fired? (GCController.cpp in WebCore)

Maybe we should clear this cache when full GC happens.
Comment 12 Wenson Hsieh 2021-09-07 12:03:10 PDT
(In reply to Yusuke Suzuki from comment #10)
> Comment on attachment 437510 [details]
> Patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=437510&action=review
> 
> >>> Source/WebCore/html/parser/AtomHTMLToken.h:200
> >>> +    static MainThreadNeverDestroyed<std::array<RefPtr<AtomStringImpl>, atomizedStringImplTableCapacity>> tables[2];
> >> 
> >> One problem I found is that this table is caching arbitrary string. It is possible that this string is very large and it is not know tag name.
> >> So I think we need a good story of clearing this cache, otherwise, we will potentially leak large strings.
> >> Can you add some mechanism to make this cache cleared? (e.g. clearing this when innerHTML finishes).
> > 
> > I did think about clearing after `setInnerHTML`, but I was worried that that would hinder some of the performance benefit of using this approach :/.
> > 
> > Do you think that the upper-bound on cached String length (36) might be enough to mitigate this?
> 
> I think adding an upperbound sounds good. And at least, we should clear this
> cache when memory warning happens.
> 
> 1. Can you add length upperbound here?

Ah, so this patch already enforces a 36-character upper limit on strings that are cached.

> 2. Can you wire memory warning thing to this cache to clear things when
> warning happens?

Will do!

> 3. And can you add FIXME+bugzilla URL about more periodic clearing?

👍🏻

> 4. What do you think about clearing this cache when GC timer is fired?
> (GCController.cpp in WebCore)

Yes, I think that makes sense! Will look into this too.
Comment 13 Wenson Hsieh 2021-09-07 14:25:54 PDT
(In reply to Yusuke Suzuki from comment #11)
> Comment on attachment 437510 [details]
> Patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=437510&action=review
> 
> >>>> Source/WebCore/html/parser/AtomHTMLToken.h:200
> >>>> +    static MainThreadNeverDestroyed<std::array<RefPtr<AtomStringImpl>, atomizedStringImplTableCapacity>> tables[2];
> >>> 
> >>> One problem I found is that this table is caching arbitrary string. It is possible that this string is very large and it is not know tag name.
> >>> So I think we need a good story of clearing this cache, otherwise, we will potentially leak large strings.
> >>> Can you add some mechanism to make this cache cleared? (e.g. clearing this when innerHTML finishes).
> >> 
> >> I did think about clearing after `setInnerHTML`, but I was worried that that would hinder some of the performance benefit of using this approach :/.
> >> 
> >> Do you think that the upper-bound on cached String length (36) might be enough to mitigate this?
> > 
> > I think adding an upperbound sounds good. And at least, we should clear this cache when memory warning happens.
> > 
> > 1. Can you add length upperbound here?
> > 2. Can you wire memory warning thing to this cache to clear things when warning happens?
> > 3. And can you add FIXME+bugzilla URL about more periodic clearing?
> > 4. What do you think about clearing this cache when GC timer is fired? (GCController.cpp in WebCore)
> 
> Maybe we should clear this cache when full GC happens.

We discussed this in Slack, and think it might be sufficient to clear the cache when navigating.
Comment 14 Wenson Hsieh 2021-09-07 15:27:23 PDT
Created attachment 437562 [details]
v2
Comment 15 Yusuke Suzuki 2021-09-07 16:30:00 PDT
Comment on attachment 437562 [details]
v2

View in context: https://bugs.webkit.org/attachment.cgi?id=437562&action=review

r=me

> Source/WebCore/page/MemoryRelease.cpp:89
> +    HTMLAtomStringCache::clear();

I believe this releaseNoncriticalMemory function is only running under the main thread, but can you double-check?

> Source/WebCore/page/cocoa/MemoryReleaseCocoa.mm:104
> +    HTMLAtomStringCache::clear();

This runs only under the main thread, correct? If so, then looks good to me!
Comment 16 Wenson Hsieh 2021-09-07 16:45:17 PDT
Comment on attachment 437562 [details]
v2

View in context: https://bugs.webkit.org/attachment.cgi?id=437562&action=review

Thanks for the review!

>> Source/WebCore/page/MemoryRelease.cpp:89
>> +    HTMLAtomStringCache::clear();
> 
> I believe this releaseNoncriticalMemory function is only running under the main thread, but can you double-check?

Yes, I think that's the case…it at a least looks like we can guarantee that isMainThread() is true, since there are some places in WebKitLegacy on iOS that call into `releaseMemory` through a block in `WebThreadRun`. But (IIUC) that should be okay.

>> Source/WebCore/page/cocoa/MemoryReleaseCocoa.mm:104
>> +    HTMLAtomStringCache::clear();
> 
> This runs only under the main thread, correct? If so, then looks good to me!

Yes! This one definitely only runs on the (real) main thread.
Comment 17 Darin Adler 2021-09-07 16:59:29 PDT
Comment on attachment 437562 [details]
v2

View in context: https://bugs.webkit.org/attachment.cgi?id=437562&action=review

> Source/WebCore/html/parser/HTMLAtomStringCache.h:36
> +    enum class Type : uint8_t { TagOrAttributeName = 0, AttributeValue = 1 };

How about bool instead of uint8_t?

Also, it’s so wordy. Can we have two overloads of get so the caller doesn’t have to write such a long identifier? The Type can be a secret implementation detail.

> Source/WebCore/html/parser/HTMLAtomStringCache.h:39
> +    ALWAYS_INLINE static AtomString get(const Vector<UChar, inlineCapacity>& string)

I’d be tempted to call this "create" even though it does a "create if needed and return existing if available" operation. Usually "get" only means "return existing if available". Or "make", or "atom".

> Source/WebCore/html/parser/HTMLAtomStringCache.h:49
> +        auto firstChararacter = string[0];
> +        auto lastChararacter = string[length - 1];

I suggest the traditional spelling for character; extra "ra" here.

> Source/WebCore/html/parser/HTMLAtomStringCache.h:68
> +    ALWAYS_INLINE static RefPtr<AtomStringImpl>& getImpl(Type type, UChar firstChararacter, UChar lastCharacter, UChar length)

More misspelling here.

> Source/WebCore/html/parser/HTMLAtomStringCache.h:73
> +        unsigned hash = (firstChararacter << 6) ^ ((lastCharacter << 14) ^ firstChararacter);
> +        hash += (hash >> 14) + (length << 14);
> +        hash ^= hash << 14;
> +        return cache(type)[(hash + (hash >> 6)) % capacity];

Would be nice to comment what kind of hash function this is, and why it’s a good choice.

>>> Source/WebCore/page/MemoryRelease.cpp:89
>>> +    HTMLAtomStringCache::clear();
>> 
>> I believe this releaseNoncriticalMemory function is only running under the main thread, but can you double-check?
> 
> Yes, I think that's the case…it at a least looks like we can guarantee that isMainThread() is true, since there are some places in WebKitLegacy on iOS that call into `releaseMemory` through a block in `WebThreadRun`. But (IIUC) that should be okay.

Is there something that will do ASSERT(isMainThread()) to help keep us honest? I don’t see anything explicit, but maybe something implicit?

> Source/WebCore/page/cocoa/MemoryReleaseCocoa.mm:102
> +#endif // PLATFORM(IOS_FAMILY)

I don’t think the #endif needs a comment when it is *so* close to the #if.
Comment 18 Wenson Hsieh 2021-09-07 17:38:29 PDT
Comment on attachment 437562 [details]
v2

View in context: https://bugs.webkit.org/attachment.cgi?id=437562&action=review

>> Source/WebCore/html/parser/HTMLAtomStringCache.h:36
>> +    enum class Type : uint8_t { TagOrAttributeName = 0, AttributeValue = 1 };
> 
> How about bool instead of uint8_t?
> 
> Also, it’s so wordy. Can we have two overloads of get so the caller doesn’t have to write such a long identifier? The Type can be a secret implementation detail.

Sounds good — changed the `uint8_t` to `bool`, and made this enum private (with `makeTagOrAttributeName` and `makeAttributeValue` exposed as public methods instead).

>> Source/WebCore/html/parser/HTMLAtomStringCache.h:39
>> +    ALWAYS_INLINE static AtomString get(const Vector<UChar, inlineCapacity>& string)
> 
> I’d be tempted to call this "create" even though it does a "create if needed and return existing if available" operation. Usually "get" only means "return existing if available". Or "make", or "atom".

Good point — I'll go with `makeTagOrAttributeName` and `makeAttributeValue`.

>> Source/WebCore/html/parser/HTMLAtomStringCache.h:49
>> +        auto lastChararacter = string[length - 1];
> 
> I suggest the traditional spelling for character; extra "ra" here.

Whoops! Fixed.

>> Source/WebCore/html/parser/HTMLAtomStringCache.h:68
>> +    ALWAYS_INLINE static RefPtr<AtomStringImpl>& getImpl(Type type, UChar firstChararacter, UChar lastCharacter, UChar length)
> 
> More misspelling here.

Fixed!

>> Source/WebCore/html/parser/HTMLAtomStringCache.h:73
>> +        return cache(type)[(hash + (hash >> 6)) % capacity];
> 
> Would be nice to comment what kind of hash function this is, and why it’s a good choice.

👍🏻 Added an explanation to the ChangeLog.

>>>> Source/WebCore/page/MemoryRelease.cpp:89
>>>> +    HTMLAtomStringCache::clear();
>>> 
>>> I believe this releaseNoncriticalMemory function is only running under the main thread, but can you double-check?
>> 
>> Yes, I think that's the case…it at a least looks like we can guarantee that isMainThread() is true, since there are some places in WebKitLegacy on iOS that call into `releaseMemory` through a block in `WebThreadRun`. But (IIUC) that should be okay.
> 
> Is there something that will do ASSERT(isMainThread()) to help keep us honest? I don’t see anything explicit, but maybe something implicit?

That's a good question! It looks like anything here that would attempt to ref or deref RefCounted objects would trigger an ASSERT, if this were to fire when `!isMainThread()`. From code inspection, I think the call to `InlineStyleSheetOwner::clearCache();` above would trigger this as long as the style sheet cache is non-empty, since the values of that map are `RefPtr<StyleSheetContents>`.

>> Source/WebCore/page/cocoa/MemoryReleaseCocoa.mm:102
>> +#endif // PLATFORM(IOS_FAMILY)
> 
> I don’t think the #endif needs a comment when it is *so* close to the #if.

True — removed the #endif.
Comment 19 Wenson Hsieh 2021-09-07 17:49:54 PDT
Created attachment 437574 [details]
For EWS
Comment 20 Darin Adler 2021-09-07 18:22:43 PDT
Comment on attachment 437574 [details]
For EWS

View in context: https://bugs.webkit.org/attachment.cgi?id=437574&action=review

Pondered this a little more.

> Source/WebCore/html/parser/HTMLAtomStringCache.h:56
> +    enum class Type : bool { TagOrAttributeName = 0, AttributeValue = 1 };

Just a super-tiny nit pick in an already great patch: the "= 0" and "= 1" here aren’t really needed, even though we do use these booleans as array indices. For one thing, it doesn’t matter which is 0 and which is 1. And for another, converting enum to a size_t doesn’t somehow become more "well defined" if we specify numeric values for them.

> Source/WebCore/html/parser/HTMLAtomStringCache.h:70
> +        auto& foundImpl = getImpl(type, firstCharacter, lastCharacter, length);

I probably would not use the word "impl" at all in the implementation; would look for different words. I think maybe I would name both the function and the local variable cacheSlot, cacheBucket, slot, or bucket.

> Source/WebCore/html/parser/HTMLAtomStringCache.h:71
> +        if (!WTF::equal(foundImpl.get(), string.data(), length)) {

Surprised that you need to use the WTF:: prefix explicitly here.

> Source/WebCore/html/parser/HTMLAtomStringCache.h:90
> +    using Cache = std::array<RefPtr<AtomStringImpl>, capacity>;

Any good reason the array elements are not AtomString rather than RefPtr<AtomStringImpl>?
Comment 21 Wenson Hsieh 2021-09-07 21:25:17 PDT
Comment on attachment 437574 [details]
For EWS

View in context: https://bugs.webkit.org/attachment.cgi?id=437574&action=review

>> Source/WebCore/html/parser/HTMLAtomStringCache.h:56
>> +    enum class Type : bool { TagOrAttributeName = 0, AttributeValue = 1 };
> 
> Just a super-tiny nit pick in an already great patch: the "= 0" and "= 1" here aren’t really needed, even though we do use these booleans as array indices. For one thing, it doesn’t matter which is 0 and which is 1. And for another, converting enum to a size_t doesn’t somehow become more "well defined" if we specify numeric values for them.

Makes sense! Removed the =0 and =1.

>> Source/WebCore/html/parser/HTMLAtomStringCache.h:70
>> +        auto& foundImpl = getImpl(type, firstCharacter, lastCharacter, length);
> 
> I probably would not use the word "impl" at all in the implementation; would look for different words. I think maybe I would name both the function and the local variable cacheSlot, cacheBucket, slot, or bucket.

👍🏻 Renamed `getImpl` to `cacheSlot` (and made it return an AtomString&), and also renamed `foundImpl` to just `slot`.

>> Source/WebCore/html/parser/HTMLAtomStringCache.h:71
>> +        if (!WTF::equal(foundImpl.get(), string.data(), length)) {
> 
> Surprised that you need to use the WTF:: prefix explicitly here.

Ah, I must've copied it from elsewhere but kept the WTF::. Removed.

>> Source/WebCore/html/parser/HTMLAtomStringCache.h:90
>> +    using Cache = std::array<RefPtr<AtomStringImpl>, capacity>;
> 
> Any good reason the array elements are not AtomString rather than RefPtr<AtomStringImpl>?

That's true, making this an AtomString directly should simplify this a bit.

I've changed this to be a std::array of AtomString, and changed the `RefPtr<AtomStringImpl>& getImpl(…)` method above to just `AtomString& get(…)`.
Comment 22 Radar WebKit Bug Importer 2021-09-07 21:27:46 PDT
<rdar://problem/82854612>
Comment 23 Wenson Hsieh 2021-09-07 21:36:14 PDT
Created attachment 437590 [details]
For EWS
Comment 24 EWS 2021-09-08 07:01:57 PDT
Committed r282142 (241439@main): <https://commits.webkit.org/241439@main>

All reviewed patches have been landed. Closing bug and clearing flags on attachment 437590 [details].