RESOLVED FIXED Bug 229907
Add a fast path for atomizing strings when parsing HTML
https://bugs.webkit.org/show_bug.cgi?id=229907
Summary Add a fast path for atomizing strings when parsing HTML
Wenson Hsieh
Reported 2021-09-03 18:02:58 PDT
.
Attachments
For EWS (6.63 KB, patch)
2021-09-05 13:26 PDT, Wenson Hsieh
no flags
Patch (6.92 KB, patch)
2021-09-06 11:38 PDT, Wenson Hsieh
no flags
Patch (7.00 KB, patch)
2021-09-07 08:30 PDT, Wenson Hsieh
no flags
v2 (18.43 KB, patch)
2021-09-07 15:27 PDT, Wenson Hsieh
no flags
For EWS (19.05 KB, patch)
2021-09-07 17:49 PDT, Wenson Hsieh
no flags
For EWS (18.92 KB, patch)
2021-09-07 21:36 PDT, Wenson Hsieh
no flags
Wenson Hsieh
Comment 1 2021-09-05 13:26:55 PDT
Cameron McCormack (:heycam)
Comment 2 2021-09-05 17:41:44 PDT
Comment on attachment 437365 [details] For EWS View in context: https://bugs.webkit.org/attachment.cgi?id=437365&action=review Looks good, so a non-reviewer r+ from me. > Source/WebCore/html/parser/AtomHTMLToken.h:215 > + static constexpr auto maxStringLengthForCache = 36; Might be worth a comment (here or in the ChangeLog entry) to say where 36 comes from. > Source/WebCore/html/parser/AtomHTMLToken.h:222 > + if (!WTF::equal(foundImpl.get(), string.data(), string.size())) { Use length rather than string.size(), or drop the length variable earlier. > Source/WebCore/html/parser/AtomHTMLToken.h:258 > + m_name = atomize<AtomStringType::TagOrAttributeName>(token.name()); I guess this will let us fast path one more atomization, since the name in the <!DOCTYPE> is almost certainly "html", and we'll later probably have an <html> and an </html> tag.
Wenson Hsieh
Comment 3 2021-09-06 11:27:05 PDT
Comment on attachment 437365 [details] For EWS View in context: https://bugs.webkit.org/attachment.cgi?id=437365&action=review Thanks for taking a look! >> Source/WebCore/html/parser/AtomHTMLToken.h:215 >> + static constexpr auto maxStringLengthForCache = 36; > > Might be worth a comment (here or in the ChangeLog entry) to say where 36 comes from. Good point — I'll add an explanation for this in the ChangeLog. I chose to add an upper bound longer strings are less likely in practice to be very frequently re-atomized, compared to shorter strings. This specific 36-character limit actually allows us to cache certain UUIDs in Speedometer that are repeatedly atomized during parsing, which significantly increases the cache hit rate in Speedometer. >> Source/WebCore/html/parser/AtomHTMLToken.h:222 >> + if (!WTF::equal(foundImpl.get(), string.data(), string.size())) { > > Use length rather than string.size(), or drop the length variable earlier. 👍🏻 Changed to use `length` here. >> Source/WebCore/html/parser/AtomHTMLToken.h:258 >> + m_name = atomize<AtomStringType::TagOrAttributeName>(token.name()); > > I guess this will let us fast path one more atomization, since the name in the <!DOCTYPE> is almost certainly "html", and we'll later probably have an <html> and an </html> tag. Indeed — while this part is not really critical for the performance boost, I figured it would be nice to just change all of the atomization call sites to use the `atomize<>` helper for consistency.
Wenson Hsieh
Comment 4 2021-09-06 11:38:04 PDT
Yusuke Suzuki
Comment 5 2021-09-07 07:23:44 PDT
Comment on attachment 437430 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=437430&action=review > Source/WebCore/html/parser/AtomHTMLToken.h:200 > + static NeverDestroyed<std::array<RefPtr<AtomStringImpl>, atomizedStringImplTableCapacity>> tables[2]; Use `MainThreadNeverDestroyed<>` instead :)
Wenson Hsieh
Comment 6 2021-09-07 08:30:43 PDT
Wenson Hsieh
Comment 7 2021-09-07 08:30:58 PDT
(In reply to Yusuke Suzuki from comment #5) > Comment on attachment 437430 [details] > Patch > > View in context: > https://bugs.webkit.org/attachment.cgi?id=437430&action=review > > > Source/WebCore/html/parser/AtomHTMLToken.h:200 > > + static NeverDestroyed<std::array<RefPtr<AtomStringImpl>, atomizedStringImplTableCapacity>> tables[2]; > > Use `MainThreadNeverDestroyed<>` instead :) Fixed — thanks!
Yusuke Suzuki
Comment 8 2021-09-07 11:33:17 PDT
Comment on attachment 437510 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=437510&action=review > Source/WebCore/html/parser/AtomHTMLToken.h:200 > + static MainThreadNeverDestroyed<std::array<RefPtr<AtomStringImpl>, atomizedStringImplTableCapacity>> tables[2]; One problem I found is that this table is caching arbitrary string. It is possible that this string is very large and it is not know tag name. So I think we need a good story of clearing this cache, otherwise, we will potentially leak large strings. Can you add some mechanism to make this cache cleared? (e.g. clearing this when innerHTML finishes).
Wenson Hsieh
Comment 9 2021-09-07 11:46:23 PDT
(In reply to Yusuke Suzuki from comment #8) > Comment on attachment 437510 [details] > Patch > > View in context: > https://bugs.webkit.org/attachment.cgi?id=437510&action=review > > > Source/WebCore/html/parser/AtomHTMLToken.h:200 > > + static MainThreadNeverDestroyed<std::array<RefPtr<AtomStringImpl>, atomizedStringImplTableCapacity>> tables[2]; > > One problem I found is that this table is caching arbitrary string. It is > possible that this string is very large and it is not know tag name. > So I think we need a good story of clearing this cache, otherwise, we will > potentially leak large strings. > Can you add some mechanism to make this cache cleared? (e.g. clearing this > when innerHTML finishes). I did think about clearing after `setInnerHTML`, but I was worried that that would hinder some of the performance benefit of using this approach :/. Do you think that the upper-bound on cached String length (36) might be enough to mitigate this?
Yusuke Suzuki
Comment 10 2021-09-07 11:59:49 PDT
Comment on attachment 437510 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=437510&action=review >>> Source/WebCore/html/parser/AtomHTMLToken.h:200 >>> + static MainThreadNeverDestroyed<std::array<RefPtr<AtomStringImpl>, atomizedStringImplTableCapacity>> tables[2]; >> >> One problem I found is that this table is caching arbitrary string. It is possible that this string is very large and it is not know tag name. >> So I think we need a good story of clearing this cache, otherwise, we will potentially leak large strings. >> Can you add some mechanism to make this cache cleared? (e.g. clearing this when innerHTML finishes). > > I did think about clearing after `setInnerHTML`, but I was worried that that would hinder some of the performance benefit of using this approach :/. > > Do you think that the upper-bound on cached String length (36) might be enough to mitigate this? I think adding an upperbound sounds good. And at least, we should clear this cache when memory warning happens. 1. Can you add length upperbound here? 2. Can you wire memory warning thing to this cache to clear things when warning happens? 3. And can you add FIXME+bugzilla URL about more periodic clearing? 4. What do you think about clearing this cache when GC timer is fired? (GCController.cpp in WebCore)
Yusuke Suzuki
Comment 11 2021-09-07 12:02:38 PDT
Comment on attachment 437510 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=437510&action=review >>>> Source/WebCore/html/parser/AtomHTMLToken.h:200 >>>> + static MainThreadNeverDestroyed<std::array<RefPtr<AtomStringImpl>, atomizedStringImplTableCapacity>> tables[2]; >>> >>> One problem I found is that this table is caching arbitrary string. It is possible that this string is very large and it is not know tag name. >>> So I think we need a good story of clearing this cache, otherwise, we will potentially leak large strings. >>> Can you add some mechanism to make this cache cleared? (e.g. clearing this when innerHTML finishes). >> >> I did think about clearing after `setInnerHTML`, but I was worried that that would hinder some of the performance benefit of using this approach :/. >> >> Do you think that the upper-bound on cached String length (36) might be enough to mitigate this? > > I think adding an upperbound sounds good. And at least, we should clear this cache when memory warning happens. > > 1. Can you add length upperbound here? > 2. Can you wire memory warning thing to this cache to clear things when warning happens? > 3. And can you add FIXME+bugzilla URL about more periodic clearing? > 4. What do you think about clearing this cache when GC timer is fired? (GCController.cpp in WebCore) Maybe we should clear this cache when full GC happens.
Wenson Hsieh
Comment 12 2021-09-07 12:03:10 PDT
(In reply to Yusuke Suzuki from comment #10) > Comment on attachment 437510 [details] > Patch > > View in context: > https://bugs.webkit.org/attachment.cgi?id=437510&action=review > > >>> Source/WebCore/html/parser/AtomHTMLToken.h:200 > >>> + static MainThreadNeverDestroyed<std::array<RefPtr<AtomStringImpl>, atomizedStringImplTableCapacity>> tables[2]; > >> > >> One problem I found is that this table is caching arbitrary string. It is possible that this string is very large and it is not know tag name. > >> So I think we need a good story of clearing this cache, otherwise, we will potentially leak large strings. > >> Can you add some mechanism to make this cache cleared? (e.g. clearing this when innerHTML finishes). > > > > I did think about clearing after `setInnerHTML`, but I was worried that that would hinder some of the performance benefit of using this approach :/. > > > > Do you think that the upper-bound on cached String length (36) might be enough to mitigate this? > > I think adding an upperbound sounds good. And at least, we should clear this > cache when memory warning happens. > > 1. Can you add length upperbound here? Ah, so this patch already enforces a 36-character upper limit on strings that are cached. > 2. Can you wire memory warning thing to this cache to clear things when > warning happens? Will do! > 3. And can you add FIXME+bugzilla URL about more periodic clearing? 👍🏻 > 4. What do you think about clearing this cache when GC timer is fired? > (GCController.cpp in WebCore) Yes, I think that makes sense! Will look into this too.
Wenson Hsieh
Comment 13 2021-09-07 14:25:54 PDT
(In reply to Yusuke Suzuki from comment #11) > Comment on attachment 437510 [details] > Patch > > View in context: > https://bugs.webkit.org/attachment.cgi?id=437510&action=review > > >>>> Source/WebCore/html/parser/AtomHTMLToken.h:200 > >>>> + static MainThreadNeverDestroyed<std::array<RefPtr<AtomStringImpl>, atomizedStringImplTableCapacity>> tables[2]; > >>> > >>> One problem I found is that this table is caching arbitrary string. It is possible that this string is very large and it is not know tag name. > >>> So I think we need a good story of clearing this cache, otherwise, we will potentially leak large strings. > >>> Can you add some mechanism to make this cache cleared? (e.g. clearing this when innerHTML finishes). > >> > >> I did think about clearing after `setInnerHTML`, but I was worried that that would hinder some of the performance benefit of using this approach :/. > >> > >> Do you think that the upper-bound on cached String length (36) might be enough to mitigate this? > > > > I think adding an upperbound sounds good. And at least, we should clear this cache when memory warning happens. > > > > 1. Can you add length upperbound here? > > 2. Can you wire memory warning thing to this cache to clear things when warning happens? > > 3. And can you add FIXME+bugzilla URL about more periodic clearing? > > 4. What do you think about clearing this cache when GC timer is fired? (GCController.cpp in WebCore) > > Maybe we should clear this cache when full GC happens. We discussed this in Slack, and think it might be sufficient to clear the cache when navigating.
Wenson Hsieh
Comment 14 2021-09-07 15:27:23 PDT
Yusuke Suzuki
Comment 15 2021-09-07 16:30:00 PDT
Comment on attachment 437562 [details] v2 View in context: https://bugs.webkit.org/attachment.cgi?id=437562&action=review r=me > Source/WebCore/page/MemoryRelease.cpp:89 > + HTMLAtomStringCache::clear(); I believe this releaseNoncriticalMemory function is only running under the main thread, but can you double-check? > Source/WebCore/page/cocoa/MemoryReleaseCocoa.mm:104 > + HTMLAtomStringCache::clear(); This runs only under the main thread, correct? If so, then looks good to me!
Wenson Hsieh
Comment 16 2021-09-07 16:45:17 PDT
Comment on attachment 437562 [details] v2 View in context: https://bugs.webkit.org/attachment.cgi?id=437562&action=review Thanks for the review! >> Source/WebCore/page/MemoryRelease.cpp:89 >> + HTMLAtomStringCache::clear(); > > I believe this releaseNoncriticalMemory function is only running under the main thread, but can you double-check? Yes, I think that's the case…it at a least looks like we can guarantee that isMainThread() is true, since there are some places in WebKitLegacy on iOS that call into `releaseMemory` through a block in `WebThreadRun`. But (IIUC) that should be okay. >> Source/WebCore/page/cocoa/MemoryReleaseCocoa.mm:104 >> + HTMLAtomStringCache::clear(); > > This runs only under the main thread, correct? If so, then looks good to me! Yes! This one definitely only runs on the (real) main thread.
Darin Adler
Comment 17 2021-09-07 16:59:29 PDT
Comment on attachment 437562 [details] v2 View in context: https://bugs.webkit.org/attachment.cgi?id=437562&action=review > Source/WebCore/html/parser/HTMLAtomStringCache.h:36 > + enum class Type : uint8_t { TagOrAttributeName = 0, AttributeValue = 1 }; How about bool instead of uint8_t? Also, it’s so wordy. Can we have two overloads of get so the caller doesn’t have to write such a long identifier? The Type can be a secret implementation detail. > Source/WebCore/html/parser/HTMLAtomStringCache.h:39 > + ALWAYS_INLINE static AtomString get(const Vector<UChar, inlineCapacity>& string) I’d be tempted to call this "create" even though it does a "create if needed and return existing if available" operation. Usually "get" only means "return existing if available". Or "make", or "atom". > Source/WebCore/html/parser/HTMLAtomStringCache.h:49 > + auto firstChararacter = string[0]; > + auto lastChararacter = string[length - 1]; I suggest the traditional spelling for character; extra "ra" here. > Source/WebCore/html/parser/HTMLAtomStringCache.h:68 > + ALWAYS_INLINE static RefPtr<AtomStringImpl>& getImpl(Type type, UChar firstChararacter, UChar lastCharacter, UChar length) More misspelling here. > Source/WebCore/html/parser/HTMLAtomStringCache.h:73 > + unsigned hash = (firstChararacter << 6) ^ ((lastCharacter << 14) ^ firstChararacter); > + hash += (hash >> 14) + (length << 14); > + hash ^= hash << 14; > + return cache(type)[(hash + (hash >> 6)) % capacity]; Would be nice to comment what kind of hash function this is, and why it’s a good choice. >>> Source/WebCore/page/MemoryRelease.cpp:89 >>> + HTMLAtomStringCache::clear(); >> >> I believe this releaseNoncriticalMemory function is only running under the main thread, but can you double-check? > > Yes, I think that's the case…it at a least looks like we can guarantee that isMainThread() is true, since there are some places in WebKitLegacy on iOS that call into `releaseMemory` through a block in `WebThreadRun`. But (IIUC) that should be okay. Is there something that will do ASSERT(isMainThread()) to help keep us honest? I don’t see anything explicit, but maybe something implicit? > Source/WebCore/page/cocoa/MemoryReleaseCocoa.mm:102 > +#endif // PLATFORM(IOS_FAMILY) I don’t think the #endif needs a comment when it is *so* close to the #if.
Wenson Hsieh
Comment 18 2021-09-07 17:38:29 PDT
Comment on attachment 437562 [details] v2 View in context: https://bugs.webkit.org/attachment.cgi?id=437562&action=review >> Source/WebCore/html/parser/HTMLAtomStringCache.h:36 >> + enum class Type : uint8_t { TagOrAttributeName = 0, AttributeValue = 1 }; > > How about bool instead of uint8_t? > > Also, it’s so wordy. Can we have two overloads of get so the caller doesn’t have to write such a long identifier? The Type can be a secret implementation detail. Sounds good — changed the `uint8_t` to `bool`, and made this enum private (with `makeTagOrAttributeName` and `makeAttributeValue` exposed as public methods instead). >> Source/WebCore/html/parser/HTMLAtomStringCache.h:39 >> + ALWAYS_INLINE static AtomString get(const Vector<UChar, inlineCapacity>& string) > > I’d be tempted to call this "create" even though it does a "create if needed and return existing if available" operation. Usually "get" only means "return existing if available". Or "make", or "atom". Good point — I'll go with `makeTagOrAttributeName` and `makeAttributeValue`. >> Source/WebCore/html/parser/HTMLAtomStringCache.h:49 >> + auto lastChararacter = string[length - 1]; > > I suggest the traditional spelling for character; extra "ra" here. Whoops! Fixed. >> Source/WebCore/html/parser/HTMLAtomStringCache.h:68 >> + ALWAYS_INLINE static RefPtr<AtomStringImpl>& getImpl(Type type, UChar firstChararacter, UChar lastCharacter, UChar length) > > More misspelling here. Fixed! >> Source/WebCore/html/parser/HTMLAtomStringCache.h:73 >> + return cache(type)[(hash + (hash >> 6)) % capacity]; > > Would be nice to comment what kind of hash function this is, and why it’s a good choice. 👍🏻 Added an explanation to the ChangeLog. >>>> Source/WebCore/page/MemoryRelease.cpp:89 >>>> + HTMLAtomStringCache::clear(); >>> >>> I believe this releaseNoncriticalMemory function is only running under the main thread, but can you double-check? >> >> Yes, I think that's the case…it at a least looks like we can guarantee that isMainThread() is true, since there are some places in WebKitLegacy on iOS that call into `releaseMemory` through a block in `WebThreadRun`. But (IIUC) that should be okay. > > Is there something that will do ASSERT(isMainThread()) to help keep us honest? I don’t see anything explicit, but maybe something implicit? That's a good question! It looks like anything here that would attempt to ref or deref RefCounted objects would trigger an ASSERT, if this were to fire when `!isMainThread()`. From code inspection, I think the call to `InlineStyleSheetOwner::clearCache();` above would trigger this as long as the style sheet cache is non-empty, since the values of that map are `RefPtr<StyleSheetContents>`. >> Source/WebCore/page/cocoa/MemoryReleaseCocoa.mm:102 >> +#endif // PLATFORM(IOS_FAMILY) > > I don’t think the #endif needs a comment when it is *so* close to the #if. True — removed the #endif.
Wenson Hsieh
Comment 19 2021-09-07 17:49:54 PDT
Darin Adler
Comment 20 2021-09-07 18:22:43 PDT
Comment on attachment 437574 [details] For EWS View in context: https://bugs.webkit.org/attachment.cgi?id=437574&action=review Pondered this a little more. > Source/WebCore/html/parser/HTMLAtomStringCache.h:56 > + enum class Type : bool { TagOrAttributeName = 0, AttributeValue = 1 }; Just a super-tiny nit pick in an already great patch: the "= 0" and "= 1" here aren’t really needed, even though we do use these booleans as array indices. For one thing, it doesn’t matter which is 0 and which is 1. And for another, converting enum to a size_t doesn’t somehow become more "well defined" if we specify numeric values for them. > Source/WebCore/html/parser/HTMLAtomStringCache.h:70 > + auto& foundImpl = getImpl(type, firstCharacter, lastCharacter, length); I probably would not use the word "impl" at all in the implementation; would look for different words. I think maybe I would name both the function and the local variable cacheSlot, cacheBucket, slot, or bucket. > Source/WebCore/html/parser/HTMLAtomStringCache.h:71 > + if (!WTF::equal(foundImpl.get(), string.data(), length)) { Surprised that you need to use the WTF:: prefix explicitly here. > Source/WebCore/html/parser/HTMLAtomStringCache.h:90 > + using Cache = std::array<RefPtr<AtomStringImpl>, capacity>; Any good reason the array elements are not AtomString rather than RefPtr<AtomStringImpl>?
Wenson Hsieh
Comment 21 2021-09-07 21:25:17 PDT
Comment on attachment 437574 [details] For EWS View in context: https://bugs.webkit.org/attachment.cgi?id=437574&action=review >> Source/WebCore/html/parser/HTMLAtomStringCache.h:56 >> + enum class Type : bool { TagOrAttributeName = 0, AttributeValue = 1 }; > > Just a super-tiny nit pick in an already great patch: the "= 0" and "= 1" here aren’t really needed, even though we do use these booleans as array indices. For one thing, it doesn’t matter which is 0 and which is 1. And for another, converting enum to a size_t doesn’t somehow become more "well defined" if we specify numeric values for them. Makes sense! Removed the =0 and =1. >> Source/WebCore/html/parser/HTMLAtomStringCache.h:70 >> + auto& foundImpl = getImpl(type, firstCharacter, lastCharacter, length); > > I probably would not use the word "impl" at all in the implementation; would look for different words. I think maybe I would name both the function and the local variable cacheSlot, cacheBucket, slot, or bucket. 👍🏻 Renamed `getImpl` to `cacheSlot` (and made it return an AtomString&), and also renamed `foundImpl` to just `slot`. >> Source/WebCore/html/parser/HTMLAtomStringCache.h:71 >> + if (!WTF::equal(foundImpl.get(), string.data(), length)) { > > Surprised that you need to use the WTF:: prefix explicitly here. Ah, I must've copied it from elsewhere but kept the WTF::. Removed. >> Source/WebCore/html/parser/HTMLAtomStringCache.h:90 >> + using Cache = std::array<RefPtr<AtomStringImpl>, capacity>; > > Any good reason the array elements are not AtomString rather than RefPtr<AtomStringImpl>? That's true, making this an AtomString directly should simplify this a bit. I've changed this to be a std::array of AtomString, and changed the `RefPtr<AtomStringImpl>& getImpl(…)` method above to just `AtomString& get(…)`.
Radar WebKit Bug Importer
Comment 22 2021-09-07 21:27:46 PDT
Wenson Hsieh
Comment 23 2021-09-07 21:36:14 PDT
EWS
Comment 24 2021-09-08 07:01:57 PDT
Committed r282142 (241439@main): <https://commits.webkit.org/241439@main> All reviewed patches have been landed. Closing bug and clearing flags on attachment 437590 [details].
Note You need to log in before you can comment on or make changes to this bug.