Bug 228858

Summary: white space atomization during parsing is expensive
Product: WebKit Reporter: Cameron McCormack (:heycam) <heycam>
Component: DOMAssignee: Cameron McCormack (:heycam) <heycam>
Status: RESOLVED FIXED    
Severity: Normal CC: cdumez, changseok, esprehn+autocc, ews-watchlist, gyuyoung.kim, kangil.han, pvollan, webkit-bug-importer, ysuzuki
Priority: P2 Keywords: InRadar
Version: WebKit Local Build   
Hardware: Unspecified   
OS: Unspecified   
Attachments:
Description Flags
Patch
none
Patch
none
Patch none

Description Cameron McCormack (:heycam) 2021-08-05 22:41:56 PDT
We have a memory optimization where the HTML parser will atomize any text node string that is all white space.  The process for this is a bit expensive, since we must loop over all the characters in the string three times:

* First, to check that all the characters are white space
* Second, to hash the string when looking up the atom hash table
* Third, to check the string for equality with any existing atom hash table entry

Most white space strings we encounter have a limited form -- they have at most three or four runs of consecutive equal white space characters, e.g. it's common to see a newline followed by a number of space characters.  We can take advantage of this by compressing the white space string into a simple run-length encoded form while we check that the string is entirely white space.  If we keep a cache of recently atomized white space strings that can be quickly looked up, keyed off the encoded form, we can re-use a previous result of atomizing an identical string and avoid the hashing and hash entry equality checks.

I have a WIP patch for this that is showing a 1% improvement on Speedometer 2 overall (due to 2-4% improvements on a few of the subtests) and no change to PLT5 on my local machine.
Comment 1 Radar WebKit Bug Importer 2021-08-05 22:42:07 PDT
<rdar://problem/81601565>
Comment 2 Cameron McCormack (:heycam) 2021-08-06 01:03:56 PDT
Created attachment 435057 [details]
Patch
Comment 3 Cameron McCormack (:heycam) 2021-08-07 17:10:47 PDT
Created attachment 435138 [details]
Patch
Comment 4 Yusuke Suzuki 2021-08-08 23:19:12 PDT
Comment on attachment 435138 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=435138&action=review

r=me

> Source/WebCore/html/parser/HTMLConstructionSite.cpp:858
> +        code |= (end - startOfRun);

It should be `character - startOfRun`.
Comment 5 Cameron McCormack (:heycam) 2021-08-08 23:52:34 PDT
Created attachment 435169 [details]
Patch
Comment 6 EWS 2021-08-09 04:05:13 PDT
Committed r280772 (240356@main): <https://commits.webkit.org/240356@main>

All reviewed patches have been landed. Closing bug and clearing flags on attachment 435169 [details].
Comment 7 Yusuke Suzuki 2021-08-09 05:35:22 PDT
Committed r280773 (240357@main): <https://commits.webkit.org/240357@main>
Comment 8 Per Arne Vollan 2021-08-19 18:37:29 PDT
Comment on attachment 435169 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=435169&action=review

> Source/WebCore/html/parser/HTMLConstructionSite.cpp:896
> +        WTFLogAlways("reuse code %llx", code);

Will this create a lot of logging?

> Source/WebCore/html/parser/HTMLConstructionSite.cpp:901
> +        WTFLogAlways("override");

Ditto.

> Source/WebCore/html/parser/HTMLConstructionSite.cpp:906
> +        WTFLogAlways("replace code %llx", code);

Ditto.

> Source/WebCore/html/parser/HTMLConstructionSite.cpp:913
> +    WTFLogAlways("new code %llx", code);

Ditto.
Comment 9 Cameron McCormack (:heycam) 2021-08-19 18:44:07 PDT
(In reply to Per Arne Vollan from comment #8)
> Comment on attachment 435169 [details]
> Patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=435169&action=review
> 
> > Source/WebCore/html/parser/HTMLConstructionSite.cpp:896
> > +        WTFLogAlways("reuse code %llx", code);
> 
> Will this create a lot of logging?

Yes, Yusuke removed them shortly after this patch landed.