Bug 107337 - BackgroundHTMLParser should be able to atomize well-known strings
Summary: BackgroundHTMLParser should be able to atomize well-known strings
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: New Bugs (show other bugs)
Version: 528+ (Nightly build)
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Adam Barth
URL:
Keywords:
Depends on:
Blocks: 111645
  Show dependency treegraph
 
Reported: 2013-01-18 16:20 PST by Adam Barth
Modified: 2013-03-13 13:33 PDT (History)
14 users (show)

See Also:


Attachments
Work in progress (14.64 KB, patch)
2013-01-18 16:22 PST, Adam Barth
no flags Details | Formatted Diff | Diff
Work in progress (14.77 KB, patch)
2013-01-18 17:03 PST, Adam Barth
no flags Details | Formatted Diff | Diff
alternative approach, does not work yet (15.50 KB, patch)
2013-03-06 17:38 PST, Eric Seidel (no email)
no flags Details | Formatted Diff | Diff
Updated version of Adam's HTMLIdentifier patch (26.69 KB, patch)
2013-03-07 04:31 PST, Eric Seidel (no email)
no flags Details | Formatted Diff | Diff
Updated version of Adam's HTMLIdentifier patch (26.75 KB, patch)
2013-03-07 04:36 PST, Eric Seidel (no email)
no flags Details | Formatted Diff | Diff
updated, and more busticated version (34.82 KB, patch)
2013-03-07 14:47 PST, Eric Seidel (no email)
no flags Details | Formatted Diff | Diff
Patch (40.00 KB, patch)
2013-03-07 18:35 PST, Eric Seidel (no email)
no flags Details | Formatted Diff | Diff
Now with more build goop (44.26 KB, patch)
2013-03-07 20:23 PST, Eric Seidel (no email)
no flags Details | Formatted Diff | Diff
Should now build with the threaded parser off (44.53 KB, patch)
2013-03-07 22:26 PST, Eric Seidel (no email)
no flags Details | Formatted Diff | Diff
Fix Mac (52.18 KB, patch)
2013-03-07 22:46 PST, Eric Seidel (no email)
no flags Details | Formatted Diff | Diff
Patch (49.75 KB, patch)
2013-03-07 23:07 PST, Eric Seidel (no email)
no flags Details | Formatted Diff | Diff
Patch for landing (50.05 KB, patch)
2013-03-08 15:59 PST, Eric Seidel (no email)
no flags Details | Formatted Diff | Diff
Patch for landing (50.05 KB, patch)
2013-03-08 16:23 PST, Eric Seidel (no email)
no flags Details | Formatted Diff | Diff
Patch for landing (49.79 KB, patch)
2013-03-08 16:43 PST, Eric Seidel (no email)
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Adam Barth 2013-01-18 16:20:41 PST
BackgroundHTMLParser should be able to atomize well-known strings
Comment 1 Adam Barth 2013-01-18 16:22:07 PST
Created attachment 183569 [details]
Work in progress
Comment 2 Adam Barth 2013-01-18 17:02:26 PST
== Before patch ==

avg 464.0280999999959 ms
median 429.24000000493834 ms
stdev 134.56573926949213 ms

== After patch ==

avg 475.42640000028763 ms
median 440.3119999988121 ms
stdev 135.6254289739386 ms

Speedup: 2.4%
Comment 3 Adam Barth 2013-01-18 17:03:39 PST
Created attachment 183577 [details]
Work in progress
Comment 4 Adam Barth 2013-01-19 11:08:01 PST
I had an idea for another approach.  I'll test it out an report back.
Comment 5 Eric Seidel (no email) 2013-03-06 17:38:54 PST
Created attachment 191873 [details]
alternative approach, does not work yet
Comment 6 Eric Seidel (no email) 2013-03-06 17:44:04 PST
This approach is not going to work with String::isSafeToSendToAnotherThread, as that only allows strings with a single ref to be sent.  We'd need to make it smarter to also allow these cached strings to be sent.  Possibly by checking the exact ref count to the refs we find in the chunk.
Comment 7 Eric Seidel (no email) 2013-03-06 23:18:49 PST
(In reply to comment #2)
> Speedup: 2.4%

Which benchmark was this?
Comment 8 Eric Seidel (no email) 2013-03-06 23:24:23 PST
Your earlier patch is very interesting.  It has the potential to be significantly stronger than the one I took.  It trades 4 bytes per string pointer on CompactToken for the potential to have less memory taken up by strings.  It's probably the right tradeoff.  I also like the fact that it can be pre-populated easily with HTMLNames.
Comment 9 Eric Seidel (no email) 2013-03-06 23:38:22 PST
Sorry, I kinda hijacked this bug with a patch solving a related (but different) problem.

My patch addresses the idea of multiple of the same string appearing in tokens in the same chunk.

Yours addresses using known constant strings (HTMLNames currently) in any chunk.

Looking again, your patch is probably the quicker path to victory, so I may try and clean that up and re-post.
Comment 10 Eric Seidel (no email) 2013-03-06 23:41:11 PST
I originally thought that your approach would add new strings to a shared table, but that's not the case. Your patch only ever "uniques" strings which exist in HTMLNames::getHTMLTags(), which is probably strong enough for a first pass.

It won't help us with common attribute values, or unknown tag names (so it's not as strong as a real AtomicString solution could be), but it's very simple and appears thread-safe.
Comment 11 Adam Barth 2013-03-07 00:02:14 PST
> > Speedup: 2.4%
> 
> Which benchmark was this?

I don't really remember, but I bet it was the threaded HTML parser benchmark that you wrote originally with the hack not to build render trees for display none iframes.
Comment 12 Eric Seidel (no email) 2013-03-07 03:41:45 PST
I updated your patch, will post shortly.

Before:

Running Parser/html-parser-threaded.html (1 of 1)
RESULT Parser: html-parser-threaded: Time= 441.885400002 ms
median= 443.726500002 ms, stdev= 7.25002679952 ms, min= 430.244000047 ms, max= 455.511000007 ms
RESULT Parser: html-parser-threaded: JSHeap= 3521735.4 bytes
median= 3602932.0 bytes, stdev= 623509.897211 bytes, min= 2623300.0 bytes, max= 4377948.0 bytes
RESULT Parser: html-parser-threaded: Malloc= 52747410.8 bytes
median= 56256220.0 bytes, stdev= 11587237.5555 bytes, min= 36093864.0 bytes, max= 69862568.0 bytes
Finished: 13.947144 s

After:

Running Parser/html-parser-threaded.html (1 of 1)
RESULT Parser: html-parser-threaded: Time= 451.509050003 ms
median= 451.107000001 ms, stdev= 5.71298626931 ms, min= 444.229000015 ms, max= 463.285000005 ms
RESULT Parser: html-parser-threaded: JSHeap= 3534839.0 bytes
median= 3602932.0 bytes, stdev= 638172.662467 bytes, min= 2623304.0 bytes, max= 4508988.0 bytes
RESULT Parser: html-parser-threaded: Malloc= 55495976.4 bytes
median= 57171872.0 bytes, stdev= 11837219.6543 bytes, min= 36895656.0 bytes, max= 70339216.0 bytes
Finished: 13.771727 s

We appear to be mallocing more after the patch?  I must have doen something wrong.
Comment 13 Eric Seidel (no email) 2013-03-07 04:31:21 PST
Created attachment 191968 [details]
Updated version of Adam's HTMLIdentifier patch
Comment 14 Eric Seidel (no email) 2013-03-07 04:36:40 PST
Created attachment 191971 [details]
Updated version of Adam's HTMLIdentifier patch
Comment 15 Eric Seidel (no email) 2013-03-07 04:37:56 PST
My limited testing seems to indicate that this patch is actually a slowdown for Parser/html-parser-threaded.html (with the display: none iframes patch applied)  I'll look more tomorrow, there may be something obvious I'm missing.
Comment 16 Adam Barth 2013-03-07 13:13:01 PST
The data from Comment #2 is super duper old.
Comment 17 Eric Seidel (no email) 2013-03-07 13:20:40 PST
(In reply to comment #16)
> The data from Comment #2 is super duper old.

I think my 5am patch is super-duper broken still. :)  Did find a bunch of related wins during the search however.  Including bug 111708. :)  I will look again shortly.
Comment 18 Eric Seidel (no email) 2013-03-07 14:47:52 PST
Created attachment 192087 [details]
updated, and more busticated version
Comment 19 Eric Seidel (no email) 2013-03-07 18:35:07 PST
Created attachment 192124 [details]
Patch
Comment 20 Early Warning System Bot 2013-03-07 18:45:52 PST
Comment on attachment 192124 [details]
Patch

Attachment 192124 [details] did not pass qt-ews (qt):
Output: http://webkit-commit-queue.appspot.com/results/17009461
Comment 21 Early Warning System Bot 2013-03-07 18:48:23 PST
Comment on attachment 192124 [details]
Patch

Attachment 192124 [details] did not pass qt-wk2-ews (qt):
Output: http://webkit-commit-queue.appspot.com/results/16987302
Comment 22 WebKit Review Bot 2013-03-07 19:03:08 PST
Comment on attachment 192124 [details]
Patch

Attachment 192124 [details] did not pass cr-linux-debug-ews (chromium-xvfb):
Output: http://webkit-commit-queue.appspot.com/results/17037250
Comment 23 Build Bot 2013-03-07 19:20:12 PST
Comment on attachment 192124 [details]
Patch

Attachment 192124 [details] did not pass win-ews (win):
Output: http://webkit-commit-queue.appspot.com/results/17081269
Comment 24 Peter Beverloo (cr-android ews) 2013-03-07 19:21:13 PST
Comment on attachment 192124 [details]
Patch

Attachment 192124 [details] did not pass cr-android-ews (chromium-android):
Output: http://webkit-commit-queue.appspot.com/results/16992477
Comment 25 EFL EWS Bot 2013-03-07 19:26:50 PST
Comment on attachment 192124 [details]
Patch

Attachment 192124 [details] did not pass efl-ews (efl):
Output: http://webkit-commit-queue.appspot.com/results/17009478
Comment 26 Eric Seidel (no email) 2013-03-07 20:03:08 PST
Oh, thats right.  We have non-gyp ports. :p
Comment 27 Build Bot 2013-03-07 20:03:30 PST
Comment on attachment 192124 [details]
Patch

Attachment 192124 [details] did not pass mac-wk2-ews (mac-wk2):
Output: http://webkit-commit-queue.appspot.com/results/16992502
Comment 28 Eric Seidel (no email) 2013-03-07 20:03:38 PST
Could I get some code commentary before I got through the 8 rounds of EWS fixups? :)
Comment 29 WebKit Review Bot 2013-03-07 20:07:45 PST
Comment on attachment 192124 [details]
Patch

Attachment 192124 [details] did not pass chromium-ews (chromium-xvfb):
Output: http://webkit-commit-queue.appspot.com/results/16992504
Comment 30 Eric Seidel (no email) 2013-03-07 20:08:28 PST
Chromium linux, I thought you had my back.  I feel betrayed.
Comment 31 Eric Seidel (no email) 2013-03-07 20:23:29 PST
Created attachment 192136 [details]
Now with more build goop
Comment 32 Early Warning System Bot 2013-03-07 20:30:21 PST
Comment on attachment 192136 [details]
Now with more build goop

Attachment 192136 [details] did not pass qt-wk2-ews (qt):
Output: http://webkit-commit-queue.appspot.com/results/17066282
Comment 33 Early Warning System Bot 2013-03-07 20:31:05 PST
Comment on attachment 192136 [details]
Now with more build goop

Attachment 192136 [details] did not pass qt-ews (qt):
Output: http://webkit-commit-queue.appspot.com/results/17054334
Comment 34 EFL EWS Bot 2013-03-07 20:32:38 PST
Comment on attachment 192136 [details]
Now with more build goop

Attachment 192136 [details] did not pass efl-ews (efl):
Output: http://webkit-commit-queue.appspot.com/results/16987342
Comment 35 Build Bot 2013-03-07 20:59:54 PST
Comment on attachment 192136 [details]
Now with more build goop

Attachment 192136 [details] did not pass win-ews (win):
Output: http://webkit-commit-queue.appspot.com/results/17087166
Comment 36 Eric Seidel (no email) 2013-03-07 22:26:10 PST
Created attachment 192146 [details]
Should now build with the threaded parser off
Comment 37 Eric Seidel (no email) 2013-03-07 22:46:53 PST
Created attachment 192149 [details]
Fix Mac
Comment 38 Eric Seidel (no email) 2013-03-07 23:04:00 PST
Oh, XCode.  You're so fun.
Comment 39 Eric Seidel (no email) 2013-03-07 23:07:38 PST
Created attachment 192152 [details]
Patch
Comment 40 Adam Barth 2013-03-08 11:11:17 PST
Comment on attachment 192152 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=192152&action=review

> Source/WebCore/ChangeLog:20
> +        median= 443.726500002 ms, stdev= 7.25002679952 ms, min= 430.244000047 ms, max= 455.511000007 ms
> +        to:
> +        median= 427.849500004 ms, stdev= 9.96967058292 ms, min= 417.914000049 ms, max= 461.528000014 ms

You're reporting the median, but not the mean.  We usually work with the mean.

> Source/WebCore/html/parser/BackgroundHTMLParser.cpp:125
> +    // FIXME: Using CaseFoldingHash::equal instead of equalIgnoringCase, as equalIgnoringCase
> +    // wants non-const StringImpl* (even though it never modifies them).

Maybe we should change equalIgnoringCase to use a const StringImpl* ?

> Source/WebCore/html/parser/HTMLIdentifier.cpp:49
> +    return table;

Should we ASSERT(isMainThread() || !table.isEmpty()) ?

> Source/WebCore/html/parser/HTMLIdentifier.cpp:126
> +        // We expect some hash collisions, but only for identical strings.
> +        // Since all of these names are AtomicStrings pointers should be equal.
> +        ASSERT_UNUSED(addResult, !addResult.isNewEntry || name == addResult.iterator->value.second);

Is this just dumb luck?  What happens when we happen to have two HTML tag names with the same hash value?  We should at least have a comment here that explain what to do when that happens.

> Source/WebCore/html/parser/HTMLIdentifier.cpp:133
> +    // This is safe to call from any thread, but must be called before
> +    // using HTMLIdentifier.

I would just ASSERT(isMainThread()) since that's easier and how we plan to use it anyway.

> Source/WebCore/html/parser/HTMLIdentifier.h:51
> +        if (vector.size() <= maxNameLength) {

Why not put this branch into findIndex?  That would simply this code dramatically.

> Source/WebCore/html/parser/HTMLIdentifier.h:62
> +        if (width == Likely8Bit)
> +            m_string = StringImpl::create8BitIfPossible(vector);
> +        else if (width == Force8Bit)
> +            m_string = String::make8BitFrom16BitSource(vector);
> +        else
> +            m_string = String(vector);

This work should probably be done by String.  It's useful in other places.

After these two changes, this function just becomes:

m_index = findIndex(vector.data(), vector.size());
if (m_index != invalidIndex)
    return;
m_string = String(vector, width);

> Source/WebCore/html/parser/HTMLParserIdioms.cpp:286
> +    // Note: This hash compare is not correct for all possible StringImpls, but
> +    // since we're always comparing against constants, it's OK.

Why is it not correct?  How can two strings be equal if their hashes aren't equal?
Comment 41 Eric Seidel (no email) 2013-03-08 15:53:17 PST
Comment on attachment 192152 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=192152&action=review

>> Source/WebCore/ChangeLog:20
>> +        median= 427.849500004 ms, stdev= 9.96967058292 ms, min= 417.914000049 ms, max= 461.528000014 ms
> 
> You're reporting the median, but not the mean.  We usually work with the mean.

I was just copying the output from run-perf-tests.  Filed https://bugs.webkit.org/show_bug.cgi?id=111895

>> Source/WebCore/html/parser/BackgroundHTMLParser.cpp:125
>> +    // wants non-const StringImpl* (even though it never modifies them).
> 
> Maybe we should change equalIgnoringCase to use a const StringImpl* ?

I considered it.  I worried it would bloat this change unnecessarily.  I've filed:
https://bugs.webkit.org/show_bug.cgi?id=111892
and have updated the comment to say so.

>> Source/WebCore/html/parser/HTMLIdentifier.cpp:49
>> +    return table;
> 
> Should we ASSERT(isMainThread() || !table.isEmpty()) ?

Done.

>> Source/WebCore/html/parser/HTMLIdentifier.cpp:126
>> +        ASSERT_UNUSED(addResult, !addResult.isNewEntry || name == addResult.iterator->value.second);
> 
> Is this just dumb luck?  What happens when we happen to have two HTML tag names with the same hash value?  We should at least have a comment here that explain what to do when that happens.

Yup. :)  There aren't that many strings in this table, so it doesn't shock me that they all have unique hashes.  I've added a note for anyone hitting this in the future.

>> Source/WebCore/html/parser/HTMLIdentifier.cpp:133
>> +    // using HTMLIdentifier.
> 
> I would just ASSERT(isMainThread()) since that's easier and how we plan to use it anyway.

Done.

>> Source/WebCore/html/parser/HTMLIdentifier.h:51
>> +        if (vector.size() <= maxNameLength) {
> 
> Why not put this branch into findIndex?  That would simply this code dramatically.

Duh.  Fixed.  I think I was trying to avoid even calling findIndex in the case where we knew we were going to malloc.  But the malloc should overshadow this function call overhead. :)

>> Source/WebCore/html/parser/HTMLIdentifier.h:62
>> +            m_string = String(vector);
> 
> This work should probably be done by String.  It's useful in other places.
> 
> After these two changes, this function just becomes:
> 
> m_index = findIndex(vector.data(), vector.size());
> if (m_index != invalidIndex)
>     return;
> m_string = String(vector, width);

Excellent idea, but a much more involved change (as we need to handle more than just UChar as a char type).  Filed bug 111896.

>> Source/WebCore/html/parser/HTMLParserIdioms.cpp:286
>> +    // since we're always comparing against constants, it's OK.
> 
> Why is it not correct?  How can two strings be equal if their hashes aren't equal?

I think it was this comment which scared me into writing that one:
http://trac.webkit.org/browser/trunk/Source/WTF/wtf/text/StringImpl.h#L312
I'm happy to remove it.
Comment 42 Eric Seidel (no email) 2013-03-08 15:59:12 PST
Created attachment 192300 [details]
Patch for landing
Comment 43 WebKit Review Bot 2013-03-08 16:01:52 PST
Comment on attachment 192300 [details]
Patch for landing

Rejecting attachment 192300 [details] from commit-queue.

Failed to run "['/mnt/git/webkit-commit-queue/Tools/Scripts/webkit-patch', '--status-host=webkit-commit-queue.appspot.com', '--bot-id=gce-cq-04', 'apply-attachment', '--no-update', '--non-interactive', 192300, '--port=chromium-xvfb']" exit_code: 2 cwd: /mnt/git/webkit-commit-queue

Last 500 characters of output:

patching file Source/WebCore/html/parser/HTMLIdentifier.cpp
patching file Source/WebCore/html/parser/HTMLIdentifier.h
patching file Source/WebCore/html/parser/HTMLParserIdioms.cpp
patching file Source/WebCore/html/parser/HTMLParserIdioms.h
patching file Source/WebCore/html/parser/HTMLPreloadScanner.cpp
patching file Source/WebCore/html/parser/HTMLPreloadScanner.h

Failed to run "[u'/mnt/git/webkit-commit-queue/Tools/Scripts/svn-apply', '--force']" exit_code: 1 cwd: /mnt/git/webkit-commit-queue

Full output: http://webkit-commit-queue.appspot.com/results/17109259
Comment 44 Eric Seidel (no email) 2013-03-08 16:23:43 PST
Created attachment 192304 [details]
Patch for landing
Comment 45 WebKit Review Bot 2013-03-08 16:26:34 PST
Comment on attachment 192304 [details]
Patch for landing

Rejecting attachment 192304 [details] from commit-queue.

Failed to run "['/mnt/git/webkit-commit-queue/Tools/Scripts/webkit-patch', '--status-host=webkit-commit-queue.appspot.com', '--bot-id=gce-cq-04', 'apply-attachment', '--no-update', '--non-interactive', 192304, '--port=chromium-xvfb']" exit_code: 2 cwd: /mnt/git/webkit-commit-queue

Last 500 characters of output:

patching file Source/WebCore/html/parser/HTMLIdentifier.cpp
patching file Source/WebCore/html/parser/HTMLIdentifier.h
patching file Source/WebCore/html/parser/HTMLParserIdioms.cpp
patching file Source/WebCore/html/parser/HTMLParserIdioms.h
patching file Source/WebCore/html/parser/HTMLPreloadScanner.cpp
patching file Source/WebCore/html/parser/HTMLPreloadScanner.h

Failed to run "[u'/mnt/git/webkit-commit-queue/Tools/Scripts/svn-apply', '--force']" exit_code: 1 cwd: /mnt/git/webkit-commit-queue

Full output: http://webkit-commit-queue.appspot.com/results/17047463
Comment 46 Eric Seidel (no email) 2013-03-08 16:43:54 PST
Created attachment 192309 [details]
Patch for landing
Comment 47 WebKit Review Bot 2013-03-08 17:08:47 PST
Comment on attachment 192309 [details]
Patch for landing

Clearing flags on attachment: 192309

Committed r145292: <http://trac.webkit.org/changeset/145292>
Comment 48 WebKit Review Bot 2013-03-08 17:08:55 PST
All reviewed patches have been landed.  Closing bug.
Comment 50 Adam Barth 2013-03-13 01:30:44 PDT
That's exciting.  Thank you for letting us know!
Comment 51 Eric Seidel (no email) 2013-03-13 13:33:13 PDT
Thank you very much for the update Nils!