Bug 107337

Summary:

BackgroundHTMLParser should be able to atomize well-known strings

Product:

WebKit

Reporter:

Adam Barth <abarth>

Component:

New Bugs

Assignee:

Adam Barth <abarth>

Status:

RESOLVED FIXED

Severity:

Normal

CC:

buildbot, dglazkov, eric, esprehn+autocc, gyuyoung.kim, nbarth, ojan.autocc, peter+ews, rakuco, rego+ews, rniwa, webkit-ews, webkit.review.bot, xan.lopez

Priority:

Version:

528+ (Nightly build)

Hardware:

Unspecified

OS:

Unspecified

Bug Depends on:

Bug Blocks:

111645

Attachments:

Description	Flags
Work in progress	none
Work in progress	none
alternative approach, does not work yet	none
Updated version of Adam's HTMLIdentifier patch	none
Updated version of Adam's HTMLIdentifier patch	none
updated, and more busticated version	none
Patch	none
Now with more build goop	none
Should now build with the threaded parser off	none
Fix Mac	none
Patch	none
Patch for landing	none
Patch for landing	none
Patch for landing	none

Adam Barth

Reported 2013-01-18 16:20:41 PST

BackgroundHTMLParser should be able to atomize well-known strings

Attachments
Work in progress (14.64 KB, patch) 2013-01-18 16:22 PST, Adam Barth	no flags	Details Formatted Diff Diff
Work in progress (14.77 KB, patch) 2013-01-18 17:03 PST, Adam Barth	no flags	Details Formatted Diff Diff
alternative approach, does not work yet (15.50 KB, patch) 2013-03-06 17:38 PST, Eric Seidel (no email)	no flags	Details Formatted Diff Diff
Updated version of Adam's HTMLIdentifier patch (26.69 KB, patch) 2013-03-07 04:31 PST, Eric Seidel (no email)	no flags	Details Formatted Diff Diff
Updated version of Adam's HTMLIdentifier patch (26.75 KB, patch) 2013-03-07 04:36 PST, Eric Seidel (no email)	no flags	Details Formatted Diff Diff
updated, and more busticated version (34.82 KB, patch) 2013-03-07 14:47 PST, Eric Seidel (no email)	no flags	Details Formatted Diff Diff
Patch (40.00 KB, patch) 2013-03-07 18:35 PST, Eric Seidel (no email)	no flags	Details Formatted Diff Diff
Now with more build goop (44.26 KB, patch) 2013-03-07 20:23 PST, Eric Seidel (no email)	no flags	Details Formatted Diff Diff
Should now build with the threaded parser off (44.53 KB, patch) 2013-03-07 22:26 PST, Eric Seidel (no email)	no flags	Details Formatted Diff Diff
Fix Mac (52.18 KB, patch) 2013-03-07 22:46 PST, Eric Seidel (no email)	no flags	Details Formatted Diff Diff
Patch (49.75 KB, patch) 2013-03-07 23:07 PST, Eric Seidel (no email)	no flags	Details Formatted Diff Diff
Patch for landing (50.05 KB, patch) 2013-03-08 15:59 PST, Eric Seidel (no email)	no flags	Details Formatted Diff Diff
Patch for landing (50.05 KB, patch) 2013-03-08 16:23 PST, Eric Seidel (no email)	no flags	Details Formatted Diff Diff
Patch for landing (49.79 KB, patch) 2013-03-08 16:43 PST, Eric Seidel (no email)	no flags	Details Formatted Diff Diff
Show Obsolete (13) View All Add attachment proposed patch, testcase, etc.

Adam Barth

Comment 1 2013-01-18 16:22:07 PST

Created attachment 183569 [details] Work in progress

Adam Barth

Comment 2 2013-01-18 17:02:26 PST

== Before patch == avg 464.0280999999959 ms median 429.24000000493834 ms stdev 134.56573926949213 ms == After patch == avg 475.42640000028763 ms median 440.3119999988121 ms stdev 135.6254289739386 ms Speedup: 2.4%

Adam Barth

Comment 3 2013-01-18 17:03:39 PST

Created attachment 183577 [details] Work in progress

Adam Barth

Comment 4 2013-01-19 11:08:01 PST

I had an idea for another approach. I'll test it out an report back.

Eric Seidel (no email)

Comment 5 2013-03-06 17:38:54 PST

Created attachment 191873 [details] alternative approach, does not work yet

Eric Seidel (no email)

Comment 6 2013-03-06 17:44:04 PST

This approach is not going to work with String::isSafeToSendToAnotherThread, as that only allows strings with a single ref to be sent. We'd need to make it smarter to also allow these cached strings to be sent. Possibly by checking the exact ref count to the refs we find in the chunk.

Eric Seidel (no email)

Comment 7 2013-03-06 23:18:49 PST

(In reply to comment #2) > Speedup: 2.4% Which benchmark was this?

Eric Seidel (no email)

Comment 8 2013-03-06 23:24:23 PST

Your earlier patch is very interesting. It has the potential to be significantly stronger than the one I took. It trades 4 bytes per string pointer on CompactToken for the potential to have less memory taken up by strings. It's probably the right tradeoff. I also like the fact that it can be pre-populated easily with HTMLNames.

Eric Seidel (no email)

Comment 9 2013-03-06 23:38:22 PST

Sorry, I kinda hijacked this bug with a patch solving a related (but different) problem. My patch addresses the idea of multiple of the same string appearing in tokens in the same chunk. Yours addresses using known constant strings (HTMLNames currently) in any chunk. Looking again, your patch is probably the quicker path to victory, so I may try and clean that up and re-post.

Eric Seidel (no email)

Comment 10 2013-03-06 23:41:11 PST

I originally thought that your approach would add new strings to a shared table, but that's not the case. Your patch only ever "uniques" strings which exist in HTMLNames::getHTMLTags(), which is probably strong enough for a first pass. It won't help us with common attribute values, or unknown tag names (so it's not as strong as a real AtomicString solution could be), but it's very simple and appears thread-safe.

Adam Barth

Comment 11 2013-03-07 00:02:14 PST

> > Speedup: 2.4% > > Which benchmark was this? I don't really remember, but I bet it was the threaded HTML parser benchmark that you wrote originally with the hack not to build render trees for display none iframes.

Eric Seidel (no email)

Comment 12 2013-03-07 03:41:45 PST

I updated your patch, will post shortly. Before: Running Parser/html-parser-threaded.html (1 of 1) RESULT Parser: html-parser-threaded: Time= 441.885400002 ms median= 443.726500002 ms, stdev= 7.25002679952 ms, min= 430.244000047 ms, max= 455.511000007 ms RESULT Parser: html-parser-threaded: JSHeap= 3521735.4 bytes median= 3602932.0 bytes, stdev= 623509.897211 bytes, min= 2623300.0 bytes, max= 4377948.0 bytes RESULT Parser: html-parser-threaded: Malloc= 52747410.8 bytes median= 56256220.0 bytes, stdev= 11587237.5555 bytes, min= 36093864.0 bytes, max= 69862568.0 bytes Finished: 13.947144 s After: Running Parser/html-parser-threaded.html (1 of 1) RESULT Parser: html-parser-threaded: Time= 451.509050003 ms median= 451.107000001 ms, stdev= 5.71298626931 ms, min= 444.229000015 ms, max= 463.285000005 ms RESULT Parser: html-parser-threaded: JSHeap= 3534839.0 bytes median= 3602932.0 bytes, stdev= 638172.662467 bytes, min= 2623304.0 bytes, max= 4508988.0 bytes RESULT Parser: html-parser-threaded: Malloc= 55495976.4 bytes median= 57171872.0 bytes, stdev= 11837219.6543 bytes, min= 36895656.0 bytes, max= 70339216.0 bytes Finished: 13.771727 s We appear to be mallocing more after the patch? I must have doen something wrong.

Eric Seidel (no email)

Comment 13 2013-03-07 04:31:21 PST

Created attachment 191968 [details] Updated version of Adam's HTMLIdentifier patch

Eric Seidel (no email)

Comment 14 2013-03-07 04:36:40 PST

Created attachment 191971 [details] Updated version of Adam's HTMLIdentifier patch

Eric Seidel (no email)

Comment 15 2013-03-07 04:37:56 PST

My limited testing seems to indicate that this patch is actually a slowdown for Parser/html-parser-threaded.html (with the display: none iframes patch applied) I'll look more tomorrow, there may be something obvious I'm missing.

Adam Barth

Comment 16 2013-03-07 13:13:01 PST

The data from Comment #2 is super duper old.

Eric Seidel (no email)

Comment 17 2013-03-07 13:20:40 PST

(In reply to comment #16) > The data from Comment #2 is super duper old. I think my 5am patch is super-duper broken still. :) Did find a bunch of related wins during the search however. Including bug 111708. :) I will look again shortly.

Eric Seidel (no email)

Comment 18 2013-03-07 14:47:52 PST

Created attachment 192087 [details] updated, and more busticated version

Eric Seidel (no email)

Comment 19 2013-03-07 18:35:07 PST

Created attachment 192124 [details] Patch

Early Warning System Bot

Comment 20 2013-03-07 18:45:52 PST

Comment on attachment 192124 [details] Patch Attachment 192124 [details] did not pass qt-ews (qt): Output: http://webkit-commit-queue.appspot.com/results/17009461

Early Warning System Bot

Comment 21 2013-03-07 18:48:23 PST

Comment on attachment 192124 [details] Patch Attachment 192124 [details] did not pass qt-wk2-ews (qt): Output: http://webkit-commit-queue.appspot.com/results/16987302

WebKit Review Bot

Comment 22 2013-03-07 19:03:08 PST

Comment on attachment 192124 [details] Patch Attachment 192124 [details] did not pass cr-linux-debug-ews (chromium-xvfb): Output: http://webkit-commit-queue.appspot.com/results/17037250

Build Bot

Comment 23 2013-03-07 19:20:12 PST

Comment on attachment 192124 [details] Patch Attachment 192124 [details] did not pass win-ews (win): Output: http://webkit-commit-queue.appspot.com/results/17081269

Peter Beverloo (cr-android ews)

Comment 24 2013-03-07 19:21:13 PST

Comment on attachment 192124 [details] Patch Attachment 192124 [details] did not pass cr-android-ews (chromium-android): Output: http://webkit-commit-queue.appspot.com/results/16992477

EFL EWS Bot

Comment 25 2013-03-07 19:26:50 PST

Comment on attachment 192124 [details] Patch Attachment 192124 [details] did not pass efl-ews (efl): Output: http://webkit-commit-queue.appspot.com/results/17009478

Eric Seidel (no email)

Comment 26 2013-03-07 20:03:08 PST

Oh, thats right. We have non-gyp ports. :p

Build Bot

Comment 27 2013-03-07 20:03:30 PST

Comment on attachment 192124 [details] Patch Attachment 192124 [details] did not pass mac-wk2-ews (mac-wk2): Output: http://webkit-commit-queue.appspot.com/results/16992502

Eric Seidel (no email)

Comment 28 2013-03-07 20:03:38 PST

Could I get some code commentary before I got through the 8 rounds of EWS fixups? :)

WebKit Review Bot

Comment 29 2013-03-07 20:07:45 PST

Comment on attachment 192124 [details] Patch Attachment 192124 [details] did not pass chromium-ews (chromium-xvfb): Output: http://webkit-commit-queue.appspot.com/results/16992504

Eric Seidel (no email)

Comment 30 2013-03-07 20:08:28 PST

Chromium linux, I thought you had my back. I feel betrayed.

Eric Seidel (no email)

Comment 31 2013-03-07 20:23:29 PST

Created attachment 192136 [details] Now with more build goop

Early Warning System Bot

Comment 32 2013-03-07 20:30:21 PST

Comment on attachment 192136 [details] Now with more build goop Attachment 192136 [details] did not pass qt-wk2-ews (qt): Output: http://webkit-commit-queue.appspot.com/results/17066282

Early Warning System Bot

Comment 33 2013-03-07 20:31:05 PST

Comment on attachment 192136 [details] Now with more build goop Attachment 192136 [details] did not pass qt-ews (qt): Output: http://webkit-commit-queue.appspot.com/results/17054334

EFL EWS Bot

Comment 34 2013-03-07 20:32:38 PST

Comment on attachment 192136 [details] Now with more build goop Attachment 192136 [details] did not pass efl-ews (efl): Output: http://webkit-commit-queue.appspot.com/results/16987342

Build Bot

Comment 35 2013-03-07 20:59:54 PST

Comment on attachment 192136 [details] Now with more build goop Attachment 192136 [details] did not pass win-ews (win): Output: http://webkit-commit-queue.appspot.com/results/17087166

Eric Seidel (no email)

Comment 36 2013-03-07 22:26:10 PST

Created attachment 192146 [details] Should now build with the threaded parser off

Eric Seidel (no email)

Comment 37 2013-03-07 22:46:53 PST

Created attachment 192149 [details] Fix Mac

Eric Seidel (no email)

Comment 38 2013-03-07 23:04:00 PST

Oh, XCode. You're so fun.

Eric Seidel (no email)

Comment 39 2013-03-07 23:07:38 PST

Created attachment 192152 [details] Patch

Adam Barth

Comment 40 2013-03-08 11:11:17 PST

Comment on attachment 192152 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=192152&action=review > Source/WebCore/ChangeLog:20 > + median= 443.726500002 ms, stdev= 7.25002679952 ms, min= 430.244000047 ms, max= 455.511000007 ms > + to: > + median= 427.849500004 ms, stdev= 9.96967058292 ms, min= 417.914000049 ms, max= 461.528000014 ms You're reporting the median, but not the mean. We usually work with the mean. > Source/WebCore/html/parser/BackgroundHTMLParser.cpp:125 > + // FIXME: Using CaseFoldingHash::equal instead of equalIgnoringCase, as equalIgnoringCase > + // wants non-const StringImpl* (even though it never modifies them). Maybe we should change equalIgnoringCase to use a const StringImpl* ? > Source/WebCore/html/parser/HTMLIdentifier.cpp:49 > + return table; Should we ASSERT(isMainThread() || !table.isEmpty()) ? > Source/WebCore/html/parser/HTMLIdentifier.cpp:126 > + // We expect some hash collisions, but only for identical strings. > + // Since all of these names are AtomicStrings pointers should be equal. > + ASSERT_UNUSED(addResult, !addResult.isNewEntry || name == addResult.iterator->value.second); Is this just dumb luck? What happens when we happen to have two HTML tag names with the same hash value? We should at least have a comment here that explain what to do when that happens. > Source/WebCore/html/parser/HTMLIdentifier.cpp:133 > + // This is safe to call from any thread, but must be called before > + // using HTMLIdentifier. I would just ASSERT(isMainThread()) since that's easier and how we plan to use it anyway. > Source/WebCore/html/parser/HTMLIdentifier.h:51 > + if (vector.size() <= maxNameLength) { Why not put this branch into findIndex? That would simply this code dramatically. > Source/WebCore/html/parser/HTMLIdentifier.h:62 > + if (width == Likely8Bit) > + m_string = StringImpl::create8BitIfPossible(vector); > + else if (width == Force8Bit) > + m_string = String::make8BitFrom16BitSource(vector); > + else > + m_string = String(vector); This work should probably be done by String. It's useful in other places. After these two changes, this function just becomes: m_index = findIndex(vector.data(), vector.size()); if (m_index != invalidIndex) return; m_string = String(vector, width); > Source/WebCore/html/parser/HTMLParserIdioms.cpp:286 > + // Note: This hash compare is not correct for all possible StringImpls, but > + // since we're always comparing against constants, it's OK. Why is it not correct? How can two strings be equal if their hashes aren't equal?

Eric Seidel (no email)

Comment 41 2013-03-08 15:53:17 PST

Comment on attachment 192152 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=192152&action=review >> Source/WebCore/ChangeLog:20 >> + median= 427.849500004 ms, stdev= 9.96967058292 ms, min= 417.914000049 ms, max= 461.528000014 ms > > You're reporting the median, but not the mean. We usually work with the mean. I was just copying the output from run-perf-tests. Filed https://bugs.webkit.org/show_bug.cgi?id=111895 >> Source/WebCore/html/parser/BackgroundHTMLParser.cpp:125 >> + // wants non-const StringImpl* (even though it never modifies them). > > Maybe we should change equalIgnoringCase to use a const StringImpl* ? I considered it. I worried it would bloat this change unnecessarily. I've filed: https://bugs.webkit.org/show_bug.cgi?id=111892 and have updated the comment to say so. >> Source/WebCore/html/parser/HTMLIdentifier.cpp:49 >> + return table; > > Should we ASSERT(isMainThread() || !table.isEmpty()) ? Done. >> Source/WebCore/html/parser/HTMLIdentifier.cpp:126 >> + ASSERT_UNUSED(addResult, !addResult.isNewEntry || name == addResult.iterator->value.second); > > Is this just dumb luck? What happens when we happen to have two HTML tag names with the same hash value? We should at least have a comment here that explain what to do when that happens. Yup. :) There aren't that many strings in this table, so it doesn't shock me that they all have unique hashes. I've added a note for anyone hitting this in the future. >> Source/WebCore/html/parser/HTMLIdentifier.cpp:133 >> + // using HTMLIdentifier. > > I would just ASSERT(isMainThread()) since that's easier and how we plan to use it anyway. Done. >> Source/WebCore/html/parser/HTMLIdentifier.h:51 >> + if (vector.size() <= maxNameLength) { > > Why not put this branch into findIndex? That would simply this code dramatically. Duh. Fixed. I think I was trying to avoid even calling findIndex in the case where we knew we were going to malloc. But the malloc should overshadow this function call overhead. :) >> Source/WebCore/html/parser/HTMLIdentifier.h:62 >> + m_string = String(vector); > > This work should probably be done by String. It's useful in other places. > > After these two changes, this function just becomes: > > m_index = findIndex(vector.data(), vector.size()); > if (m_index != invalidIndex) > return; > m_string = String(vector, width); Excellent idea, but a much more involved change (as we need to handle more than just UChar as a char type). Filed bug 111896. >> Source/WebCore/html/parser/HTMLParserIdioms.cpp:286 >> + // since we're always comparing against constants, it's OK. > > Why is it not correct? How can two strings be equal if their hashes aren't equal? I think it was this comment which scared me into writing that one: http://trac.webkit.org/browser/trunk/Source/WTF/wtf/text/StringImpl.h#L312 I'm happy to remove it.

Eric Seidel (no email)

Comment 42 2013-03-08 15:59:12 PST

Created attachment 192300 [details] Patch for landing

WebKit Review Bot

Comment 43 2013-03-08 16:01:52 PST

Comment on attachment 192300 [details] Patch for landing Rejecting attachment 192300 [details] from commit-queue. Failed to run "['/mnt/git/webkit-commit-queue/Tools/Scripts/webkit-patch', '--status-host=webkit-commit-queue.appspot.com', '--bot-id=gce-cq-04', 'apply-attachment', '--no-update', '--non-interactive', 192300, '--port=chromium-xvfb']" exit_code: 2 cwd: /mnt/git/webkit-commit-queue Last 500 characters of output: patching file Source/WebCore/html/parser/HTMLIdentifier.cpp patching file Source/WebCore/html/parser/HTMLIdentifier.h patching file Source/WebCore/html/parser/HTMLParserIdioms.cpp patching file Source/WebCore/html/parser/HTMLParserIdioms.h patching file Source/WebCore/html/parser/HTMLPreloadScanner.cpp patching file Source/WebCore/html/parser/HTMLPreloadScanner.h Failed to run "[u'/mnt/git/webkit-commit-queue/Tools/Scripts/svn-apply', '--force']" exit_code: 1 cwd: /mnt/git/webkit-commit-queue Full output: http://webkit-commit-queue.appspot.com/results/17109259

Eric Seidel (no email)

Comment 44 2013-03-08 16:23:43 PST

Created attachment 192304 [details] Patch for landing

WebKit Review Bot

Comment 45 2013-03-08 16:26:34 PST

Comment on attachment 192304 [details] Patch for landing Rejecting attachment 192304 [details] from commit-queue. Failed to run "['/mnt/git/webkit-commit-queue/Tools/Scripts/webkit-patch', '--status-host=webkit-commit-queue.appspot.com', '--bot-id=gce-cq-04', 'apply-attachment', '--no-update', '--non-interactive', 192304, '--port=chromium-xvfb']" exit_code: 2 cwd: /mnt/git/webkit-commit-queue Last 500 characters of output: patching file Source/WebCore/html/parser/HTMLIdentifier.cpp patching file Source/WebCore/html/parser/HTMLIdentifier.h patching file Source/WebCore/html/parser/HTMLParserIdioms.cpp patching file Source/WebCore/html/parser/HTMLParserIdioms.h patching file Source/WebCore/html/parser/HTMLPreloadScanner.cpp patching file Source/WebCore/html/parser/HTMLPreloadScanner.h Failed to run "[u'/mnt/git/webkit-commit-queue/Tools/Scripts/svn-apply', '--force']" exit_code: 1 cwd: /mnt/git/webkit-commit-queue Full output: http://webkit-commit-queue.appspot.com/results/17047463

Eric Seidel (no email)

Comment 46 2013-03-08 16:43:54 PST

Created attachment 192309 [details] Patch for landing

WebKit Review Bot

Comment 47 2013-03-08 17:08:47 PST

Comment on attachment 192309 [details] Patch for landing Clearing flags on attachment: 192309 Committed r145292: <http://trac.webkit.org/changeset/145292>

WebKit Review Bot

Comment 48 2013-03-08 17:08:55 PST

All reviewed patches have been landed. Closing bug.

Nils Barth

Comment 49 2013-03-13 00:01:30 PDT

Just a note that this patch looks like it's sped up Page Cycler Moz on Win7. Note the step down in time take after revision 145280: http://chromium-perf.appspot.com/?tab=chromium-rel-win7-webkit&graph=times&trace=t_extcs1&rev=187389&history=80&master=ChromiumWebkit&testSuite=page_cycler_moz&details=true http://chromium-perf.appspot.com/?tab=chromium-rel-win7-webkit&graph=times&trace=t_extwr&rev=187389&history=80&master=ChromiumWebkit&testSuite=page_cycler_moz&details=true ...so looks like some confirmation that it's doing some good!

Adam Barth

Comment 50 2013-03-13 01:30:44 PDT

That's exciting. Thank you for letting us know!

Eric Seidel (no email)

Comment 51 2013-03-13 13:33:13 PDT

Thank you very much for the update Nils!

Note You need to log in before you can comment on or make changes to this bug.