RESOLVED FIXED 107337
BackgroundHTMLParser should be able to atomize well-known strings
https://bugs.webkit.org/show_bug.cgi?id=107337
Summary BackgroundHTMLParser should be able to atomize well-known strings
Adam Barth
Reported 2013-01-18 16:20:41 PST
BackgroundHTMLParser should be able to atomize well-known strings
Attachments
Work in progress (14.64 KB, patch)
2013-01-18 16:22 PST, Adam Barth
no flags
Work in progress (14.77 KB, patch)
2013-01-18 17:03 PST, Adam Barth
no flags
alternative approach, does not work yet (15.50 KB, patch)
2013-03-06 17:38 PST, Eric Seidel (no email)
no flags
Updated version of Adam's HTMLIdentifier patch (26.69 KB, patch)
2013-03-07 04:31 PST, Eric Seidel (no email)
no flags
Updated version of Adam's HTMLIdentifier patch (26.75 KB, patch)
2013-03-07 04:36 PST, Eric Seidel (no email)
no flags
updated, and more busticated version (34.82 KB, patch)
2013-03-07 14:47 PST, Eric Seidel (no email)
no flags
Patch (40.00 KB, patch)
2013-03-07 18:35 PST, Eric Seidel (no email)
no flags
Now with more build goop (44.26 KB, patch)
2013-03-07 20:23 PST, Eric Seidel (no email)
no flags
Should now build with the threaded parser off (44.53 KB, patch)
2013-03-07 22:26 PST, Eric Seidel (no email)
no flags
Fix Mac (52.18 KB, patch)
2013-03-07 22:46 PST, Eric Seidel (no email)
no flags
Patch (49.75 KB, patch)
2013-03-07 23:07 PST, Eric Seidel (no email)
no flags
Patch for landing (50.05 KB, patch)
2013-03-08 15:59 PST, Eric Seidel (no email)
no flags
Patch for landing (50.05 KB, patch)
2013-03-08 16:23 PST, Eric Seidel (no email)
no flags
Patch for landing (49.79 KB, patch)
2013-03-08 16:43 PST, Eric Seidel (no email)
no flags
Adam Barth
Comment 1 2013-01-18 16:22:07 PST
Created attachment 183569 [details] Work in progress
Adam Barth
Comment 2 2013-01-18 17:02:26 PST
== Before patch == avg 464.0280999999959 ms median 429.24000000493834 ms stdev 134.56573926949213 ms == After patch == avg 475.42640000028763 ms median 440.3119999988121 ms stdev 135.6254289739386 ms Speedup: 2.4%
Adam Barth
Comment 3 2013-01-18 17:03:39 PST
Created attachment 183577 [details] Work in progress
Adam Barth
Comment 4 2013-01-19 11:08:01 PST
I had an idea for another approach. I'll test it out an report back.
Eric Seidel (no email)
Comment 5 2013-03-06 17:38:54 PST
Created attachment 191873 [details] alternative approach, does not work yet
Eric Seidel (no email)
Comment 6 2013-03-06 17:44:04 PST
This approach is not going to work with String::isSafeToSendToAnotherThread, as that only allows strings with a single ref to be sent. We'd need to make it smarter to also allow these cached strings to be sent. Possibly by checking the exact ref count to the refs we find in the chunk.
Eric Seidel (no email)
Comment 7 2013-03-06 23:18:49 PST
(In reply to comment #2) > Speedup: 2.4% Which benchmark was this?
Eric Seidel (no email)
Comment 8 2013-03-06 23:24:23 PST
Your earlier patch is very interesting. It has the potential to be significantly stronger than the one I took. It trades 4 bytes per string pointer on CompactToken for the potential to have less memory taken up by strings. It's probably the right tradeoff. I also like the fact that it can be pre-populated easily with HTMLNames.
Eric Seidel (no email)
Comment 9 2013-03-06 23:38:22 PST
Sorry, I kinda hijacked this bug with a patch solving a related (but different) problem. My patch addresses the idea of multiple of the same string appearing in tokens in the same chunk. Yours addresses using known constant strings (HTMLNames currently) in any chunk. Looking again, your patch is probably the quicker path to victory, so I may try and clean that up and re-post.
Eric Seidel (no email)
Comment 10 2013-03-06 23:41:11 PST
I originally thought that your approach would add new strings to a shared table, but that's not the case. Your patch only ever "uniques" strings which exist in HTMLNames::getHTMLTags(), which is probably strong enough for a first pass. It won't help us with common attribute values, or unknown tag names (so it's not as strong as a real AtomicString solution could be), but it's very simple and appears thread-safe.
Adam Barth
Comment 11 2013-03-07 00:02:14 PST
> > Speedup: 2.4% > > Which benchmark was this? I don't really remember, but I bet it was the threaded HTML parser benchmark that you wrote originally with the hack not to build render trees for display none iframes.
Eric Seidel (no email)
Comment 12 2013-03-07 03:41:45 PST
I updated your patch, will post shortly. Before: Running Parser/html-parser-threaded.html (1 of 1) RESULT Parser: html-parser-threaded: Time= 441.885400002 ms median= 443.726500002 ms, stdev= 7.25002679952 ms, min= 430.244000047 ms, max= 455.511000007 ms RESULT Parser: html-parser-threaded: JSHeap= 3521735.4 bytes median= 3602932.0 bytes, stdev= 623509.897211 bytes, min= 2623300.0 bytes, max= 4377948.0 bytes RESULT Parser: html-parser-threaded: Malloc= 52747410.8 bytes median= 56256220.0 bytes, stdev= 11587237.5555 bytes, min= 36093864.0 bytes, max= 69862568.0 bytes Finished: 13.947144 s After: Running Parser/html-parser-threaded.html (1 of 1) RESULT Parser: html-parser-threaded: Time= 451.509050003 ms median= 451.107000001 ms, stdev= 5.71298626931 ms, min= 444.229000015 ms, max= 463.285000005 ms RESULT Parser: html-parser-threaded: JSHeap= 3534839.0 bytes median= 3602932.0 bytes, stdev= 638172.662467 bytes, min= 2623304.0 bytes, max= 4508988.0 bytes RESULT Parser: html-parser-threaded: Malloc= 55495976.4 bytes median= 57171872.0 bytes, stdev= 11837219.6543 bytes, min= 36895656.0 bytes, max= 70339216.0 bytes Finished: 13.771727 s We appear to be mallocing more after the patch? I must have doen something wrong.
Eric Seidel (no email)
Comment 13 2013-03-07 04:31:21 PST
Created attachment 191968 [details] Updated version of Adam's HTMLIdentifier patch
Eric Seidel (no email)
Comment 14 2013-03-07 04:36:40 PST
Created attachment 191971 [details] Updated version of Adam's HTMLIdentifier patch
Eric Seidel (no email)
Comment 15 2013-03-07 04:37:56 PST
My limited testing seems to indicate that this patch is actually a slowdown for Parser/html-parser-threaded.html (with the display: none iframes patch applied) I'll look more tomorrow, there may be something obvious I'm missing.
Adam Barth
Comment 16 2013-03-07 13:13:01 PST
The data from Comment #2 is super duper old.
Eric Seidel (no email)
Comment 17 2013-03-07 13:20:40 PST
(In reply to comment #16) > The data from Comment #2 is super duper old. I think my 5am patch is super-duper broken still. :) Did find a bunch of related wins during the search however. Including bug 111708. :) I will look again shortly.
Eric Seidel (no email)
Comment 18 2013-03-07 14:47:52 PST
Created attachment 192087 [details] updated, and more busticated version
Eric Seidel (no email)
Comment 19 2013-03-07 18:35:07 PST
Early Warning System Bot
Comment 20 2013-03-07 18:45:52 PST
Early Warning System Bot
Comment 21 2013-03-07 18:48:23 PST
WebKit Review Bot
Comment 22 2013-03-07 19:03:08 PST
Comment on attachment 192124 [details] Patch Attachment 192124 [details] did not pass cr-linux-debug-ews (chromium-xvfb): Output: http://webkit-commit-queue.appspot.com/results/17037250
Build Bot
Comment 23 2013-03-07 19:20:12 PST
Peter Beverloo (cr-android ews)
Comment 24 2013-03-07 19:21:13 PST
Comment on attachment 192124 [details] Patch Attachment 192124 [details] did not pass cr-android-ews (chromium-android): Output: http://webkit-commit-queue.appspot.com/results/16992477
EFL EWS Bot
Comment 25 2013-03-07 19:26:50 PST
Eric Seidel (no email)
Comment 26 2013-03-07 20:03:08 PST
Oh, thats right. We have non-gyp ports. :p
Build Bot
Comment 27 2013-03-07 20:03:30 PST
Eric Seidel (no email)
Comment 28 2013-03-07 20:03:38 PST
Could I get some code commentary before I got through the 8 rounds of EWS fixups? :)
WebKit Review Bot
Comment 29 2013-03-07 20:07:45 PST
Comment on attachment 192124 [details] Patch Attachment 192124 [details] did not pass chromium-ews (chromium-xvfb): Output: http://webkit-commit-queue.appspot.com/results/16992504
Eric Seidel (no email)
Comment 30 2013-03-07 20:08:28 PST
Chromium linux, I thought you had my back. I feel betrayed.
Eric Seidel (no email)
Comment 31 2013-03-07 20:23:29 PST
Created attachment 192136 [details] Now with more build goop
Early Warning System Bot
Comment 32 2013-03-07 20:30:21 PST
Comment on attachment 192136 [details] Now with more build goop Attachment 192136 [details] did not pass qt-wk2-ews (qt): Output: http://webkit-commit-queue.appspot.com/results/17066282
Early Warning System Bot
Comment 33 2013-03-07 20:31:05 PST
Comment on attachment 192136 [details] Now with more build goop Attachment 192136 [details] did not pass qt-ews (qt): Output: http://webkit-commit-queue.appspot.com/results/17054334
EFL EWS Bot
Comment 34 2013-03-07 20:32:38 PST
Comment on attachment 192136 [details] Now with more build goop Attachment 192136 [details] did not pass efl-ews (efl): Output: http://webkit-commit-queue.appspot.com/results/16987342
Build Bot
Comment 35 2013-03-07 20:59:54 PST
Comment on attachment 192136 [details] Now with more build goop Attachment 192136 [details] did not pass win-ews (win): Output: http://webkit-commit-queue.appspot.com/results/17087166
Eric Seidel (no email)
Comment 36 2013-03-07 22:26:10 PST
Created attachment 192146 [details] Should now build with the threaded parser off
Eric Seidel (no email)
Comment 37 2013-03-07 22:46:53 PST
Eric Seidel (no email)
Comment 38 2013-03-07 23:04:00 PST
Oh, XCode. You're so fun.
Eric Seidel (no email)
Comment 39 2013-03-07 23:07:38 PST
Adam Barth
Comment 40 2013-03-08 11:11:17 PST
Comment on attachment 192152 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=192152&action=review > Source/WebCore/ChangeLog:20 > + median= 443.726500002 ms, stdev= 7.25002679952 ms, min= 430.244000047 ms, max= 455.511000007 ms > + to: > + median= 427.849500004 ms, stdev= 9.96967058292 ms, min= 417.914000049 ms, max= 461.528000014 ms You're reporting the median, but not the mean. We usually work with the mean. > Source/WebCore/html/parser/BackgroundHTMLParser.cpp:125 > + // FIXME: Using CaseFoldingHash::equal instead of equalIgnoringCase, as equalIgnoringCase > + // wants non-const StringImpl* (even though it never modifies them). Maybe we should change equalIgnoringCase to use a const StringImpl* ? > Source/WebCore/html/parser/HTMLIdentifier.cpp:49 > + return table; Should we ASSERT(isMainThread() || !table.isEmpty()) ? > Source/WebCore/html/parser/HTMLIdentifier.cpp:126 > + // We expect some hash collisions, but only for identical strings. > + // Since all of these names are AtomicStrings pointers should be equal. > + ASSERT_UNUSED(addResult, !addResult.isNewEntry || name == addResult.iterator->value.second); Is this just dumb luck? What happens when we happen to have two HTML tag names with the same hash value? We should at least have a comment here that explain what to do when that happens. > Source/WebCore/html/parser/HTMLIdentifier.cpp:133 > + // This is safe to call from any thread, but must be called before > + // using HTMLIdentifier. I would just ASSERT(isMainThread()) since that's easier and how we plan to use it anyway. > Source/WebCore/html/parser/HTMLIdentifier.h:51 > + if (vector.size() <= maxNameLength) { Why not put this branch into findIndex? That would simply this code dramatically. > Source/WebCore/html/parser/HTMLIdentifier.h:62 > + if (width == Likely8Bit) > + m_string = StringImpl::create8BitIfPossible(vector); > + else if (width == Force8Bit) > + m_string = String::make8BitFrom16BitSource(vector); > + else > + m_string = String(vector); This work should probably be done by String. It's useful in other places. After these two changes, this function just becomes: m_index = findIndex(vector.data(), vector.size()); if (m_index != invalidIndex) return; m_string = String(vector, width); > Source/WebCore/html/parser/HTMLParserIdioms.cpp:286 > + // Note: This hash compare is not correct for all possible StringImpls, but > + // since we're always comparing against constants, it's OK. Why is it not correct? How can two strings be equal if their hashes aren't equal?
Eric Seidel (no email)
Comment 41 2013-03-08 15:53:17 PST
Comment on attachment 192152 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=192152&action=review >> Source/WebCore/ChangeLog:20 >> + median= 427.849500004 ms, stdev= 9.96967058292 ms, min= 417.914000049 ms, max= 461.528000014 ms > > You're reporting the median, but not the mean. We usually work with the mean. I was just copying the output from run-perf-tests. Filed https://bugs.webkit.org/show_bug.cgi?id=111895 >> Source/WebCore/html/parser/BackgroundHTMLParser.cpp:125 >> + // wants non-const StringImpl* (even though it never modifies them). > > Maybe we should change equalIgnoringCase to use a const StringImpl* ? I considered it. I worried it would bloat this change unnecessarily. I've filed: https://bugs.webkit.org/show_bug.cgi?id=111892 and have updated the comment to say so. >> Source/WebCore/html/parser/HTMLIdentifier.cpp:49 >> + return table; > > Should we ASSERT(isMainThread() || !table.isEmpty()) ? Done. >> Source/WebCore/html/parser/HTMLIdentifier.cpp:126 >> + ASSERT_UNUSED(addResult, !addResult.isNewEntry || name == addResult.iterator->value.second); > > Is this just dumb luck? What happens when we happen to have two HTML tag names with the same hash value? We should at least have a comment here that explain what to do when that happens. Yup. :) There aren't that many strings in this table, so it doesn't shock me that they all have unique hashes. I've added a note for anyone hitting this in the future. >> Source/WebCore/html/parser/HTMLIdentifier.cpp:133 >> + // using HTMLIdentifier. > > I would just ASSERT(isMainThread()) since that's easier and how we plan to use it anyway. Done. >> Source/WebCore/html/parser/HTMLIdentifier.h:51 >> + if (vector.size() <= maxNameLength) { > > Why not put this branch into findIndex? That would simply this code dramatically. Duh. Fixed. I think I was trying to avoid even calling findIndex in the case where we knew we were going to malloc. But the malloc should overshadow this function call overhead. :) >> Source/WebCore/html/parser/HTMLIdentifier.h:62 >> + m_string = String(vector); > > This work should probably be done by String. It's useful in other places. > > After these two changes, this function just becomes: > > m_index = findIndex(vector.data(), vector.size()); > if (m_index != invalidIndex) > return; > m_string = String(vector, width); Excellent idea, but a much more involved change (as we need to handle more than just UChar as a char type). Filed bug 111896. >> Source/WebCore/html/parser/HTMLParserIdioms.cpp:286 >> + // since we're always comparing against constants, it's OK. > > Why is it not correct? How can two strings be equal if their hashes aren't equal? I think it was this comment which scared me into writing that one: http://trac.webkit.org/browser/trunk/Source/WTF/wtf/text/StringImpl.h#L312 I'm happy to remove it.
Eric Seidel (no email)
Comment 42 2013-03-08 15:59:12 PST
Created attachment 192300 [details] Patch for landing
WebKit Review Bot
Comment 43 2013-03-08 16:01:52 PST
Comment on attachment 192300 [details] Patch for landing Rejecting attachment 192300 [details] from commit-queue. Failed to run "['/mnt/git/webkit-commit-queue/Tools/Scripts/webkit-patch', '--status-host=webkit-commit-queue.appspot.com', '--bot-id=gce-cq-04', 'apply-attachment', '--no-update', '--non-interactive', 192300, '--port=chromium-xvfb']" exit_code: 2 cwd: /mnt/git/webkit-commit-queue Last 500 characters of output: patching file Source/WebCore/html/parser/HTMLIdentifier.cpp patching file Source/WebCore/html/parser/HTMLIdentifier.h patching file Source/WebCore/html/parser/HTMLParserIdioms.cpp patching file Source/WebCore/html/parser/HTMLParserIdioms.h patching file Source/WebCore/html/parser/HTMLPreloadScanner.cpp patching file Source/WebCore/html/parser/HTMLPreloadScanner.h Failed to run "[u'/mnt/git/webkit-commit-queue/Tools/Scripts/svn-apply', '--force']" exit_code: 1 cwd: /mnt/git/webkit-commit-queue Full output: http://webkit-commit-queue.appspot.com/results/17109259
Eric Seidel (no email)
Comment 44 2013-03-08 16:23:43 PST
Created attachment 192304 [details] Patch for landing
WebKit Review Bot
Comment 45 2013-03-08 16:26:34 PST
Comment on attachment 192304 [details] Patch for landing Rejecting attachment 192304 [details] from commit-queue. Failed to run "['/mnt/git/webkit-commit-queue/Tools/Scripts/webkit-patch', '--status-host=webkit-commit-queue.appspot.com', '--bot-id=gce-cq-04', 'apply-attachment', '--no-update', '--non-interactive', 192304, '--port=chromium-xvfb']" exit_code: 2 cwd: /mnt/git/webkit-commit-queue Last 500 characters of output: patching file Source/WebCore/html/parser/HTMLIdentifier.cpp patching file Source/WebCore/html/parser/HTMLIdentifier.h patching file Source/WebCore/html/parser/HTMLParserIdioms.cpp patching file Source/WebCore/html/parser/HTMLParserIdioms.h patching file Source/WebCore/html/parser/HTMLPreloadScanner.cpp patching file Source/WebCore/html/parser/HTMLPreloadScanner.h Failed to run "[u'/mnt/git/webkit-commit-queue/Tools/Scripts/svn-apply', '--force']" exit_code: 1 cwd: /mnt/git/webkit-commit-queue Full output: http://webkit-commit-queue.appspot.com/results/17047463
Eric Seidel (no email)
Comment 46 2013-03-08 16:43:54 PST
Created attachment 192309 [details] Patch for landing
WebKit Review Bot
Comment 47 2013-03-08 17:08:47 PST
Comment on attachment 192309 [details] Patch for landing Clearing flags on attachment: 192309 Committed r145292: <http://trac.webkit.org/changeset/145292>
WebKit Review Bot
Comment 48 2013-03-08 17:08:55 PST
All reviewed patches have been landed. Closing bug.
Adam Barth
Comment 50 2013-03-13 01:30:44 PDT
That's exciting. Thank you for letting us know!
Eric Seidel (no email)
Comment 51 2013-03-13 13:33:13 PDT
Thank you very much for the update Nils!
Note You need to log in before you can comment on or make changes to this bug.