WebKit Bugzilla
New
Browse
Log In
×
Sign in with GitHub
or
Remember my login
Create Account
·
Forgot Password
Forgotten password account recovery
RESOLVED FIXED
107337
BackgroundHTMLParser should be able to atomize well-known strings
https://bugs.webkit.org/show_bug.cgi?id=107337
Summary
BackgroundHTMLParser should be able to atomize well-known strings
Adam Barth
Reported
2013-01-18 16:20:41 PST
BackgroundHTMLParser should be able to atomize well-known strings
Attachments
Work in progress
(14.64 KB, patch)
2013-01-18 16:22 PST
,
Adam Barth
no flags
Details
Formatted Diff
Diff
Work in progress
(14.77 KB, patch)
2013-01-18 17:03 PST
,
Adam Barth
no flags
Details
Formatted Diff
Diff
alternative approach, does not work yet
(15.50 KB, patch)
2013-03-06 17:38 PST
,
Eric Seidel (no email)
no flags
Details
Formatted Diff
Diff
Updated version of Adam's HTMLIdentifier patch
(26.69 KB, patch)
2013-03-07 04:31 PST
,
Eric Seidel (no email)
no flags
Details
Formatted Diff
Diff
Updated version of Adam's HTMLIdentifier patch
(26.75 KB, patch)
2013-03-07 04:36 PST
,
Eric Seidel (no email)
no flags
Details
Formatted Diff
Diff
updated, and more busticated version
(34.82 KB, patch)
2013-03-07 14:47 PST
,
Eric Seidel (no email)
no flags
Details
Formatted Diff
Diff
Patch
(40.00 KB, patch)
2013-03-07 18:35 PST
,
Eric Seidel (no email)
no flags
Details
Formatted Diff
Diff
Now with more build goop
(44.26 KB, patch)
2013-03-07 20:23 PST
,
Eric Seidel (no email)
no flags
Details
Formatted Diff
Diff
Should now build with the threaded parser off
(44.53 KB, patch)
2013-03-07 22:26 PST
,
Eric Seidel (no email)
no flags
Details
Formatted Diff
Diff
Fix Mac
(52.18 KB, patch)
2013-03-07 22:46 PST
,
Eric Seidel (no email)
no flags
Details
Formatted Diff
Diff
Patch
(49.75 KB, patch)
2013-03-07 23:07 PST
,
Eric Seidel (no email)
no flags
Details
Formatted Diff
Diff
Patch for landing
(50.05 KB, patch)
2013-03-08 15:59 PST
,
Eric Seidel (no email)
no flags
Details
Formatted Diff
Diff
Patch for landing
(50.05 KB, patch)
2013-03-08 16:23 PST
,
Eric Seidel (no email)
no flags
Details
Formatted Diff
Diff
Patch for landing
(49.79 KB, patch)
2013-03-08 16:43 PST
,
Eric Seidel (no email)
no flags
Details
Formatted Diff
Diff
Show Obsolete
(13)
View All
Add attachment
proposed patch, testcase, etc.
Adam Barth
Comment 1
2013-01-18 16:22:07 PST
Created
attachment 183569
[details]
Work in progress
Adam Barth
Comment 2
2013-01-18 17:02:26 PST
== Before patch == avg 464.0280999999959 ms median 429.24000000493834 ms stdev 134.56573926949213 ms == After patch == avg 475.42640000028763 ms median 440.3119999988121 ms stdev 135.6254289739386 ms Speedup: 2.4%
Adam Barth
Comment 3
2013-01-18 17:03:39 PST
Created
attachment 183577
[details]
Work in progress
Adam Barth
Comment 4
2013-01-19 11:08:01 PST
I had an idea for another approach. I'll test it out an report back.
Eric Seidel (no email)
Comment 5
2013-03-06 17:38:54 PST
Created
attachment 191873
[details]
alternative approach, does not work yet
Eric Seidel (no email)
Comment 6
2013-03-06 17:44:04 PST
This approach is not going to work with String::isSafeToSendToAnotherThread, as that only allows strings with a single ref to be sent. We'd need to make it smarter to also allow these cached strings to be sent. Possibly by checking the exact ref count to the refs we find in the chunk.
Eric Seidel (no email)
Comment 7
2013-03-06 23:18:49 PST
(In reply to
comment #2
)
> Speedup: 2.4%
Which benchmark was this?
Eric Seidel (no email)
Comment 8
2013-03-06 23:24:23 PST
Your earlier patch is very interesting. It has the potential to be significantly stronger than the one I took. It trades 4 bytes per string pointer on CompactToken for the potential to have less memory taken up by strings. It's probably the right tradeoff. I also like the fact that it can be pre-populated easily with HTMLNames.
Eric Seidel (no email)
Comment 9
2013-03-06 23:38:22 PST
Sorry, I kinda hijacked this bug with a patch solving a related (but different) problem. My patch addresses the idea of multiple of the same string appearing in tokens in the same chunk. Yours addresses using known constant strings (HTMLNames currently) in any chunk. Looking again, your patch is probably the quicker path to victory, so I may try and clean that up and re-post.
Eric Seidel (no email)
Comment 10
2013-03-06 23:41:11 PST
I originally thought that your approach would add new strings to a shared table, but that's not the case. Your patch only ever "uniques" strings which exist in HTMLNames::getHTMLTags(), which is probably strong enough for a first pass. It won't help us with common attribute values, or unknown tag names (so it's not as strong as a real AtomicString solution could be), but it's very simple and appears thread-safe.
Adam Barth
Comment 11
2013-03-07 00:02:14 PST
> > Speedup: 2.4% > > Which benchmark was this?
I don't really remember, but I bet it was the threaded HTML parser benchmark that you wrote originally with the hack not to build render trees for display none iframes.
Eric Seidel (no email)
Comment 12
2013-03-07 03:41:45 PST
I updated your patch, will post shortly. Before: Running Parser/html-parser-threaded.html (1 of 1) RESULT Parser: html-parser-threaded: Time= 441.885400002 ms median= 443.726500002 ms, stdev= 7.25002679952 ms, min= 430.244000047 ms, max= 455.511000007 ms RESULT Parser: html-parser-threaded: JSHeap= 3521735.4 bytes median= 3602932.0 bytes, stdev= 623509.897211 bytes, min= 2623300.0 bytes, max= 4377948.0 bytes RESULT Parser: html-parser-threaded: Malloc= 52747410.8 bytes median= 56256220.0 bytes, stdev= 11587237.5555 bytes, min= 36093864.0 bytes, max= 69862568.0 bytes Finished: 13.947144 s After: Running Parser/html-parser-threaded.html (1 of 1) RESULT Parser: html-parser-threaded: Time= 451.509050003 ms median= 451.107000001 ms, stdev= 5.71298626931 ms, min= 444.229000015 ms, max= 463.285000005 ms RESULT Parser: html-parser-threaded: JSHeap= 3534839.0 bytes median= 3602932.0 bytes, stdev= 638172.662467 bytes, min= 2623304.0 bytes, max= 4508988.0 bytes RESULT Parser: html-parser-threaded: Malloc= 55495976.4 bytes median= 57171872.0 bytes, stdev= 11837219.6543 bytes, min= 36895656.0 bytes, max= 70339216.0 bytes Finished: 13.771727 s We appear to be mallocing more after the patch? I must have doen something wrong.
Eric Seidel (no email)
Comment 13
2013-03-07 04:31:21 PST
Created
attachment 191968
[details]
Updated version of Adam's HTMLIdentifier patch
Eric Seidel (no email)
Comment 14
2013-03-07 04:36:40 PST
Created
attachment 191971
[details]
Updated version of Adam's HTMLIdentifier patch
Eric Seidel (no email)
Comment 15
2013-03-07 04:37:56 PST
My limited testing seems to indicate that this patch is actually a slowdown for Parser/html-parser-threaded.html (with the display: none iframes patch applied) I'll look more tomorrow, there may be something obvious I'm missing.
Adam Barth
Comment 16
2013-03-07 13:13:01 PST
The data from
Comment #2
is super duper old.
Eric Seidel (no email)
Comment 17
2013-03-07 13:20:40 PST
(In reply to
comment #16
)
> The data from
Comment #2
is super duper old.
I think my 5am patch is super-duper broken still. :) Did find a bunch of related wins during the search however. Including
bug 111708
. :) I will look again shortly.
Eric Seidel (no email)
Comment 18
2013-03-07 14:47:52 PST
Created
attachment 192087
[details]
updated, and more busticated version
Eric Seidel (no email)
Comment 19
2013-03-07 18:35:07 PST
Created
attachment 192124
[details]
Patch
Early Warning System Bot
Comment 20
2013-03-07 18:45:52 PST
Comment on
attachment 192124
[details]
Patch
Attachment 192124
[details]
did not pass qt-ews (qt): Output:
http://webkit-commit-queue.appspot.com/results/17009461
Early Warning System Bot
Comment 21
2013-03-07 18:48:23 PST
Comment on
attachment 192124
[details]
Patch
Attachment 192124
[details]
did not pass qt-wk2-ews (qt): Output:
http://webkit-commit-queue.appspot.com/results/16987302
WebKit Review Bot
Comment 22
2013-03-07 19:03:08 PST
Comment on
attachment 192124
[details]
Patch
Attachment 192124
[details]
did not pass cr-linux-debug-ews (chromium-xvfb): Output:
http://webkit-commit-queue.appspot.com/results/17037250
Build Bot
Comment 23
2013-03-07 19:20:12 PST
Comment on
attachment 192124
[details]
Patch
Attachment 192124
[details]
did not pass win-ews (win): Output:
http://webkit-commit-queue.appspot.com/results/17081269
Peter Beverloo (cr-android ews)
Comment 24
2013-03-07 19:21:13 PST
Comment on
attachment 192124
[details]
Patch
Attachment 192124
[details]
did not pass cr-android-ews (chromium-android): Output:
http://webkit-commit-queue.appspot.com/results/16992477
EFL EWS Bot
Comment 25
2013-03-07 19:26:50 PST
Comment on
attachment 192124
[details]
Patch
Attachment 192124
[details]
did not pass efl-ews (efl): Output:
http://webkit-commit-queue.appspot.com/results/17009478
Eric Seidel (no email)
Comment 26
2013-03-07 20:03:08 PST
Oh, thats right. We have non-gyp ports. :p
Build Bot
Comment 27
2013-03-07 20:03:30 PST
Comment on
attachment 192124
[details]
Patch
Attachment 192124
[details]
did not pass mac-wk2-ews (mac-wk2): Output:
http://webkit-commit-queue.appspot.com/results/16992502
Eric Seidel (no email)
Comment 28
2013-03-07 20:03:38 PST
Could I get some code commentary before I got through the 8 rounds of EWS fixups? :)
WebKit Review Bot
Comment 29
2013-03-07 20:07:45 PST
Comment on
attachment 192124
[details]
Patch
Attachment 192124
[details]
did not pass chromium-ews (chromium-xvfb): Output:
http://webkit-commit-queue.appspot.com/results/16992504
Eric Seidel (no email)
Comment 30
2013-03-07 20:08:28 PST
Chromium linux, I thought you had my back. I feel betrayed.
Eric Seidel (no email)
Comment 31
2013-03-07 20:23:29 PST
Created
attachment 192136
[details]
Now with more build goop
Early Warning System Bot
Comment 32
2013-03-07 20:30:21 PST
Comment on
attachment 192136
[details]
Now with more build goop
Attachment 192136
[details]
did not pass qt-wk2-ews (qt): Output:
http://webkit-commit-queue.appspot.com/results/17066282
Early Warning System Bot
Comment 33
2013-03-07 20:31:05 PST
Comment on
attachment 192136
[details]
Now with more build goop
Attachment 192136
[details]
did not pass qt-ews (qt): Output:
http://webkit-commit-queue.appspot.com/results/17054334
EFL EWS Bot
Comment 34
2013-03-07 20:32:38 PST
Comment on
attachment 192136
[details]
Now with more build goop
Attachment 192136
[details]
did not pass efl-ews (efl): Output:
http://webkit-commit-queue.appspot.com/results/16987342
Build Bot
Comment 35
2013-03-07 20:59:54 PST
Comment on
attachment 192136
[details]
Now with more build goop
Attachment 192136
[details]
did not pass win-ews (win): Output:
http://webkit-commit-queue.appspot.com/results/17087166
Eric Seidel (no email)
Comment 36
2013-03-07 22:26:10 PST
Created
attachment 192146
[details]
Should now build with the threaded parser off
Eric Seidel (no email)
Comment 37
2013-03-07 22:46:53 PST
Created
attachment 192149
[details]
Fix Mac
Eric Seidel (no email)
Comment 38
2013-03-07 23:04:00 PST
Oh, XCode. You're so fun.
Eric Seidel (no email)
Comment 39
2013-03-07 23:07:38 PST
Created
attachment 192152
[details]
Patch
Adam Barth
Comment 40
2013-03-08 11:11:17 PST
Comment on
attachment 192152
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=192152&action=review
> Source/WebCore/ChangeLog:20 > + median= 443.726500002 ms, stdev= 7.25002679952 ms, min= 430.244000047 ms, max= 455.511000007 ms > + to: > + median= 427.849500004 ms, stdev= 9.96967058292 ms, min= 417.914000049 ms, max= 461.528000014 ms
You're reporting the median, but not the mean. We usually work with the mean.
> Source/WebCore/html/parser/BackgroundHTMLParser.cpp:125 > + // FIXME: Using CaseFoldingHash::equal instead of equalIgnoringCase, as equalIgnoringCase > + // wants non-const StringImpl* (even though it never modifies them).
Maybe we should change equalIgnoringCase to use a const StringImpl* ?
> Source/WebCore/html/parser/HTMLIdentifier.cpp:49 > + return table;
Should we ASSERT(isMainThread() || !table.isEmpty()) ?
> Source/WebCore/html/parser/HTMLIdentifier.cpp:126 > + // We expect some hash collisions, but only for identical strings. > + // Since all of these names are AtomicStrings pointers should be equal. > + ASSERT_UNUSED(addResult, !addResult.isNewEntry || name == addResult.iterator->value.second);
Is this just dumb luck? What happens when we happen to have two HTML tag names with the same hash value? We should at least have a comment here that explain what to do when that happens.
> Source/WebCore/html/parser/HTMLIdentifier.cpp:133 > + // This is safe to call from any thread, but must be called before > + // using HTMLIdentifier.
I would just ASSERT(isMainThread()) since that's easier and how we plan to use it anyway.
> Source/WebCore/html/parser/HTMLIdentifier.h:51 > + if (vector.size() <= maxNameLength) {
Why not put this branch into findIndex? That would simply this code dramatically.
> Source/WebCore/html/parser/HTMLIdentifier.h:62 > + if (width == Likely8Bit) > + m_string = StringImpl::create8BitIfPossible(vector); > + else if (width == Force8Bit) > + m_string = String::make8BitFrom16BitSource(vector); > + else > + m_string = String(vector);
This work should probably be done by String. It's useful in other places. After these two changes, this function just becomes: m_index = findIndex(vector.data(), vector.size()); if (m_index != invalidIndex) return; m_string = String(vector, width);
> Source/WebCore/html/parser/HTMLParserIdioms.cpp:286 > + // Note: This hash compare is not correct for all possible StringImpls, but > + // since we're always comparing against constants, it's OK.
Why is it not correct? How can two strings be equal if their hashes aren't equal?
Eric Seidel (no email)
Comment 41
2013-03-08 15:53:17 PST
Comment on
attachment 192152
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=192152&action=review
>> Source/WebCore/ChangeLog:20 >> + median= 427.849500004 ms, stdev= 9.96967058292 ms, min= 417.914000049 ms, max= 461.528000014 ms > > You're reporting the median, but not the mean. We usually work with the mean.
I was just copying the output from run-perf-tests. Filed
https://bugs.webkit.org/show_bug.cgi?id=111895
>> Source/WebCore/html/parser/BackgroundHTMLParser.cpp:125 >> + // wants non-const StringImpl* (even though it never modifies them). > > Maybe we should change equalIgnoringCase to use a const StringImpl* ?
I considered it. I worried it would bloat this change unnecessarily. I've filed:
https://bugs.webkit.org/show_bug.cgi?id=111892
and have updated the comment to say so.
>> Source/WebCore/html/parser/HTMLIdentifier.cpp:49 >> + return table; > > Should we ASSERT(isMainThread() || !table.isEmpty()) ?
Done.
>> Source/WebCore/html/parser/HTMLIdentifier.cpp:126 >> + ASSERT_UNUSED(addResult, !addResult.isNewEntry || name == addResult.iterator->value.second); > > Is this just dumb luck? What happens when we happen to have two HTML tag names with the same hash value? We should at least have a comment here that explain what to do when that happens.
Yup. :) There aren't that many strings in this table, so it doesn't shock me that they all have unique hashes. I've added a note for anyone hitting this in the future.
>> Source/WebCore/html/parser/HTMLIdentifier.cpp:133 >> + // using HTMLIdentifier. > > I would just ASSERT(isMainThread()) since that's easier and how we plan to use it anyway.
Done.
>> Source/WebCore/html/parser/HTMLIdentifier.h:51 >> + if (vector.size() <= maxNameLength) { > > Why not put this branch into findIndex? That would simply this code dramatically.
Duh. Fixed. I think I was trying to avoid even calling findIndex in the case where we knew we were going to malloc. But the malloc should overshadow this function call overhead. :)
>> Source/WebCore/html/parser/HTMLIdentifier.h:62 >> + m_string = String(vector); > > This work should probably be done by String. It's useful in other places. > > After these two changes, this function just becomes: > > m_index = findIndex(vector.data(), vector.size()); > if (m_index != invalidIndex) > return; > m_string = String(vector, width);
Excellent idea, but a much more involved change (as we need to handle more than just UChar as a char type). Filed
bug 111896
.
>> Source/WebCore/html/parser/HTMLParserIdioms.cpp:286 >> + // since we're always comparing against constants, it's OK. > > Why is it not correct? How can two strings be equal if their hashes aren't equal?
I think it was this comment which scared me into writing that one:
http://trac.webkit.org/browser/trunk/Source/WTF/wtf/text/StringImpl.h#L312
I'm happy to remove it.
Eric Seidel (no email)
Comment 42
2013-03-08 15:59:12 PST
Created
attachment 192300
[details]
Patch for landing
WebKit Review Bot
Comment 43
2013-03-08 16:01:52 PST
Comment on
attachment 192300
[details]
Patch for landing Rejecting
attachment 192300
[details]
from commit-queue. Failed to run "['/mnt/git/webkit-commit-queue/Tools/Scripts/webkit-patch', '--status-host=webkit-commit-queue.appspot.com', '--bot-id=gce-cq-04', 'apply-attachment', '--no-update', '--non-interactive', 192300, '--port=chromium-xvfb']" exit_code: 2 cwd: /mnt/git/webkit-commit-queue Last 500 characters of output: patching file Source/WebCore/html/parser/HTMLIdentifier.cpp patching file Source/WebCore/html/parser/HTMLIdentifier.h patching file Source/WebCore/html/parser/HTMLParserIdioms.cpp patching file Source/WebCore/html/parser/HTMLParserIdioms.h patching file Source/WebCore/html/parser/HTMLPreloadScanner.cpp patching file Source/WebCore/html/parser/HTMLPreloadScanner.h Failed to run "[u'/mnt/git/webkit-commit-queue/Tools/Scripts/svn-apply', '--force']" exit_code: 1 cwd: /mnt/git/webkit-commit-queue Full output:
http://webkit-commit-queue.appspot.com/results/17109259
Eric Seidel (no email)
Comment 44
2013-03-08 16:23:43 PST
Created
attachment 192304
[details]
Patch for landing
WebKit Review Bot
Comment 45
2013-03-08 16:26:34 PST
Comment on
attachment 192304
[details]
Patch for landing Rejecting
attachment 192304
[details]
from commit-queue. Failed to run "['/mnt/git/webkit-commit-queue/Tools/Scripts/webkit-patch', '--status-host=webkit-commit-queue.appspot.com', '--bot-id=gce-cq-04', 'apply-attachment', '--no-update', '--non-interactive', 192304, '--port=chromium-xvfb']" exit_code: 2 cwd: /mnt/git/webkit-commit-queue Last 500 characters of output: patching file Source/WebCore/html/parser/HTMLIdentifier.cpp patching file Source/WebCore/html/parser/HTMLIdentifier.h patching file Source/WebCore/html/parser/HTMLParserIdioms.cpp patching file Source/WebCore/html/parser/HTMLParserIdioms.h patching file Source/WebCore/html/parser/HTMLPreloadScanner.cpp patching file Source/WebCore/html/parser/HTMLPreloadScanner.h Failed to run "[u'/mnt/git/webkit-commit-queue/Tools/Scripts/svn-apply', '--force']" exit_code: 1 cwd: /mnt/git/webkit-commit-queue Full output:
http://webkit-commit-queue.appspot.com/results/17047463
Eric Seidel (no email)
Comment 46
2013-03-08 16:43:54 PST
Created
attachment 192309
[details]
Patch for landing
WebKit Review Bot
Comment 47
2013-03-08 17:08:47 PST
Comment on
attachment 192309
[details]
Patch for landing Clearing flags on attachment: 192309 Committed
r145292
: <
http://trac.webkit.org/changeset/145292
>
WebKit Review Bot
Comment 48
2013-03-08 17:08:55 PST
All reviewed patches have been landed. Closing bug.
Nils Barth
Comment 49
2013-03-13 00:01:30 PDT
Just a note that this patch looks like it's sped up Page Cycler Moz on Win7. Note the step down in time take after revision 145280:
http://chromium-perf.appspot.com/?tab=chromium-rel-win7-webkit&graph=times&trace=t_extcs1&rev=187389&history=80&master=ChromiumWebkit&testSuite=page_cycler_moz&details=true
http://chromium-perf.appspot.com/?tab=chromium-rel-win7-webkit&graph=times&trace=t_extwr&rev=187389&history=80&master=ChromiumWebkit&testSuite=page_cycler_moz&details=true
...so looks like some confirmation that it's doing some good!
Adam Barth
Comment 50
2013-03-13 01:30:44 PDT
That's exciting. Thank you for letting us know!
Eric Seidel (no email)
Comment 51
2013-03-13 13:33:13 PDT
Thank you very much for the update Nils!
Note
You need to
log in
before you can comment on or make changes to this bug.
Top of Page
Format For Printing
XML
Clone This Bug