Bug 106127 - [meta] HTML parser shouldn't block the main thread
: [meta] HTML parser shouldn't block the main thread
Status: RESOLVED FIXED
: WebKit
WebCore Misc.
: 528+ (Nightly build)
: All All
: P2 Normal
Assigned To:
:
:
: 57376 63531 90751 106128 106251 106256 106268 106375 106401 106496 106595 106597 106607 106615 106618 106694 106722 106854 106919 107068 107069 107070 107071 107082 107083 107086 107087 107105 107140 107150 107158 107159 107160 107170 107190 107201 107317 107320 107330 107332 107367 107368 107519 107522 107561 107569 107575 107584 107593 107596 107603 107664 107713 107751 107753 107755 107807 107876 107975 107983 108027 108096 108394 108531 108557 108655 108666 108698 108726 108880 108970 108984 109076 109237 109240 109477 109485 109486 109495 109598 109607 109738 109742 109750 109754 109760 109764 109995 110251 110258 110276 110408 110517 110529 110532 110537 110538 110637 110643 110647 110678 110801 110907 110929 110937 110949 110951 111021 111023 111043 111044 111130 111135 111200 111248 111249 111253 111272 111365 111423 111610 112057
:
  Show dependency treegraph
 
Reported: 2013-01-04 13:08 PST by
Modified: 2013-03-11 19:52 PST (History)


Attachments
HTML parser runtime (measured on chromium-mac on a Macbook Pro via inspector instrumentation) (17.46 KB, text/html)
2013-01-08 14:25 PST, Adam Barth
no flags Details
HTML parser runtime (measured on chromium-android on a Nexus 7 via inspector instrumentation) (17.22 KB, text/html)
2013-01-09 16:05 PST, Adam Barth
no flags Details


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2013-01-04 13:08:22 PST
This is a meta bug for moving the HTML parser off the main thread.

We're currently evaluating how much performance there is to be gained from this change.  The performance gains might arise in two ways:

1) Moving parsing off the main thread could make web pages more responsive because the main thread is available for handling input events and executing JavaScript.
2) Moving parsing off the main thread could make web pages load more quickly because WebCore can do other work in parallel with parsing HTML (such as parsing CSS or attaching elements to the render tree).

While we investigate these possible performance benefits, we might refactor the parser a bit to remove main-thread dependencies from the core objects (e.g., HTMLTokenizer and HTMLTreeBuilder).  Once we have more data, we'll start a discussion on webkit-dev before making any major architectural changes.
------- Comment #1 From 2013-01-06 18:42:17 PST -------
Here's a slide deck from Mozilla related to this topic:
http://people.mozilla.com/~roc/Samsung/MozillaParallelism.pdf
------- Comment #2 From 2013-01-07 05:03:08 PST -------
Over a run of bunch of real world web sites we seem to have ~3% of main thread CPU time in the HTML tokenization and parsing (excluding the actual tree building, the most expensive part). There are surely individual cases much worse than that. This is big enough to support architectural changes like this.

The goal should be to eventually have the whole path from networking on off the main thread and only do the actual tree building there.
------- Comment #3 From 2013-01-07 12:02:50 PST -------
(In reply to comment #2)
> Over a run of bunch of real world web sites we seem to have ~3% of main thread CPU time in the HTML tokenization and parsing (excluding the actual tree building, the most expensive part). There are surely individual cases much worse than that. This is big enough to support architectural changes like this.
> 
> The goal should be to eventually have the whole path from networking on off the main thread and only do the actual tree building there.

I'm very curious about this number!

Adam and I briefly looked into generating a number like that last Friday.  I had assumed parse time would be larger than 3% of total active main thread time, especially on Mobile.

Could you share some of your methodology?  Or other percentages of main thread usage?  I'd be very interested in what you know about how we're spending time on the main thread, and happy to help you reduce it.

I assume you just used dtrace + the PLT or similar?  Our first-crack plan had been to use inspector timeline events and page cyclers (same idea as the plt), but I'm less interested in what events the inspector happens to record, and more about total time on the main thread and where it's going.  We could also use systrace for this, and I might go that route next.
------- Comment #4 From 2013-01-07 12:03:45 PST -------
(In reply to comment #3)
> (In reply to comment #2)
>
> I'm very curious about this number!

We can also discuss this offline or in a separate bug.  I don't need to hijack Adam's meta-bug.  But I remain very interested in your testing and being able to repeat it/compare numbers/speed-up webkit.
------- Comment #5 From 2013-01-08 10:53:22 PST -------
(In reply to comment #1)
> Here's a slide deck from Mozilla related to this topic:
> http://people.mozilla.com/~roc/Samsung/MozillaParallelism.pdf

And some more in-depth design discussion:
https://developer.mozilla.org/en/Gecko/HTML_parser_threading
------- Comment #6 From 2013-01-08 11:48:25 PST -------
Note that Mozilla implementation of this wasn't necessarily that evidence driven: https://twitter.com/hsivonen/status/129457178368151552
------- Comment #7 From 2013-01-08 12:44:49 PST -------
> Over a run of bunch of real world web sites we seem to have ~3% of main thread CPU time in the HTML tokenization and parsing (excluding the actual tree building, the most expensive part).

Why you say "tree building," do you mean the work down by the HTMLTreeBuilder object or the actually parserAppendChild/attach calls?  We should be able to move HTMLTreeBuilder onto the background thread, but we probably would not be able to move parserAppendChild or attach.

nduca did some measurements with Chromium's telemetry profiler (which uses the inspector timeline's notion of what constitutes HTML parsing time).  On a selection of 25 popular web sites, he sees the parser using between 2% and 8% of main thread CPU time (with an average of 5%).  Some examples on the high end (i.e., >=7%) are games.yahoo.com, www.youtube.com, http://en.wikipedia.org/wiki/Wikipedia, and pinterest.com.

These numbers seem consistent with Antti's measurements given that Antti is likely excluding some amount of tree building work that the inspector is charging to the parser.
------- Comment #8 From 2013-01-08 14:25:53 PST -------
Created an attachment (id=181766) [details]
HTML parser runtime (measured on chromium-mac on a Macbook Pro via inspector instrumentation)

Here's more details from the dataset Nat took on his Macbook Pro.  The "ParseHTML" column represents the total amount of time attributed to the HTML parser by the inspector instrumentation.  The "ParseHTML_max" column is the largest contiguous chunk of time (in a single load of the page).

Looking at the ParseHTML_max column, the parser seems to often consume multiple frames (by which I mean 60 Hz time slices on the main thread).  In some cases, such as http://en.wikipedia.org/wiki/Wikipedia and http://games.yahoo.com the parser creates 7-9 frames of jank.

Note: These measurements were taken on a Macbook Pro.  It would be interesting to see how these measurements compare on a mobile device.
------- Comment #9 From 2013-01-08 14:31:47 PST -------
(In reply to comment #8)
> Created an attachment (id=181766) [details] [details]
> Looking at the ParseHTML_max column, the parser seems to often consume multiple frames (by which I mean 60 Hz time slices on the main thread).  In some cases, such as http://en.wikipedia.org/wiki/Wikipedia and http://games.yahoo.com the parser creates 7-9 frames of jank.

The parser is currently set only to yield every 4000 tokens or 500ms.  Which is likely waaay too long on a touch device.

http://trac.webkit.org/browser/trunk/Source/WebCore/html/parser/HTMLParserScheduler.cpp#L34

It would be interesting to build with a much lower threshold (like 30ms) and see how the web feels.

Definitely pulling the parser off the main thread might help with these sorts of jank.
------- Comment #10 From 2013-01-08 15:31:54 PST -------
(In reply to comment #7)
> > Over a run of bunch of real world web sites we seem to have ~3% of main thread CPU time in the HTML tokenization and parsing (excluding the actual tree building, the most expensive part).
> 
> Why you say "tree building," do you mean the work down by the HTMLTreeBuilder object or the actually parserAppendChild/attach calls?  We should be able to move HTMLTreeBuilder onto the background thread, but we probably would not be able to move parserAppendChild or attach.

I was pruning out entire HTMLTreeBuilder::constructTreeFromAtomicToken(). Pruning more carefully (calls to Element functions only) leaves ~3.5% in total.

> nduca did some measurements with Chromium's telemetry profiler (which uses the inspector timeline's notion of what constitutes HTML parsing time).  On a selection of 25 popular web sites, he sees the parser using between 2% and 8% of main thread CPU time (with an average of 5%).  Some examples on the high end (i.e., >=7%) are games.yahoo.com, www.youtube.com, http://en.wikipedia.org/wiki/Wikipedia, and pinterest.com.

I would like to see measurements done without relying on inspector infrastructure (for example by simple instrumentation code) so we know what exactly is being measured. 

As I said I think this is worth doing based on the current data already. However it would be good to realistic understanding what kinds of gains to expect.
------- Comment #11 From 2013-01-09 16:05:38 PST -------
Created an attachment (id=182002) [details]
HTML parser runtime (measured on chromium-android on a Nexus 7 via inspector instrumentation)

Here's are the results from the chromium-android port on a Nexus 7 (using a content_shell build from this afternoon).  The parser takes up more time on the main thread.  For example, on games.yahoo.com the HTML parser takes up 1.2 seconds.  On average, the HTML parser takes 486 ms of main thread time.

The "max" times are also considerably worse on the Nexus 7.  The average "max" value is about 10 frames (60 Hz time slices), with the worse case being 38 frames.
------- Comment #12 From 2013-01-09 16:07:07 PST -------
> I would like to see measurements done without relying on inspector infrastructure (for example by simple instrumentation code) so we know what exactly is being measured. 

I would prefer to use a tool like instruments as well, but unfortunately the measurement harness we're using is build out of the inspector instrumentation.

> As I said I think this is worth doing based on the current data already. However it would be good to realistic understanding what kinds of gains to expect.

I agree.  I'll send an email to webkit-dev.
------- Comment #13 From 2013-01-24 11:53:59 PST -------
I ran the prototype through the top 25 suite in Telemetry on a Galaxy Nexus (with the exception of Calendar which didn't load with the threaded parser). This benchmark loads cached sites from a local web page replay instance.

Results are preliminary but encouraging:

                  Default  Threaded  Improvement
DOMContentLoaded     4972      4304          13%
ParseHTML total       702       593          14%
ParseHTML avg           9         5          44%
ParseHTML max         309       107          65%

Full results:
https://docs.google.com/a/chromium.org/spreadsheet/ccc?key=0AmVDuVhIZxCTdGdLUlhkbnVUaDlCQ01uVm92S05saHc#gid=0

One suspicious thing is that the absolute value of DOMContentLoaded improved more than ParseHTML. Perhaps due to our doc.write bug, we are actually doing less work on some of the pages. I'm also a little surprised the ParseHTML numbers didn't improve more. That suggests tree building is still taking a fair amount of time.
------- Comment #14 From 2013-02-22 13:36:11 PST -------
The spreadsheet for triaging the remaining test failures is at

https://docs.google.com/spreadsheet/ccc?key=0AlC4tS7Ao1fIdE5IbVJESW00V2F5RUIwRDk3WEhMblE&usp=sharing
------- Comment #15 From 2013-03-02 02:02:25 PST -------
This is now on by default in Chromium Canary:
https://groups.google.com/a/chromium.org/forum/#!topic/chromium-dev/hBUVtg7gacE
See the announcement for details on the (substantial) perf win (even for single-core devices!?)

Other ports probably want to wait a couple days before turning this on, in case there are other bugs we should shake out.

Bug 110937 may also block at least Mac WK1 from enabling this for the time being.
------- Comment #16 From 2013-03-05 02:05:06 PST -------
*** Bug 57376 has been marked as a duplicate of this bug. ***
------- Comment #17 From 2013-03-05 02:08:11 PST -------
The parser was disabled on Chromium Canary due to a couple crashers we fixed today.  It should be back on as of Weds' Canary.
------- Comment #18 From 2013-03-06 17:34:19 PST -------
This bug is almost ready to close.  Filed bug 111645 for tracking further perf improvements to the threaded parser codepath.
------- Comment #19 From 2013-03-07 12:37:07 PST -------
IMHO, we should fix bug 109764 before closing this bug.
------- Comment #20 From 2013-03-11 19:52:49 PST -------
The parser appears to work.  :)