Bug 106127

Summary: [meta] HTML parser shouldn't block the main thread
Product: WebKit Reporter: Adam Barth <abarth>
Component: WebCore Misc.Assignee: Nobody <webkit-unassigned>
Status: RESOLVED FIXED    
Severity: Normal CC: ap, benjamin, dbates, dongseong.hwang, eric, huangxueqing, jamesr, jay.bhaskar, johnme, kalyan.kondapally, koivisto, leviw, mike, mjs, nayankk, nduca, priyajeet.hora, psolanki, sam, skyul, syoichi, tonikitoo, tonyg, vivekg
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: All   
OS: All   
Bug Depends on: 110929, 111043, 111044, 57376, 63531, 90751, 106128, 106251, 106256, 106268, 106375, 106401, 106496, 106595, 106597, 106607, 106615, 106618, 106694, 106722, 106854, 106919, 107068, 107069, 107070, 107071, 107082, 107083, 107086, 107087, 107105, 107140, 107150, 107158, 107159, 107160, 107170, 107190, 107201, 107317, 107320, 107330, 107332, 107367, 107368, 107519, 107522, 107561, 107569, 107575, 107584, 107593, 107596, 107603, 107664, 107713, 107751, 107753, 107755, 107807, 107876, 107975, 107983, 108027, 108096, 108394, 108531, 108557, 108655, 108666, 108698, 108726, 108880, 108970, 108984, 109076, 109237, 109240, 109477, 109485, 109486, 109495, 109598, 109607, 109738, 109742, 109750, 109754, 109760, 109764, 109995, 110251, 110258, 110276, 110408, 110517, 110529, 110532, 110537, 110538, 110637, 110643, 110647, 110678, 110801, 110907, 110937, 110949, 110951, 111021, 111023, 111130, 111135, 111200, 111248, 111249, 111253, 111272, 111365, 111423, 111610, 112057    
Bug Blocks:    
Attachments:
Description Flags
HTML parser runtime (measured on chromium-mac on a Macbook Pro via inspector instrumentation)
none
HTML parser runtime (measured on chromium-android on a Nexus 7 via inspector instrumentation) none

Description Adam Barth 2013-01-04 13:08:22 PST
This is a meta bug for moving the HTML parser off the main thread.

We're currently evaluating how much performance there is to be gained from this change.  The performance gains might arise in two ways:

1) Moving parsing off the main thread could make web pages more responsive because the main thread is available for handling input events and executing JavaScript.
2) Moving parsing off the main thread could make web pages load more quickly because WebCore can do other work in parallel with parsing HTML (such as parsing CSS or attaching elements to the render tree).

While we investigate these possible performance benefits, we might refactor the parser a bit to remove main-thread dependencies from the core objects (e.g., HTMLTokenizer and HTMLTreeBuilder).  Once we have more data, we'll start a discussion on webkit-dev before making any major architectural changes.
Comment 1 Adam Barth 2013-01-06 18:42:17 PST
Here's a slide deck from Mozilla related to this topic:
http://people.mozilla.com/~roc/Samsung/MozillaParallelism.pdf
Comment 2 Antti Koivisto 2013-01-07 05:03:08 PST
Over a run of bunch of real world web sites we seem to have ~3% of main thread CPU time in the HTML tokenization and parsing (excluding the actual tree building, the most expensive part). There are surely individual cases much worse than that. This is big enough to support architectural changes like this.

The goal should be to eventually have the whole path from networking on off the main thread and only do the actual tree building there.
Comment 3 Eric Seidel (no email) 2013-01-07 12:02:50 PST
(In reply to comment #2)
> Over a run of bunch of real world web sites we seem to have ~3% of main thread CPU time in the HTML tokenization and parsing (excluding the actual tree building, the most expensive part). There are surely individual cases much worse than that. This is big enough to support architectural changes like this.
> 
> The goal should be to eventually have the whole path from networking on off the main thread and only do the actual tree building there.

I'm very curious about this number!

Adam and I briefly looked into generating a number like that last Friday.  I had assumed parse time would be larger than 3% of total active main thread time, especially on Mobile.

Could you share some of your methodology?  Or other percentages of main thread usage?  I'd be very interested in what you know about how we're spending time on the main thread, and happy to help you reduce it.

I assume you just used dtrace + the PLT or similar?  Our first-crack plan had been to use inspector timeline events and page cyclers (same idea as the plt), but I'm less interested in what events the inspector happens to record, and more about total time on the main thread and where it's going.  We could also use systrace for this, and I might go that route next.
Comment 4 Eric Seidel (no email) 2013-01-07 12:03:45 PST
(In reply to comment #3)
> (In reply to comment #2)
>
> I'm very curious about this number!

We can also discuss this offline or in a separate bug.  I don't need to hijack Adam's meta-bug.  But I remain very interested in your testing and being able to repeat it/compare numbers/speed-up webkit.
Comment 5 Tony Gentilcore 2013-01-08 10:53:22 PST
(In reply to comment #1)
> Here's a slide deck from Mozilla related to this topic:
> http://people.mozilla.com/~roc/Samsung/MozillaParallelism.pdf

And some more in-depth design discussion:
https://developer.mozilla.org/en/Gecko/HTML_parser_threading
Comment 6 Antti Koivisto 2013-01-08 11:48:25 PST
Note that Mozilla implementation of this wasn't necessarily that evidence driven: https://twitter.com/hsivonen/status/129457178368151552
Comment 7 Adam Barth 2013-01-08 12:44:49 PST
> Over a run of bunch of real world web sites we seem to have ~3% of main thread CPU time in the HTML tokenization and parsing (excluding the actual tree building, the most expensive part).

Why you say "tree building," do you mean the work down by the HTMLTreeBuilder object or the actually parserAppendChild/attach calls?  We should be able to move HTMLTreeBuilder onto the background thread, but we probably would not be able to move parserAppendChild or attach.

nduca did some measurements with Chromium's telemetry profiler (which uses the inspector timeline's notion of what constitutes HTML parsing time).  On a selection of 25 popular web sites, he sees the parser using between 2% and 8% of main thread CPU time (with an average of 5%).  Some examples on the high end (i.e., >=7%) are games.yahoo.com, www.youtube.com, http://en.wikipedia.org/wiki/Wikipedia, and pinterest.com.

These numbers seem consistent with Antti's measurements given that Antti is likely excluding some amount of tree building work that the inspector is charging to the parser.
Comment 8 Adam Barth 2013-01-08 14:25:53 PST
Created attachment 181766 [details]
HTML parser runtime (measured on chromium-mac on a Macbook Pro via inspector instrumentation)

Here's more details from the dataset Nat took on his Macbook Pro.  The "ParseHTML" column represents the total amount of time attributed to the HTML parser by the inspector instrumentation.  The "ParseHTML_max" column is the largest contiguous chunk of time (in a single load of the page).

Looking at the ParseHTML_max column, the parser seems to often consume multiple frames (by which I mean 60 Hz time slices on the main thread).  In some cases, such as http://en.wikipedia.org/wiki/Wikipedia and http://games.yahoo.com the parser creates 7-9 frames of jank.

Note: These measurements were taken on a Macbook Pro.  It would be interesting to see how these measurements compare on a mobile device.
Comment 9 Eric Seidel (no email) 2013-01-08 14:31:47 PST
(In reply to comment #8)
> Created an attachment (id=181766) [details]
> Looking at the ParseHTML_max column, the parser seems to often consume multiple frames (by which I mean 60 Hz time slices on the main thread).  In some cases, such as http://en.wikipedia.org/wiki/Wikipedia and http://games.yahoo.com the parser creates 7-9 frames of jank.

The parser is currently set only to yield every 4000 tokens or 500ms.  Which is likely waaay too long on a touch device.

http://trac.webkit.org/browser/trunk/Source/WebCore/html/parser/HTMLParserScheduler.cpp#L34

It would be interesting to build with a much lower threshold (like 30ms) and see how the web feels.

Definitely pulling the parser off the main thread might help with these sorts of jank.
Comment 10 Antti Koivisto 2013-01-08 15:31:54 PST
(In reply to comment #7)
> > Over a run of bunch of real world web sites we seem to have ~3% of main thread CPU time in the HTML tokenization and parsing (excluding the actual tree building, the most expensive part).
> 
> Why you say "tree building," do you mean the work down by the HTMLTreeBuilder object or the actually parserAppendChild/attach calls?  We should be able to move HTMLTreeBuilder onto the background thread, but we probably would not be able to move parserAppendChild or attach.

I was pruning out entire HTMLTreeBuilder::constructTreeFromAtomicToken(). Pruning more carefully (calls to Element functions only) leaves ~3.5% in total.

> nduca did some measurements with Chromium's telemetry profiler (which uses the inspector timeline's notion of what constitutes HTML parsing time).  On a selection of 25 popular web sites, he sees the parser using between 2% and 8% of main thread CPU time (with an average of 5%).  Some examples on the high end (i.e., >=7%) are games.yahoo.com, www.youtube.com, http://en.wikipedia.org/wiki/Wikipedia, and pinterest.com.

I would like to see measurements done without relying on inspector infrastructure (for example by simple instrumentation code) so we know what exactly is being measured. 

As I said I think this is worth doing based on the current data already. However it would be good to realistic understanding what kinds of gains to expect.
Comment 11 Adam Barth 2013-01-09 16:05:38 PST
Created attachment 182002 [details]
HTML parser runtime (measured on chromium-android on a Nexus 7 via inspector instrumentation)

Here's are the results from the chromium-android port on a Nexus 7 (using a content_shell build from this afternoon).  The parser takes up more time on the main thread.  For example, on games.yahoo.com the HTML parser takes up 1.2 seconds.  On average, the HTML parser takes 486 ms of main thread time.

The "max" times are also considerably worse on the Nexus 7.  The average "max" value is about 10 frames (60 Hz time slices), with the worse case being 38 frames.
Comment 12 Adam Barth 2013-01-09 16:07:07 PST
> I would like to see measurements done without relying on inspector infrastructure (for example by simple instrumentation code) so we know what exactly is being measured. 

I would prefer to use a tool like instruments as well, but unfortunately the measurement harness we're using is build out of the inspector instrumentation.

> As I said I think this is worth doing based on the current data already. However it would be good to realistic understanding what kinds of gains to expect.

I agree.  I'll send an email to webkit-dev.
Comment 13 Tony Gentilcore 2013-01-24 11:53:59 PST
I ran the prototype through the top 25 suite in Telemetry on a Galaxy Nexus (with the exception of Calendar which didn't load with the threaded parser). This benchmark loads cached sites from a local web page replay instance.

Results are preliminary but encouraging:

                  Default  Threaded  Improvement
DOMContentLoaded     4972      4304          13%
ParseHTML total       702       593          14%
ParseHTML avg           9         5          44%
ParseHTML max         309       107          65%

Full results:
https://docs.google.com/a/chromium.org/spreadsheet/ccc?key=0AmVDuVhIZxCTdGdLUlhkbnVUaDlCQ01uVm92S05saHc#gid=0

One suspicious thing is that the absolute value of DOMContentLoaded improved more than ParseHTML. Perhaps due to our doc.write bug, we are actually doing less work on some of the pages. I'm also a little surprised the ParseHTML numbers didn't improve more. That suggests tree building is still taking a fair amount of time.
Comment 14 Adam Barth 2013-02-22 13:36:11 PST
The spreadsheet for triaging the remaining test failures is at

https://docs.google.com/spreadsheet/ccc?key=0AlC4tS7Ao1fIdE5IbVJESW00V2F5RUIwRDk3WEhMblE&usp=sharing
Comment 15 Eric Seidel (no email) 2013-03-02 02:02:25 PST
This is now on by default in Chromium Canary:
https://groups.google.com/a/chromium.org/forum/#!topic/chromium-dev/hBUVtg7gacE
See the announcement for details on the (substantial) perf win (even for single-core devices!?)

Other ports probably want to wait a couple days before turning this on, in case there are other bugs we should shake out.

Bug 110937 may also block at least Mac WK1 from enabling this for the time being.
Comment 16 Eric Seidel (no email) 2013-03-05 02:05:06 PST
*** Bug 57376 has been marked as a duplicate of this bug. ***
Comment 17 Eric Seidel (no email) 2013-03-05 02:08:11 PST
The parser was disabled on Chromium Canary due to a couple crashers we fixed today.  It should be back on as of Weds' Canary.
Comment 18 Eric Seidel (no email) 2013-03-06 17:34:19 PST
This bug is almost ready to close.  Filed bug 111645 for tracking further perf improvements to the threaded parser codepath.
Comment 19 Adam Barth 2013-03-07 12:37:07 PST
IMHO, we should fix bug 109764 before closing this bug.
Comment 20 Adam Barth 2013-03-11 19:52:49 PDT
The parser appears to work.  :)