Bug 90751

Summary: Introduce speculative HTML parsing
Product: WebKit Reporter: Kwang Yul Seo <skyul>
Component: DOMAssignee: Kwang Yul Seo <skyul>
Status: RESOLVED FIXED    
Severity: Normal CC: abarth, ap, eric, koivisto, syoichi, tonyg, zoltan
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: Unspecified   
OS: Unspecified   
Bug Depends on: 91703, 92057, 92224, 93070, 93113, 93137, 93455, 93608, 93612    
Bug Blocks: 106127    

Description Kwang Yul Seo 2012-07-08 22:36:16 PDT
Make HTMLDocumentParser parse the remaining input stream speculatively while it is waiting for scripts. If the pending script doesn't end up calling document.write(), the speculation is successful.

We can borrow the idea of Firefox's HTML parser threading:

https://developer.mozilla.org/en/Gecko/HTML_parser_threading

We can implement speculative parsing with or without moving HTML parsing off the main thread. My plan is two steps: step 1 is to implement speculative parsing in the main thread and step 2 is to make HTML parsing off the main thread.

Also preload scanning and speculative parsing can be combined because they can share the same tokenizer.
Comment 1 Kwang Yul Seo 2012-07-25 00:46:53 PDT
Once we have speculative parsing in action, preload scanner can reuse the tokens generated by the speculative parser. This will fix Bug 57376. Firefore's HTML parser threading does this.
Comment 2 Zoltan Horvath 2012-07-25 00:56:54 PDT
I'm really excited to see you working on this.
Comment 3 Kwang Yul Seo 2012-08-01 03:58:05 PDT
I am working on this. I will setup a GitHub repo so that people can see my work-in-progress implementation and give me some feedbacks.
Comment 4 Kwang Yul Seo 2012-08-02 05:52:15 PDT
(In reply to comment #3)
> I am working on this. I will setup a GitHub repo so that people can see my work-in-progress implementation and give me some feedbacks.

I've just forked a new webkit repo from https://github.com/WebKit/webkit

https://github.com/kseo/webkit

I will create "speculativeparser" branch and upload my changes when my initial implementation is ready.
Comment 5 Adam Barth 2012-08-02 08:18:09 PDT
Let me know when you'd like me to take a look at your branch.  I can give you feedback directly on GitHub if you like.
Comment 6 Kwang Yul Seo 2012-08-02 17:12:46 PDT
(In reply to comment #5)
> Let me know when you'd like me to take a look at your branch.  I can give you feedback directly on GitHub if you like.

That's great! Thanks. I will let you know.
Comment 7 Kwang Yul Seo 2012-08-06 10:25:10 PDT
(In reply to comment #5)
> Let me know when you'd like me to take a look at your branch.  I can give you feedback directly on GitHub if you like.

Adam, would you take a look at my branch? I've just pushed the preparation patches for speculative parser to my repo as "speculativeparser" branch. I ensured that each commit passes the parser-related layout tests (fast/parser and html5lib). Please refer to the commit logs for detailed explanation of each commit. 

https://github.com/kseo/webkit/tree/speculativeparser

I haven't fully implemented the speculation parser yet, but I want to get some feedback on these preparation patches early because they greatly affect the follow-up speculative parser implementation.

After I replaced HTMLConstructionSiteTask and HTMLConstructionSite::attachLater with Function and bind, there is a 2.4% performance regression in the html5-parser benchmark. I found that the regression is mostly due to the virtual method call used in Function implementation. If you think this regression is not small, I will re-implement task classes with no virtual method. My goal is to reduce the overhead by less than 1% so that we will be more confident that speculating parsing has real performance benefits.
Comment 8 Kwang Yul Seo 2012-08-06 10:55:05 PDT
The biggest change in preparation of speculative parsing is the introduction of node proxy. Please refer to the following commit log for detailed explanation:

"""Introduce node proxy classes and make the parser use them.

When the parser runs in speculation mode, it can't actually create nodes or elements because element creation has visible side effects such as registering image or form elements. So the speculative parser must be able to pump the tokenizer and run the tree builder without performing construction site tasks such as attachment and element creation. The parser needs to queue such tasks and run them all if the speculation ends up succeeding or throw them away otherwise.

Currently, each item of the stack of open elements and the active formatting element list has a pointer to its element. Thnaks to r123577 and its follow-up pathces, we no longer read tag names and attributes from these elements, but we still keep the element pointer in each item because element pointers are passed when creating tree construction tasks.

We need a way to refer to an element without actually creating it. So I introduced proxy nodes. A proxy node is an object which can hold a RefPtr pointer to an element. When we need to create a node, we create an empty node proxy instead and queue an element creation task that will be performed later. This element creation task actually creates an element and stores it in its proxy node. So with these proxy nodes, we can create follow-up construction site tasks. When these tasks are performed, nodes are retrieved from proxy nodes

Other parts of code including the stack of open elements and the active formatting element list treats a proxy node as if it is a real node. This is okay because we never (and must not) call methods on the node outside construction site tasks. The tree builder must be oblivious to the difference between Node and NodeProxy.

We have 4 node proxy classes with the same class hierarchy as Node: NodeProxy,
ContainerNodeProxy, ElementProxy and HTMLFormElementProxy. They essentially play the same role but I added subclasses because I wanted to keep the type information present in the current code. NodeProxy itself is also ref-counted to prevent memory leak.

Node proxies are similar to content handles used in Firefox's HTML parsing threading implementation, but they differ in that contents handles stick around for the life time of the parser and have no type information. A content handle is just a pointer to a pointer to an element (nsIContent**). Refer to https://developer.mozilla.org/en-US/docs/Gecko/HTML_parser_threading "Memory Management When Crossing the Thread Boundary" for further information.

After this patch, we have ugly code like element->element(). The first element is of type ElementProxy*, and the second element() is of type Element*. This patch mechanically replaces Element* with ElementProxy*. So I had to retrieve Element* from ElementProxy* when calling methods on Element*. This ugliness will be removed in the follow-up patches."""
Comment 9 Kwang Yul Seo 2012-08-06 21:46:42 PDT
(In reply to comment #7)
> After I replaced HTMLConstructionSiteTask and HTMLConstructionSite::attachLater with Function and bind, there is a 2.4% performance regression in the html5-parser benchmark. I found that the regression is mostly due to the virtual method call used in Function implementation. If you think this regression is not small, I will re-implement task classes with no virtual method. My goal is to reduce the overhead by less than 1% so that we will be more confident that speculating parsing has real performance benefits.

Adam, please look at only "Introduce node proxy classes and make the parser use them." patch. 

I have decided to rework HTMLConstructionSiteTask class instead of replacing it with Function objects. It will (hopefully) solve the performance regression.

Also it seems Function is not flexible enough to solve the remaining implementation issues e.g, script execution task. I will write more on this later.
Comment 10 Kwang Yul Seo 2012-08-09 05:47:38 PDT
I uploaded 3 preparation patches for speculative parsing: Bug 93455, Bug 93608, Bug 93612.

These are large (but mostly mechanical) refactoring changes. Because the rest of speculative parser implementation depends heavily on these refactoring patches, I want to get some early feedback.

You can see them from my GitHub repo too:

https://github.com/kseo/webkit/tree/speculativeparser2
Comment 11 Eric Seidel (no email) 2013-03-05 12:44:58 PST
This was a great bug.  BUt I think we can close it as fixed, or as a dupe of the larger threaded parser work.