90751 – Introduce speculative HTML parsing

RESOLVED FIXED90751

Introduce speculative HTML parsing

https://bugs.webkit.org/show_bug.cgi?id=90751

Summary Introduce speculative HTML parsing

Kwang Yul Seo

Reported 2012-07-08 22:36:16 PDT

Make HTMLDocumentParser parse the remaining input stream speculatively while it is waiting for scripts. If the pending script doesn't end up calling document.write(), the speculation is successful. We can borrow the idea of Firefox's HTML parser threading: https://developer.mozilla.org/en/Gecko/HTML_parser_threading We can implement speculative parsing with or without moving HTML parsing off the main thread. My plan is two steps: step 1 is to implement speculative parsing in the main thread and step 2 is to make HTML parsing off the main thread. Also preload scanning and speculative parsing can be combined because they can share the same tokenizer.

Attachments
Add attachment proposed patch, testcase, etc.

Kwang Yul Seo

Comment 1 2012-07-25 00:46:53 PDT

Once we have speculative parsing in action, preload scanner can reuse the tokens generated by the speculative parser. This will fix Bug 57376. Firefore's HTML parser threading does this.

Zoltan Horvath

Comment 2 2012-07-25 00:56:54 PDT

I'm really excited to see you working on this.

Kwang Yul Seo

Comment 3 2012-08-01 03:58:05 PDT

I am working on this. I will setup a GitHub repo so that people can see my work-in-progress implementation and give me some feedbacks.

Kwang Yul Seo

Comment 4 2012-08-02 05:52:15 PDT

(In reply to comment #3) > I am working on this. I will setup a GitHub repo so that people can see my work-in-progress implementation and give me some feedbacks. I've just forked a new webkit repo from https://github.com/WebKit/webkit https://github.com/kseo/webkit I will create "speculativeparser" branch and upload my changes when my initial implementation is ready.

Adam Barth

Comment 5 2012-08-02 08:18:09 PDT

Let me know when you'd like me to take a look at your branch. I can give you feedback directly on GitHub if you like.

Kwang Yul Seo

Comment 6 2012-08-02 17:12:46 PDT

(In reply to comment #5) > Let me know when you'd like me to take a look at your branch. I can give you feedback directly on GitHub if you like. That's great! Thanks. I will let you know.

Kwang Yul Seo

Comment 7 2012-08-06 10:25:10 PDT

(In reply to comment #5) > Let me know when you'd like me to take a look at your branch. I can give you feedback directly on GitHub if you like. Adam, would you take a look at my branch? I've just pushed the preparation patches for speculative parser to my repo as "speculativeparser" branch. I ensured that each commit passes the parser-related layout tests (fast/parser and html5lib). Please refer to the commit logs for detailed explanation of each commit. https://github.com/kseo/webkit/tree/speculativeparser I haven't fully implemented the speculation parser yet, but I want to get some feedback on these preparation patches early because they greatly affect the follow-up speculative parser implementation. After I replaced HTMLConstructionSiteTask and HTMLConstructionSite::attachLater with Function and bind, there is a 2.4% performance regression in the html5-parser benchmark. I found that the regression is mostly due to the virtual method call used in Function implementation. If you think this regression is not small, I will re-implement task classes with no virtual method. My goal is to reduce the overhead by less than 1% so that we will be more confident that speculating parsing has real performance benefits.

Kwang Yul Seo

Comment 8 2012-08-06 10:55:05 PDT

The biggest change in preparation of speculative parsing is the introduction of node proxy. Please refer to the following commit log for detailed explanation: """Introduce node proxy classes and make the parser use them. When the parser runs in speculation mode, it can't actually create nodes or elements because element creation has visible side effects such as registering image or form elements. So the speculative parser must be able to pump the tokenizer and run the tree builder without performing construction site tasks such as attachment and element creation. The parser needs to queue such tasks and run them all if the speculation ends up succeeding or throw them away otherwise. Currently, each item of the stack of open elements and the active formatting element list has a pointer to its element. Thnaks to r123577 and its follow-up pathces, we no longer read tag names and attributes from these elements, but we still keep the element pointer in each item because element pointers are passed when creating tree construction tasks. We need a way to refer to an element without actually creating it. So I introduced proxy nodes. A proxy node is an object which can hold a RefPtr pointer to an element. When we need to create a node, we create an empty node proxy instead and queue an element creation task that will be performed later. This element creation task actually creates an element and stores it in its proxy node. So with these proxy nodes, we can create follow-up construction site tasks. When these tasks are performed, nodes are retrieved from proxy nodes Other parts of code including the stack of open elements and the active formatting element list treats a proxy node as if it is a real node. This is okay because we never (and must not) call methods on the node outside construction site tasks. The tree builder must be oblivious to the difference between Node and NodeProxy. We have 4 node proxy classes with the same class hierarchy as Node: NodeProxy, ContainerNodeProxy, ElementProxy and HTMLFormElementProxy. They essentially play the same role but I added subclasses because I wanted to keep the type information present in the current code. NodeProxy itself is also ref-counted to prevent memory leak. Node proxies are similar to content handles used in Firefox's HTML parsing threading implementation, but they differ in that contents handles stick around for the life time of the parser and have no type information. A content handle is just a pointer to a pointer to an element (nsIContent**). Refer to https://developer.mozilla.org/en-US/docs/Gecko/HTML_parser_threading "Memory Management When Crossing the Thread Boundary" for further information. After this patch, we have ugly code like element->element(). The first element is of type ElementProxy*, and the second element() is of type Element*. This patch mechanically replaces Element* with ElementProxy*. So I had to retrieve Element* from ElementProxy* when calling methods on Element*. This ugliness will be removed in the follow-up patches."""

Kwang Yul Seo

Comment 9 2012-08-06 21:46:42 PDT

(In reply to comment #7) > After I replaced HTMLConstructionSiteTask and HTMLConstructionSite::attachLater with Function and bind, there is a 2.4% performance regression in the html5-parser benchmark. I found that the regression is mostly due to the virtual method call used in Function implementation. If you think this regression is not small, I will re-implement task classes with no virtual method. My goal is to reduce the overhead by less than 1% so that we will be more confident that speculating parsing has real performance benefits. Adam, please look at only "Introduce node proxy classes and make the parser use them." patch. I have decided to rework HTMLConstructionSiteTask class instead of replacing it with Function objects. It will (hopefully) solve the performance regression. Also it seems Function is not flexible enough to solve the remaining implementation issues e.g, script execution task. I will write more on this later.

Kwang Yul Seo

Comment 10 2012-08-09 05:47:38 PDT

I uploaded 3 preparation patches for speculative parsing: Bug 93455, Bug 93608, Bug 93612. These are large (but mostly mechanical) refactoring changes. Because the rest of speculative parser implementation depends heavily on these refactoring patches, I want to get some early feedback. You can see them from my GitHub repo too: https://github.com/kseo/webkit/tree/speculativeparser2

Eric Seidel (no email)

Comment 11 2013-03-05 12:44:58 PST

This was a great bug. BUt I think we can close it as fixed, or as a dupe of the larger threaded parser work.

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Normal

Classification Unclassified

Version 528+ (Nightly build)

Hardware Unspecified

OS Unspecified

Product WebKit

Component DOM

Assignee

Kwang Yul Seo

Reported

2012-07-08 22:36 PDT

Modified

2013-03-05 12:44 PST History

CC List

7 users Show

URL

Keywords

Depends on

91703 92057 92224 93070 93113 93137 93455 93608 93612

Blocks

106127

Dependencies

tree graph