Bug 41141

Summary:

Add a new baseclass for XML, HTML and Text DocumentParsers to clean up DocumentParser call sites

Product:

WebKit

Reporter:

Eric Seidel (no email) <eric>

Component:

New Bugs

Assignee:

Eric Seidel (no email) <eric>

Status:

RESOLVED FIXED

Severity:

Normal

CC:

abarth, commit-queue, darin, dbates, eric, tonyg, webkit.review.bot

Priority:

Version:

528+ (Nightly build)

Hardware:

Other

OS:

OS X 10.5

Bug Depends on:

41136

Bug Blocks:

Attachments:

Description	Flags
Patch	none
Ready for review	none
Patch	none
Patch for landing	none
Patch for landing	none
Patch for landing	none

Eric Seidel (no email)

Reported 2010-06-24 02:46:14 PDT

Add a new baseclass for XML, HTML and Text DocumentParsers to clean up DocumentParser call sites

Attachments
Patch (42.87 KB, patch) 2010-06-24 02:52 PDT, Eric Seidel (no email)	no flags	Details Formatted Diff Diff
Ready for review (47.19 KB, patch) 2010-06-24 03:26 PDT, Eric Seidel (no email)	no flags	Details Formatted Diff Diff
Patch (49.00 KB, patch) 2010-06-26 15:29 PDT, Eric Seidel (no email)	no flags	Details Formatted Diff Diff
Patch for landing (49.29 KB, patch) 2010-06-26 19:40 PDT, Eric Seidel (no email)	no flags	Details Formatted Diff Diff
Patch for landing (49.29 KB, patch) 2010-06-27 22:40 PDT, Eric Seidel (no email)	no flags	Details Formatted Diff Diff
Patch for landing (54.26 KB, patch) 2010-06-28 00:33 PDT, Eric Seidel (no email)	no flags	Details Formatted Diff Diff
Show Obsolete (5) View All Add attachment proposed patch, testcase, etc.

Eric Seidel (no email)

Comment 1 2010-06-24 02:52:26 PDT

Created attachment 59626 [details] Patch

Eric Seidel (no email)

Comment 2 2010-06-24 02:57:16 PDT

Comment on attachment 59626 [details] Patch I view this patch as an imperfect step in the right direction. It's actually probably ready for review except for missing build system modifications and possible need of name change. ScriptableDocumentParser was the best I could come up with. But there is probably better. It may make sense to make TextDocumentParser a direct subclass of DocumentParser and then make ScriptableDocumentParser a subclass of TextDocumentParser. It's unclear. Right now TextDocumentParser is "scriptable" so that it can have "write()" instead of writeRawData. There are really two divisions here. One for what sort of data do you want (raw vs. decoded) which might be solved best by moving the decoding into the base DocumentParser class. and a second of whether you as a DocumentParser support all the scripting stuff that was previously on the DocumentParser base interface. One of the nice side-effects of this patch is now nearly every method on ScriptableDocumentParser is abstract since we don't need default implementations to support the RawDataDocumentParsers anymore!

Eric Seidel (no email)

Comment 3 2010-06-24 03:01:55 PDT

Maybe a "TextDataDocumentParser" intermediary base class to handle write() and wellFormed() would be best. It's also unclear if I should move synchronousScriptWrite() down onto the base DocumentParser class in preparation for one day solving https://bugs.webkit.org/show_bug.cgi?id=25397. I think in any non-textdata document parser synchronousScriptWrite() just ends up calling open, so maybe that's what the DocumentWriter should do on its behalf.

Eric Seidel (no email)

Comment 4 2010-06-24 03:26:03 PDT

Created attachment 59627 [details] Ready for review

Adam Barth

Comment 5 2010-06-24 08:48:12 PDT

Comment on attachment 59627 [details] Ready for review Some initial comments. I'll review this more this afternoon. WebCore/dom/Document.cpp:1723 + || (scriptableDocumentParser() && scriptableDocumentParser()->isExecutingScript())) This calls the virtual function twice for no reason. You should save the result in a local. WebCore/dom/Document.cpp:1762 + if (m_frame && scriptableDocumentParser()) Same bad pattern. WebCore/dom/Document.cpp:2004 + // FIXME: JS code can always call document.write, all DocumentParsers may I don't think this comment is correct. document.write should only exist for HTML documents. WebCore/dom/ScriptableDocumentParser.cpp:38 + void ScriptableDocumentParser::synchronousScriptWrite(const SegmentedString& source) Why not "insert" ? WebCore/dom/ScriptableDocumentParser.cpp:43 + void ScriptableDocumentParser::appendNetworkData(const SegmentedString& source) Why not "append" ?

Adam Barth

Comment 6 2010-06-24 15:23:00 PDT

Comment on attachment 59627 [details] Ready for review As we discussed in person, the patch has too much RTTI. I think that's because the interface is being drawn at the wrong place.

Eric Seidel (no email)

Comment 7 2010-06-26 15:29:19 PDT

Created attachment 59844 [details] Patch

Darin Adler

Comment 8 2010-06-26 15:42:39 PDT

Comment on attachment 59844 [details] Patch I'm not sure about the name "scriptable document parser". What does "scriptable" mean here? I think it's a bug that scripts don't work in image documents, for example, so we probably should not be enshrining that bug in our design or class names. If that's what scriptable means. > - if (DocumentParser* parser = m_frame->document()->parser()) > + ScriptableDocumentParser* parser = m_frame->document()->scriptableDocumentParser(); > + if (parser) > return parser->lineNumber() + 1; Why did you take the definition of the variable out of the if statement? If you're going to do that, then I suggest changing the style to early return. Or you could leave the style the way it was. > + // FIXME: This is a hack for HTMLFormControlElement::removedFromTree > virtual LegacyHTMLTreeBuilder* htmlTreeBuilder() const { return 0; } I find this comment confusing. Is there some specific sense in which having a way to downcast the tree builder is a hack? I understand that RTTI is not elegant, but hack seems a strong word here. > + ScriptableDocumentParser(Document* document, bool viewSourceMode = false); The argument name "document" could be omitted here. This constructor should be protected. Can any of the other public members be protected? > + // Only used by Document::open for deciding if its safe to act on a > + // javscript document.open() call right now, or it should be ignored. > + virtual bool isExecutingScript() const { return false; } I suggest spelling it "JavaScript". Generally speaking I don't think functions should have comments listing who calls them. Is there some other way you can say this? > +}; > + There's an extra semicolon here after the namespace ends. > +// FIXME: Why is this different from SVGDocumentExtentions parserLineNumber? Spelling error: SVGDocumentExtensions. > +static int parserLineNumber(Document* document) > { > - DocumentParser* parser = document->parser(); > + if (document && document->scriptableDocumentParser()) > + return document->scriptableDocumentParser()->lineNumber() + 1; > + return 0; > +} It's not good to call a function that does RTTI like scriptableDocumentParser twice in a row. Is there a better way to write this? Local variable and early return perhaps? > +// FIXME: TextDocumentParser could just be an HTMLDocumentParser > +// which started the Tokenizer in the PlainText state. Really? Once you're in the PlainText state you can't get out? Seems OK. r=me despite the concerns above

Eric Seidel (no email)

Comment 9 2010-06-26 19:12:27 PDT

(In reply to comment #8) > (From update of attachment 59844 [details]) > I'm not sure about the name "scriptable document parser". What does "scriptable" mean here? I think it's a bug that scripts don't work in image documents, for example, so we probably should not be enshrining that bug in our design or class names. If that's what scriptable means. Agreed. I'm not exactly sure how you would run a script in an image document however. The user typing a javascript: url? > > - if (DocumentParser* parser = m_frame->document()->parser()) > > + ScriptableDocumentParser* parser = m_frame->document()->scriptableDocumentParser(); > > + if (parser) > > return parser->lineNumber() + 1; > > Why did you take the definition of the variable out of the if statement? If you're going to do that, then I suggest changing the style to early return. Or you could leave the style the way it was. I did it to make the line shorter. Your'e right that early return would be better, will fix. > > + // FIXME: This is a hack for HTMLFormControlElement::removedFromTree > > virtual LegacyHTMLTreeBuilder* htmlTreeBuilder() const { return 0; } > > I find this comment confusing. Is there some specific sense in which having a way to downcast the tree builder is a hack? I understand that RTTI is not elegant, but hack seems a strong word here. Oh, it's not just a specific downcast this is doing here. It's grabbing at the treebuilder which is otherwise considered an "implementation detail" of the parser. Even worse, it's expecting back a LegacyHTMLTreeBuilder, which once Adam and I finish the new tree builder won't even be possible. In order to finish the new tree builder we'll need to fix/remove this hack. > > + ScriptableDocumentParser(Document* document, bool viewSourceMode = false); > > The argument name "document" could be omitted here. This constructor should be protected. Can any of the other public members be protected? Agreed! I'll look for more members to make private/protected. That was part of the point of all these changes, I'm sad I overlooked a chance to do more API restriction here. Thanks for the catch. > > + // Only used by Document::open for deciding if its safe to act on a > > + // javscript document.open() call right now, or it should be ignored. > > + virtual bool isExecutingScript() const { return false; } > > I suggest spelling it "JavaScript". Generally speaking I don't think functions should have comments listing who calls them. Is there some other way you can say this? I agree, those comments have a high chance of bitrot. They were an attempt to document the previously sprawling API of the DocumentParser. The comments are intended to help someone else (like myself or you) understand the existing backwards interaction between the DocumntParser and the rest of WebCore in hopes of fixing it. > > +}; > > + > > There's an extra semicolon here after the namespace ends. I can never remember which definitions have semicolons after them and which don't. (Classes do, namespaces don't I guess). > > +// FIXME: Why is this different from SVGDocumentExtentions parserLineNumber? > > Spelling error: SVGDocumentExtensions. > > > +static int parserLineNumber(Document* document) > > { > > - DocumentParser* parser = document->parser(); > > + if (document && document->scriptableDocumentParser()) > > + return document->scriptableDocumentParser()->lineNumber() + 1; > > + return 0; > > +} > > It's not good to call a function that does RTTI like scriptableDocumentParser twice in a row. Is there a better way to write this? Local variable and early return perhaps? Woh, that's just wrong. Will fix. > > +// FIXME: TextDocumentParser could just be an HTMLDocumentParser > > +// which started the Tokenizer in the PlainText state. > > Really? Once you're in the PlainText state you can't get out? Yup! The plaintext state is amazing. Throw a <plaintext> tag anywhere in your document and the rest will get eaten. :) Try it in the next blog comment you write... > Seems OK. r=me despite the concerns above Thanks. Will post a new patch with corrections. I'm not super-proud of this patch. It's an attempt to bring more sanity to teh parser system, but I may be over-engineering the problem. Really I just don't understand enough of the parser/loader interactions yet to make the perfect parser API, so this is an attempt at an incremental step towards tomorrow.

Eric Seidel (no email)

Comment 10 2010-06-26 19:13:49 PDT

http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#plaintext-state Note how there is no way to leave the state. Even a </plaintext> tag would be emitted as character tokens from the tokenizer. :)

Eric Seidel (no email)

Comment 11 2010-06-26 19:25:37 PDT

One advantage to just setting the HTMLTokenizer in the PlainText state and using that for the TextDocumentParser is that then we'd get all the \n\r handling for free, as well as any possible scripting needs. It wouldn't automatically get yielding, since yielding is done per-token currently. But we could add some yielding at the character level and then it would yield.

Eric Seidel (no email)

Comment 12 2010-06-26 19:40:13 PDT

Created attachment 59845 [details] Patch for landing

WebKit Commit Bot

Comment 13 2010-06-26 20:20:15 PDT

Comment on attachment 59845 [details] Patch for landing Rejecting patch 59845 from commit-queue. Unexpected failure when processing patch! Please file a bug against webkit-patch. Failed to run "['./WebKitTools/Scripts/webkit-patch', '--status-host=webkit-commit-queue.appspot.com', 'land-attachment', '--force-clean', '--build', '--non-interactive', '--ignore-builders', '--build-style=both', '--quiet', 59845, '--test', '--parent-command=commit-queue', '--no-update']" exit_code: 1 Last 500 characters of output: date_status return NetworkTransaction().run(lambda: self._post_status_to_server(queue_name, status, patch, results_file)) File "/Users/eseidel/Projects/CommitQueue/WebKitTools/Scripts/webkitpy/common/net/networktransaction.py", line 56, in run self._check_for_timeout() File "/Users/eseidel/Projects/CommitQueue/WebKitTools/Scripts/webkitpy/common/net/networktransaction.py", line 63, in _check_for_timeout raise NetworkTimeout() webkitpy.common.net.networktransaction.NetworkTimeout

Darin Adler

Comment 14 2010-06-27 09:55:14 PDT

(In reply to comment #9) > (In reply to comment #8) > > I'm not sure about the name "scriptable document parser". What does "scriptable" mean here? I think it's a bug that scripts don't work in image documents, for example, so we probably should not be enshrining that bug in our design or class names. If that's what scriptable means. > > Agreed. I'm not exactly sure how you would run a script in an image document however. The user typing a javascript: url? A JavaScript URL is one way, or someone could insert additional elements using JavaScript running in another frame with the same origin. Those elements could contain attributes with script in them. There may be other ways.

Eric Seidel (no email)

Comment 15 2010-06-27 22:40:41 PDT

Created attachment 59873 [details] Patch for landing

WebKit Commit Bot

Comment 16 2010-06-27 23:21:13 PDT

Comment on attachment 59873 [details] Patch for landing Rejecting patch 59873 from commit-queue. Unexpected failure when processing patch! Please file a bug against webkit-patch. Failed to run "['./WebKitTools/Scripts/webkit-patch', '--status-host=webkit-commit-queue.appspot.com', 'land-attachment', '--force-clean', '--build', '--non-interactive', '--ignore-builders', '--build-style=both', '--quiet', 59873, '--test', '--parent-command=commit-queue', '--no-update']" exit_code: 1 Last 500 characters of output: date_status return NetworkTransaction().run(lambda: self._post_status_to_server(queue_name, status, patch, results_file)) File "/Users/eseidel/Projects/CommitQueue/WebKitTools/Scripts/webkitpy/common/net/networktransaction.py", line 56, in run self._check_for_timeout() File "/Users/eseidel/Projects/CommitQueue/WebKitTools/Scripts/webkitpy/common/net/networktransaction.py", line 63, in _check_for_timeout raise NetworkTimeout() webkitpy.common.net.networktransaction.NetworkTimeout

Eric Seidel (no email)

Comment 17 2010-06-28 00:33:43 PDT

Created attachment 59878 [details] Patch for landing

WebKit Commit Bot

Comment 18 2010-06-28 00:56:32 PDT

Comment on attachment 59878 [details] Patch for landing Clearing flags on attachment: 59878 Committed r61985: <http://trac.webkit.org/changeset/61985>

WebKit Commit Bot

Comment 19 2010-06-28 00:56:39 PDT

All reviewed patches have been landed. Closing bug.

WebKit Review Bot

Comment 20 2010-06-28 01:09:02 PDT

http://trac.webkit.org/changeset/61985 might have broken Chromium Linux Release

Eric Seidel (no email)

Comment 21 2010-06-28 01:47:21 PDT

Committed r61987: <http://trac.webkit.org/changeset/61987>

Note You need to log in before you can comment on or make changes to this bug.