Bug 41141

Summary: Add a new baseclass for XML, HTML and Text DocumentParsers to clean up DocumentParser call sites
Product: WebKit Reporter: Eric Seidel (no email) <eric>
Component: New BugsAssignee: Eric Seidel (no email) <eric>
Status: RESOLVED FIXED    
Severity: Normal CC: abarth, commit-queue, darin, dbates, eric, tonyg, webkit.review.bot
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: Other   
OS: OS X 10.5   
Bug Depends on: 41136    
Bug Blocks:    
Attachments:
Description Flags
Patch
none
Ready for review
none
Patch
none
Patch for landing
none
Patch for landing
none
Patch for landing none

Description Eric Seidel (no email) 2010-06-24 02:46:14 PDT
Add a new baseclass for XML, HTML and Text DocumentParsers to clean up DocumentParser call sites
Comment 1 Eric Seidel (no email) 2010-06-24 02:52:26 PDT
Created attachment 59626 [details]
Patch
Comment 2 Eric Seidel (no email) 2010-06-24 02:57:16 PDT
Comment on attachment 59626 [details]
Patch

I view this patch as an imperfect step in the right direction.

It's actually probably ready for review except for missing build system modifications and possible need of name change.

ScriptableDocumentParser was the best I could come up with.  But there is probably better.

It may make sense to make TextDocumentParser a direct subclass of DocumentParser and then make ScriptableDocumentParser a subclass of TextDocumentParser.  It's unclear.  Right now TextDocumentParser is "scriptable" so that it can have "write()" instead of writeRawData.

There are really two divisions here.  One for what sort of data do you want (raw vs. decoded) which might be solved best by moving the decoding into the base DocumentParser class. and a second of whether you as a DocumentParser support all the scripting stuff that was previously on the DocumentParser base interface.

One of the nice side-effects of this patch is now nearly every method on ScriptableDocumentParser is abstract since we don't need default implementations to support the RawDataDocumentParsers anymore!
Comment 3 Eric Seidel (no email) 2010-06-24 03:01:55 PDT
Maybe a "TextDataDocumentParser" intermediary base class to handle write() and wellFormed() would be best.

It's also unclear if I should move synchronousScriptWrite() down onto the base DocumentParser class in preparation for one day solving https://bugs.webkit.org/show_bug.cgi?id=25397.  I think in any non-textdata document parser synchronousScriptWrite() just ends up calling open, so maybe that's what the DocumentWriter should do on its behalf.
Comment 4 Eric Seidel (no email) 2010-06-24 03:26:03 PDT
Created attachment 59627 [details]
Ready for review
Comment 5 Adam Barth 2010-06-24 08:48:12 PDT
Comment on attachment 59627 [details]
Ready for review

Some initial comments.  I'll review this more this afternoon.

WebCore/dom/Document.cpp:1723
 +              || (scriptableDocumentParser() && scriptableDocumentParser()->isExecutingScript()))
This calls the virtual function twice for no reason.  You should save the result in a local.

WebCore/dom/Document.cpp:1762
 +      if (m_frame && scriptableDocumentParser())
Same bad pattern.

WebCore/dom/Document.cpp:2004
 +      // FIXME: JS code can always call document.write, all DocumentParsers may
I don't think this comment is correct.  document.write should only exist for HTML documents.

WebCore/dom/ScriptableDocumentParser.cpp:38
 +  void ScriptableDocumentParser::synchronousScriptWrite(const SegmentedString& source)
Why not "insert" ?

WebCore/dom/ScriptableDocumentParser.cpp:43
 +  void ScriptableDocumentParser::appendNetworkData(const SegmentedString& source)
Why not "append" ?
Comment 6 Adam Barth 2010-06-24 15:23:00 PDT
Comment on attachment 59627 [details]
Ready for review

As we discussed in person, the patch has too much RTTI.  I think that's because the interface is being drawn at the wrong place.
Comment 7 Eric Seidel (no email) 2010-06-26 15:29:19 PDT
Created attachment 59844 [details]
Patch
Comment 8 Darin Adler 2010-06-26 15:42:39 PDT
Comment on attachment 59844 [details]
Patch

I'm not sure about the name "scriptable document parser". What does "scriptable" mean here? I think it's a bug that scripts don't work in image documents, for example, so we probably should not be enshrining that bug in our design or class names. If that's what scriptable means.

> -    if (DocumentParser* parser = m_frame->document()->parser())
> +    ScriptableDocumentParser* parser = m_frame->document()->scriptableDocumentParser();
> +    if (parser)
>          return parser->lineNumber() + 1;

Why did you take the definition of the variable out of the if statement? If you're going to do that, then I suggest changing the style to early return. Or you could leave the style the way it was.

> +    // FIXME: This is a hack for HTMLFormControlElement::removedFromTree
>      virtual LegacyHTMLTreeBuilder* htmlTreeBuilder() const { return 0; }

I find this comment confusing. Is there some specific sense in which having a way to downcast the tree builder is a hack? I understand that RTTI is not elegant, but hack seems a strong word here.

> +    ScriptableDocumentParser(Document* document, bool viewSourceMode = false);

The argument name "document" could be omitted here. This constructor should be protected. Can any of the other public members be protected?

> +    // Only used by Document::open for deciding if its safe to act on a
> +    // javscript document.open() call right now, or it should be ignored.
> +    virtual bool isExecutingScript() const { return false; }

I suggest spelling it "JavaScript". Generally speaking I don't think functions should have comments listing who calls them. Is there some other way you can say this?

> +};
> +

There's an extra semicolon here after the namespace ends.

> +// FIXME: Why is this different from SVGDocumentExtentions parserLineNumber?

Spelling error: SVGDocumentExtensions.

> +static int parserLineNumber(Document* document)
>  {
> -    DocumentParser* parser = document->parser();
> +    if (document && document->scriptableDocumentParser())
> +        return document->scriptableDocumentParser()->lineNumber() + 1;
> +    return 0;
> +}

It's not good to call a function that does RTTI like scriptableDocumentParser twice in a row. Is there a better way to write this? Local variable and early return perhaps?

> +// FIXME: TextDocumentParser could just be an HTMLDocumentParser
> +// which started the Tokenizer in the PlainText state.

Really? Once you're in the PlainText state you can't get out?

Seems OK. r=me despite the concerns above
Comment 9 Eric Seidel (no email) 2010-06-26 19:12:27 PDT
(In reply to comment #8)
> (From update of attachment 59844 [details])
> I'm not sure about the name "scriptable document parser". What does "scriptable" mean here? I think it's a bug that scripts don't work in image documents, for example, so we probably should not be enshrining that bug in our design or class names. If that's what scriptable means.

Agreed.  I'm not exactly sure how you would run a script in an image document however.  The user typing a javascript: url?

> > -    if (DocumentParser* parser = m_frame->document()->parser())
> > +    ScriptableDocumentParser* parser = m_frame->document()->scriptableDocumentParser();
> > +    if (parser)
> >          return parser->lineNumber() + 1;
> 
> Why did you take the definition of the variable out of the if statement? If you're going to do that, then I suggest changing the style to early return. Or you could leave the style the way it was.

I did it to make the line shorter.  Your'e right that early return would be better, will fix.


> > +    // FIXME: This is a hack for HTMLFormControlElement::removedFromTree
> >      virtual LegacyHTMLTreeBuilder* htmlTreeBuilder() const { return 0; }
> 
> I find this comment confusing. Is there some specific sense in which having a way to downcast the tree builder is a hack? I understand that RTTI is not elegant, but hack seems a strong word here.

Oh, it's not just a specific downcast this is doing here.  It's grabbing at the treebuilder which is otherwise considered an "implementation detail" of the parser.  Even worse, it's expecting back a LegacyHTMLTreeBuilder, which once Adam and I finish the  new tree builder won't even be possible.  In order to finish the new tree builder we'll need to fix/remove this hack.

> > +    ScriptableDocumentParser(Document* document, bool viewSourceMode = false);
> 
> The argument name "document" could be omitted here. This constructor should be protected. Can any of the other public members be protected?

Agreed!  I'll look for more members to make private/protected.  That was part of the point of all these changes, I'm sad I overlooked a chance to do more API restriction here.  Thanks for the catch.

> > +    // Only used by Document::open for deciding if its safe to act on a
> > +    // javscript document.open() call right now, or it should be ignored.
> > +    virtual bool isExecutingScript() const { return false; }
> 
> I suggest spelling it "JavaScript". Generally speaking I don't think functions should have comments listing who calls them. Is there some other way you can say this?

I agree, those comments have a high chance of bitrot.  They were an attempt to document the previously sprawling API of the DocumentParser.  The comments are intended to help someone else (like myself or you) understand the existing backwards interaction between the DocumntParser and the rest of WebCore in hopes of fixing it.

> > +};
> > +
> 
> There's an extra semicolon here after the namespace ends.

I can never remember which definitions have semicolons after them and which don't.  (Classes do, namespaces don't I guess).

> > +// FIXME: Why is this different from SVGDocumentExtentions parserLineNumber?
> 
> Spelling error: SVGDocumentExtensions.
> 
> > +static int parserLineNumber(Document* document)
> >  {
> > -    DocumentParser* parser = document->parser();
> > +    if (document && document->scriptableDocumentParser())
> > +        return document->scriptableDocumentParser()->lineNumber() + 1;
> > +    return 0;
> > +}
> 
> It's not good to call a function that does RTTI like scriptableDocumentParser twice in a row. Is there a better way to write this? Local variable and early return perhaps?

Woh, that's just wrong.  Will fix.

> > +// FIXME: TextDocumentParser could just be an HTMLDocumentParser
> > +// which started the Tokenizer in the PlainText state.
> 
> Really? Once you're in the PlainText state you can't get out?

Yup!  The plaintext state is amazing.  Throw a <plaintext> tag anywhere in your document and the rest will get eaten. :)  Try it in the next blog comment you write...

> Seems OK. r=me despite the concerns above

Thanks.  Will post a new patch with corrections.  I'm not super-proud of this patch.  It's an attempt to bring more sanity to teh parser system, but I may be over-engineering the problem.  Really I just don't understand enough of the parser/loader interactions yet to make the perfect parser API, so this is an attempt at an incremental step towards tomorrow.
Comment 10 Eric Seidel (no email) 2010-06-26 19:13:49 PDT
http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#plaintext-state
Note how there is no way to leave the state.  Even a </plaintext> tag would be emitted as character tokens from the tokenizer. :)
Comment 11 Eric Seidel (no email) 2010-06-26 19:25:37 PDT
One advantage to just setting the HTMLTokenizer in the PlainText state and using that for the TextDocumentParser is that then we'd get all the \n\r handling for free, as well as any possible scripting needs.  It wouldn't automatically get yielding, since yielding is done per-token currently.  But we could add some yielding at the character level and then it would yield.
Comment 12 Eric Seidel (no email) 2010-06-26 19:40:13 PDT
Created attachment 59845 [details]
Patch for landing
Comment 13 WebKit Commit Bot 2010-06-26 20:20:15 PDT
Comment on attachment 59845 [details]
Patch for landing

Rejecting patch 59845 from commit-queue.

Unexpected failure when processing patch!  Please file a bug against webkit-patch.
Failed to run "['./WebKitTools/Scripts/webkit-patch', '--status-host=webkit-commit-queue.appspot.com', 'land-attachment', '--force-clean', '--build', '--non-interactive', '--ignore-builders', '--build-style=both', '--quiet', 59845, '--test', '--parent-command=commit-queue', '--no-update']" exit_code: 1
Last 500 characters of output:
date_status
    return NetworkTransaction().run(lambda: self._post_status_to_server(queue_name, status, patch, results_file))
  File "/Users/eseidel/Projects/CommitQueue/WebKitTools/Scripts/webkitpy/common/net/networktransaction.py", line 56, in run
    self._check_for_timeout()
  File "/Users/eseidel/Projects/CommitQueue/WebKitTools/Scripts/webkitpy/common/net/networktransaction.py", line 63, in _check_for_timeout
    raise NetworkTimeout()
webkitpy.common.net.networktransaction.NetworkTimeout
Comment 14 Darin Adler 2010-06-27 09:55:14 PDT
(In reply to comment #9)
> (In reply to comment #8)
> > I'm not sure about the name "scriptable document parser". What does "scriptable" mean here? I think it's a bug that scripts don't work in image documents, for example, so we probably should not be enshrining that bug in our design or class names. If that's what scriptable means.
> 
> Agreed.  I'm not exactly sure how you would run a script in an image document however.  The user typing a javascript: url?

A JavaScript URL is one way, or someone could insert additional elements using JavaScript running in another frame with the same origin. Those elements could contain attributes with script in them. There may be other ways.
Comment 15 Eric Seidel (no email) 2010-06-27 22:40:41 PDT
Created attachment 59873 [details]
Patch for landing
Comment 16 WebKit Commit Bot 2010-06-27 23:21:13 PDT
Comment on attachment 59873 [details]
Patch for landing

Rejecting patch 59873 from commit-queue.

Unexpected failure when processing patch!  Please file a bug against webkit-patch.
Failed to run "['./WebKitTools/Scripts/webkit-patch', '--status-host=webkit-commit-queue.appspot.com', 'land-attachment', '--force-clean', '--build', '--non-interactive', '--ignore-builders', '--build-style=both', '--quiet', 59873, '--test', '--parent-command=commit-queue', '--no-update']" exit_code: 1
Last 500 characters of output:
date_status
    return NetworkTransaction().run(lambda: self._post_status_to_server(queue_name, status, patch, results_file))
  File "/Users/eseidel/Projects/CommitQueue/WebKitTools/Scripts/webkitpy/common/net/networktransaction.py", line 56, in run
    self._check_for_timeout()
  File "/Users/eseidel/Projects/CommitQueue/WebKitTools/Scripts/webkitpy/common/net/networktransaction.py", line 63, in _check_for_timeout
    raise NetworkTimeout()
webkitpy.common.net.networktransaction.NetworkTimeout
Comment 17 Eric Seidel (no email) 2010-06-28 00:33:43 PDT
Created attachment 59878 [details]
Patch for landing
Comment 18 WebKit Commit Bot 2010-06-28 00:56:32 PDT
Comment on attachment 59878 [details]
Patch for landing

Clearing flags on attachment: 59878

Committed r61985: <http://trac.webkit.org/changeset/61985>
Comment 19 WebKit Commit Bot 2010-06-28 00:56:39 PDT
All reviewed patches have been landed.  Closing bug.
Comment 20 WebKit Review Bot 2010-06-28 01:09:02 PDT
http://trac.webkit.org/changeset/61985 might have broken Chromium Linux Release
Comment 21 Eric Seidel (no email) 2010-06-28 01:47:21 PDT
Committed r61987: <http://trac.webkit.org/changeset/61987>