Bug 41344

Summary: HTMLTokenizer needs EndOfFile support
Product: WebKit Reporter: Eric Seidel (no email) <eric>
Component: New BugsAssignee: Eric Seidel (no email) <eric>
Status: RESOLVED FIXED    
Severity: Normal CC: abarth, eric, webkit.review.bot
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: Other   
OS: OS X 10.5   
Bug Depends on:    
Bug Blocks: 41123    
Attachments:
Description Flags
work in progress
none
Patch
none
Patch for landing eric: commit-queue+

Description Eric Seidel (no email) 2010-06-29 05:14:16 PDT
HTMLTokenizer needs EndOfFile support
Comment 1 Eric Seidel (no email) 2010-06-29 05:16:43 PDT
Created attachment 60011 [details]
work in progress
Comment 2 Adam Barth 2010-06-29 09:55:50 PDT
Please register with master bug.
Comment 3 Adam Barth 2010-06-29 09:59:38 PDT
Comment on attachment 60011 [details]
work in progress

Driveby comments

WebCore/html/HTMLToken.h:65
 +          // Should we ASSERT(m_type == Uninitialized) here?
Yes.  EOF is a separate token.

WebCore/html/HTMLToken.h:67
 +          // in HTMLTokenizer before calling makeEndOfFile().
That's correct.  Those started tokens are never emitted.

WebCore/html/HTMLTokenizer.cpp:43
 +  const UChar endOfFile = 0;
Lame that this happens twice.

WebCore/html/HTMLTokenizer.cpp:189
 +          if (m_token->type() == HTMLToken::Character)                       \
You should be calling shouldEmitBufferedCharacterWhatever here.

WebCore/html/HTMLTokenizer.cpp:278
 +              if (m_token->type() == HTMLToken::Character) {
You should use shouldEmitBufferedCharacter whatever
Comment 4 Eric Seidel (no email) 2010-06-29 12:04:54 PDT
(In reply to comment #3)
> WebCore/html/HTMLToken.h:65
>  +          // Should we ASSERT(m_type == Uninitialized) here?
> Yes.  EOF is a separate token.

Well, then we should just call clear() right before makeEndOfFile() or as part of makeEndOfFile().

> WebCore/html/HTMLToken.h:67
>  +          // in HTMLTokenizer before calling makeEndOfFile().
> That's correct.  Those started tokens are never emitted.

How does a document of "<scr" end up emitting <scr?

> WebCore/html/HTMLTokenizer.cpp:43
>  +  const UChar endOfFile = 0;
> Lame that this happens twice.

Agreed.  Plan to clean up in a real patch.

> WebCore/html/HTMLTokenizer.cpp:189
>  +          if (m_token->type() == HTMLToken::Character)                       \
> You should be calling shouldEmitBufferedCharacterWhatever here.
> 
> WebCore/html/HTMLTokenizer.cpp:278
>  +              if (m_token->type() == HTMLToken::Character) {
> You should use shouldEmitBufferedCharacter whatever

This was copied from the if directly above:
    BEGIN_STATE(DataState) {
        if (cc == '&')
            ADVANCE_TO(CharacterReferenceInDataState);
        else if (cc == '<') {
            if (m_token->type() == HTMLToken::Character) {
                // We have a bunch of character tokens queued up that we
                // are emitting lazily here.
                return true;
            }
            ADVANCE_TO(TagOpenState);
        } 

Should that use shouldEmitBufferedCharacterToken?  And would it look like:

if (shouldEmitBufferedCharacterToken())
    return true;?
Comment 5 Adam Barth 2010-06-29 13:34:53 PDT
> This was copied from the if directly above:
>     BEGIN_STATE(DataState) {
>         if (cc == '&')
>             ADVANCE_TO(CharacterReferenceInDataState);
>         else if (cc == '<') {
>             if (m_token->type() == HTMLToken::Character) {
>                 // We have a bunch of character tokens queued up that we
>                 // are emitting lazily here.
>                 return true;
>             }
>             ADVANCE_TO(TagOpenState);
>         } 
> 
> Should that use shouldEmitBufferedCharacterToken?  And would it look like:

That branch is special because it's the place where we emit the buffered character tokens for an *open* input stream.  shouldEmitBufferedCharacterToken is how we know to emit the last character token in the input stream (because we're not going to get a '<' to cause us to flush).
Comment 6 Eric Seidel (no email) 2010-06-29 16:47:34 PDT
I think you and I will need to discuss this patch more in person.
Comment 7 Adam Barth 2010-06-29 17:20:33 PDT
Comment on attachment 60011 [details]
work in progress

WebCore/html/HTMLDocumentParser.cpp:295
 +      m_input.appendToEnd(String(&endOfFile, 1));
Sure, we can discuss more in person.  I don't understand how this patch works at app.  Won't this put a null character into the decoded character stream before NULLs are replaced with REPLACEMENT CHARACTER?  I'm not sure what's the best approach threading the EOF character through the tokenizer's machine.
Comment 8 Eric Seidel (no email) 2010-06-30 00:36:26 PDT
Created attachment 60093 [details]
Patch
Comment 9 Adam Barth 2010-06-30 00:50:45 PDT
Comment on attachment 60093 [details]
Patch

Might be a slight hack, but ok.  :)

cq- for adding the assert to emitCharacter.

WebCore/html/HTMLTokenizer.cpp:187
 +  #define EMIT_END_OF_FILE()                                                 \
This should be a function.
Comment 10 Eric Seidel (no email) 2010-06-30 00:59:09 PDT
Created attachment 60094 [details]
Patch for landing
Comment 11 Eric Seidel (no email) 2010-06-30 01:28:54 PDT
Committed r62168: <http://trac.webkit.org/changeset/62168>
Comment 12 WebKit Review Bot 2010-06-30 01:54:46 PDT
http://trac.webkit.org/changeset/62168 might have broken Qt Linux Release
The following changes are on the blame list:
http://trac.webkit.org/changeset/62168
http://trac.webkit.org/changeset/62166
http://trac.webkit.org/changeset/62167
Comment 13 Eric Seidel (no email) 2010-06-30 02:14:42 PDT
Committed r62172: <http://trac.webkit.org/changeset/62172>
Comment 14 Eric Seidel (no email) 2010-06-30 02:46:21 PDT
Committed r62173: <http://trac.webkit.org/changeset/62173>