Bug 41344

Summary: HTMLTokenizer needs EndOfFile support
Product: WebKit Reporter: Eric Seidel (no email) <eric>
Component: New BugsAssignee: Eric Seidel (no email) <eric>
Status: RESOLVED FIXED    
Severity: Normal CC: abarth, eric, webkit.review.bot
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: Other   
OS: OS X 10.5   
Bug Depends on:    
Bug Blocks: 41123    
Attachments:
Description Flags
work in progress
none
Patch
none
Patch for landing eric: commit-queue+

Eric Seidel (no email)
Reported 2010-06-29 05:14:16 PDT
HTMLTokenizer needs EndOfFile support
Attachments
work in progress (34.42 KB, patch)
2010-06-29 05:16 PDT, Eric Seidel (no email)
no flags
Patch (35.22 KB, patch)
2010-06-30 00:36 PDT, Eric Seidel (no email)
no flags
Patch for landing (35.87 KB, patch)
2010-06-30 00:59 PDT, Eric Seidel (no email)
eric: commit-queue+
Eric Seidel (no email)
Comment 1 2010-06-29 05:16:43 PDT
Created attachment 60011 [details] work in progress
Adam Barth
Comment 2 2010-06-29 09:55:50 PDT
Please register with master bug.
Adam Barth
Comment 3 2010-06-29 09:59:38 PDT
Comment on attachment 60011 [details] work in progress Driveby comments WebCore/html/HTMLToken.h:65 + // Should we ASSERT(m_type == Uninitialized) here? Yes. EOF is a separate token. WebCore/html/HTMLToken.h:67 + // in HTMLTokenizer before calling makeEndOfFile(). That's correct. Those started tokens are never emitted. WebCore/html/HTMLTokenizer.cpp:43 + const UChar endOfFile = 0; Lame that this happens twice. WebCore/html/HTMLTokenizer.cpp:189 + if (m_token->type() == HTMLToken::Character) \ You should be calling shouldEmitBufferedCharacterWhatever here. WebCore/html/HTMLTokenizer.cpp:278 + if (m_token->type() == HTMLToken::Character) { You should use shouldEmitBufferedCharacter whatever
Eric Seidel (no email)
Comment 4 2010-06-29 12:04:54 PDT
(In reply to comment #3) > WebCore/html/HTMLToken.h:65 > + // Should we ASSERT(m_type == Uninitialized) here? > Yes. EOF is a separate token. Well, then we should just call clear() right before makeEndOfFile() or as part of makeEndOfFile(). > WebCore/html/HTMLToken.h:67 > + // in HTMLTokenizer before calling makeEndOfFile(). > That's correct. Those started tokens are never emitted. How does a document of "<scr" end up emitting <scr? > WebCore/html/HTMLTokenizer.cpp:43 > + const UChar endOfFile = 0; > Lame that this happens twice. Agreed. Plan to clean up in a real patch. > WebCore/html/HTMLTokenizer.cpp:189 > + if (m_token->type() == HTMLToken::Character) \ > You should be calling shouldEmitBufferedCharacterWhatever here. > > WebCore/html/HTMLTokenizer.cpp:278 > + if (m_token->type() == HTMLToken::Character) { > You should use shouldEmitBufferedCharacter whatever This was copied from the if directly above: BEGIN_STATE(DataState) { if (cc == '&') ADVANCE_TO(CharacterReferenceInDataState); else if (cc == '<') { if (m_token->type() == HTMLToken::Character) { // We have a bunch of character tokens queued up that we // are emitting lazily here. return true; } ADVANCE_TO(TagOpenState); } Should that use shouldEmitBufferedCharacterToken? And would it look like: if (shouldEmitBufferedCharacterToken()) return true;?
Adam Barth
Comment 5 2010-06-29 13:34:53 PDT
> This was copied from the if directly above: > BEGIN_STATE(DataState) { > if (cc == '&') > ADVANCE_TO(CharacterReferenceInDataState); > else if (cc == '<') { > if (m_token->type() == HTMLToken::Character) { > // We have a bunch of character tokens queued up that we > // are emitting lazily here. > return true; > } > ADVANCE_TO(TagOpenState); > } > > Should that use shouldEmitBufferedCharacterToken? And would it look like: That branch is special because it's the place where we emit the buffered character tokens for an *open* input stream. shouldEmitBufferedCharacterToken is how we know to emit the last character token in the input stream (because we're not going to get a '<' to cause us to flush).
Eric Seidel (no email)
Comment 6 2010-06-29 16:47:34 PDT
I think you and I will need to discuss this patch more in person.
Adam Barth
Comment 7 2010-06-29 17:20:33 PDT
Comment on attachment 60011 [details] work in progress WebCore/html/HTMLDocumentParser.cpp:295 + m_input.appendToEnd(String(&endOfFile, 1)); Sure, we can discuss more in person. I don't understand how this patch works at app. Won't this put a null character into the decoded character stream before NULLs are replaced with REPLACEMENT CHARACTER? I'm not sure what's the best approach threading the EOF character through the tokenizer's machine.
Eric Seidel (no email)
Comment 8 2010-06-30 00:36:26 PDT
Adam Barth
Comment 9 2010-06-30 00:50:45 PDT
Comment on attachment 60093 [details] Patch Might be a slight hack, but ok. :) cq- for adding the assert to emitCharacter. WebCore/html/HTMLTokenizer.cpp:187 + #define EMIT_END_OF_FILE() \ This should be a function.
Eric Seidel (no email)
Comment 10 2010-06-30 00:59:09 PDT
Created attachment 60094 [details] Patch for landing
Eric Seidel (no email)
Comment 11 2010-06-30 01:28:54 PDT
WebKit Review Bot
Comment 12 2010-06-30 01:54:46 PDT
Eric Seidel (no email)
Comment 13 2010-06-30 02:14:42 PDT
Eric Seidel (no email)
Comment 14 2010-06-30 02:46:21 PDT
Note You need to log in before you can comment on or make changes to this bug.