HTMLTokenizer needs EndOfFile support
Created attachment 60011 [details] work in progress
Please register with master bug.
Comment on attachment 60011 [details] work in progress Driveby comments WebCore/html/HTMLToken.h:65 + // Should we ASSERT(m_type == Uninitialized) here? Yes. EOF is a separate token. WebCore/html/HTMLToken.h:67 + // in HTMLTokenizer before calling makeEndOfFile(). That's correct. Those started tokens are never emitted. WebCore/html/HTMLTokenizer.cpp:43 + const UChar endOfFile = 0; Lame that this happens twice. WebCore/html/HTMLTokenizer.cpp:189 + if (m_token->type() == HTMLToken::Character) \ You should be calling shouldEmitBufferedCharacterWhatever here. WebCore/html/HTMLTokenizer.cpp:278 + if (m_token->type() == HTMLToken::Character) { You should use shouldEmitBufferedCharacter whatever
(In reply to comment #3) > WebCore/html/HTMLToken.h:65 > + // Should we ASSERT(m_type == Uninitialized) here? > Yes. EOF is a separate token. Well, then we should just call clear() right before makeEndOfFile() or as part of makeEndOfFile(). > WebCore/html/HTMLToken.h:67 > + // in HTMLTokenizer before calling makeEndOfFile(). > That's correct. Those started tokens are never emitted. How does a document of "<scr" end up emitting <scr? > WebCore/html/HTMLTokenizer.cpp:43 > + const UChar endOfFile = 0; > Lame that this happens twice. Agreed. Plan to clean up in a real patch. > WebCore/html/HTMLTokenizer.cpp:189 > + if (m_token->type() == HTMLToken::Character) \ > You should be calling shouldEmitBufferedCharacterWhatever here. > > WebCore/html/HTMLTokenizer.cpp:278 > + if (m_token->type() == HTMLToken::Character) { > You should use shouldEmitBufferedCharacter whatever This was copied from the if directly above: BEGIN_STATE(DataState) { if (cc == '&') ADVANCE_TO(CharacterReferenceInDataState); else if (cc == '<') { if (m_token->type() == HTMLToken::Character) { // We have a bunch of character tokens queued up that we // are emitting lazily here. return true; } ADVANCE_TO(TagOpenState); } Should that use shouldEmitBufferedCharacterToken? And would it look like: if (shouldEmitBufferedCharacterToken()) return true;?
> This was copied from the if directly above: > BEGIN_STATE(DataState) { > if (cc == '&') > ADVANCE_TO(CharacterReferenceInDataState); > else if (cc == '<') { > if (m_token->type() == HTMLToken::Character) { > // We have a bunch of character tokens queued up that we > // are emitting lazily here. > return true; > } > ADVANCE_TO(TagOpenState); > } > > Should that use shouldEmitBufferedCharacterToken? And would it look like: That branch is special because it's the place where we emit the buffered character tokens for an *open* input stream. shouldEmitBufferedCharacterToken is how we know to emit the last character token in the input stream (because we're not going to get a '<' to cause us to flush).
I think you and I will need to discuss this patch more in person.
Comment on attachment 60011 [details] work in progress WebCore/html/HTMLDocumentParser.cpp:295 + m_input.appendToEnd(String(&endOfFile, 1)); Sure, we can discuss more in person. I don't understand how this patch works at app. Won't this put a null character into the decoded character stream before NULLs are replaced with REPLACEMENT CHARACTER? I'm not sure what's the best approach threading the EOF character through the tokenizer's machine.
Created attachment 60093 [details] Patch
Comment on attachment 60093 [details] Patch Might be a slight hack, but ok. :) cq- for adding the assert to emitCharacter. WebCore/html/HTMLTokenizer.cpp:187 + #define EMIT_END_OF_FILE() \ This should be a function.
Created attachment 60094 [details] Patch for landing
Committed r62168: <http://trac.webkit.org/changeset/62168>
http://trac.webkit.org/changeset/62168 might have broken Qt Linux Release The following changes are on the blame list: http://trac.webkit.org/changeset/62168 http://trac.webkit.org/changeset/62166 http://trac.webkit.org/changeset/62167
Committed r62172: <http://trac.webkit.org/changeset/62172>
Committed r62173: <http://trac.webkit.org/changeset/62173>