Bug 41344

Summary:

HTMLTokenizer needs EndOfFile support

Product:

WebKit

Reporter:

Eric Seidel (no email) <eric>

Component:

New Bugs

Assignee:

Eric Seidel (no email) <eric>

Status:

RESOLVED FIXED

Severity:

Normal

CC:

abarth, eric, webkit.review.bot

Priority:

Version:

528+ (Nightly build)

Hardware:

Other

OS:

OS X 10.5

Bug Depends on:

Bug Blocks:

41123

Attachments:

Description	Flags
work in progress	none
Patch	none
Patch for landing	eric: commit-queue+

Eric Seidel (no email)

Reported 2010-06-29 05:14:16 PDT

HTMLTokenizer needs EndOfFile support

Attachments
work in progress (34.42 KB, patch) 2010-06-29 05:16 PDT, Eric Seidel (no email)	no flags	Details Formatted Diff Diff
Patch (35.22 KB, patch) 2010-06-30 00:36 PDT, Eric Seidel (no email)	no flags	Details Formatted Diff Diff
Patch for landing (35.87 KB, patch) 2010-06-30 00:59 PDT, Eric Seidel (no email)	eric: commit-queue+	Details Formatted Diff Diff
Show Obsolete (2) View All Add attachment proposed patch, testcase, etc.

Eric Seidel (no email)

Comment 1 2010-06-29 05:16:43 PDT

Created attachment 60011 [details] work in progress

Adam Barth

Comment 2 2010-06-29 09:55:50 PDT

Please register with master bug.

Adam Barth

Comment 3 2010-06-29 09:59:38 PDT

Comment on attachment 60011 [details] work in progress Driveby comments WebCore/html/HTMLToken.h:65 + // Should we ASSERT(m_type == Uninitialized) here? Yes. EOF is a separate token. WebCore/html/HTMLToken.h:67 + // in HTMLTokenizer before calling makeEndOfFile(). That's correct. Those started tokens are never emitted. WebCore/html/HTMLTokenizer.cpp:43 + const UChar endOfFile = 0; Lame that this happens twice. WebCore/html/HTMLTokenizer.cpp:189 + if (m_token->type() == HTMLToken::Character) \ You should be calling shouldEmitBufferedCharacterWhatever here. WebCore/html/HTMLTokenizer.cpp:278 + if (m_token->type() == HTMLToken::Character) { You should use shouldEmitBufferedCharacter whatever

Eric Seidel (no email)

Comment 4 2010-06-29 12:04:54 PDT

(In reply to comment #3) > WebCore/html/HTMLToken.h:65 > + // Should we ASSERT(m_type == Uninitialized) here? > Yes. EOF is a separate token. Well, then we should just call clear() right before makeEndOfFile() or as part of makeEndOfFile(). > WebCore/html/HTMLToken.h:67 > + // in HTMLTokenizer before calling makeEndOfFile(). > That's correct. Those started tokens are never emitted. How does a document of "<scr" end up emitting <scr? > WebCore/html/HTMLTokenizer.cpp:43 > + const UChar endOfFile = 0; > Lame that this happens twice. Agreed. Plan to clean up in a real patch. > WebCore/html/HTMLTokenizer.cpp:189 > + if (m_token->type() == HTMLToken::Character) \ > You should be calling shouldEmitBufferedCharacterWhatever here. > > WebCore/html/HTMLTokenizer.cpp:278 > + if (m_token->type() == HTMLToken::Character) { > You should use shouldEmitBufferedCharacter whatever This was copied from the if directly above: BEGIN_STATE(DataState) { if (cc == '&') ADVANCE_TO(CharacterReferenceInDataState); else if (cc == '<') { if (m_token->type() == HTMLToken::Character) { // We have a bunch of character tokens queued up that we // are emitting lazily here. return true; } ADVANCE_TO(TagOpenState); } Should that use shouldEmitBufferedCharacterToken? And would it look like: if (shouldEmitBufferedCharacterToken()) return true;?

Adam Barth

Comment 5 2010-06-29 13:34:53 PDT

> This was copied from the if directly above: > BEGIN_STATE(DataState) { > if (cc == '&') > ADVANCE_TO(CharacterReferenceInDataState); > else if (cc == '<') { > if (m_token->type() == HTMLToken::Character) { > // We have a bunch of character tokens queued up that we > // are emitting lazily here. > return true; > } > ADVANCE_TO(TagOpenState); > } > > Should that use shouldEmitBufferedCharacterToken? And would it look like: That branch is special because it's the place where we emit the buffered character tokens for an *open* input stream. shouldEmitBufferedCharacterToken is how we know to emit the last character token in the input stream (because we're not going to get a '<' to cause us to flush).

Eric Seidel (no email)

Comment 6 2010-06-29 16:47:34 PDT

I think you and I will need to discuss this patch more in person.

Adam Barth

Comment 7 2010-06-29 17:20:33 PDT

Comment on attachment 60011 [details] work in progress WebCore/html/HTMLDocumentParser.cpp:295 + m_input.appendToEnd(String(&endOfFile, 1)); Sure, we can discuss more in person. I don't understand how this patch works at app. Won't this put a null character into the decoded character stream before NULLs are replaced with REPLACEMENT CHARACTER? I'm not sure what's the best approach threading the EOF character through the tokenizer's machine.

Eric Seidel (no email)

Comment 8 2010-06-30 00:36:26 PDT

Created attachment 60093 [details] Patch

Adam Barth

Comment 9 2010-06-30 00:50:45 PDT

Comment on attachment 60093 [details] Patch Might be a slight hack, but ok. :) cq- for adding the assert to emitCharacter. WebCore/html/HTMLTokenizer.cpp:187 + #define EMIT_END_OF_FILE() \ This should be a function.

Eric Seidel (no email)

Comment 10 2010-06-30 00:59:09 PDT

Created attachment 60094 [details] Patch for landing

Eric Seidel (no email)

Comment 11 2010-06-30 01:28:54 PDT

Committed r62168: <http://trac.webkit.org/changeset/62168>

WebKit Review Bot

Comment 12 2010-06-30 01:54:46 PDT

http://trac.webkit.org/changeset/62168 might have broken Qt Linux Release The following changes are on the blame list: http://trac.webkit.org/changeset/62168 http://trac.webkit.org/changeset/62166 http://trac.webkit.org/changeset/62167

Eric Seidel (no email)

Comment 13 2010-06-30 02:14:42 PDT

Committed r62172: <http://trac.webkit.org/changeset/62172>

Eric Seidel (no email)

Comment 14 2010-06-30 02:46:21 PDT

Committed r62173: <http://trac.webkit.org/changeset/62173>

Note You need to log in before you can comment on or make changes to this bug.