WebKit Bugzilla
New
Browse
Log In
×
Sign in with GitHub
or
Remember my login
Create Account
·
Forgot Password
Forgotten password account recovery
RESOLVED FIXED
41344
HTMLTokenizer needs EndOfFile support
https://bugs.webkit.org/show_bug.cgi?id=41344
Summary
HTMLTokenizer needs EndOfFile support
Eric Seidel (no email)
Reported
2010-06-29 05:14:16 PDT
HTMLTokenizer needs EndOfFile support
Attachments
work in progress
(34.42 KB, patch)
2010-06-29 05:16 PDT
,
Eric Seidel (no email)
no flags
Details
Formatted Diff
Diff
Patch
(35.22 KB, patch)
2010-06-30 00:36 PDT
,
Eric Seidel (no email)
no flags
Details
Formatted Diff
Diff
Patch for landing
(35.87 KB, patch)
2010-06-30 00:59 PDT
,
Eric Seidel (no email)
eric
: commit-queue+
Details
Formatted Diff
Diff
Show Obsolete
(2)
View All
Add attachment
proposed patch, testcase, etc.
Eric Seidel (no email)
Comment 1
2010-06-29 05:16:43 PDT
Created
attachment 60011
[details]
work in progress
Adam Barth
Comment 2
2010-06-29 09:55:50 PDT
Please register with master bug.
Adam Barth
Comment 3
2010-06-29 09:59:38 PDT
Comment on
attachment 60011
[details]
work in progress Driveby comments WebCore/html/HTMLToken.h:65 + // Should we ASSERT(m_type == Uninitialized) here? Yes. EOF is a separate token. WebCore/html/HTMLToken.h:67 + // in HTMLTokenizer before calling makeEndOfFile(). That's correct. Those started tokens are never emitted. WebCore/html/HTMLTokenizer.cpp:43 + const UChar endOfFile = 0; Lame that this happens twice. WebCore/html/HTMLTokenizer.cpp:189 + if (m_token->type() == HTMLToken::Character) \ You should be calling shouldEmitBufferedCharacterWhatever here. WebCore/html/HTMLTokenizer.cpp:278 + if (m_token->type() == HTMLToken::Character) { You should use shouldEmitBufferedCharacter whatever
Eric Seidel (no email)
Comment 4
2010-06-29 12:04:54 PDT
(In reply to
comment #3
)
> WebCore/html/HTMLToken.h:65 > + // Should we ASSERT(m_type == Uninitialized) here? > Yes. EOF is a separate token.
Well, then we should just call clear() right before makeEndOfFile() or as part of makeEndOfFile().
> WebCore/html/HTMLToken.h:67 > + // in HTMLTokenizer before calling makeEndOfFile(). > That's correct. Those started tokens are never emitted.
How does a document of "<scr" end up emitting <scr?
> WebCore/html/HTMLTokenizer.cpp:43 > + const UChar endOfFile = 0; > Lame that this happens twice.
Agreed. Plan to clean up in a real patch.
> WebCore/html/HTMLTokenizer.cpp:189 > + if (m_token->type() == HTMLToken::Character) \ > You should be calling shouldEmitBufferedCharacterWhatever here. > > WebCore/html/HTMLTokenizer.cpp:278 > + if (m_token->type() == HTMLToken::Character) { > You should use shouldEmitBufferedCharacter whatever
This was copied from the if directly above: BEGIN_STATE(DataState) { if (cc == '&') ADVANCE_TO(CharacterReferenceInDataState); else if (cc == '<') { if (m_token->type() == HTMLToken::Character) { // We have a bunch of character tokens queued up that we // are emitting lazily here. return true; } ADVANCE_TO(TagOpenState); } Should that use shouldEmitBufferedCharacterToken? And would it look like: if (shouldEmitBufferedCharacterToken()) return true;?
Adam Barth
Comment 5
2010-06-29 13:34:53 PDT
> This was copied from the if directly above: > BEGIN_STATE(DataState) { > if (cc == '&') > ADVANCE_TO(CharacterReferenceInDataState); > else if (cc == '<') { > if (m_token->type() == HTMLToken::Character) { > // We have a bunch of character tokens queued up that we > // are emitting lazily here. > return true; > } > ADVANCE_TO(TagOpenState); > } > > Should that use shouldEmitBufferedCharacterToken? And would it look like:
That branch is special because it's the place where we emit the buffered character tokens for an *open* input stream. shouldEmitBufferedCharacterToken is how we know to emit the last character token in the input stream (because we're not going to get a '<' to cause us to flush).
Eric Seidel (no email)
Comment 6
2010-06-29 16:47:34 PDT
I think you and I will need to discuss this patch more in person.
Adam Barth
Comment 7
2010-06-29 17:20:33 PDT
Comment on
attachment 60011
[details]
work in progress WebCore/html/HTMLDocumentParser.cpp:295 + m_input.appendToEnd(String(&endOfFile, 1)); Sure, we can discuss more in person. I don't understand how this patch works at app. Won't this put a null character into the decoded character stream before NULLs are replaced with REPLACEMENT CHARACTER? I'm not sure what's the best approach threading the EOF character through the tokenizer's machine.
Eric Seidel (no email)
Comment 8
2010-06-30 00:36:26 PDT
Created
attachment 60093
[details]
Patch
Adam Barth
Comment 9
2010-06-30 00:50:45 PDT
Comment on
attachment 60093
[details]
Patch Might be a slight hack, but ok. :) cq- for adding the assert to emitCharacter. WebCore/html/HTMLTokenizer.cpp:187 + #define EMIT_END_OF_FILE() \ This should be a function.
Eric Seidel (no email)
Comment 10
2010-06-30 00:59:09 PDT
Created
attachment 60094
[details]
Patch for landing
Eric Seidel (no email)
Comment 11
2010-06-30 01:28:54 PDT
Committed
r62168
: <
http://trac.webkit.org/changeset/62168
>
WebKit Review Bot
Comment 12
2010-06-30 01:54:46 PDT
http://trac.webkit.org/changeset/62168
might have broken Qt Linux Release The following changes are on the blame list:
http://trac.webkit.org/changeset/62168
http://trac.webkit.org/changeset/62166
http://trac.webkit.org/changeset/62167
Eric Seidel (no email)
Comment 13
2010-06-30 02:14:42 PDT
Committed
r62172
: <
http://trac.webkit.org/changeset/62172
>
Eric Seidel (no email)
Comment 14
2010-06-30 02:46:21 PDT
Committed
r62173
: <
http://trac.webkit.org/changeset/62173
>
Note
You need to
log in
before you can comment on or make changes to this bug.
Top of Page
Format For Printing
XML
Clone This Bug