WebKit Bugzilla
New
Browse
Log In
×
Sign in with GitHub
or
Remember my login
Create Account
·
Forgot Password
Forgotten password account recovery
RESOLVED FIXED
65000
Refactor HTML tokenizer code in preparation for a new XML tokenizer
https://bugs.webkit.org/show_bug.cgi?id=65000
Summary
Refactor HTML tokenizer code in preparation for a new XML tokenizer
Vicki Pfau
Reported
2011-07-21 17:41:04 PDT
Refactor HTML tokenizer code in preparation for a new XML tokenizer
Attachments
Patch
(116.66 KB, patch)
2011-07-21 17:50 PDT
,
Vicki Pfau
no flags
Details
Formatted Diff
Diff
Patch
(119.74 KB, patch)
2011-07-22 14:25 PDT
,
Vicki Pfau
no flags
Details
Formatted Diff
Diff
Patch
(119.79 KB, patch)
2011-07-22 18:12 PDT
,
Vicki Pfau
no flags
Details
Formatted Diff
Diff
Patch
(119.80 KB, patch)
2011-07-22 18:54 PDT
,
Vicki Pfau
no flags
Details
Formatted Diff
Diff
Show Obsolete
(3)
View All
Add attachment
proposed patch, testcase, etc.
Vicki Pfau
Comment 1
2011-07-21 17:50:26 PDT
Created
attachment 101678
[details]
Patch
WebKit Review Bot
Comment 2
2011-07-21 17:53:26 PDT
Attachment 101678
[details]
did not pass style-queue: Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/WebCore/ChangeLog', u'Source/WebCor..." exit_code: 1 Source/WebCore/html/parser/HTMLTokenizer.cpp:217: Non-label code inside switch statements should be indented. [whitespace/indent] [4] Total errors found: 1 in 11 files If any of these errors are false positives, please file a bug against check-webkit-style.
Adam Barth
Comment 3
2011-07-21 18:05:29 PDT
Comment on
attachment 101678
[details]
Patch Spoke with jpfau in #webkit a bit about this change. This seems like a good direction. The main question revolves around whether we want InputStreamPreprocessor for XML.
Adam Barth
Comment 4
2011-07-21 18:06:31 PDT
Comment on
attachment 101678
[details]
Patch Can you include before/after results from the HTML parser benchmark? The tokenizer perf is very sensitive to subtle changes.
WebKit Review Bot
Comment 5
2011-07-21 18:11:57 PDT
Comment on
attachment 101678
[details]
Patch
Attachment 101678
[details]
did not pass mac-ews (mac): Output:
http://queues.webkit.org/results/9192900
WebKit Review Bot
Comment 6
2011-07-21 18:35:33 PDT
Comment on
attachment 101678
[details]
Patch
Attachment 101678
[details]
did not pass chromium-ews (chromium-xvfb): Output:
http://queues.webkit.org/results/9192910
WebKit Review Bot
Comment 7
2011-07-21 18:37:25 PDT
Comment on
attachment 101678
[details]
Patch
Attachment 101678
[details]
did not pass cr-mac-ews (chromium): Output:
http://queues.webkit.org/results/9201974
Adam Barth
Comment 8
2011-07-22 11:11:57 PDT
Comment on
attachment 101678
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=101678&action=review
> Source/WebCore/xml/parser/MarkupTokenizerBase.h:155 > + // This method can get hidden in subclasses
Can we make these methods ASSERT_NOT_REACHED instead? I'm not quite sure why we have them.
Adam Barth
Comment 9
2011-07-22 11:12:55 PDT
Comment on
attachment 101678
[details]
Patch Generally, this looks good. I'd like to see a perf comparison (to be safe) and there's some question about how much of the input stream pre-processor you want for XML, which affects some of the factoring.
Vicki Pfau
Comment 10
2011-07-22 14:20:15 PDT
(In reply to
comment #8
)
> (From update of
attachment 101678
[details]
) > View in context:
https://bugs.webkit.org/attachment.cgi?id=101678&action=review
> > > Source/WebCore/xml/parser/MarkupTokenizerBase.h:155 > > + // This method can get hidden in subclasses > > Can we make these methods ASSERT_NOT_REACHED instead? I'm not quite sure why we have them.
These functions contain shared code that is used by the HTML tokenizer as well as the XML tokenizer. The base functions won't get hidden by the XML tokenizer because it doesn't need to worry about saving the open tag name or any special stuff that HTML does. (In reply to
comment #9
)
> (From update of
attachment 101678
[details]
) > Generally, this looks good. I'd like to see a perf comparison (to be safe) and there's some question about how much of the input stream pre-processor you want for XML, which affects some of the factoring.
The refactored code seems to cause about a 2.5-3.5% performance hit to the parser. Before the refactor, the HTML parser performance test gave me results that were typically like this: avg 1638.45 median 1633.5 stdev 16.719674039884868 min 1616 max 1684 And after the refractor, the numbers typically looked like this: avg 1690.95 median 1690 stdev 10.086996579755542 min 1673 max 1709 The input preprocessor is also reused because XML doesn't accept null bytes and does CRLF folding.
Vicki Pfau
Comment 11
2011-07-22 14:25:08 PDT
Created
attachment 101765
[details]
Patch
WebKit Review Bot
Comment 12
2011-07-22 14:28:22 PDT
Attachment 101765
[details]
did not pass style-queue: Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/WebCore/ChangeLog', u'Source/WebCor..." exit_code: 1 Source/WebCore/html/parser/HTMLTokenizer.cpp:210: Non-label code inside switch statements should be indented. [whitespace/indent] [4] Total errors found: 1 in 15 files If any of these errors are false positives, please file a bug against check-webkit-style.
Adam Barth
Comment 13
2011-07-22 14:38:32 PDT
Can we avoid the 3% regression? Where is it coming from?
Vicki Pfau
Comment 14
2011-07-22 18:12:04 PDT
Created
attachment 101799
[details]
Patch
WebKit Review Bot
Comment 15
2011-07-22 18:14:29 PDT
Attachment 101799
[details]
did not pass style-queue: Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/WebCore/ChangeLog', u'Source/WebCor..." exit_code: 1 Source/WebCore/html/parser/HTMLTokenizer.cpp:221: Non-label code inside switch statements should be indented. [whitespace/indent] [4] Total errors found: 1 in 15 files If any of these errors are false positives, please file a bug against check-webkit-style.
Adam Barth
Comment 16
2011-07-22 18:15:15 PDT
Comment on
attachment 101799
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=101799&action=review
> Source/WebCore/html/parser/HTMLTokenizer.cpp:110 > + && (m_state == HTMLTokenizerState::DataState > + || m_state == HTMLTokenizerState::RCDATAState > + || m_state == HTMLTokenizerState::RAWTEXTState > + || m_state == HTMLTokenizerState::PLAINTEXTState);
Bad indent
Vicki Pfau
Comment 17
2011-07-22 18:54:52 PDT
Created
attachment 101802
[details]
Patch
WebKit Review Bot
Comment 18
2011-07-22 18:58:13 PDT
Attachment 101802
[details]
did not pass style-queue: Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/WebCore/ChangeLog', u'Source/WebCor..." exit_code: 1 Source/WebCore/html/parser/HTMLTokenizer.cpp:221: Non-label code inside switch statements should be indented. [whitespace/indent] [4] Total errors found: 1 in 15 files If any of these errors are false positives, please file a bug against check-webkit-style.
WebKit Review Bot
Comment 19
2011-07-24 03:07:43 PDT
Comment on
attachment 101802
[details]
Patch Clearing flags on attachment: 101802 Committed
r91643
: <
http://trac.webkit.org/changeset/91643
>
WebKit Review Bot
Comment 20
2011-07-24 03:07:49 PDT
All reviewed patches have been landed. Closing bug.
Note
You need to
log in
before you can comment on or make changes to this bug.
Top of Page
Format For Printing
XML
Clone This Bug