Bug 65000 - Refactor HTML tokenizer code in preparation for a new XML tokenizer
Summary: Refactor HTML tokenizer code in preparation for a new XML tokenizer
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: New Bugs (show other bugs)
Version: 528+ (Nightly build)
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Vicki Pfau
URL:
Keywords:
Depends on:
Blocks: 64566
  Show dependency treegraph
 
Reported: 2011-07-21 17:41 PDT by Vicki Pfau
Modified: 2011-07-24 03:07 PDT (History)
3 users (show)

See Also:


Attachments
Patch (116.66 KB, patch)
2011-07-21 17:50 PDT, Vicki Pfau
no flags Details | Formatted Diff | Diff
Patch (119.74 KB, patch)
2011-07-22 14:25 PDT, Vicki Pfau
no flags Details | Formatted Diff | Diff
Patch (119.79 KB, patch)
2011-07-22 18:12 PDT, Vicki Pfau
no flags Details | Formatted Diff | Diff
Patch (119.80 KB, patch)
2011-07-22 18:54 PDT, Vicki Pfau
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Vicki Pfau 2011-07-21 17:41:04 PDT
Refactor HTML tokenizer code in preparation for a new XML tokenizer
Comment 1 Vicki Pfau 2011-07-21 17:50:26 PDT
Created attachment 101678 [details]
Patch
Comment 2 WebKit Review Bot 2011-07-21 17:53:26 PDT
Attachment 101678 [details] did not pass style-queue:

Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/WebCore/ChangeLog', u'Source/WebCor..." exit_code: 1

Source/WebCore/html/parser/HTMLTokenizer.cpp:217:  Non-label code inside switch statements should be indented.  [whitespace/indent] [4]
Total errors found: 1 in 11 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 3 Adam Barth 2011-07-21 18:05:29 PDT
Comment on attachment 101678 [details]
Patch

Spoke with jpfau in #webkit a bit about this change.  This seems like a good direction.  The main question revolves around whether we want InputStreamPreprocessor for XML.
Comment 4 Adam Barth 2011-07-21 18:06:31 PDT
Comment on attachment 101678 [details]
Patch

Can you include before/after results from the HTML parser benchmark?  The tokenizer perf is very sensitive to subtle changes.
Comment 5 WebKit Review Bot 2011-07-21 18:11:57 PDT
Comment on attachment 101678 [details]
Patch

Attachment 101678 [details] did not pass mac-ews (mac):
Output: http://queues.webkit.org/results/9192900
Comment 6 WebKit Review Bot 2011-07-21 18:35:33 PDT
Comment on attachment 101678 [details]
Patch

Attachment 101678 [details] did not pass chromium-ews (chromium-xvfb):
Output: http://queues.webkit.org/results/9192910
Comment 7 WebKit Review Bot 2011-07-21 18:37:25 PDT
Comment on attachment 101678 [details]
Patch

Attachment 101678 [details] did not pass cr-mac-ews (chromium):
Output: http://queues.webkit.org/results/9201974
Comment 8 Adam Barth 2011-07-22 11:11:57 PDT
Comment on attachment 101678 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=101678&action=review

> Source/WebCore/xml/parser/MarkupTokenizerBase.h:155
> +    // This method can get hidden in subclasses

Can we make these methods ASSERT_NOT_REACHED instead?  I'm not quite sure why we have them.
Comment 9 Adam Barth 2011-07-22 11:12:55 PDT
Comment on attachment 101678 [details]
Patch

Generally, this looks good.  I'd like to see a perf comparison (to be safe) and there's some question about how much of the input stream pre-processor you want for XML, which affects some of the factoring.
Comment 10 Vicki Pfau 2011-07-22 14:20:15 PDT
(In reply to comment #8)
> (From update of attachment 101678 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=101678&action=review
> 
> > Source/WebCore/xml/parser/MarkupTokenizerBase.h:155
> > +    // This method can get hidden in subclasses
> 
> Can we make these methods ASSERT_NOT_REACHED instead?  I'm not quite sure why we have them.

These functions contain shared code that is used by the HTML tokenizer as well as the XML tokenizer. The base functions won't get hidden by the XML tokenizer because it doesn't need to worry about saving the open tag name or any special stuff that HTML does.

(In reply to comment #9)
> (From update of attachment 101678 [details])
> Generally, this looks good.  I'd like to see a perf comparison (to be safe) and there's some question about how much of the input stream pre-processor you want for XML, which affects some of the factoring.

The refactored code seems to cause about a 2.5-3.5% performance hit to the parser. Before the refactor, the HTML parser performance test gave me results that were typically like this:

avg 1638.45
median 1633.5
stdev 16.719674039884868
min 1616
max 1684

And after the refractor, the numbers typically looked like this:

avg 1690.95
median 1690
stdev 10.086996579755542
min 1673
max 1709

The input preprocessor is also reused because XML doesn't accept null bytes and does CRLF folding.
Comment 11 Vicki Pfau 2011-07-22 14:25:08 PDT
Created attachment 101765 [details]
Patch
Comment 12 WebKit Review Bot 2011-07-22 14:28:22 PDT
Attachment 101765 [details] did not pass style-queue:

Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/WebCore/ChangeLog', u'Source/WebCor..." exit_code: 1

Source/WebCore/html/parser/HTMLTokenizer.cpp:210:  Non-label code inside switch statements should be indented.  [whitespace/indent] [4]
Total errors found: 1 in 15 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 13 Adam Barth 2011-07-22 14:38:32 PDT
Can we avoid the 3% regression?  Where is it coming from?
Comment 14 Vicki Pfau 2011-07-22 18:12:04 PDT
Created attachment 101799 [details]
Patch
Comment 15 WebKit Review Bot 2011-07-22 18:14:29 PDT
Attachment 101799 [details] did not pass style-queue:

Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/WebCore/ChangeLog', u'Source/WebCor..." exit_code: 1

Source/WebCore/html/parser/HTMLTokenizer.cpp:221:  Non-label code inside switch statements should be indented.  [whitespace/indent] [4]
Total errors found: 1 in 15 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 16 Adam Barth 2011-07-22 18:15:15 PDT
Comment on attachment 101799 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=101799&action=review

> Source/WebCore/html/parser/HTMLTokenizer.cpp:110
> +    && (m_state == HTMLTokenizerState::DataState
> +        || m_state == HTMLTokenizerState::RCDATAState
> +        || m_state == HTMLTokenizerState::RAWTEXTState
> +        || m_state == HTMLTokenizerState::PLAINTEXTState);

Bad indent
Comment 17 Vicki Pfau 2011-07-22 18:54:52 PDT
Created attachment 101802 [details]
Patch
Comment 18 WebKit Review Bot 2011-07-22 18:58:13 PDT
Attachment 101802 [details] did not pass style-queue:

Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/WebCore/ChangeLog', u'Source/WebCor..." exit_code: 1

Source/WebCore/html/parser/HTMLTokenizer.cpp:221:  Non-label code inside switch statements should be indented.  [whitespace/indent] [4]
Total errors found: 1 in 15 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 19 WebKit Review Bot 2011-07-24 03:07:43 PDT
Comment on attachment 101802 [details]
Patch

Clearing flags on attachment: 101802

Committed r91643: <http://trac.webkit.org/changeset/91643>
Comment 20 WebKit Review Bot 2011-07-24 03:07:49 PDT
All reviewed patches have been landed.  Closing bug.