RESOLVED FIXED 65000
Refactor HTML tokenizer code in preparation for a new XML tokenizer
https://bugs.webkit.org/show_bug.cgi?id=65000
Summary Refactor HTML tokenizer code in preparation for a new XML tokenizer
Vicki Pfau
Reported 2011-07-21 17:41:04 PDT
Refactor HTML tokenizer code in preparation for a new XML tokenizer
Attachments
Patch (116.66 KB, patch)
2011-07-21 17:50 PDT, Vicki Pfau
no flags
Patch (119.74 KB, patch)
2011-07-22 14:25 PDT, Vicki Pfau
no flags
Patch (119.79 KB, patch)
2011-07-22 18:12 PDT, Vicki Pfau
no flags
Patch (119.80 KB, patch)
2011-07-22 18:54 PDT, Vicki Pfau
no flags
Vicki Pfau
Comment 1 2011-07-21 17:50:26 PDT
WebKit Review Bot
Comment 2 2011-07-21 17:53:26 PDT
Attachment 101678 [details] did not pass style-queue: Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/WebCore/ChangeLog', u'Source/WebCor..." exit_code: 1 Source/WebCore/html/parser/HTMLTokenizer.cpp:217: Non-label code inside switch statements should be indented. [whitespace/indent] [4] Total errors found: 1 in 11 files If any of these errors are false positives, please file a bug against check-webkit-style.
Adam Barth
Comment 3 2011-07-21 18:05:29 PDT
Comment on attachment 101678 [details] Patch Spoke with jpfau in #webkit a bit about this change. This seems like a good direction. The main question revolves around whether we want InputStreamPreprocessor for XML.
Adam Barth
Comment 4 2011-07-21 18:06:31 PDT
Comment on attachment 101678 [details] Patch Can you include before/after results from the HTML parser benchmark? The tokenizer perf is very sensitive to subtle changes.
WebKit Review Bot
Comment 5 2011-07-21 18:11:57 PDT
WebKit Review Bot
Comment 6 2011-07-21 18:35:33 PDT
Comment on attachment 101678 [details] Patch Attachment 101678 [details] did not pass chromium-ews (chromium-xvfb): Output: http://queues.webkit.org/results/9192910
WebKit Review Bot
Comment 7 2011-07-21 18:37:25 PDT
Comment on attachment 101678 [details] Patch Attachment 101678 [details] did not pass cr-mac-ews (chromium): Output: http://queues.webkit.org/results/9201974
Adam Barth
Comment 8 2011-07-22 11:11:57 PDT
Comment on attachment 101678 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=101678&action=review > Source/WebCore/xml/parser/MarkupTokenizerBase.h:155 > + // This method can get hidden in subclasses Can we make these methods ASSERT_NOT_REACHED instead? I'm not quite sure why we have them.
Adam Barth
Comment 9 2011-07-22 11:12:55 PDT
Comment on attachment 101678 [details] Patch Generally, this looks good. I'd like to see a perf comparison (to be safe) and there's some question about how much of the input stream pre-processor you want for XML, which affects some of the factoring.
Vicki Pfau
Comment 10 2011-07-22 14:20:15 PDT
(In reply to comment #8) > (From update of attachment 101678 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=101678&action=review > > > Source/WebCore/xml/parser/MarkupTokenizerBase.h:155 > > + // This method can get hidden in subclasses > > Can we make these methods ASSERT_NOT_REACHED instead? I'm not quite sure why we have them. These functions contain shared code that is used by the HTML tokenizer as well as the XML tokenizer. The base functions won't get hidden by the XML tokenizer because it doesn't need to worry about saving the open tag name or any special stuff that HTML does. (In reply to comment #9) > (From update of attachment 101678 [details]) > Generally, this looks good. I'd like to see a perf comparison (to be safe) and there's some question about how much of the input stream pre-processor you want for XML, which affects some of the factoring. The refactored code seems to cause about a 2.5-3.5% performance hit to the parser. Before the refactor, the HTML parser performance test gave me results that were typically like this: avg 1638.45 median 1633.5 stdev 16.719674039884868 min 1616 max 1684 And after the refractor, the numbers typically looked like this: avg 1690.95 median 1690 stdev 10.086996579755542 min 1673 max 1709 The input preprocessor is also reused because XML doesn't accept null bytes and does CRLF folding.
Vicki Pfau
Comment 11 2011-07-22 14:25:08 PDT
WebKit Review Bot
Comment 12 2011-07-22 14:28:22 PDT
Attachment 101765 [details] did not pass style-queue: Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/WebCore/ChangeLog', u'Source/WebCor..." exit_code: 1 Source/WebCore/html/parser/HTMLTokenizer.cpp:210: Non-label code inside switch statements should be indented. [whitespace/indent] [4] Total errors found: 1 in 15 files If any of these errors are false positives, please file a bug against check-webkit-style.
Adam Barth
Comment 13 2011-07-22 14:38:32 PDT
Can we avoid the 3% regression? Where is it coming from?
Vicki Pfau
Comment 14 2011-07-22 18:12:04 PDT
WebKit Review Bot
Comment 15 2011-07-22 18:14:29 PDT
Attachment 101799 [details] did not pass style-queue: Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/WebCore/ChangeLog', u'Source/WebCor..." exit_code: 1 Source/WebCore/html/parser/HTMLTokenizer.cpp:221: Non-label code inside switch statements should be indented. [whitespace/indent] [4] Total errors found: 1 in 15 files If any of these errors are false positives, please file a bug against check-webkit-style.
Adam Barth
Comment 16 2011-07-22 18:15:15 PDT
Comment on attachment 101799 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=101799&action=review > Source/WebCore/html/parser/HTMLTokenizer.cpp:110 > + && (m_state == HTMLTokenizerState::DataState > + || m_state == HTMLTokenizerState::RCDATAState > + || m_state == HTMLTokenizerState::RAWTEXTState > + || m_state == HTMLTokenizerState::PLAINTEXTState); Bad indent
Vicki Pfau
Comment 17 2011-07-22 18:54:52 PDT
WebKit Review Bot
Comment 18 2011-07-22 18:58:13 PDT
Attachment 101802 [details] did not pass style-queue: Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/WebCore/ChangeLog', u'Source/WebCor..." exit_code: 1 Source/WebCore/html/parser/HTMLTokenizer.cpp:221: Non-label code inside switch statements should be indented. [whitespace/indent] [4] Total errors found: 1 in 15 files If any of these errors are false positives, please file a bug against check-webkit-style.
WebKit Review Bot
Comment 19 2011-07-24 03:07:43 PDT
Comment on attachment 101802 [details] Patch Clearing flags on attachment: 101802 Committed r91643: <http://trac.webkit.org/changeset/91643>
WebKit Review Bot
Comment 20 2011-07-24 03:07:49 PDT
All reviewed patches have been landed. Closing bug.
Note You need to log in before you can comment on or make changes to this bug.