Refactor HTML tokenizer code in preparation for a new XML tokenizer
Created attachment 101678 [details] Patch
Attachment 101678 [details] did not pass style-queue: Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/WebCore/ChangeLog', u'Source/WebCor..." exit_code: 1 Source/WebCore/html/parser/HTMLTokenizer.cpp:217: Non-label code inside switch statements should be indented. [whitespace/indent] [4] Total errors found: 1 in 11 files If any of these errors are false positives, please file a bug against check-webkit-style.
Comment on attachment 101678 [details] Patch Spoke with jpfau in #webkit a bit about this change. This seems like a good direction. The main question revolves around whether we want InputStreamPreprocessor for XML.
Comment on attachment 101678 [details] Patch Can you include before/after results from the HTML parser benchmark? The tokenizer perf is very sensitive to subtle changes.
Comment on attachment 101678 [details] Patch Attachment 101678 [details] did not pass mac-ews (mac): Output: http://queues.webkit.org/results/9192900
Comment on attachment 101678 [details] Patch Attachment 101678 [details] did not pass chromium-ews (chromium-xvfb): Output: http://queues.webkit.org/results/9192910
Comment on attachment 101678 [details] Patch Attachment 101678 [details] did not pass cr-mac-ews (chromium): Output: http://queues.webkit.org/results/9201974
Comment on attachment 101678 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=101678&action=review > Source/WebCore/xml/parser/MarkupTokenizerBase.h:155 > + // This method can get hidden in subclasses Can we make these methods ASSERT_NOT_REACHED instead? I'm not quite sure why we have them.
Comment on attachment 101678 [details] Patch Generally, this looks good. I'd like to see a perf comparison (to be safe) and there's some question about how much of the input stream pre-processor you want for XML, which affects some of the factoring.
(In reply to comment #8) > (From update of attachment 101678 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=101678&action=review > > > Source/WebCore/xml/parser/MarkupTokenizerBase.h:155 > > + // This method can get hidden in subclasses > > Can we make these methods ASSERT_NOT_REACHED instead? I'm not quite sure why we have them. These functions contain shared code that is used by the HTML tokenizer as well as the XML tokenizer. The base functions won't get hidden by the XML tokenizer because it doesn't need to worry about saving the open tag name or any special stuff that HTML does. (In reply to comment #9) > (From update of attachment 101678 [details]) > Generally, this looks good. I'd like to see a perf comparison (to be safe) and there's some question about how much of the input stream pre-processor you want for XML, which affects some of the factoring. The refactored code seems to cause about a 2.5-3.5% performance hit to the parser. Before the refactor, the HTML parser performance test gave me results that were typically like this: avg 1638.45 median 1633.5 stdev 16.719674039884868 min 1616 max 1684 And after the refractor, the numbers typically looked like this: avg 1690.95 median 1690 stdev 10.086996579755542 min 1673 max 1709 The input preprocessor is also reused because XML doesn't accept null bytes and does CRLF folding.
Created attachment 101765 [details] Patch
Attachment 101765 [details] did not pass style-queue: Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/WebCore/ChangeLog', u'Source/WebCor..." exit_code: 1 Source/WebCore/html/parser/HTMLTokenizer.cpp:210: Non-label code inside switch statements should be indented. [whitespace/indent] [4] Total errors found: 1 in 15 files If any of these errors are false positives, please file a bug against check-webkit-style.
Can we avoid the 3% regression? Where is it coming from?
Created attachment 101799 [details] Patch
Attachment 101799 [details] did not pass style-queue: Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/WebCore/ChangeLog', u'Source/WebCor..." exit_code: 1 Source/WebCore/html/parser/HTMLTokenizer.cpp:221: Non-label code inside switch statements should be indented. [whitespace/indent] [4] Total errors found: 1 in 15 files If any of these errors are false positives, please file a bug against check-webkit-style.
Comment on attachment 101799 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=101799&action=review > Source/WebCore/html/parser/HTMLTokenizer.cpp:110 > + && (m_state == HTMLTokenizerState::DataState > + || m_state == HTMLTokenizerState::RCDATAState > + || m_state == HTMLTokenizerState::RAWTEXTState > + || m_state == HTMLTokenizerState::PLAINTEXTState); Bad indent
Created attachment 101802 [details] Patch
Attachment 101802 [details] did not pass style-queue: Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/WebCore/ChangeLog', u'Source/WebCor..." exit_code: 1 Source/WebCore/html/parser/HTMLTokenizer.cpp:221: Non-label code inside switch statements should be indented. [whitespace/indent] [4] Total errors found: 1 in 15 files If any of these errors are false positives, please file a bug against check-webkit-style.
Comment on attachment 101802 [details] Patch Clearing flags on attachment: 101802 Committed r91643: <http://trac.webkit.org/changeset/91643>
All reviewed patches have been landed. Closing bug.