47397 – TextResourceDecoder::checkForHeadCharset can look way past the limit.

RESOLVED FIXED 47397

TextResourceDecoder::checkForHeadCharset can look way past the limit.

https://bugs.webkit.org/show_bug.cgi?id=47397

Summary TextResourceDecoder::checkForHeadCharset can look way past the limit.

Dmitry Titov

Reported 2010-10-07 20:36:22 PDT

TextResourceDecoder::checkForHeadCharset looks for charset in the portion of data received and does not feed the parser until it finds the encoding. The limit seems to be set to the end of head section (before body) or 1024 bytes, whatever comes last. However, the logic does not stop if one has HTML with just text in the body, liek this: <head></head><body>... lots of plain text ... </body> In this case, the decoder will accumulate all the text, no matter how big, and only then feed it into parser. It may scan it multiple times as well. Test and patch coming. If for some reason we actually want this to work specifically like this, please voice concern.

Attachments
patch (work in progress) (5.85 KB, patch) 2010-11-12 16:02 PST, Jenn Braithwaite	no flags	Details Formatted Diff Diff
patch - HTMLTokenizer version (work in progress) (35.12 KB, patch) 2010-11-22 14:45 PST, Jenn Braithwaite	no flags	Details Formatted Diff Diff
patch (474.11 KB, patch) 2010-12-02 11:05 PST, Jenn Braithwaite	abarth: review+ abarth: commit-queue-	Details Formatted Diff Diff
updated patch (474.39 KB, patch) 2010-12-08 10:48 PST, Jenn Braithwaite	abarth: review+ commit-queue: commit-queue-	Details Formatted Diff Diff
updated patch - static const change (474.41 KB, patch) 2010-12-09 10:12 PST, Jenn Braithwaite	no flags	Details Formatted Diff Diff
Patch for landing (482.70 KB, patch) 2010-12-09 20:28 PST, Adam Barth	commit-queue: commit-queue-	Details Formatted Diff Diff
failure (55.98 KB, text/plain) 2010-12-10 00:32 PST, Adam Barth	no flags	Details
Show Obsolete (5) View All Add attachment proposed patch, testcase, etc.

Alexey Proskuryakov

Comment 1 2010-10-07 20:54:09 PDT

I think that this was sort of known, with the idea that looking a little (in the common case) more ahead wasn't a problem. But I don't think that's intentional behavior.

Alexey Proskuryakov

Comment 2 2010-10-08 14:02:50 PDT

Also, the thinking at some point was that we'd like to switch charset sniffer to use real parser, so we didn't care about imperfections in existing code, as long as that didn't affect real sites. Perhaps using the real parser is more practical now!

Dmitry Titov

Comment 3 2010-10-08 15:07:23 PDT

I agree having a lot of text in the body w/o any other tags is a fringe case, I only hit it trying to write a test for another issue. However, looking at this code in a debugger, I found that there is another scenario when it behaves not as intended: <script> if (foo < bar) bar = foo; </script> <body> ... more stuff, including other tags but not <script> This causes the checkForHeadCharset to think it never leaves head section, since when it find "<" in JS it starts looking for the closing ">" of what it thinks is a tag. So it skips "</script" and happily continues with enclosingTagName being "script" and no chances of resetting this state. This seems to be a more probable case, and it will cause the whole page to accumulate before starting to parse it. I'm still not sure what kind of test I can create for this, which would not rely on a timeout.

Dmitry Titov

Comment 4 2010-10-08 15:16:20 PDT

(In reply to comment #2) > Perhaps using the real parser is more practical now! Hmm, it'd be cool. I'll look into it.

Alexey Proskuryakov

Comment 5 2010-10-08 15:17:26 PDT

If the sniffer thinks that it's in script, then the below test should fail, since we ignore tags in <script>. But it doesn't fail for me: <script> < </script><meta charset=koi8-r><script>alert(document.charset)</script>

Adam Barth

Comment 6 2010-10-08 15:20:11 PDT

> Hmm, it'd be cool. I'll look into it. If you look, there are a couple of examples of folks who use the HTMLTokenizer by itself. It's designed to be straightforward to reuse without the rest of the parsing machinery. The one trick might be handling both 16bit and 8bit input streams at the same time.

Dmitry Titov

Comment 7 2010-10-08 16:05:33 PDT

(In reply to comment #5) > If the sniffer thinks that it's in script, then the below test should fail, since we ignore tags in <script>. But it doesn't fail for me: > > <script> < </script><meta charset=koi8-r><script>alert(document.charset)</script> This particular test will succeed because the net effect is that the sniffer forever thinks it's in head section - so the detection of charset works fine, however if there is no charset, the detection never stop and detector accumulates the whole page (scanning through whole of it multiple times perhaps) before giving it to parser when the data load finished.

Dmitry Titov

Comment 8 2010-10-08 16:11:47 PDT

(In reply to comment #6) > > Hmm, it'd be cool. I'll look into it. > > If you look, there are a couple of examples of folks who use the HTMLTokenizer by itself. It's designed to be straightforward to reuse without the rest of the parsing machinery. The one trick might be handling both 16bit and 8bit input streams at the same time. Interesting. HTMLTokenizer works on decoded stream and it's not clear how to do parsing before decoding, which is what we need to do here. Assuming some random encoding first (UTF8?) could perhaps lead to subtle bugs when decoded text could be interpreted incorrectly. Using imperfect but all-in-one-function sniffer might be a better alternative since it allows someone to inspect it easier while keeping in mind the assumption that the input text is actually an undecoded stream. It also has some functionality that will still not be a part of the HTMLTokenizer, like looking for BOM, XML processing instructions etc. Still looking at HTMLTokenizer...

Alexey Proskuryakov

Comment 9 2010-10-08 16:17:37 PDT

> This particular test will succeed because the net effect is that the sniffer forever thinks it's in head section I guess I could just check for myself in debugger, but I don't follow this explanation. The sniffer is supposed to ignore <meta> and other tags inside <script>, and you said that it never bails out of "in script" mode.

Dmitry Titov

Comment 10 2010-10-08 16:22:15 PDT

(In reply to comment #9) > > This particular test will succeed because the net effect is that the sniffer forever thinks it's in head section > > I guess I could just check for myself in debugger, but I don't follow this explanation. The sniffer is supposed to ignore <meta> and other tags inside <script>, and you said that it never bails out of "in script" mode. Sorry Alexey, I was thinking about my issue while answering :-). Well, I've just traced your example in the debugger and the code does skip closing </script> tag and then processes <meta> even while having non-null enclosingTagName. So I'm not sure it actually skips tags inside <script>...

Adam Barth

Comment 11 2010-10-08 16:25:24 PDT

> Interesting. HTMLTokenizer works on decoded stream and it's not clear how to do parsing before decoding, which is what we need to do here. Assuming some random encoding first (UTF8?) could perhaps lead to subtle bugs when decoded text could be interpreted incorrectly. If the width of the characters are ok, it should be fine. By design, we don't support encodings that tokenizer differently than UTF8 (e.g., UTF7). The bigger problem, I'd expect, is that HTMLTokenizer expects a 16bit character but we might have UTF8 input. You might be able to finesse that because the HTMLTokenizer only looks at one character at a time. You could just zero-pad the 8bit characters into 16bit characters.

Alexey Proskuryakov

Comment 12 2010-10-08 16:27:35 PDT

I see. We only ignore tags in <script> for the purpose of ending <head> section - for example, <script>"<body>"</script> won't make inHeadSection false.

Dmitry Titov

Comment 13 2010-10-08 16:34:42 PDT

(In reply to comment #12) > I see. We only ignore tags in <script> for the purpose of ending <head> section - for example, <script>"<body>"</script> won't make inHeadSection false. Curiously, even this detects the charset (incorrectly): <script>foo='<meta charset=koi8-r>'</script><script>alert(document.charset)</script> +1 for HTMLTokenizer :-)

Darin Adler

Comment 14 2010-10-08 17:33:43 PDT

(In reply to comment #8) > Assuming some random encoding first (UTF8?) could perhaps lead to subtle bugs when decoded text could be interpreted incorrectly. You’d want to use something like ISO Latin-1, where it can’t spuriously fail to decode.

Jenn Braithwaite

Comment 15 2010-11-12 16:02:29 PST

Created attachment 73787 [details] patch (work in progress) Preliminary patch showing possible fixes to the charset sniffer to address two issues brought up by dimich. Includes tests for those fixes, plus tests to baseline the current charset sniffer behavior (quotes and meta-tag-in-body). Next step: investigate HTMLTokenizer to decide if we should replace the charset sniffer code with it.

Jenn Braithwaite

Comment 16 2010-11-22 14:45:22 PST

Created attachment 74600 [details] patch - HTMLTokenizer version (work in progress) This patch implements charset encoding detection using HTMLTokenizer. It also has additional tests, including those in the previous patch. The version with the HTMLTokenizer is about 1% slower than the version in the previous patch, averaged across 5 runs of: time WebkitBuild/Debug/DumpRenderTree `ls LayoutTests/fast/encoding/*.html` However, the parsing is more precise in the HTMLTokenizer version, e.g. correct handling of quoted text and escaped characters, and passes all tests. The version in the previous patch fails 4 of the new tests: quotes, escaped characters, mismatched end tag, unexpected 'charset=' text in meta tag. "More precise" also means "stricter", so there may be other tests that need the same corrections as fast/encoding/namespace-tolerance.html (see patch). Please take a look a this new patch and weigh in on whether we should switch to using a real parser versus more bug fixes to the existing code.

Adam Barth

Comment 17 2010-11-22 15:34:23 PST

Comment on attachment 74600 [details] patch - HTMLTokenizer version (work in progress) View in context: https://bugs.webkit.org/attachment.cgi?id=74600&action=review > LayoutTests/fast/encoding/namespace-tolerance.html:2 > -<meta content="charset=UTF-8"> > +<meta content="text/html; charset=UTF-8" http-equiv="Content-Type"> Aren't we supposed to detect the original charset according to HTML5? > WebCore/loader/TextResourceDecoder.cpp:575 > + // the HTTP-EQUIV meta has no effect on XHTML the => The Also, please end sentences with a period. > WebCore/loader/TextResourceDecoder.cpp:581 > + // Create tokenizer with usePreHTML5ParserQuirks set to false to match the > + // behavior that was in place prior to switching to using the HTMLTokenizer > + // for parsing. We don't really need this comment. > WebCore/loader/TextResourceDecoder.cpp:583 > + return checkForMetaCharset(data, len); len => length. > WebCore/loader/TextResourceDecoder.cpp:586 > +bool TextResourceDecoder::checkForMetaCharset(const char* data, size_t len) Where does this algorithm come from? If it's from the HTML5 spec, we should put in links to the spec. In general, this algorithm looks complicated with lots of Boolean flags. Can we make it a state machine? Should we break out all the logic related to scanning for the charset into its own class separate from a TextResourceDecoder? Should that class go in WebCore/html/parser? > WebCore/loader/TextResourceDecoder.cpp:609 > + m_input.append(SegmentedString(m_assumedCodec->decode(data, len))); I bet there is where your slowdown is coming from. > WebCore/loader/TextResourceDecoder.cpp:623 > + m_enclosingTagName = titleTag.localName().impl(); Why not just store the whole Qualified name? > WebCore/loader/TextResourceDecoder.cpp:636 > + String attributeValue = String(iter->m_value.data(), iter->m_value.size()); You can avoid this memcpy if you don't care about the attr name, but maybe that doesn't matter. > WebCore/loader/TextResourceDecoder.h:104 > + OwnPtr<HTMLTokenizer> m_tokenizer; > + OwnPtr<TextCodec> m_assumedCodec; > + SegmentedString m_input; > + HTMLToken m_token; > + AtomicStringImpl* m_enclosingTagName; > + bool m_inHeadSection; Yeah, this block of state seems like it should be encapsulated into its own object.

Jenn Braithwaite

Comment 18 2010-11-22 15:54:09 PST

Comment on attachment 74600 [details] patch - HTMLTokenizer version (work in progress) View in context: https://bugs.webkit.org/attachment.cgi?id=74600&action=review Thanks for taking a look so quickly. Still waiting for a go/no-go on the HTMLTokenizer before I clean things up and make a proper patch. What's your vote? >> LayoutTests/fast/encoding/namespace-tolerance.html:2 >> +<meta content="text/html; charset=UTF-8" http-equiv="Content-Type"> > > Aren't we supposed to detect the original charset according to HTML5? HTML5 would be <meta charset=UTF-8> Notice the "content=" difference. >> WebCore/loader/TextResourceDecoder.cpp:586 >> +bool TextResourceDecoder::checkForMetaCharset(const char* data, size_t len) > > Where does this algorithm come from? If it's from the HTML5 spec, we should put in links to the spec. In general, this algorithm looks complicated with lots of Boolean flags. Can we make it a state machine? Should we break out all the logic related to scanning for the charset into its own class separate from a TextResourceDecoder? Should that class go in WebCore/html/parser? The logic is basically the existing algorithm, with parsing replaced by the tokenizer. A few bug fixes were made to the algorithm. The Boolean flags are because http-equiv could come before/after content in a meta tag. I also have a version where this is implemented in a separate class in WebCore/html/parser, but it seemed like overkill as this is the only use of it. I'll revisit that version. >> WebCore/loader/TextResourceDecoder.cpp:609 >> + m_input.append(SegmentedString(m_assumedCodec->decode(data, len))); > > I bet there is where your slowdown is coming from. Is this necessary? If so, is the slowdown worth the added parsing precision? >> WebCore/loader/TextResourceDecoder.cpp:623 >> + m_enclosingTagName = titleTag.localName().impl(); > > Why not just store the whole Qualified name? This is from the existing code. I'm guessing it's more efficient to use the ptr. (@ap: is that right?)

Adam Barth

Comment 19 2010-11-22 16:11:45 PST

Comment on attachment 74600 [details] patch - HTMLTokenizer version (work in progress) View in context: https://bugs.webkit.org/attachment.cgi?id=74600&action=review Is there an algorithm in the HTML5 spec fo extracting the charset from the meta tag? If so, we should switch to that rather than passing though a state that almost, but not quite, identical to our historical behavior. You say there's a 1% slowdown, but on what benchmark? We should try the PLT. >>> LayoutTests/fast/encoding/namespace-tolerance.html:2 >>> +<meta content="text/html; charset=UTF-8" http-equiv="Content-Type"> >> >> Aren't we supposed to detect the original charset according to HTML5? > > HTML5 would be <meta charset=UTF-8> Notice the "content=" difference. That's authoring requirement, but what are the user-agent requirements like? >>> WebCore/loader/TextResourceDecoder.cpp:586 >>> +bool TextResourceDecoder::checkForMetaCharset(const char* data, size_t len) >> >> Where does this algorithm come from? If it's from the HTML5 spec, we should put in links to the spec. In general, this algorithm looks complicated with lots of Boolean flags. Can we make it a state machine? Should we break out all the logic related to scanning for the charset into its own class separate from a TextResourceDecoder? Should that class go in WebCore/html/parser? > > The logic is basically the existing algorithm, with parsing replaced by the tokenizer. A few bug fixes were made to the algorithm. The Boolean flags are because http-equiv could come before/after content in a meta tag. I also have a version where this is implemented in a separate class in WebCore/html/parser, but it seemed like overkill as this is the only use of it. I'll revisit that version. The idea behind using a separate logic is to reduce the complexity of the system because this block of state has a very narrow API (given some bits, what't the charset). >>> WebCore/loader/TextResourceDecoder.cpp:609 >>> + m_input.append(SegmentedString(m_assumedCodec->decode(data, len))); >> >> I bet there is where your slowdown is coming from. > > Is this necessary? If so, is the slowdown worth the added parsing precision? Well, somewhere you need to blow up the string in 16 bit characters. We've talked about adding an 8 bit path through the parser, which is what this would be good for. I wish there was some way of sharing the decoding work... Maybe we want a trivial decoder that just zero-pads everything?

Jenn Braithwaite

Comment 20 2010-11-22 16:40:55 PST

(In reply to comment #19) > (From update of attachment 74600 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=74600&action=review > > Is there an algorithm in the HTML5 spec fo extracting the charset from the meta tag? If so, we should switch to that rather than passing though a state that almost, but not quite, identical to our historical behavior. This code is a close match to the algorithm in the HTML5 spec (except I used a 2nd flag instead of a mode). I can make it a more precise match to the spec. > > You say there's a 1% slowdown, but on what benchmark? We should try the PLT. This is a rough measurement on my machine, with the same processes active for all runs, to give us some ballpark for evaluating the time vs parsing-precision trade-off. Another trade off is ease of maintainability with the HTMLTokenizer version. > > >>> LayoutTests/fast/encoding/namespace-tolerance.html:2 > >>> +<meta content="text/html; charset=UTF-8" http-equiv="Content-Type"> > >> > >> Aren't we supposed to detect the original charset according to HTML5? > > > > HTML5 would be <meta charset=UTF-8> Notice the "content=" difference. > > That's authoring requirement, but what are the user-agent requirements like? content is ignored if got-pragma is false in the HTML5 spec algorithm. http-equiv=Content-Type makes got-pragma true. > >>> WebCore/loader/TextResourceDecoder.cpp:609 > >>> + m_input.append(SegmentedString(m_assumedCodec->decode(data, len))); > >> > >> I bet there is where your slowdown is coming from. > > > > Is this necessary? If so, is the slowdown worth the added parsing precision? > > Well, somewhere you need to blow up the string in 16 bit characters. We've talked about adding an 8 bit path through the parser, which is what this would be good for. I wish there was some way of sharing the decoding work... Maybe we want a trivial decoder that just zero-pads everything? Hoping someone else will chime in here on the zero-padding idea. I used ISO Latin-1 based on Adler's comment in this thread, but I'm open to changing that.

Adam Barth

Comment 21 2010-11-22 16:45:34 PST

High-level reactions: 1) This looks like the right approach. 2) Deleting all that custom parsing good is good times. 3) I'd like to try matching the HTML5 algorithm precisely. 4) I'd like to understand, and hopefully eliminate, any slowdown.

Alexey Proskuryakov

Comment 22 2010-11-22 22:18:45 PST

> 3) I'd like to try matching the HTML5 algorithm precisely. The HTML5 algorithm kind of matches our old super-buggy code, for no apparent reason. See <http://www.w3.org/Bugs/Public/show_bug.cgi?id=9628> for discussion.

Adam Barth

Comment 23 2010-11-22 22:36:33 PST

Sounds like we should convert this test suite to a layout test: http://www.hixie.ch/tests/adhoc/html/parsing/encoding/all.html

Ian 'Hixie' Hickson

Comment 24 2010-11-29 13:18:27 PST

Sorry if the reasons for the spec aren't apparent. There are two main reasons: making your life easier (generally it's easier if I spec what you do so you don't have to change it) and compatibility (there's no particularly compelling reason to implement this one way or another, so the way that browsers do it seems the safest option). I'm happy to change the spec to whatever implementations end up doing. I recommend working with other browser vendors to come up with a proposal everyone is willing to implement; this issue has been escalated to the HTMLWG chairs who will shortly be asking for change proposals for it.

Henri Sivonen

Comment 25 2010-11-30 02:10:32 PST

Firefox trunk implements the HTML5 spec from before the spec paid attention to the http-equiv attribute, expect Firefox trunk inspects the first 1024 bytes instead of the first 512 (to match WebKit, to work around Apache gzip brokenness on Dreamhost-hosted sites that makes Gecko unable to reload Dreamhost-hosted content in mid-stream so the prescan *must* succeed on Dreamhost and to pass probability-based heuristic CJK detection test cases now that the heuristic detector also runs on the first 1024 bytes instead of running on the whole document). I'm now in the process of implementing the spec changes to make Gecko pay attention to the http-equiv attribute. This is https://bugzilla.mozilla.org/show_bug.cgi?id=594730 However, in the process, some spec bugs were found: http://www.w3.org/Bugs/Public/show_bug.cgi?id=11428 http://www.w3.org/Bugs/Public/show_bug.cgi?id=11426 http://www.w3.org/Bugs/Public/show_bug.cgi?id=11411 http://www.w3.org/Bugs/Public/show_bug.cgi?id=11412 http://www.w3.org/Bugs/Public/show_bug.cgi?id=11414 I recommend implementing the spec as amended by the obvious fixes to the 5 spec bugs. This means not stopping at </head>. I see no practical value in implementing the change proposed in http://www.w3.org/Bugs/Public/show_bug.cgi?id=9628

Alexey Proskuryakov

Comment 26 2010-11-30 08:42:42 PST

One thing I really don't want to implement in WebKit is restarting if <meta> is found by actual parsing, as opposed to charset prescan.

Jenn Braithwaite

Comment 27 2010-12-02 11:05:08 PST

Created attachment 75396 [details] patch This new version is now 1% faster than the pre-existing charset algorithm. Performance was measured using the same methods mentioned earlier in the thread.

Alexey Proskuryakov

Comment 28 2010-12-02 11:23:58 PST

Comment on attachment 75396 [details] patch View in context: https://bugs.webkit.org/attachment.cgi?id=75396&action=review > LayoutTests/ChangeLog:11 > + Added http-equiv attribute to meta tag in 2 existing tests. So, did '<meta content="charset=UTF-8">' that we had in the tests not work in other browsers?

Jenn Braithwaite

Comment 29 2010-12-02 11:33:21 PST

(In reply to comment #28) > (From update of attachment 75396 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=75396&action=review > > > LayoutTests/ChangeLog:11 > > + Added http-equiv attribute to meta tag in 2 existing tests. > > So, did '<meta content="charset=UTF-8">' that we had in the tests not work in other browsers? The spec now requires http-equiv before using the charset in the content attribute. Henri noted that he's updating Firefox to pay attention to the http-equiv attribute. IE8 requires http-equiv.

Adam Barth

Comment 30 2010-12-07 19:18:27 PST

Comment on attachment 75396 [details] patch View in context: https://bugs.webkit.org/attachment.cgi?id=75396&action=review Very nice patch. I particularly like the ratio of testing to code change and the large amount of code removed. A few nits below. > WebCore/html/parser/HTMLMetaCharsetParser.cpp:43 > + : m_tokenizer(HTMLTokenizer::create(false)) I wish I remembered what false does here. > WebCore/html/parser/HTMLMetaCharsetParser.cpp:64 > + pos += 7; Ideally we'd have a constant for this value to connect it with the string: const char* charsetString = "charset"; const size_t charsetLength = sizeof("charset") - 1; Something like that. > WebCore/html/parser/HTMLMetaCharsetParser.cpp:80 > + quoteMark = value[pos++]; static_cast<char> + ASSERT that it's a 7 bit value would be nice here. > WebCore/html/parser/HTMLMetaCharsetParser.cpp:108 > + String attributeValue = String(iter->m_value.data(), iter->m_value.size()); You can just call the constructor instead of both the constructor and the copy constructor (i.e, remove the = ). > WebCore/html/parser/HTMLMetaCharsetParser.cpp:113 > + } else if (!charset.length()) { charset.isEmpty() > WebCore/html/parser/HTMLMetaCharsetParser.cpp:166 > + AtomicString tag(m_token.name().data(), m_token.name().size()); I'd rename tag => tagName, but it's not that important.

Jenn Braithwaite

Comment 31 2010-12-08 10:48:21 PST

Created attachment 75929 [details] updated patch > --- Comment #30 from Adam Barth <abarth@webkit.org> 2010-12-07 19:18:27 PST --- > (From update of attachment 75396 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=75396&action=review > > Very nice patch. I particularly like the ratio of testing to code change and the large amount of code removed. A few nits below. > >> WebCore/html/parser/HTMLMetaCharsetParser.cpp:43 >> + : m_tokenizer(HTMLTokenizer::create(false)) > > I wish I remembered what false does here. Added a comment. >> WebCore/html/parser/HTMLMetaCharsetParser.cpp:64 >> + pos += 7; > > Ideally we'd have a constant for this value to connect it with the string: > > const char* charsetString = "charset"; > const size_t charsetLength = sizeof("charset") - 1; > > Something like that. Done. >> WebCore/html/parser/HTMLMetaCharsetParser.cpp:80 >> + quoteMark = value[pos++]; > > static_cast<char> + ASSERT that it's a 7 bit value would be nice here. Done. >> WebCore/html/parser/HTMLMetaCharsetParser.cpp:108 >> + String attributeValue = String(iter->m_value.data(), iter->m_value.size()); > > You can just call the constructor instead of both the constructor and the copy constructor (i.e, remove the = ). Done. >> WebCore/html/parser/HTMLMetaCharsetParser.cpp:113 >> + } else if (!charset.length()) { > > charset.isEmpty() Done. >> WebCore/html/parser/HTMLMetaCharsetParser.cpp:166 >> + AtomicString tag(m_token.name().data(), m_token.name().size()); > > I'd rename tag => tagName, but it's not that important. Done. Also, fixed a few minor typos in the tests and modified parser-tests.html to keep the hixie's original indenting/formatting.

Adam Barth

Comment 32 2010-12-08 19:01:45 PST

Comment on attachment 75929 [details] updated patch View in context: https://bugs.webkit.org/attachment.cgi?id=75929&action=review Thanks! > WebCore/html/parser/HTMLMetaCharsetParser.cpp:55 > +const char* charsetString = "charset"; > +const size_t charsetLength = sizeof("charset") - 1; We should probably put these in an anonymous namespace so we don't leak the symbols (or maybe const stops that?)

Darin Adler

Comment 33 2010-12-08 19:03:57 PST

Comment on attachment 75929 [details] updated patch View in context: https://bugs.webkit.org/attachment.cgi?id=75929&action=review > WebCore/html/parser/HTMLMetaCharsetParser.cpp:54 > +const char* charsetString = "charset"; To actually make this a constant, we need to put a const after the *. The const before the * means that the characters are const, not the pointer. Another option is to use const char[] charsetString. >> WebCore/html/parser/HTMLMetaCharsetParser.cpp:55 >> +const size_t charsetLength = sizeof("charset") - 1; > > We should probably put these in an anonymous namespace so we don't leak the symbols (or maybe const stops that?) The old school alternative to anonymous namespace is to just put the keyword static on these. I’d suggest that.

Alexey Proskuryakov

Comment 34 2010-12-08 19:06:45 PST

Consts stop that. Our normal style is to use "static const", but the difference is minimal - with static const, the compiler will complain about earlier declarations (e.g. in a header).

WebKit Commit Bot

Comment 35 2010-12-09 02:08:43 PST

Comment on attachment 75929 [details] updated patch Rejecting patch 75929 from commit-queue. Failed to run "['./WebKitTools/Scripts/webkit-patch', '--status-host=queues.webkit.org', '--bot-id=eseidel-cq-sl', 'apply-attachment', '--force-clean', '--non-interactive', 75929]" exit_code: 2 Last 500 characters of output: lines). Hunk #3 succeeded at 15646 (offset 419 lines). Hunk #4 succeeded at 20495 (offset 641 lines). Hunk #5 succeeded at 23224 (offset 726 lines). patching file WebCore/html/parser/HTMLMetaCharsetParser.cpp patching file WebCore/html/parser/HTMLMetaCharsetParser.h patching file WebCore/loader/TextResourceDecoder.cpp patching file WebCore/loader/TextResourceDecoder.h Failed to run "[u'/Projects/CommitQueue/WebKitTools/Scripts/svn-apply', u'--reviewer', u'Adam Barth', u'--force']" exit_code: 1 Full output: http://queues.webkit.org/results/6964004

Jenn Braithwaite

Comment 36 2010-12-09 10:12:15 PST

Created attachment 76082 [details] updated patch - static const change

WebKit Commit Bot

Comment 37 2010-12-09 20:09:00 PST

Comment on attachment 76082 [details] updated patch - static const change Rejecting patch 76082 from commit-queue. Failed to run "['./WebKitTools/Scripts/webkit-patch', '--status-host=queues.webkit.org', '--bot-id=eseidel-cq-sl', 'apply-attachment', '--force-clean', '--non-interactive', 76082]" exit_code: 2 Last 500 characters of output: lines). Hunk #3 succeeded at 15646 (offset 419 lines). Hunk #4 succeeded at 20495 (offset 641 lines). Hunk #5 succeeded at 23224 (offset 726 lines). patching file WebCore/html/parser/HTMLMetaCharsetParser.cpp patching file WebCore/html/parser/HTMLMetaCharsetParser.h patching file WebCore/loader/TextResourceDecoder.cpp patching file WebCore/loader/TextResourceDecoder.h Failed to run "[u'/Projects/CommitQueue/WebKitTools/Scripts/svn-apply', u'--reviewer', u'Adam Barth', u'--force']" exit_code: 1 Full output: http://queues.webkit.org/results/6923028

Adam Barth

Comment 38 2010-12-09 20:15:36 PST

I'll merge this for you.

Adam Barth

Comment 39 2010-12-09 20:28:57 PST

Created attachment 76158 [details] Patch for landing

WebKit Commit Bot

Comment 40 2010-12-09 22:27:06 PST

Comment on attachment 76158 [details] Patch for landing Rejecting patch 76158 from commit-queue. Failed to run "['./WebKitTools/Scripts/webkit-patch', '--status-host=queues.webkit.org', '--bot-id=eseidel-cq-sl', 'apply-attachment', '--non-interactive', 76158]" exit_code: 2 Last 500 characters of output: ncoding/resources/122.html patching file LayoutTests/fast/encoding/resources/123.html patching file LayoutTests/fast/text/international/bidi-innertext.html patching file LayoutTests/http/tests/misc/charset-sniffer-end-sniffing-expected.txt patching file LayoutTests/http/tests/misc/charset-sniffer-end-sniffing.html patching file LayoutTests/http/tests/misc/resources/charset-sniffer-end-sniffing.php Failed to run "[u'/Projects/CommitQueue/WebKitTools/Scripts/svn-apply', u'--force']" exit_code: 1 Full output: http://queues.webkit.org/results/6977017

Adam Barth

Comment 41 2010-12-10 00:32:54 PST

Created attachment 76171 [details] failure This test seems to cause this failure.

Adam Barth

Comment 42 2010-12-10 11:50:58 PST

Committed r73756: <http://trac.webkit.org/changeset/73756>

Jenn Braithwaite

Comment 43 2010-12-10 12:09:02 PST

(In reply to comment #41) > Created an attachment (id=76171) [details] > failure > > This test seems to cause this failure. Putting the solution on the record for when/if the test files are ever updated: The fast/encoding/resources/[0-9]*.html test files contain unusual characters that got corrupted during the svn-apply process. They need to be landed manually. Also, four of the tests are actually testing tabs. Do 'svn pset allow-tabs true' on those files to disable the precommit hook.

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Normal

Classification Unclassified

Version 528+ (Nightly build)

Hardware All

OS All

Product WebKit

Component Page Loading

Assignee

Jenn Braithwaite

Reported

2010-10-07 20:36 PDT

Modified

2011-02-16 13:56 PST History

CC List

8 users Show

URL

Keywords

Depends on

Blocks

54582

Dependencies

tree graph