Bug 47397 - TextResourceDecoder::checkForHeadCharset can look way past the limit.
Summary: TextResourceDecoder::checkForHeadCharset can look way past the limit.
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: Page Loading (show other bugs)
Version: 528+ (Nightly build)
Hardware: All All
: P2 Normal
Assignee: Jenn Braithwaite
URL:
Keywords:
Depends on:
Blocks: 54582
  Show dependency treegraph
 
Reported: 2010-10-07 20:36 PDT by Dmitry Titov
Modified: 2011-02-16 13:56 PST (History)
8 users (show)

See Also:


Attachments
patch (work in progress) (5.85 KB, patch)
2010-11-12 16:02 PST, Jenn Braithwaite
no flags Details | Formatted Diff | Diff
patch - HTMLTokenizer version (work in progress) (35.12 KB, patch)
2010-11-22 14:45 PST, Jenn Braithwaite
no flags Details | Formatted Diff | Diff
patch (474.11 KB, patch)
2010-12-02 11:05 PST, Jenn Braithwaite
abarth: review+
abarth: commit-queue-
Details | Formatted Diff | Diff
updated patch (474.39 KB, patch)
2010-12-08 10:48 PST, Jenn Braithwaite
abarth: review+
commit-queue: commit-queue-
Details | Formatted Diff | Diff
updated patch - static const change (474.41 KB, patch)
2010-12-09 10:12 PST, Jenn Braithwaite
no flags Details | Formatted Diff | Diff
Patch for landing (482.70 KB, patch)
2010-12-09 20:28 PST, Adam Barth
commit-queue: commit-queue-
Details | Formatted Diff | Diff
failure (55.98 KB, text/plain)
2010-12-10 00:32 PST, Adam Barth
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Dmitry Titov 2010-10-07 20:36:22 PDT
TextResourceDecoder::checkForHeadCharset looks for charset in the portion of data received and does not feed the parser until it finds the encoding. The limit seems to be set to the end of head section (before body) or 1024 bytes, whatever comes last. However, the logic does not stop if one has HTML with just text in the body, liek this:

<head></head><body>... lots of plain text ... </body>

In this case, the decoder will accumulate all the text, no matter how big, and only then feed it into parser. It may scan it multiple times as well.

Test and patch coming. If for some reason we actually want this to work specifically like this, please voice concern.
Comment 1 Alexey Proskuryakov 2010-10-07 20:54:09 PDT
I think that this was sort of known, with the idea that looking a little (in the common case) more ahead wasn't a problem. But I don't think that's intentional behavior.
Comment 2 Alexey Proskuryakov 2010-10-08 14:02:50 PDT
Also, the thinking at some point was that we'd like to switch charset sniffer to use real parser, so we didn't care about imperfections in existing code, as long as that didn't affect real sites.

Perhaps using the real parser is more practical now!
Comment 3 Dmitry Titov 2010-10-08 15:07:23 PDT
I agree having a lot of text in the body w/o any other tags is a fringe case, I only hit it trying to write a test for another issue.

However, looking at this code in a debugger, I found that there is another scenario when it behaves not as intended:

<script> if (foo < bar) bar = foo; </script>
<body> ... more stuff, including other tags but not <script>

This causes the checkForHeadCharset to think it never leaves head section, since when it find "<" in JS it starts looking for the closing ">" of what it thinks is a tag. So it skips "</script" and happily continues with enclosingTagName being "script" and no chances of resetting this state.

This seems to be a more probable case, and it will cause the whole page to accumulate before starting to parse it.

I'm still not sure what kind of test I can create for this, which would not rely on a timeout.
Comment 4 Dmitry Titov 2010-10-08 15:16:20 PDT
(In reply to comment #2)
> Perhaps using the real parser is more practical now!

Hmm, it'd be cool. I'll look into it.
Comment 5 Alexey Proskuryakov 2010-10-08 15:17:26 PDT
If the sniffer thinks that it's in script, then the below test should fail, since we ignore tags in <script>. But it doesn't fail for me:

<script> < </script><meta charset=koi8-r><script>alert(document.charset)</script>
Comment 6 Adam Barth 2010-10-08 15:20:11 PDT
> Hmm, it'd be cool. I'll look into it.

If you look, there are a couple of examples of folks who use the HTMLTokenizer by itself.  It's designed to be straightforward to reuse without the rest of the parsing machinery.  The one trick might be handling both 16bit and 8bit input streams at the same time.
Comment 7 Dmitry Titov 2010-10-08 16:05:33 PDT
(In reply to comment #5)
> If the sniffer thinks that it's in script, then the below test should fail, since we ignore tags in <script>. But it doesn't fail for me:
> 
> <script> < </script><meta charset=koi8-r><script>alert(document.charset)</script>

This particular test will succeed because the net effect is that the sniffer forever thinks it's in head section - so the detection of charset works fine, however if there is no charset, the detection never stop and detector accumulates the whole page (scanning through whole of it multiple times perhaps) before giving it to parser when the data load finished.
Comment 8 Dmitry Titov 2010-10-08 16:11:47 PDT
(In reply to comment #6)
> > Hmm, it'd be cool. I'll look into it.
> 
> If you look, there are a couple of examples of folks who use the HTMLTokenizer by itself.  It's designed to be straightforward to reuse without the rest of the parsing machinery.  The one trick might be handling both 16bit and 8bit input streams at the same time.

Interesting. HTMLTokenizer works on decoded stream and it's not clear how to do parsing before decoding, which is what we need to do here. Assuming some random encoding first (UTF8?) could perhaps lead to subtle bugs when decoded text could be interpreted incorrectly.

Using imperfect but all-in-one-function sniffer might be a better alternative since it allows someone to inspect it easier while keeping in mind the assumption that the input text is actually an undecoded stream. It also has some functionality that will still not be a part of the HTMLTokenizer, like looking for BOM, XML processing instructions etc.

Still looking at HTMLTokenizer...
Comment 9 Alexey Proskuryakov 2010-10-08 16:17:37 PDT
> This particular test will succeed because the net effect is that the sniffer forever thinks it's in head section

I guess I could just check for myself in debugger, but I don't follow this explanation. The sniffer is supposed to ignore <meta> and other tags inside <script>, and you said that it never bails out of "in script" mode.
Comment 10 Dmitry Titov 2010-10-08 16:22:15 PDT
(In reply to comment #9)
> > This particular test will succeed because the net effect is that the sniffer forever thinks it's in head section
> 
> I guess I could just check for myself in debugger, but I don't follow this explanation. The sniffer is supposed to ignore <meta> and other tags inside <script>, and you said that it never bails out of "in script" mode.

Sorry Alexey, I was thinking about my issue while answering :-). Well, I've just traced your example in the debugger and the code does skip closing </script> tag and then processes <meta> even while having non-null enclosingTagName. So I'm not sure it actually skips tags inside <script>...
Comment 11 Adam Barth 2010-10-08 16:25:24 PDT
> Interesting. HTMLTokenizer works on decoded stream and it's not clear how to do parsing before decoding, which is what we need to do here. Assuming some random encoding first (UTF8?) could perhaps lead to subtle bugs when decoded text could be interpreted incorrectly.

If the width of the characters are ok, it should be fine.  By design, we don't support encodings that tokenizer differently than UTF8 (e.g., UTF7).

The bigger problem, I'd expect, is that HTMLTokenizer expects a 16bit character but we might have UTF8 input.  You might be able to finesse that because the HTMLTokenizer only looks at one character at a time.  You could just zero-pad the 8bit characters into 16bit characters.
Comment 12 Alexey Proskuryakov 2010-10-08 16:27:35 PDT
I see. We only ignore tags in <script> for the purpose of ending <head> section - for example, <script>"<body>"</script> won't make inHeadSection false.
Comment 13 Dmitry Titov 2010-10-08 16:34:42 PDT
(In reply to comment #12)
> I see. We only ignore tags in <script> for the purpose of ending <head> section - for example, <script>"<body>"</script> won't make inHeadSection false.

Curiously, even this detects the charset (incorrectly):
<script>foo='<meta charset=koi8-r>'</script><script>alert(document.charset)</script>

+1 for HTMLTokenizer :-)
Comment 14 Darin Adler 2010-10-08 17:33:43 PDT
(In reply to comment #8)
> Assuming some random encoding first (UTF8?) could perhaps lead to subtle bugs when decoded text could be interpreted incorrectly.

You’d want to use something like ISO Latin-1, where it can’t spuriously fail to decode.
Comment 15 Jenn Braithwaite 2010-11-12 16:02:29 PST
Created attachment 73787 [details]
patch (work in progress)

Preliminary patch showing possible fixes to the charset sniffer to address two issues brought up by dimich. Includes tests for those fixes, plus tests to baseline the current charset sniffer behavior (quotes and meta-tag-in-body).

Next step: investigate HTMLTokenizer to decide if we should replace the charset sniffer code with it.
Comment 16 Jenn Braithwaite 2010-11-22 14:45:22 PST
Created attachment 74600 [details]
patch - HTMLTokenizer version (work in progress)

This patch implements charset encoding detection using HTMLTokenizer.  It also has additional tests, including those in the previous patch.

The version with the HTMLTokenizer is about 1% slower than the version in the previous patch, averaged across 5 runs of:
time WebkitBuild/Debug/DumpRenderTree `ls LayoutTests/fast/encoding/*.html`

However, the parsing is more precise in the HTMLTokenizer version, e.g. correct handling of quoted text and escaped characters, and passes all tests. The version in the previous patch fails 4 of the new tests: quotes, escaped characters, mismatched end tag, unexpected 'charset=' text in meta tag.  "More precise" also means "stricter", so there may be other tests that need the same corrections as fast/encoding/namespace-tolerance.html (see patch).

Please take a look a this new patch and weigh in on whether we should switch to using a real parser versus more bug fixes to the existing code.
Comment 17 Adam Barth 2010-11-22 15:34:23 PST
Comment on attachment 74600 [details]
patch - HTMLTokenizer version (work in progress)

View in context: https://bugs.webkit.org/attachment.cgi?id=74600&action=review

> LayoutTests/fast/encoding/namespace-tolerance.html:2
> -<meta content="charset=UTF-8">
> +<meta content="text/html; charset=UTF-8" http-equiv="Content-Type">

Aren't we supposed to detect the original charset according to HTML5?

> WebCore/loader/TextResourceDecoder.cpp:575
> +    // the HTTP-EQUIV meta has no effect on XHTML

the => The

Also, please end sentences with a period.

> WebCore/loader/TextResourceDecoder.cpp:581
> +    // Create tokenizer with usePreHTML5ParserQuirks set to false to match the
> +    // behavior that was in place prior to switching to using the HTMLTokenizer
> +    // for parsing.

We don't really need this comment.

> WebCore/loader/TextResourceDecoder.cpp:583
> +    return checkForMetaCharset(data, len);

len => length.

> WebCore/loader/TextResourceDecoder.cpp:586
> +bool TextResourceDecoder::checkForMetaCharset(const char* data, size_t len)

Where does this algorithm come from?  If it's from the HTML5 spec, we should put in links to the spec.  In general, this algorithm looks complicated with lots of Boolean flags.  Can we make it a state machine?  Should we break out all the logic related to scanning for the charset into its own class separate from a TextResourceDecoder?  Should that class go in WebCore/html/parser?

> WebCore/loader/TextResourceDecoder.cpp:609
> +    m_input.append(SegmentedString(m_assumedCodec->decode(data, len)));

I bet there is where your slowdown is coming from.

> WebCore/loader/TextResourceDecoder.cpp:623
> +                        m_enclosingTagName = titleTag.localName().impl();

Why not just store the whole Qualified name?

> WebCore/loader/TextResourceDecoder.cpp:636
> +                            String attributeValue = String(iter->m_value.data(), iter->m_value.size());

You can avoid this memcpy if you don't care about the attr name, but maybe that doesn't matter.

> WebCore/loader/TextResourceDecoder.h:104
> +    OwnPtr<HTMLTokenizer> m_tokenizer;
> +    OwnPtr<TextCodec> m_assumedCodec;
> +    SegmentedString m_input;
> +    HTMLToken m_token;
> +    AtomicStringImpl* m_enclosingTagName;
> +    bool m_inHeadSection;

Yeah, this block of state seems like it should be encapsulated into its own object.
Comment 18 Jenn Braithwaite 2010-11-22 15:54:09 PST
Comment on attachment 74600 [details]
patch - HTMLTokenizer version (work in progress)

View in context: https://bugs.webkit.org/attachment.cgi?id=74600&action=review

Thanks for taking a look so quickly.  Still waiting for a go/no-go on the HTMLTokenizer before I clean things up and make a proper patch.  What's your vote?

>> LayoutTests/fast/encoding/namespace-tolerance.html:2
>> +<meta content="text/html; charset=UTF-8" http-equiv="Content-Type">
> 
> Aren't we supposed to detect the original charset according to HTML5?

HTML5 would be <meta charset=UTF-8>  Notice the "content=" difference.

>> WebCore/loader/TextResourceDecoder.cpp:586
>> +bool TextResourceDecoder::checkForMetaCharset(const char* data, size_t len)
> 
> Where does this algorithm come from?  If it's from the HTML5 spec, we should put in links to the spec.  In general, this algorithm looks complicated with lots of Boolean flags.  Can we make it a state machine?  Should we break out all the logic related to scanning for the charset into its own class separate from a TextResourceDecoder?  Should that class go in WebCore/html/parser?

The logic is basically the existing algorithm, with parsing replaced by the tokenizer. A few bug fixes were made to the algorithm.  The Boolean flags are because http-equiv could come before/after content in a meta tag.  I also have a version where this is implemented in a separate class in WebCore/html/parser, but it seemed like overkill as this is the only use of it.  I'll revisit that version.

>> WebCore/loader/TextResourceDecoder.cpp:609
>> +    m_input.append(SegmentedString(m_assumedCodec->decode(data, len)));
> 
> I bet there is where your slowdown is coming from.

Is this necessary?  If so, is the slowdown worth the added parsing precision?

>> WebCore/loader/TextResourceDecoder.cpp:623
>> +                        m_enclosingTagName = titleTag.localName().impl();
> 
> Why not just store the whole Qualified name?

This is from the existing code.  I'm guessing it's more efficient to use the ptr. (@ap: is that right?)
Comment 19 Adam Barth 2010-11-22 16:11:45 PST
Comment on attachment 74600 [details]
patch - HTMLTokenizer version (work in progress)

View in context: https://bugs.webkit.org/attachment.cgi?id=74600&action=review

Is there an algorithm in the HTML5 spec fo extracting the charset from the meta tag?  If so, we should switch to that rather than passing though a state that almost, but not quite, identical to our historical behavior.

You say there's a 1% slowdown, but on what benchmark?  We should try the PLT.

>>> LayoutTests/fast/encoding/namespace-tolerance.html:2
>>> +<meta content="text/html; charset=UTF-8" http-equiv="Content-Type">
>> 
>> Aren't we supposed to detect the original charset according to HTML5?
> 
> HTML5 would be <meta charset=UTF-8>  Notice the "content=" difference.

That's authoring requirement, but what are the user-agent requirements like?

>>> WebCore/loader/TextResourceDecoder.cpp:586
>>> +bool TextResourceDecoder::checkForMetaCharset(const char* data, size_t len)
>> 
>> Where does this algorithm come from?  If it's from the HTML5 spec, we should put in links to the spec.  In general, this algorithm looks complicated with lots of Boolean flags.  Can we make it a state machine?  Should we break out all the logic related to scanning for the charset into its own class separate from a TextResourceDecoder?  Should that class go in WebCore/html/parser?
> 
> The logic is basically the existing algorithm, with parsing replaced by the tokenizer. A few bug fixes were made to the algorithm.  The Boolean flags are because http-equiv could come before/after content in a meta tag.  I also have a version where this is implemented in a separate class in WebCore/html/parser, but it seemed like overkill as this is the only use of it.  I'll revisit that version.

The idea behind using a separate logic is to reduce the complexity of the system because this block of state has a very narrow API (given some bits, what't the charset).

>>> WebCore/loader/TextResourceDecoder.cpp:609
>>> +    m_input.append(SegmentedString(m_assumedCodec->decode(data, len)));
>> 
>> I bet there is where your slowdown is coming from.
> 
> Is this necessary?  If so, is the slowdown worth the added parsing precision?

Well, somewhere you need to blow up the string in 16 bit characters.  We've talked about adding an 8 bit path through the parser, which is what this would be good for.  I wish there was some way of sharing the decoding work...  Maybe we want a trivial decoder that just zero-pads everything?
Comment 20 Jenn Braithwaite 2010-11-22 16:40:55 PST
(In reply to comment #19)
> (From update of attachment 74600 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=74600&action=review
> 
> Is there an algorithm in the HTML5 spec fo extracting the charset from the meta tag?  If so, we should switch to that rather than passing though a state that almost, but not quite, identical to our historical behavior.

This code is a close match to the algorithm in the HTML5 spec (except I used a 2nd flag instead of a mode).  I can make it a more precise match to the spec. 

> 
> You say there's a 1% slowdown, but on what benchmark?  We should try the PLT.

This is a rough measurement on my machine, with the same processes active for all runs, to give us some ballpark for evaluating the time vs parsing-precision trade-off.  Another trade off is ease of maintainability with the HTMLTokenizer version.

> 
> >>> LayoutTests/fast/encoding/namespace-tolerance.html:2
> >>> +<meta content="text/html; charset=UTF-8" http-equiv="Content-Type">
> >> 
> >> Aren't we supposed to detect the original charset according to HTML5?
> > 
> > HTML5 would be <meta charset=UTF-8>  Notice the "content=" difference.
> 
> That's authoring requirement, but what are the user-agent requirements like?

content is ignored if got-pragma is false in the HTML5 spec algorithm.   http-equiv=Content-Type makes got-pragma true.

> >>> WebCore/loader/TextResourceDecoder.cpp:609
> >>> +    m_input.append(SegmentedString(m_assumedCodec->decode(data, len)));
> >> 
> >> I bet there is where your slowdown is coming from.
> > 
> > Is this necessary?  If so, is the slowdown worth the added parsing precision?
> 
> Well, somewhere you need to blow up the string in 16 bit characters.  We've talked about adding an 8 bit path through the parser, which is what this would be good for.  I wish there was some way of sharing the decoding work...  Maybe we want a trivial decoder that just zero-pads everything?

Hoping someone else will chime in here on the zero-padding idea.  I used ISO Latin-1 based on Adler's comment in this thread, but I'm open to changing that.
Comment 21 Adam Barth 2010-11-22 16:45:34 PST
High-level reactions:

1) This looks like the right approach.
2) Deleting all that custom parsing good is good times.
3) I'd like to try matching the HTML5 algorithm precisely.
4) I'd like to understand, and hopefully eliminate, any slowdown.
Comment 22 Alexey Proskuryakov 2010-11-22 22:18:45 PST
> 3) I'd like to try matching the HTML5 algorithm precisely.

The HTML5 algorithm kind of matches our old super-buggy code, for no apparent reason. See <http://www.w3.org/Bugs/Public/show_bug.cgi?id=9628> for discussion.
Comment 23 Adam Barth 2010-11-22 22:36:33 PST
Sounds like we should convert this test suite to a layout test:
http://www.hixie.ch/tests/adhoc/html/parsing/encoding/all.html
Comment 24 Ian 'Hixie' Hickson 2010-11-29 13:18:27 PST
Sorry if the reasons for the spec aren't apparent. There are two main reasons: making your life easier (generally it's easier if I spec what you do so you don't have to change it) and compatibility (there's no particularly compelling reason to implement this one way or another, so the way that browsers do it seems the safest option).

I'm happy to change the spec to whatever implementations end up doing. I recommend working with other browser vendors to come up with a proposal everyone is willing to implement; this issue has been escalated to the HTMLWG chairs who will shortly be asking for change proposals for it.
Comment 25 Henri Sivonen 2010-11-30 02:10:32 PST
Firefox trunk implements the HTML5 spec from before the spec paid attention to the http-equiv attribute, expect Firefox trunk inspects the first 1024 bytes instead of the first 512 (to match WebKit, to work around Apache gzip brokenness on Dreamhost-hosted sites that makes Gecko unable to reload Dreamhost-hosted content in mid-stream so the prescan *must* succeed on Dreamhost and to pass probability-based heuristic CJK detection test cases now that the heuristic detector also runs on the first 1024 bytes instead of running on the whole document).

I'm now in the process of implementing the spec changes to make Gecko pay attention to the http-equiv attribute. This is 
https://bugzilla.mozilla.org/show_bug.cgi?id=594730

However, in the process, some spec bugs were found:
http://www.w3.org/Bugs/Public/show_bug.cgi?id=11428
http://www.w3.org/Bugs/Public/show_bug.cgi?id=11426
http://www.w3.org/Bugs/Public/show_bug.cgi?id=11411
http://www.w3.org/Bugs/Public/show_bug.cgi?id=11412
http://www.w3.org/Bugs/Public/show_bug.cgi?id=11414

I recommend implementing the spec as amended by the obvious fixes to the 5 spec bugs.

This means not stopping at </head>.

I see no practical value in implementing the change proposed in
http://www.w3.org/Bugs/Public/show_bug.cgi?id=9628
Comment 26 Alexey Proskuryakov 2010-11-30 08:42:42 PST
One thing I really don't want to implement in WebKit is restarting if <meta> is found by actual parsing, as opposed to charset prescan.
Comment 27 Jenn Braithwaite 2010-12-02 11:05:08 PST
Created attachment 75396 [details]
patch

This new version is now 1% faster than the pre-existing charset algorithm.  Performance was measured using the same methods mentioned earlier in the thread.
Comment 28 Alexey Proskuryakov 2010-12-02 11:23:58 PST
Comment on attachment 75396 [details]
patch

View in context: https://bugs.webkit.org/attachment.cgi?id=75396&action=review

> LayoutTests/ChangeLog:11
> +        Added http-equiv attribute to meta tag in 2 existing tests.

So, did '<meta content="charset=UTF-8">' that we had in the tests not work in other browsers?
Comment 29 Jenn Braithwaite 2010-12-02 11:33:21 PST
(In reply to comment #28)
> (From update of attachment 75396 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=75396&action=review
> 
> > LayoutTests/ChangeLog:11
> > +        Added http-equiv attribute to meta tag in 2 existing tests.
> 
> So, did '<meta content="charset=UTF-8">' that we had in the tests not work in other browsers?

The spec now requires http-equiv before using the charset in the content attribute.  Henri noted that he's updating Firefox to pay attention to the http-equiv attribute.  IE8 requires http-equiv.
Comment 30 Adam Barth 2010-12-07 19:18:27 PST
Comment on attachment 75396 [details]
patch

View in context: https://bugs.webkit.org/attachment.cgi?id=75396&action=review

Very nice patch.  I particularly like the ratio of testing to code change and the large amount of code removed.  A few nits below.

> WebCore/html/parser/HTMLMetaCharsetParser.cpp:43
> +    : m_tokenizer(HTMLTokenizer::create(false))

I wish I remembered what false does here.

> WebCore/html/parser/HTMLMetaCharsetParser.cpp:64
> +        pos += 7;

Ideally we'd have a constant for this value to connect it with the string:

const char* charsetString = "charset";
const size_t charsetLength = sizeof("charset") - 1;

Something like that.

> WebCore/html/parser/HTMLMetaCharsetParser.cpp:80
> +            quoteMark = value[pos++];

static_cast<char> + ASSERT that it's a 7 bit value would be nice here.

> WebCore/html/parser/HTMLMetaCharsetParser.cpp:108
> +        String attributeValue = String(iter->m_value.data(), iter->m_value.size());

You can just call the constructor instead of both the constructor and the copy constructor (i.e, remove the = ).

> WebCore/html/parser/HTMLMetaCharsetParser.cpp:113
> +        } else if (!charset.length()) {

charset.isEmpty()

> WebCore/html/parser/HTMLMetaCharsetParser.cpp:166
> +            AtomicString tag(m_token.name().data(), m_token.name().size());

I'd rename tag => tagName, but it's not that important.
Comment 31 Jenn Braithwaite 2010-12-08 10:48:21 PST
Created attachment 75929 [details]
updated patch

> --- Comment #30 from Adam Barth <abarth@webkit.org>  2010-12-07 19:18:27 PST ---
> (From update of attachment 75396 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=75396&action=review
>
> Very nice patch.  I particularly like the ratio of testing to code change and the large amount of code removed.  A few nits below.
>
>> WebCore/html/parser/HTMLMetaCharsetParser.cpp:43
>> +    : m_tokenizer(HTMLTokenizer::create(false))
>
> I wish I remembered what false does here.

Added a comment.

>> WebCore/html/parser/HTMLMetaCharsetParser.cpp:64
>> +        pos += 7;
>
> Ideally we'd have a constant for this value to connect it with the string:
>
> const char* charsetString = "charset";
> const size_t charsetLength = sizeof("charset") - 1;
>
> Something like that.

Done.

>> WebCore/html/parser/HTMLMetaCharsetParser.cpp:80
>> +            quoteMark = value[pos++];
>
> static_cast<char> + ASSERT that it's a 7 bit value would be nice here.

Done.

>> WebCore/html/parser/HTMLMetaCharsetParser.cpp:108
>> +        String attributeValue = String(iter->m_value.data(), iter->m_value.size());
>
> You can just call the constructor instead of both the constructor and the copy constructor (i.e, remove the = ).

Done.

>> WebCore/html/parser/HTMLMetaCharsetParser.cpp:113
>> +        } else if (!charset.length()) {
>
> charset.isEmpty()

Done.

>> WebCore/html/parser/HTMLMetaCharsetParser.cpp:166
>> +            AtomicString tag(m_token.name().data(), m_token.name().size());
>
> I'd rename tag => tagName, but it's not that important.

Done.

Also, fixed a few minor typos in the tests and modified parser-tests.html to keep the hixie's original indenting/formatting.
Comment 32 Adam Barth 2010-12-08 19:01:45 PST
Comment on attachment 75929 [details]
updated patch

View in context: https://bugs.webkit.org/attachment.cgi?id=75929&action=review

Thanks!

> WebCore/html/parser/HTMLMetaCharsetParser.cpp:55
> +const char* charsetString = "charset";
> +const size_t charsetLength = sizeof("charset") - 1;

We should probably put these in an anonymous namespace so we don't leak the symbols (or maybe const stops that?)
Comment 33 Darin Adler 2010-12-08 19:03:57 PST
Comment on attachment 75929 [details]
updated patch

View in context: https://bugs.webkit.org/attachment.cgi?id=75929&action=review

> WebCore/html/parser/HTMLMetaCharsetParser.cpp:54
> +const char* charsetString = "charset";

To actually make this a constant, we need to put a const after the *. The const before the * means that the characters are const, not the pointer. Another option is to use const char[] charsetString.

>> WebCore/html/parser/HTMLMetaCharsetParser.cpp:55
>> +const size_t charsetLength = sizeof("charset") - 1;
> 
> We should probably put these in an anonymous namespace so we don't leak the symbols (or maybe const stops that?)

The old school alternative to anonymous namespace is to just put the keyword static on these. I’d suggest that.
Comment 34 Alexey Proskuryakov 2010-12-08 19:06:45 PST
Consts stop that.

Our normal style is to use "static const", but the difference is minimal - with static const, the compiler will complain about earlier declarations (e.g. in a header).
Comment 35 WebKit Commit Bot 2010-12-09 02:08:43 PST
Comment on attachment 75929 [details]
updated patch

Rejecting patch 75929 from commit-queue.

Failed to run "['./WebKitTools/Scripts/webkit-patch', '--status-host=queues.webkit.org', '--bot-id=eseidel-cq-sl', 'apply-attachment', '--force-clean', '--non-interactive', 75929]" exit_code: 2
Last 500 characters of output:
lines).
Hunk #3 succeeded at 15646 (offset 419 lines).
Hunk #4 succeeded at 20495 (offset 641 lines).
Hunk #5 succeeded at 23224 (offset 726 lines).
patching file WebCore/html/parser/HTMLMetaCharsetParser.cpp
patching file WebCore/html/parser/HTMLMetaCharsetParser.h
patching file WebCore/loader/TextResourceDecoder.cpp
patching file WebCore/loader/TextResourceDecoder.h

Failed to run "[u'/Projects/CommitQueue/WebKitTools/Scripts/svn-apply', u'--reviewer', u'Adam Barth', u'--force']" exit_code: 1

Full output: http://queues.webkit.org/results/6964004
Comment 36 Jenn Braithwaite 2010-12-09 10:12:15 PST
Created attachment 76082 [details]
updated patch - static const change
Comment 37 WebKit Commit Bot 2010-12-09 20:09:00 PST
Comment on attachment 76082 [details]
updated patch - static const change

Rejecting patch 76082 from commit-queue.

Failed to run "['./WebKitTools/Scripts/webkit-patch', '--status-host=queues.webkit.org', '--bot-id=eseidel-cq-sl', 'apply-attachment', '--force-clean', '--non-interactive', 76082]" exit_code: 2
Last 500 characters of output:
lines).
Hunk #3 succeeded at 15646 (offset 419 lines).
Hunk #4 succeeded at 20495 (offset 641 lines).
Hunk #5 succeeded at 23224 (offset 726 lines).
patching file WebCore/html/parser/HTMLMetaCharsetParser.cpp
patching file WebCore/html/parser/HTMLMetaCharsetParser.h
patching file WebCore/loader/TextResourceDecoder.cpp
patching file WebCore/loader/TextResourceDecoder.h

Failed to run "[u'/Projects/CommitQueue/WebKitTools/Scripts/svn-apply', u'--reviewer', u'Adam Barth', u'--force']" exit_code: 1

Full output: http://queues.webkit.org/results/6923028
Comment 38 Adam Barth 2010-12-09 20:15:36 PST
I'll merge this for you.
Comment 39 Adam Barth 2010-12-09 20:28:57 PST
Created attachment 76158 [details]
Patch for landing
Comment 40 WebKit Commit Bot 2010-12-09 22:27:06 PST
Comment on attachment 76158 [details]
Patch for landing

Rejecting patch 76158 from commit-queue.

Failed to run "['./WebKitTools/Scripts/webkit-patch', '--status-host=queues.webkit.org', '--bot-id=eseidel-cq-sl', 'apply-attachment', '--non-interactive', 76158]" exit_code: 2
Last 500 characters of output:
ncoding/resources/122.html
patching file LayoutTests/fast/encoding/resources/123.html
patching file LayoutTests/fast/text/international/bidi-innertext.html
patching file LayoutTests/http/tests/misc/charset-sniffer-end-sniffing-expected.txt
patching file LayoutTests/http/tests/misc/charset-sniffer-end-sniffing.html
patching file LayoutTests/http/tests/misc/resources/charset-sniffer-end-sniffing.php

Failed to run "[u'/Projects/CommitQueue/WebKitTools/Scripts/svn-apply', u'--force']" exit_code: 1

Full output: http://queues.webkit.org/results/6977017
Comment 41 Adam Barth 2010-12-10 00:32:54 PST
Created attachment 76171 [details]
failure

This test seems to cause this failure.
Comment 42 Adam Barth 2010-12-10 11:50:58 PST
Committed r73756: <http://trac.webkit.org/changeset/73756>
Comment 43 Jenn Braithwaite 2010-12-10 12:09:02 PST
(In reply to comment #41)
> Created an attachment (id=76171) [details]
> failure
> 
> This test seems to cause this failure.

Putting the solution on the record for when/if the test files are ever updated:

The fast/encoding/resources/[0-9]*.html test files contain unusual characters that got corrupted during the svn-apply process. They need to be landed manually.

Also, four of the tests are actually testing tabs. Do 'svn pset allow-tabs true' on those files to disable the precommit hook.