WebKit Bugzilla
New
Browse
Log In
×
Sign in with GitHub
or
Remember my login
Create Account
·
Forgot Password
Forgotten password account recovery
RESOLVED FIXED
Bug 47397
TextResourceDecoder::checkForHeadCharset can look way past the limit.
https://bugs.webkit.org/show_bug.cgi?id=47397
Summary
TextResourceDecoder::checkForHeadCharset can look way past the limit.
Dmitry Titov
Reported
2010-10-07 20:36:22 PDT
TextResourceDecoder::checkForHeadCharset looks for charset in the portion of data received and does not feed the parser until it finds the encoding. The limit seems to be set to the end of head section (before body) or 1024 bytes, whatever comes last. However, the logic does not stop if one has HTML with just text in the body, liek this: <head></head><body>... lots of plain text ... </body> In this case, the decoder will accumulate all the text, no matter how big, and only then feed it into parser. It may scan it multiple times as well. Test and patch coming. If for some reason we actually want this to work specifically like this, please voice concern.
Attachments
patch (work in progress)
(5.85 KB, patch)
2010-11-12 16:02 PST
,
Jenn Braithwaite
no flags
Details
Formatted Diff
Diff
patch - HTMLTokenizer version (work in progress)
(35.12 KB, patch)
2010-11-22 14:45 PST
,
Jenn Braithwaite
no flags
Details
Formatted Diff
Diff
patch
(474.11 KB, patch)
2010-12-02 11:05 PST
,
Jenn Braithwaite
abarth
: review+
abarth
: commit-queue-
Details
Formatted Diff
Diff
updated patch
(474.39 KB, patch)
2010-12-08 10:48 PST
,
Jenn Braithwaite
abarth
: review+
commit-queue
: commit-queue-
Details
Formatted Diff
Diff
updated patch - static const change
(474.41 KB, patch)
2010-12-09 10:12 PST
,
Jenn Braithwaite
no flags
Details
Formatted Diff
Diff
Patch for landing
(482.70 KB, patch)
2010-12-09 20:28 PST
,
Adam Barth
commit-queue
: commit-queue-
Details
Formatted Diff
Diff
failure
(55.98 KB, text/plain)
2010-12-10 00:32 PST
,
Adam Barth
no flags
Details
Show Obsolete
(5)
View All
Add attachment
proposed patch, testcase, etc.
Alexey Proskuryakov
Comment 1
2010-10-07 20:54:09 PDT
I think that this was sort of known, with the idea that looking a little (in the common case) more ahead wasn't a problem. But I don't think that's intentional behavior.
Alexey Proskuryakov
Comment 2
2010-10-08 14:02:50 PDT
Also, the thinking at some point was that we'd like to switch charset sniffer to use real parser, so we didn't care about imperfections in existing code, as long as that didn't affect real sites. Perhaps using the real parser is more practical now!
Dmitry Titov
Comment 3
2010-10-08 15:07:23 PDT
I agree having a lot of text in the body w/o any other tags is a fringe case, I only hit it trying to write a test for another issue. However, looking at this code in a debugger, I found that there is another scenario when it behaves not as intended: <script> if (foo < bar) bar = foo; </script> <body> ... more stuff, including other tags but not <script> This causes the checkForHeadCharset to think it never leaves head section, since when it find "<" in JS it starts looking for the closing ">" of what it thinks is a tag. So it skips "</script" and happily continues with enclosingTagName being "script" and no chances of resetting this state. This seems to be a more probable case, and it will cause the whole page to accumulate before starting to parse it. I'm still not sure what kind of test I can create for this, which would not rely on a timeout.
Dmitry Titov
Comment 4
2010-10-08 15:16:20 PDT
(In reply to
comment #2
)
> Perhaps using the real parser is more practical now!
Hmm, it'd be cool. I'll look into it.
Alexey Proskuryakov
Comment 5
2010-10-08 15:17:26 PDT
If the sniffer thinks that it's in script, then the below test should fail, since we ignore tags in <script>. But it doesn't fail for me: <script> < </script><meta charset=koi8-r><script>alert(document.charset)</script>
Adam Barth
Comment 6
2010-10-08 15:20:11 PDT
> Hmm, it'd be cool. I'll look into it.
If you look, there are a couple of examples of folks who use the HTMLTokenizer by itself. It's designed to be straightforward to reuse without the rest of the parsing machinery. The one trick might be handling both 16bit and 8bit input streams at the same time.
Dmitry Titov
Comment 7
2010-10-08 16:05:33 PDT
(In reply to
comment #5
)
> If the sniffer thinks that it's in script, then the below test should fail, since we ignore tags in <script>. But it doesn't fail for me: > > <script> < </script><meta charset=koi8-r><script>alert(document.charset)</script>
This particular test will succeed because the net effect is that the sniffer forever thinks it's in head section - so the detection of charset works fine, however if there is no charset, the detection never stop and detector accumulates the whole page (scanning through whole of it multiple times perhaps) before giving it to parser when the data load finished.
Dmitry Titov
Comment 8
2010-10-08 16:11:47 PDT
(In reply to
comment #6
)
> > Hmm, it'd be cool. I'll look into it. > > If you look, there are a couple of examples of folks who use the HTMLTokenizer by itself. It's designed to be straightforward to reuse without the rest of the parsing machinery. The one trick might be handling both 16bit and 8bit input streams at the same time.
Interesting. HTMLTokenizer works on decoded stream and it's not clear how to do parsing before decoding, which is what we need to do here. Assuming some random encoding first (UTF8?) could perhaps lead to subtle bugs when decoded text could be interpreted incorrectly. Using imperfect but all-in-one-function sniffer might be a better alternative since it allows someone to inspect it easier while keeping in mind the assumption that the input text is actually an undecoded stream. It also has some functionality that will still not be a part of the HTMLTokenizer, like looking for BOM, XML processing instructions etc. Still looking at HTMLTokenizer...
Alexey Proskuryakov
Comment 9
2010-10-08 16:17:37 PDT
> This particular test will succeed because the net effect is that the sniffer forever thinks it's in head section
I guess I could just check for myself in debugger, but I don't follow this explanation. The sniffer is supposed to ignore <meta> and other tags inside <script>, and you said that it never bails out of "in script" mode.
Dmitry Titov
Comment 10
2010-10-08 16:22:15 PDT
(In reply to
comment #9
)
> > This particular test will succeed because the net effect is that the sniffer forever thinks it's in head section > > I guess I could just check for myself in debugger, but I don't follow this explanation. The sniffer is supposed to ignore <meta> and other tags inside <script>, and you said that it never bails out of "in script" mode.
Sorry Alexey, I was thinking about my issue while answering :-). Well, I've just traced your example in the debugger and the code does skip closing </script> tag and then processes <meta> even while having non-null enclosingTagName. So I'm not sure it actually skips tags inside <script>...
Adam Barth
Comment 11
2010-10-08 16:25:24 PDT
> Interesting. HTMLTokenizer works on decoded stream and it's not clear how to do parsing before decoding, which is what we need to do here. Assuming some random encoding first (UTF8?) could perhaps lead to subtle bugs when decoded text could be interpreted incorrectly.
If the width of the characters are ok, it should be fine. By design, we don't support encodings that tokenizer differently than UTF8 (e.g., UTF7). The bigger problem, I'd expect, is that HTMLTokenizer expects a 16bit character but we might have UTF8 input. You might be able to finesse that because the HTMLTokenizer only looks at one character at a time. You could just zero-pad the 8bit characters into 16bit characters.
Alexey Proskuryakov
Comment 12
2010-10-08 16:27:35 PDT
I see. We only ignore tags in <script> for the purpose of ending <head> section - for example, <script>"<body>"</script> won't make inHeadSection false.
Dmitry Titov
Comment 13
2010-10-08 16:34:42 PDT
(In reply to
comment #12
)
> I see. We only ignore tags in <script> for the purpose of ending <head> section - for example, <script>"<body>"</script> won't make inHeadSection false.
Curiously, even this detects the charset (incorrectly): <script>foo='<meta charset=koi8-r>'</script><script>alert(document.charset)</script> +1 for HTMLTokenizer :-)
Darin Adler
Comment 14
2010-10-08 17:33:43 PDT
(In reply to
comment #8
)
> Assuming some random encoding first (UTF8?) could perhaps lead to subtle bugs when decoded text could be interpreted incorrectly.
You’d want to use something like ISO Latin-1, where it can’t spuriously fail to decode.
Jenn Braithwaite
Comment 15
2010-11-12 16:02:29 PST
Created
attachment 73787
[details]
patch (work in progress) Preliminary patch showing possible fixes to the charset sniffer to address two issues brought up by dimich. Includes tests for those fixes, plus tests to baseline the current charset sniffer behavior (quotes and meta-tag-in-body). Next step: investigate HTMLTokenizer to decide if we should replace the charset sniffer code with it.
Jenn Braithwaite
Comment 16
2010-11-22 14:45:22 PST
Created
attachment 74600
[details]
patch - HTMLTokenizer version (work in progress) This patch implements charset encoding detection using HTMLTokenizer. It also has additional tests, including those in the previous patch. The version with the HTMLTokenizer is about 1% slower than the version in the previous patch, averaged across 5 runs of: time WebkitBuild/Debug/DumpRenderTree `ls LayoutTests/fast/encoding/*.html` However, the parsing is more precise in the HTMLTokenizer version, e.g. correct handling of quoted text and escaped characters, and passes all tests. The version in the previous patch fails 4 of the new tests: quotes, escaped characters, mismatched end tag, unexpected 'charset=' text in meta tag. "More precise" also means "stricter", so there may be other tests that need the same corrections as fast/encoding/namespace-tolerance.html (see patch). Please take a look a this new patch and weigh in on whether we should switch to using a real parser versus more bug fixes to the existing code.
Adam Barth
Comment 17
2010-11-22 15:34:23 PST
Comment on
attachment 74600
[details]
patch - HTMLTokenizer version (work in progress) View in context:
https://bugs.webkit.org/attachment.cgi?id=74600&action=review
> LayoutTests/fast/encoding/namespace-tolerance.html:2 > -<meta content="charset=UTF-8"> > +<meta content="text/html; charset=UTF-8" http-equiv="Content-Type">
Aren't we supposed to detect the original charset according to HTML5?
> WebCore/loader/TextResourceDecoder.cpp:575 > + // the HTTP-EQUIV meta has no effect on XHTML
the => The Also, please end sentences with a period.
> WebCore/loader/TextResourceDecoder.cpp:581 > + // Create tokenizer with usePreHTML5ParserQuirks set to false to match the > + // behavior that was in place prior to switching to using the HTMLTokenizer > + // for parsing.
We don't really need this comment.
> WebCore/loader/TextResourceDecoder.cpp:583 > + return checkForMetaCharset(data, len);
len => length.
> WebCore/loader/TextResourceDecoder.cpp:586 > +bool TextResourceDecoder::checkForMetaCharset(const char* data, size_t len)
Where does this algorithm come from? If it's from the HTML5 spec, we should put in links to the spec. In general, this algorithm looks complicated with lots of Boolean flags. Can we make it a state machine? Should we break out all the logic related to scanning for the charset into its own class separate from a TextResourceDecoder? Should that class go in WebCore/html/parser?
> WebCore/loader/TextResourceDecoder.cpp:609 > + m_input.append(SegmentedString(m_assumedCodec->decode(data, len)));
I bet there is where your slowdown is coming from.
> WebCore/loader/TextResourceDecoder.cpp:623 > + m_enclosingTagName = titleTag.localName().impl();
Why not just store the whole Qualified name?
> WebCore/loader/TextResourceDecoder.cpp:636 > + String attributeValue = String(iter->m_value.data(), iter->m_value.size());
You can avoid this memcpy if you don't care about the attr name, but maybe that doesn't matter.
> WebCore/loader/TextResourceDecoder.h:104 > + OwnPtr<HTMLTokenizer> m_tokenizer; > + OwnPtr<TextCodec> m_assumedCodec; > + SegmentedString m_input; > + HTMLToken m_token; > + AtomicStringImpl* m_enclosingTagName; > + bool m_inHeadSection;
Yeah, this block of state seems like it should be encapsulated into its own object.
Jenn Braithwaite
Comment 18
2010-11-22 15:54:09 PST
Comment on
attachment 74600
[details]
patch - HTMLTokenizer version (work in progress) View in context:
https://bugs.webkit.org/attachment.cgi?id=74600&action=review
Thanks for taking a look so quickly. Still waiting for a go/no-go on the HTMLTokenizer before I clean things up and make a proper patch. What's your vote?
>> LayoutTests/fast/encoding/namespace-tolerance.html:2 >> +<meta content="text/html; charset=UTF-8" http-equiv="Content-Type"> > > Aren't we supposed to detect the original charset according to HTML5?
HTML5 would be <meta charset=UTF-8> Notice the "content=" difference.
>> WebCore/loader/TextResourceDecoder.cpp:586 >> +bool TextResourceDecoder::checkForMetaCharset(const char* data, size_t len) > > Where does this algorithm come from? If it's from the HTML5 spec, we should put in links to the spec. In general, this algorithm looks complicated with lots of Boolean flags. Can we make it a state machine? Should we break out all the logic related to scanning for the charset into its own class separate from a TextResourceDecoder? Should that class go in WebCore/html/parser?
The logic is basically the existing algorithm, with parsing replaced by the tokenizer. A few bug fixes were made to the algorithm. The Boolean flags are because http-equiv could come before/after content in a meta tag. I also have a version where this is implemented in a separate class in WebCore/html/parser, but it seemed like overkill as this is the only use of it. I'll revisit that version.
>> WebCore/loader/TextResourceDecoder.cpp:609 >> + m_input.append(SegmentedString(m_assumedCodec->decode(data, len))); > > I bet there is where your slowdown is coming from.
Is this necessary? If so, is the slowdown worth the added parsing precision?
>> WebCore/loader/TextResourceDecoder.cpp:623 >> + m_enclosingTagName = titleTag.localName().impl(); > > Why not just store the whole Qualified name?
This is from the existing code. I'm guessing it's more efficient to use the ptr. (@ap: is that right?)
Adam Barth
Comment 19
2010-11-22 16:11:45 PST
Comment on
attachment 74600
[details]
patch - HTMLTokenizer version (work in progress) View in context:
https://bugs.webkit.org/attachment.cgi?id=74600&action=review
Is there an algorithm in the HTML5 spec fo extracting the charset from the meta tag? If so, we should switch to that rather than passing though a state that almost, but not quite, identical to our historical behavior. You say there's a 1% slowdown, but on what benchmark? We should try the PLT.
>>> LayoutTests/fast/encoding/namespace-tolerance.html:2 >>> +<meta content="text/html; charset=UTF-8" http-equiv="Content-Type"> >> >> Aren't we supposed to detect the original charset according to HTML5? > > HTML5 would be <meta charset=UTF-8> Notice the "content=" difference.
That's authoring requirement, but what are the user-agent requirements like?
>>> WebCore/loader/TextResourceDecoder.cpp:586 >>> +bool TextResourceDecoder::checkForMetaCharset(const char* data, size_t len) >> >> Where does this algorithm come from? If it's from the HTML5 spec, we should put in links to the spec. In general, this algorithm looks complicated with lots of Boolean flags. Can we make it a state machine? Should we break out all the logic related to scanning for the charset into its own class separate from a TextResourceDecoder? Should that class go in WebCore/html/parser? > > The logic is basically the existing algorithm, with parsing replaced by the tokenizer. A few bug fixes were made to the algorithm. The Boolean flags are because http-equiv could come before/after content in a meta tag. I also have a version where this is implemented in a separate class in WebCore/html/parser, but it seemed like overkill as this is the only use of it. I'll revisit that version.
The idea behind using a separate logic is to reduce the complexity of the system because this block of state has a very narrow API (given some bits, what't the charset).
>>> WebCore/loader/TextResourceDecoder.cpp:609 >>> + m_input.append(SegmentedString(m_assumedCodec->decode(data, len))); >> >> I bet there is where your slowdown is coming from. > > Is this necessary? If so, is the slowdown worth the added parsing precision?
Well, somewhere you need to blow up the string in 16 bit characters. We've talked about adding an 8 bit path through the parser, which is what this would be good for. I wish there was some way of sharing the decoding work... Maybe we want a trivial decoder that just zero-pads everything?
Jenn Braithwaite
Comment 20
2010-11-22 16:40:55 PST
(In reply to
comment #19
)
> (From update of
attachment 74600
[details]
) > View in context:
https://bugs.webkit.org/attachment.cgi?id=74600&action=review
> > Is there an algorithm in the HTML5 spec fo extracting the charset from the meta tag? If so, we should switch to that rather than passing though a state that almost, but not quite, identical to our historical behavior.
This code is a close match to the algorithm in the HTML5 spec (except I used a 2nd flag instead of a mode). I can make it a more precise match to the spec.
> > You say there's a 1% slowdown, but on what benchmark? We should try the PLT.
This is a rough measurement on my machine, with the same processes active for all runs, to give us some ballpark for evaluating the time vs parsing-precision trade-off. Another trade off is ease of maintainability with the HTMLTokenizer version.
> > >>> LayoutTests/fast/encoding/namespace-tolerance.html:2 > >>> +<meta content="text/html; charset=UTF-8" http-equiv="Content-Type"> > >> > >> Aren't we supposed to detect the original charset according to HTML5? > > > > HTML5 would be <meta charset=UTF-8> Notice the "content=" difference. > > That's authoring requirement, but what are the user-agent requirements like?
content is ignored if got-pragma is false in the HTML5 spec algorithm. http-equiv=Content-Type makes got-pragma true.
> >>> WebCore/loader/TextResourceDecoder.cpp:609 > >>> + m_input.append(SegmentedString(m_assumedCodec->decode(data, len))); > >> > >> I bet there is where your slowdown is coming from. > > > > Is this necessary? If so, is the slowdown worth the added parsing precision? > > Well, somewhere you need to blow up the string in 16 bit characters. We've talked about adding an 8 bit path through the parser, which is what this would be good for. I wish there was some way of sharing the decoding work... Maybe we want a trivial decoder that just zero-pads everything?
Hoping someone else will chime in here on the zero-padding idea. I used ISO Latin-1 based on Adler's comment in this thread, but I'm open to changing that.
Adam Barth
Comment 21
2010-11-22 16:45:34 PST
High-level reactions: 1) This looks like the right approach. 2) Deleting all that custom parsing good is good times. 3) I'd like to try matching the HTML5 algorithm precisely. 4) I'd like to understand, and hopefully eliminate, any slowdown.
Alexey Proskuryakov
Comment 22
2010-11-22 22:18:45 PST
> 3) I'd like to try matching the HTML5 algorithm precisely.
The HTML5 algorithm kind of matches our old super-buggy code, for no apparent reason. See <
http://www.w3.org/Bugs/Public/show_bug.cgi?id=9628
> for discussion.
Adam Barth
Comment 23
2010-11-22 22:36:33 PST
Sounds like we should convert this test suite to a layout test:
http://www.hixie.ch/tests/adhoc/html/parsing/encoding/all.html
Ian 'Hixie' Hickson
Comment 24
2010-11-29 13:18:27 PST
Sorry if the reasons for the spec aren't apparent. There are two main reasons: making your life easier (generally it's easier if I spec what you do so you don't have to change it) and compatibility (there's no particularly compelling reason to implement this one way or another, so the way that browsers do it seems the safest option). I'm happy to change the spec to whatever implementations end up doing. I recommend working with other browser vendors to come up with a proposal everyone is willing to implement; this issue has been escalated to the HTMLWG chairs who will shortly be asking for change proposals for it.
Henri Sivonen
Comment 25
2010-11-30 02:10:32 PST
Firefox trunk implements the HTML5 spec from before the spec paid attention to the http-equiv attribute, expect Firefox trunk inspects the first 1024 bytes instead of the first 512 (to match WebKit, to work around Apache gzip brokenness on Dreamhost-hosted sites that makes Gecko unable to reload Dreamhost-hosted content in mid-stream so the prescan *must* succeed on Dreamhost and to pass probability-based heuristic CJK detection test cases now that the heuristic detector also runs on the first 1024 bytes instead of running on the whole document). I'm now in the process of implementing the spec changes to make Gecko pay attention to the http-equiv attribute. This is
https://bugzilla.mozilla.org/show_bug.cgi?id=594730
However, in the process, some spec bugs were found:
http://www.w3.org/Bugs/Public/show_bug.cgi?id=11428
http://www.w3.org/Bugs/Public/show_bug.cgi?id=11426
http://www.w3.org/Bugs/Public/show_bug.cgi?id=11411
http://www.w3.org/Bugs/Public/show_bug.cgi?id=11412
http://www.w3.org/Bugs/Public/show_bug.cgi?id=11414
I recommend implementing the spec as amended by the obvious fixes to the 5 spec bugs. This means not stopping at </head>. I see no practical value in implementing the change proposed in
http://www.w3.org/Bugs/Public/show_bug.cgi?id=9628
Alexey Proskuryakov
Comment 26
2010-11-30 08:42:42 PST
One thing I really don't want to implement in WebKit is restarting if <meta> is found by actual parsing, as opposed to charset prescan.
Jenn Braithwaite
Comment 27
2010-12-02 11:05:08 PST
Created
attachment 75396
[details]
patch This new version is now 1% faster than the pre-existing charset algorithm. Performance was measured using the same methods mentioned earlier in the thread.
Alexey Proskuryakov
Comment 28
2010-12-02 11:23:58 PST
Comment on
attachment 75396
[details]
patch View in context:
https://bugs.webkit.org/attachment.cgi?id=75396&action=review
> LayoutTests/ChangeLog:11 > + Added http-equiv attribute to meta tag in 2 existing tests.
So, did '<meta content="charset=UTF-8">' that we had in the tests not work in other browsers?
Jenn Braithwaite
Comment 29
2010-12-02 11:33:21 PST
(In reply to
comment #28
)
> (From update of
attachment 75396
[details]
) > View in context:
https://bugs.webkit.org/attachment.cgi?id=75396&action=review
> > > LayoutTests/ChangeLog:11 > > + Added http-equiv attribute to meta tag in 2 existing tests. > > So, did '<meta content="charset=UTF-8">' that we had in the tests not work in other browsers?
The spec now requires http-equiv before using the charset in the content attribute. Henri noted that he's updating Firefox to pay attention to the http-equiv attribute. IE8 requires http-equiv.
Adam Barth
Comment 30
2010-12-07 19:18:27 PST
Comment on
attachment 75396
[details]
patch View in context:
https://bugs.webkit.org/attachment.cgi?id=75396&action=review
Very nice patch. I particularly like the ratio of testing to code change and the large amount of code removed. A few nits below.
> WebCore/html/parser/HTMLMetaCharsetParser.cpp:43 > + : m_tokenizer(HTMLTokenizer::create(false))
I wish I remembered what false does here.
> WebCore/html/parser/HTMLMetaCharsetParser.cpp:64 > + pos += 7;
Ideally we'd have a constant for this value to connect it with the string: const char* charsetString = "charset"; const size_t charsetLength = sizeof("charset") - 1; Something like that.
> WebCore/html/parser/HTMLMetaCharsetParser.cpp:80 > + quoteMark = value[pos++];
static_cast<char> + ASSERT that it's a 7 bit value would be nice here.
> WebCore/html/parser/HTMLMetaCharsetParser.cpp:108 > + String attributeValue = String(iter->m_value.data(), iter->m_value.size());
You can just call the constructor instead of both the constructor and the copy constructor (i.e, remove the = ).
> WebCore/html/parser/HTMLMetaCharsetParser.cpp:113 > + } else if (!charset.length()) {
charset.isEmpty()
> WebCore/html/parser/HTMLMetaCharsetParser.cpp:166 > + AtomicString tag(m_token.name().data(), m_token.name().size());
I'd rename tag => tagName, but it's not that important.
Jenn Braithwaite
Comment 31
2010-12-08 10:48:21 PST
Created
attachment 75929
[details]
updated patch
> ---
Comment #30
from Adam Barth <
abarth@webkit.org
> 2010-12-07 19:18:27 PST --- > (From update of
attachment 75396
[details]
) > View in context:
https://bugs.webkit.org/attachment.cgi?id=75396&action=review
>
> Very nice patch. I particularly like the ratio of testing to code change and the large amount of code removed. A few nits below.
>
>> WebCore/html/parser/HTMLMetaCharsetParser.cpp:43 >> + : m_tokenizer(HTMLTokenizer::create(false))
>
> I wish I remembered what false does here.
Added a comment.
>> WebCore/html/parser/HTMLMetaCharsetParser.cpp:64 >> + pos += 7;
>
> Ideally we'd have a constant for this value to connect it with the string:
>
> const char* charsetString = "charset"; > const size_t charsetLength = sizeof("charset") - 1;
>
> Something like that.
Done.
>> WebCore/html/parser/HTMLMetaCharsetParser.cpp:80 >> + quoteMark = value[pos++];
>
> static_cast<char> + ASSERT that it's a 7 bit value would be nice here.
Done.
>> WebCore/html/parser/HTMLMetaCharsetParser.cpp:108 >> + String attributeValue = String(iter->m_value.data(), iter->m_value.size());
>
> You can just call the constructor instead of both the constructor and the copy constructor (i.e, remove the = ).
Done.
>> WebCore/html/parser/HTMLMetaCharsetParser.cpp:113 >> + } else if (!charset.length()) {
>
> charset.isEmpty()
Done.
>> WebCore/html/parser/HTMLMetaCharsetParser.cpp:166 >> + AtomicString tag(m_token.name().data(), m_token.name().size());
>
> I'd rename tag => tagName, but it's not that important.
Done. Also, fixed a few minor typos in the tests and modified parser-tests.html to keep the hixie's original indenting/formatting.
Adam Barth
Comment 32
2010-12-08 19:01:45 PST
Comment on
attachment 75929
[details]
updated patch View in context:
https://bugs.webkit.org/attachment.cgi?id=75929&action=review
Thanks!
> WebCore/html/parser/HTMLMetaCharsetParser.cpp:55 > +const char* charsetString = "charset"; > +const size_t charsetLength = sizeof("charset") - 1;
We should probably put these in an anonymous namespace so we don't leak the symbols (or maybe const stops that?)
Darin Adler
Comment 33
2010-12-08 19:03:57 PST
Comment on
attachment 75929
[details]
updated patch View in context:
https://bugs.webkit.org/attachment.cgi?id=75929&action=review
> WebCore/html/parser/HTMLMetaCharsetParser.cpp:54 > +const char* charsetString = "charset";
To actually make this a constant, we need to put a const after the *. The const before the * means that the characters are const, not the pointer. Another option is to use const char[] charsetString.
>> WebCore/html/parser/HTMLMetaCharsetParser.cpp:55 >> +const size_t charsetLength = sizeof("charset") - 1; > > We should probably put these in an anonymous namespace so we don't leak the symbols (or maybe const stops that?)
The old school alternative to anonymous namespace is to just put the keyword static on these. I’d suggest that.
Alexey Proskuryakov
Comment 34
2010-12-08 19:06:45 PST
Consts stop that. Our normal style is to use "static const", but the difference is minimal - with static const, the compiler will complain about earlier declarations (e.g. in a header).
WebKit Commit Bot
Comment 35
2010-12-09 02:08:43 PST
Comment on
attachment 75929
[details]
updated patch Rejecting patch 75929 from commit-queue. Failed to run "['./WebKitTools/Scripts/webkit-patch', '--status-host=queues.webkit.org', '--bot-id=eseidel-cq-sl', 'apply-attachment', '--force-clean', '--non-interactive', 75929]" exit_code: 2 Last 500 characters of output: lines). Hunk #3 succeeded at 15646 (offset 419 lines). Hunk #4 succeeded at 20495 (offset 641 lines). Hunk #5 succeeded at 23224 (offset 726 lines). patching file WebCore/html/parser/HTMLMetaCharsetParser.cpp patching file WebCore/html/parser/HTMLMetaCharsetParser.h patching file WebCore/loader/TextResourceDecoder.cpp patching file WebCore/loader/TextResourceDecoder.h Failed to run "[u'/Projects/CommitQueue/WebKitTools/Scripts/svn-apply', u'--reviewer', u'Adam Barth', u'--force']" exit_code: 1 Full output:
http://queues.webkit.org/results/6964004
Jenn Braithwaite
Comment 36
2010-12-09 10:12:15 PST
Created
attachment 76082
[details]
updated patch - static const change
WebKit Commit Bot
Comment 37
2010-12-09 20:09:00 PST
Comment on
attachment 76082
[details]
updated patch - static const change Rejecting patch 76082 from commit-queue. Failed to run "['./WebKitTools/Scripts/webkit-patch', '--status-host=queues.webkit.org', '--bot-id=eseidel-cq-sl', 'apply-attachment', '--force-clean', '--non-interactive', 76082]" exit_code: 2 Last 500 characters of output: lines). Hunk #3 succeeded at 15646 (offset 419 lines). Hunk #4 succeeded at 20495 (offset 641 lines). Hunk #5 succeeded at 23224 (offset 726 lines). patching file WebCore/html/parser/HTMLMetaCharsetParser.cpp patching file WebCore/html/parser/HTMLMetaCharsetParser.h patching file WebCore/loader/TextResourceDecoder.cpp patching file WebCore/loader/TextResourceDecoder.h Failed to run "[u'/Projects/CommitQueue/WebKitTools/Scripts/svn-apply', u'--reviewer', u'Adam Barth', u'--force']" exit_code: 1 Full output:
http://queues.webkit.org/results/6923028
Adam Barth
Comment 38
2010-12-09 20:15:36 PST
I'll merge this for you.
Adam Barth
Comment 39
2010-12-09 20:28:57 PST
Created
attachment 76158
[details]
Patch for landing
WebKit Commit Bot
Comment 40
2010-12-09 22:27:06 PST
Comment on
attachment 76158
[details]
Patch for landing Rejecting patch 76158 from commit-queue. Failed to run "['./WebKitTools/Scripts/webkit-patch', '--status-host=queues.webkit.org', '--bot-id=eseidel-cq-sl', 'apply-attachment', '--non-interactive', 76158]" exit_code: 2 Last 500 characters of output: ncoding/resources/122.html patching file LayoutTests/fast/encoding/resources/123.html patching file LayoutTests/fast/text/international/bidi-innertext.html patching file LayoutTests/http/tests/misc/charset-sniffer-end-sniffing-expected.txt patching file LayoutTests/http/tests/misc/charset-sniffer-end-sniffing.html patching file LayoutTests/http/tests/misc/resources/charset-sniffer-end-sniffing.php Failed to run "[u'/Projects/CommitQueue/WebKitTools/Scripts/svn-apply', u'--force']" exit_code: 1 Full output:
http://queues.webkit.org/results/6977017
Adam Barth
Comment 41
2010-12-10 00:32:54 PST
Created
attachment 76171
[details]
failure This test seems to cause this failure.
Adam Barth
Comment 42
2010-12-10 11:50:58 PST
Committed
r73756
: <
http://trac.webkit.org/changeset/73756
>
Jenn Braithwaite
Comment 43
2010-12-10 12:09:02 PST
(In reply to
comment #41
)
> Created an attachment (id=76171) [details] > failure > > This test seems to cause this failure.
Putting the solution on the record for when/if the test files are ever updated: The fast/encoding/resources/[0-9]*.html test files contain unusual characters that got corrupted during the svn-apply process. They need to be landed manually. Also, four of the tests are actually testing tabs. Do 'svn pset allow-tabs true' on those files to disable the precommit hook.
Note
You need to
log in
before you can comment on or make changes to this bug.
Top of Page
Format For Printing
XML
Clone This Bug