Bug 69083

Summary: Several CSS lexer rules don't match CSS 2.1 spec
Product: WebKit Reporter: Zoltan Herczeg <zherczeg@webkit.org>
Component: CSSAssignee: Nobody <webkit-unassigned@lists.webkit.org>
Status: NEW    
Severity: Normal CC: abecsi@webkit.org, ap@webkit.org, buildbot@hotmail.com, darin@apple.com, eric@webkit.org, hyatt@apple.com, koivisto@iki.fi, rniwa@webkit.org
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: Unspecified   
OS: Unspecified   
Bug Depends on: 70807, 70909, 71000, 71097, 72007, 72008, 72360, 74148, 74178, 74815, 76152    
Bug Blocks: 70107    
Attachments:
Description Flags
Flex killer test-case (should have yellow background)
none
Added a test for css number types, where 'px' is defined like \70\78, etc.
none
Added three tests for css lexer parsing problems. darin: review-, buildbot: commit‑queue-

Description From 2011-09-29 05:03:21 PST
I am playing with a hand written CSS lexer, and during that work I found a few bugs in the current flex rules in WebKit. I am not sure we prefer compatibility or standard compilance in such case, so I just tell you what I found, and let you decide what to do with them:

The "original" comes form: http://www.w3.org/TR/CSS21/grammar.html "G.2 Lexical scanner"
The "wk" comes form css/tokenizer.flex

original:
nonascii    [\240-\377]
wk:
nonascii        [\200-\377]

They start nonascii from 160 not 128. Not sure why.

original:
string1        \"([^\n\r\f\\"]|\\{nl}|{escape})*\"
wk:
string1         \"([\t !#$%&(-~]|\\{nl}|\'|{nonascii}|{escape})*\"

Basically we disallow 127 (DELETE) and <32 non-newline chars while the original grammar allows them.

original:
unicode        \\{h}{1,6}(\r\n|[ \t\r\n\f])?
wk:
unicode         \\{h}{1,6}[ \t\r\n\f]?

This can be exploited by a \r\n newline: A\41\r\nB should be "AAB" but it will be "AA" and "B" in WK.

original:
{num}%            {return PERCENTAGE;}
wk:
{num}%+                 {yyTok = PERCENTAGE; return yyTok;}

Why do we allow multpile percent sign? Although we still treat them as one...
------- Comment #1 From 2011-09-29 05:30:11 PST -------
(In reply to comment #0)
> I am playing with a hand written CSS lexer, and during that work I found a few bugs in the current flex rules in WebKit. I am not sure we prefer compatibility or standard compilance in such case, so I just tell you what I found, and let you decide what to do with them:
> 
> The "original" comes form: http://www.w3.org/TR/CSS21/grammar.html "G.2 Lexical scanner"
> The "wk" comes form css/tokenizer.flex
> 
> original:
> nonascii    [\240-\377]
> wk:
> nonascii        [\200-\377]
> 
> They start nonascii from 160 not 128. Not sure why.

I think this is just wrong in the spec, but there might be an explanation for it.

> 
> original:
> string1        \"([^\n\r\f\\"]|\\{nl}|{escape})*\"
> wk:
> string1         \"([\t !#$%&(-~]|\\{nl}|\'|{nonascii}|{escape})*\"
> 
> Basically we disallow 127 (DELETE) and <32 non-newline chars while the original grammar allows them.
> 

This might be to make the parser simpler, not sure why the spec is that permissive.

> original:
> unicode        \\{h}{1,6}(\r\n|[ \t\r\n\f])?
> wk:
> unicode         \\{h}{1,6}[ \t\r\n\f]?
> 
> This can be exploited by a \r\n newline: A\41\r\nB should be "AAB" but it will be "AA" and "B" in WK.
> 

This feels like a real bug which could need a test case if possible

> original:
> {num}%            {return PERCENTAGE;}
> wk:
> {num}%+                 {yyTok = PERCENTAGE; return yyTok;}
> 
> Why do we allow multpile percent sign? Although we still treat them as one...

In this case the WebKit lexer only consumes all the garbage percentage signs which I think is not a big compatibility problem.

All this is really ancient code though.
------- Comment #2 From 2011-09-29 08:43:20 PST -------
A good way to investigate these concerns is to construct test cases demonstrating the issues and try them in other browsers.
------- Comment #3 From 2011-09-29 09:38:47 PST -------
(In reply to comment #0)
> The "original" comes form: http://www.w3.org/TR/CSS21/grammar.html "G.2 Lexical scanner"
> The "wk" comes form css/tokenizer.flex
> 
> original:
> nonascii    [\240-\377]
> wk:
> nonascii        [\200-\377]
> 
> They start nonascii from 160 not 128. Not sure why.

The characters in the range U+0080-009F are considered “control characters” so I can imagine they might not be permitted in some contexts. But we need to be careful about what kind of text is passed in. If this tokenizer is used on UTF-8 text then it’s critical to allow 80-9F bytes.

> original:
> string1        \"([^\n\r\f\\"]|\\{nl}|{escape})*\"
> wk:
> string1         \"([\t !#$%&(-~]|\\{nl}|\'|{nonascii}|{escape})*\"
> 
> Basically we disallow 127 (DELETE) and <32 non-newline chars while the original grammar allows them.

Another batch of what are considered control characters.

> original:
> unicode        \\{h}{1,6}(\r\n|[ \t\r\n\f])?
> wk:
> unicode         \\{h}{1,6}[ \t\r\n\f]?
> 
> This can be exploited by a \r\n newline: A\41\r\nB should be "AAB" but it will be "AA" and "B" in WK.

Seems likely to just be a bug.

> original:
> {num}%            {return PERCENTAGE;}
> wk:
> {num}%+                 {yyTok = PERCENTAGE; return yyTok;}
> 
> Why do we allow multpile percent sign? Although we still treat them as one...

Subversion indicates that I made this change 8 years ago in <http://trac.webkit.org/changeset/4059>.

The problem being solved was a table element had a content attribute width="90%%". Other browsers were treating this as 90%. So this wasn’t about CSS itself, but rather about parsing the value of a width content attribute.

Here’s what Hyatt advised me back in 2003 (things may have changed since then):

> (1) We could just patch the tokenizer of the CSS parser to make the percent token be %+ (i.e., 1 or more percent signs). 
> (2) The yacc grammar could be patched.
> (3) A simple function could be written that parses HTML widths without using the CSS parser at all, which is what we did for color.
>
> (3) is IMO where we want to go ultimately, but either (1) or (2) would do as a beta fix.

I think we are looking at three or four separate issues here, and we should use separate bugs for them. We’ll need to fix and test each separately. Perhaps the two control character issues could be grouped, though.

Even if we make no change, for each of these we should construct test cases demonstrating the behavior. As I said, running those cases in other browsers can give us valuable information too.

When replacing a piece of code like the lexer, an important first step is to increase test coverage sufficiently. That’s the first order of business. There’s no reason to tie the behavior change to the time we do a rewrite. In fact, it’s better to do it independently either before or after the rewrite.
------- Comment #4 From 2011-09-29 10:52:24 PST -------
> Even if we make no change, for each of these we should construct test cases demonstrating the behavior. As I said, running those cases in other browsers can give us valuable information too.
> 
> When replacing a piece of code like the lexer, an important first step is to increase test coverage sufficiently. That’s the first order of business. There’s no reason to tie the behavior change to the time we do a rewrite. In fact, it’s better to do it independently either before or after the rewrite.

Totally agree. The motivation of this bug is simply to raise the visibility of these differences and see that they have a known reason. But it looks like we need further investigation. Testing other browsers and creating regression tests could be a good task for our younger team members.

Rewriting the tokeinzer has an entiriely different motivation. According to our recent profile data the total time of the tokenizer is comparable to the total time of the CSS parser which was a surprise for us. We simply want to see how a hand written tokenizer performs compared to the generated one.
------- Comment #5 From 2011-09-29 13:03:04 PST -------
(In reply to comment #4)
> Rewriting the tokeinzer has an entiriely different motivation. According to our recent profile data the total time of the tokenizer is comparable to the total time of the CSS parser which was a surprise for us. We simply want to see how a hand written tokenizer performs compared to the generated one.

Sure, when we did the new HTML parser and the new JavaScript tokenizer both made things much faster.

And in both cases, the work began by adding far more test cases to make it safe to replace the component.

Unless you see your exercise as a purely scientific experiment, I suggest you beef up test cases as an early step of the rewrite. It’s typically a straightforward and relatively quick task.
------- Comment #6 From 2011-10-26 03:10:51 PST -------
Playing with the CSS tokenizer is quite funny. Although there are several bugs in the tokeinzer, they cannot be exploited because other mechanics hides them.

Yesterday I checked an URI parsing bug. Due to a bug in CSSParser::text(...) the slash-newlines sequences are converted to newline in the same way as '\a' converted to 'a', so they are treated as normal escape sequences. I thought I could exploit this, but I didn't work, since there is another funny CSS rule, which says chars < 32 must be skipped.

So www<tab>.<tab>google<tab>.<tab>com is the same as www.google.com since <tab> is equal to 9, which is less than 32, and discarded. However, www<space>.<space>google<space>.<space>com is different, since <space> is not less than 32. <newline> is equal to 10, so it is discarded as well.

Today I checked this bug:

> Another batch of what are considered control characters.
> 
> > original:
> > unicode        \\{h}{1,6}(\r\n|[ \t\r\n\f])?
> > wk:
> > unicode         \\{h}{1,6}[ \t\r\n\f]?
> > 
> > This can be exploited by a \r\n newline: A\41\r\nB should be "AAB" but it will be "AA" and "B" in WK.
> 
> Seems likely to just be a bug.

Guess what it does not appear as well. Why?

WebCore/xml/MarkupTokenizerBase.h InputStreamPreprocessor::peek(...) converts '\r\n' sequences to '\n'! This function is indirectly called from WebCore/html/parser/HTMLDocumentParser.cpp:267 through m_tokenizer->nextToken(...). Thus, the CSS parser never encounters a '\r\n'.

Although we could simplify the "nl \n|\r\n|\r|\f" rule and CSSParser::text(...), I don't want to do it since I hope they will be dropped in the future as the hand written parser is more than twice as fast.

But I want to add a newline related layout test.
------- Comment #7 From 2011-10-26 03:35:34 PST -------
Created an attachment (id=112477) [details]
Flex killer test-case (should have yellow background)

Just to advertise the advantages of hand written parser a little more, the attached test case does not work in WebKit, but works in FireFox, Opera and even with Internet Explorer 8. Furthermore it cannot be fixed as the bug is in the flex lexer generator :] (It cannot process the last token)

By the way WebKit uses a workaround for this in its internal style things: they terminated with an unnecessary extra space, which seemed stupid for me before, but now I understand everything.
------- Comment #8 From 2011-10-26 04:31:15 PST -------
I'm all for hand written parser. The only advertisement it really needs is that it is faster and at least as correct as the old one, without being too much less readable/hackable. Is there a bug tracking this work?
------- Comment #9 From 2011-10-26 04:33:26 PST -------
(In reply to comment #8)
> I'm all for hand written parser. The only advertisement it really needs is that it is faster and at least as correct as the old one, without being too much less readable/hackable. Is there a bug tracking this work?

https://bugs.webkit.org/show_bug.cgi?id=70107
------- Comment #10 From 2011-10-27 02:16:26 PST -------
This bug should be a master bug, adding the dependent bugs.

Thanks Darin for reviewing them!
------- Comment #11 From 2011-10-28 03:56:07 PST -------
There is one more difference worth mention:

According to the spec, you can use escape sequences in any CSS reserved words inlcuding number types, so "100px" and "100\70\78" has exactly the same meaning. This is supported by FireFox, Opera and Internet Explorer but not in WebKit and would be a complicated change with the current lexer. Can I postpone this change until the custom written parser lands?
------- Comment #12 From 2011-10-28 10:36:28 PST -------
(In reply to comment #11)
> Can I postpone this change until the custom written parser lands?

Sure, but don’t postpone writing a test, please.
------- Comment #13 From 2011-10-28 11:29:14 PST -------
> Sure, but don’t postpone writing a test, please.

I don't quite understand your request. Do you need a currently failing test or something else ...?
------- Comment #14 From 2011-10-28 11:51:40 PST -------
(In reply to comment #13)
> > Sure, but don’t postpone writing a test, please.
> 
> I don't quite understand your request. Do you need a currently failing test or something else ...?

Our tests are regression tests that help us notice changes in behavior. A test showing the current behavior, which characterize it either as failing or succeeding, would be great. The expectations would reflect the current behavior. And the expectations, at least, would change in the patch that fixes the bug.
------- Comment #15 From 2011-11-03 05:32:22 PST -------
> Our tests are regression tests that help us notice changes in behavior. A test showing the current behavior, which characterize it either as failing or succeeding, would be great. The expectations would reflect the current behavior. And the expectations, at least, would change in the patch that fixes the bug.

I gave this task to Szilard. He will submit patches soon.
------- Comment #16 From 2011-11-07 06:04:13 PST -------
Created an attachment (id=113854) [details]
Added a test for css number types, where 'px' is defined like \70\78, etc.

Added a test for css number types, where 'px' is defined like \70\78, etc.
------- Comment #17 From 2011-11-08 05:48:53 PST -------
Created an attachment (id=114052) [details]
Added three tests for css lexer parsing problems.


The first and second are testing the css number types parser, where 'px' defined like escape sequences (\70\78), etc.
The third is testing the css square braces block parser. The WebKit css lexer didn't pars correctly the css block.
Example:
1. <html>... <p style="font-size: 12px;">...</p>...</html> it's ok
2. <html>... <p style="{font-size: 12px;}">...</p>...</html> it's not.
------- Comment #18 From 2011-11-08 05:52:50 PST -------
Hey Szilard,

please open a separate bugzilla entry for each patches. Your second patch does not invalidate the first one. Moreover, this is a meta-bug not an entry for specific issue.
------- Comment #19 From 2011-11-21 04:27:54 PST -------
I found one more thing:

http://www.w3.org/TR/selectors/#structural-pseudos

says:

The following selectors are therefore equivalent:
bar:nth-child(1n+0)   /* represents all bar elements, specificity (0,1,1) */
bar:nth-child(n+0)    /* same */
bar:nth-child(n)      /* same */
bar                   /* same but lower specificity (0,0,1) */

However, in our tokenizer.flex:
{ident}                 {yyTok = IDENT; return yyTok;}
<nthchild>{nth}         {yyTok = NTH; return yyTok;}

Thus, identifier got higher priority than nth in <nthchild> mode, so 'n' is reported as IDENT not as NTH.

I think we should swap these rules. Do you agree?
------- Comment #20 From 2011-11-30 10:03:47 PST -------
(In reply to comment #19)
> However, in our tokenizer.flex:
> {ident}                 {yyTok = IDENT; return yyTok;}
> <nthchild>{nth}         {yyTok = NTH; return yyTok;}
> 
> Thus, identifier got higher priority than nth in <nthchild> mode, so 'n' is reported as IDENT not as NTH.
> 
> I think we should swap these rules.

If this is a bug you should be able to demonstrate it with a test case; you could then try it in other browsers to see if they match the spec.

Swapping the rules seems like a fine way to fix the bug if it works.
------- Comment #21 From 2012-09-13 09:44:09 PST -------
(From update of attachment 114052 [details])
I think you want dumpAsText for these (I see one of them is).

I assume other browsers pass these tests?
------- Comment #22 From 2013-01-17 03:43:12 PST -------
(From update of attachment 114052 [details])
Attachment 114052 [details] did not pass mac-wk2-ews (mac-wk2):
Output: http://queues.webkit.org/results/15937047

New failing tests:
fast/css/parsing-css-wrap.html
fast/css/parsing-css-number-types.html
fast/css/parsing-css-square-braces.html
------- Comment #23 From 2013-04-08 18:11:53 PST -------
(From update of attachment 114052 [details])
View in context: https://bugs.webkit.org/attachment.cgi?id=114052&action=review

> LayoutTests/fast/css/parsing-css-wrap-expected.txt:11
> +          text run at (0,0) width 205: "TEST DID NOT COMPLETE"

It would be much better to have a text result, even if this test won’t complete.