Bug 69083 - Several CSS lexer rules don't match CSS 2.1 spec
Summary: Several CSS lexer rules don't match CSS 2.1 spec
Status: RESOLVED CONFIGURATION CHANGED
Alias: None
Product: WebKit
Classification: Unclassified
Component: CSS (show other bugs)
Version: 528+ (Nightly build)
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Nobody
URL:
Keywords: BrowserCompat, InRadar
Depends on: 74148 70807 70909 71000 71097 72007 72008 72360 74178 74815 76152
Blocks: 70107
  Show dependency treegraph
 
Reported: 2011-09-29 05:03 PDT by Zoltan Herczeg
Modified: 2024-02-13 03:18 PST (History)
13 users (show)

See Also:


Attachments
Flex killer test-case (should have yellow background) (81 bytes, text/html)
2011-10-26 03:35 PDT, Zoltan Herczeg
no flags Details
Added a test for css number types, where 'px' is defined like \70\78, etc. (3.79 KB, patch)
2011-11-07 06:04 PST, Szilard Ledan
no flags Details | Formatted Diff | Diff
Added three tests for css lexer parsing problems. (9.80 KB, patch)
2011-11-08 05:48 PST, Szilard Ledan
darin: review-
buildbot: commit-queue-
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Zoltan Herczeg 2011-09-29 05:03:21 PDT
I am playing with a hand written CSS lexer, and during that work I found a few bugs in the current flex rules in WebKit. I am not sure we prefer compatibility or standard compilance in such case, so I just tell you what I found, and let you decide what to do with them:

The "original" comes form: http://www.w3.org/TR/CSS21/grammar.html "G.2 Lexical scanner"
The "wk" comes form css/tokenizer.flex

original:
nonascii	[\240-\377]
wk:
nonascii        [\200-\377]

They start nonascii from 160 not 128. Not sure why.

original:
string1		\"([^\n\r\f\\"]|\\{nl}|{escape})*\"
wk:
string1         \"([\t !#$%&(-~]|\\{nl}|\'|{nonascii}|{escape})*\"

Basically we disallow 127 (DELETE) and <32 non-newline chars while the original grammar allows them.

original:
unicode		\\{h}{1,6}(\r\n|[ \t\r\n\f])?
wk:
unicode         \\{h}{1,6}[ \t\r\n\f]?

This can be exploited by a \r\n newline: A\41\r\nB should be "AAB" but it will be "AA" and "B" in WK.

original:
{num}%			{return PERCENTAGE;}
wk:
{num}%+                 {yyTok = PERCENTAGE; return yyTok;}

Why do we allow multpile percent sign? Although we still treat them as one...
Comment 1 Andras Becsi 2011-09-29 05:30:11 PDT
(In reply to comment #0)
> I am playing with a hand written CSS lexer, and during that work I found a few bugs in the current flex rules in WebKit. I am not sure we prefer compatibility or standard compilance in such case, so I just tell you what I found, and let you decide what to do with them:
> 
> The "original" comes form: http://www.w3.org/TR/CSS21/grammar.html "G.2 Lexical scanner"
> The "wk" comes form css/tokenizer.flex
> 
> original:
> nonascii    [\240-\377]
> wk:
> nonascii        [\200-\377]
> 
> They start nonascii from 160 not 128. Not sure why.

I think this is just wrong in the spec, but there might be an explanation for it.

> 
> original:
> string1        \"([^\n\r\f\\"]|\\{nl}|{escape})*\"
> wk:
> string1         \"([\t !#$%&(-~]|\\{nl}|\'|{nonascii}|{escape})*\"
> 
> Basically we disallow 127 (DELETE) and <32 non-newline chars while the original grammar allows them.
> 

This might be to make the parser simpler, not sure why the spec is that permissive.

> original:
> unicode        \\{h}{1,6}(\r\n|[ \t\r\n\f])?
> wk:
> unicode         \\{h}{1,6}[ \t\r\n\f]?
> 
> This can be exploited by a \r\n newline: A\41\r\nB should be "AAB" but it will be "AA" and "B" in WK.
> 

This feels like a real bug which could need a test case if possible

> original:
> {num}%            {return PERCENTAGE;}
> wk:
> {num}%+                 {yyTok = PERCENTAGE; return yyTok;}
> 
> Why do we allow multpile percent sign? Although we still treat them as one...

In this case the WebKit lexer only consumes all the garbage percentage signs which I think is not a big compatibility problem.

All this is really ancient code though.
Comment 2 Darin Adler 2011-09-29 08:43:20 PDT
A good way to investigate these concerns is to construct test cases demonstrating the issues and try them in other browsers.
Comment 3 Darin Adler 2011-09-29 09:38:47 PDT
(In reply to comment #0)
> The "original" comes form: http://www.w3.org/TR/CSS21/grammar.html "G.2 Lexical scanner"
> The "wk" comes form css/tokenizer.flex
> 
> original:
> nonascii    [\240-\377]
> wk:
> nonascii        [\200-\377]
> 
> They start nonascii from 160 not 128. Not sure why.

The characters in the range U+0080-009F are considered “control characters” so I can imagine they might not be permitted in some contexts. But we need to be careful about what kind of text is passed in. If this tokenizer is used on UTF-8 text then it’s critical to allow 80-9F bytes.

> original:
> string1        \"([^\n\r\f\\"]|\\{nl}|{escape})*\"
> wk:
> string1         \"([\t !#$%&(-~]|\\{nl}|\'|{nonascii}|{escape})*\"
> 
> Basically we disallow 127 (DELETE) and <32 non-newline chars while the original grammar allows them.

Another batch of what are considered control characters.

> original:
> unicode        \\{h}{1,6}(\r\n|[ \t\r\n\f])?
> wk:
> unicode         \\{h}{1,6}[ \t\r\n\f]?
> 
> This can be exploited by a \r\n newline: A\41\r\nB should be "AAB" but it will be "AA" and "B" in WK.

Seems likely to just be a bug.

> original:
> {num}%            {return PERCENTAGE;}
> wk:
> {num}%+                 {yyTok = PERCENTAGE; return yyTok;}
> 
> Why do we allow multpile percent sign? Although we still treat them as one...

Subversion indicates that I made this change 8 years ago in <http://trac.webkit.org/changeset/4059>.

The problem being solved was a table element had a content attribute width="90%%". Other browsers were treating this as 90%. So this wasn’t about CSS itself, but rather about parsing the value of a width content attribute.

Here’s what Hyatt advised me back in 2003 (things may have changed since then):

> (1) We could just patch the tokenizer of the CSS parser to make the percent token be %+ (i.e., 1 or more percent signs). 
> (2) The yacc grammar could be patched.
> (3) A simple function could be written that parses HTML widths without using the CSS parser at all, which is what we did for color.
>
> (3) is IMO where we want to go ultimately, but either (1) or (2) would do as a beta fix.

I think we are looking at three or four separate issues here, and we should use separate bugs for them. We’ll need to fix and test each separately. Perhaps the two control character issues could be grouped, though.

Even if we make no change, for each of these we should construct test cases demonstrating the behavior. As I said, running those cases in other browsers can give us valuable information too.

When replacing a piece of code like the lexer, an important first step is to increase test coverage sufficiently. That’s the first order of business. There’s no reason to tie the behavior change to the time we do a rewrite. In fact, it’s better to do it independently either before or after the rewrite.
Comment 4 Zoltan Herczeg 2011-09-29 10:52:24 PDT
> Even if we make no change, for each of these we should construct test cases demonstrating the behavior. As I said, running those cases in other browsers can give us valuable information too.
> 
> When replacing a piece of code like the lexer, an important first step is to increase test coverage sufficiently. That’s the first order of business. There’s no reason to tie the behavior change to the time we do a rewrite. In fact, it’s better to do it independently either before or after the rewrite.

Totally agree. The motivation of this bug is simply to raise the visibility of these differences and see that they have a known reason. But it looks like we need further investigation. Testing other browsers and creating regression tests could be a good task for our younger team members.

Rewriting the tokeinzer has an entiriely different motivation. According to our recent profile data the total time of the tokenizer is comparable to the total time of the CSS parser which was a surprise for us. We simply want to see how a hand written tokenizer performs compared to the generated one.
Comment 5 Darin Adler 2011-09-29 13:03:04 PDT
(In reply to comment #4)
> Rewriting the tokeinzer has an entiriely different motivation. According to our recent profile data the total time of the tokenizer is comparable to the total time of the CSS parser which was a surprise for us. We simply want to see how a hand written tokenizer performs compared to the generated one.

Sure, when we did the new HTML parser and the new JavaScript tokenizer both made things much faster.

And in both cases, the work began by adding far more test cases to make it safe to replace the component.

Unless you see your exercise as a purely scientific experiment, I suggest you beef up test cases as an early step of the rewrite. It’s typically a straightforward and relatively quick task.
Comment 6 Zoltan Herczeg 2011-10-26 03:10:51 PDT
Playing with the CSS tokenizer is quite funny. Although there are several bugs in the tokeinzer, they cannot be exploited because other mechanics hides them.

Yesterday I checked an URI parsing bug. Due to a bug in CSSParser::text(...) the slash-newlines sequences are converted to newline in the same way as '\a' converted to 'a', so they are treated as normal escape sequences. I thought I could exploit this, but I didn't work, since there is another funny CSS rule, which says chars < 32 must be skipped.

So www<tab>.<tab>google<tab>.<tab>com is the same as www.google.com since <tab> is equal to 9, which is less than 32, and discarded. However, www<space>.<space>google<space>.<space>com is different, since <space> is not less than 32. <newline> is equal to 10, so it is discarded as well.

Today I checked this bug:

> Another batch of what are considered control characters.
> 
> > original:
> > unicode        \\{h}{1,6}(\r\n|[ \t\r\n\f])?
> > wk:
> > unicode         \\{h}{1,6}[ \t\r\n\f]?
> > 
> > This can be exploited by a \r\n newline: A\41\r\nB should be "AAB" but it will be "AA" and "B" in WK.
> 
> Seems likely to just be a bug.

Guess what it does not appear as well. Why?

WebCore/xml/MarkupTokenizerBase.h InputStreamPreprocessor::peek(...) converts '\r\n' sequences to '\n'! This function is indirectly called from WebCore/html/parser/HTMLDocumentParser.cpp:267 through m_tokenizer->nextToken(...). Thus, the CSS parser never encounters a '\r\n'.

Although we could simplify the "nl \n|\r\n|\r|\f" rule and CSSParser::text(...), I don't want to do it since I hope they will be dropped in the future as the hand written parser is more than twice as fast.

But I want to add a newline related layout test.
Comment 7 Zoltan Herczeg 2011-10-26 03:35:34 PDT
Created attachment 112477 [details]
Flex killer test-case (should have yellow background)

Just to advertise the advantages of hand written parser a little more, the attached test case does not work in WebKit, but works in FireFox, Opera and even with Internet Explorer 8. Furthermore it cannot be fixed as the bug is in the flex lexer generator :] (It cannot process the last token)

By the way WebKit uses a workaround for this in its internal style things: they terminated with an unnecessary extra space, which seemed stupid for me before, but now I understand everything.
Comment 8 Antti Koivisto 2011-10-26 04:31:15 PDT
I'm all for hand written parser. The only advertisement it really needs is that it is faster and at least as correct as the old one, without being too much less readable/hackable. Is there a bug tracking this work?
Comment 9 Zoltan Herczeg 2011-10-26 04:33:26 PDT
(In reply to comment #8)
> I'm all for hand written parser. The only advertisement it really needs is that it is faster and at least as correct as the old one, without being too much less readable/hackable. Is there a bug tracking this work?

https://bugs.webkit.org/show_bug.cgi?id=70107
Comment 10 Zoltan Herczeg 2011-10-27 02:16:26 PDT
This bug should be a master bug, adding the dependent bugs.

Thanks Darin for reviewing them!
Comment 11 Zoltan Herczeg 2011-10-28 03:56:07 PDT
There is one more difference worth mention:

According to the spec, you can use escape sequences in any CSS reserved words inlcuding number types, so "100px" and "100\70\78" has exactly the same meaning. This is supported by FireFox, Opera and Internet Explorer but not in WebKit and would be a complicated change with the current lexer. Can I postpone this change until the custom written parser lands?
Comment 12 Darin Adler 2011-10-28 10:36:28 PDT
(In reply to comment #11)
> Can I postpone this change until the custom written parser lands?

Sure, but don’t postpone writing a test, please.
Comment 13 Zoltan Herczeg 2011-10-28 11:29:14 PDT
> Sure, but don’t postpone writing a test, please.

I don't quite understand your request. Do you need a currently failing test or something else ...?
Comment 14 Darin Adler 2011-10-28 11:51:40 PDT
(In reply to comment #13)
> > Sure, but don’t postpone writing a test, please.
> 
> I don't quite understand your request. Do you need a currently failing test or something else ...?

Our tests are regression tests that help us notice changes in behavior. A test showing the current behavior, which characterize it either as failing or succeeding, would be great. The expectations would reflect the current behavior. And the expectations, at least, would change in the patch that fixes the bug.
Comment 15 Zoltan Herczeg 2011-11-03 05:32:22 PDT
> Our tests are regression tests that help us notice changes in behavior. A test showing the current behavior, which characterize it either as failing or succeeding, would be great. The expectations would reflect the current behavior. And the expectations, at least, would change in the patch that fixes the bug.

I gave this task to Szilard. He will submit patches soon.
Comment 16 Szilard Ledan 2011-11-07 06:04:13 PST
Created attachment 113854 [details]
Added a test for css number types, where 'px' is defined like \70\78, etc.

Added a test for css number types, where 'px' is defined like \70\78, etc.
Comment 17 Szilard Ledan 2011-11-08 05:48:53 PST
Created attachment 114052 [details]
Added three tests for css lexer parsing problems.


The first and second are testing the css number types parser, where 'px' defined like escape sequences (\70\78), etc.
The third is testing the css square braces block parser. The WebKit css lexer didn't pars correctly the css block.
Example:
1. <html>... <p style="font-size: 12px;">...</p>...</html> it's ok
2. <html>... <p style="{font-size: 12px;}">...</p>...</html> it's not.
Comment 18 Zoltan Herczeg 2011-11-08 05:52:50 PST
Hey Szilard,

please open a separate bugzilla entry for each patches. Your second patch does not invalidate the first one. Moreover, this is a meta-bug not an entry for specific issue.
Comment 19 Zoltan Herczeg 2011-11-21 04:27:54 PST
I found one more thing:

http://www.w3.org/TR/selectors/#structural-pseudos

says:

The following selectors are therefore equivalent:
bar:nth-child(1n+0)   /* represents all bar elements, specificity (0,1,1) */
bar:nth-child(n+0)    /* same */
bar:nth-child(n)      /* same */
bar                   /* same but lower specificity (0,0,1) */

However, in our tokenizer.flex:
{ident}                 {yyTok = IDENT; return yyTok;}
<nthchild>{nth}         {yyTok = NTH; return yyTok;}

Thus, identifier got higher priority than nth in <nthchild> mode, so 'n' is reported as IDENT not as NTH.

I think we should swap these rules. Do you agree?
Comment 20 Darin Adler 2011-11-30 10:03:47 PST
(In reply to comment #19)
> However, in our tokenizer.flex:
> {ident}                 {yyTok = IDENT; return yyTok;}
> <nthchild>{nth}         {yyTok = NTH; return yyTok;}
> 
> Thus, identifier got higher priority than nth in <nthchild> mode, so 'n' is reported as IDENT not as NTH.
> 
> I think we should swap these rules.

If this is a bug you should be able to demonstrate it with a test case; you could then try it in other browsers to see if they match the spec.

Swapping the rules seems like a fine way to fix the bug if it works.
Comment 21 Eric Seidel (no email) 2012-09-13 09:44:09 PDT
Comment on attachment 114052 [details]
Added three tests for css lexer parsing problems.

I think you want dumpAsText for these (I see one of them is).

I assume other browsers pass these tests?
Comment 22 Build Bot 2013-01-17 03:43:12 PST
Comment on attachment 114052 [details]
Added three tests for css lexer parsing problems.

Attachment 114052 [details] did not pass mac-wk2-ews (mac-wk2):
Output: http://queues.webkit.org/results/15937047

New failing tests:
fast/css/parsing-css-wrap.html
fast/css/parsing-css-number-types.html
fast/css/parsing-css-square-braces.html
Comment 23 Darin Adler 2013-04-08 18:11:53 PDT
Comment on attachment 114052 [details]
Added three tests for css lexer parsing problems.

View in context: https://bugs.webkit.org/attachment.cgi?id=114052&action=review

> LayoutTests/fast/css/parsing-css-wrap-expected.txt:11
> +          text run at (0,0) width 205: "TEST DID NOT COMPLETE"

It would be much better to have a text result, even if this test won’t complete.
Comment 24 Ahmad Saleem 2022-07-20 17:12:55 PDT
In attached test (Flex killer), Safari 15.6 on macOS 12.5 matches with other browsers (Chrome Canary 105 and Firefox Nightly 104).

But from the test case from the patch - there are following differences:

Test Case - https://jsfiddle.net/zryLu2pa/show

Safari 15.6 and Chrome Canary 105 fails but show following rules for stylesheet:

body { background: rgb(255, 255, 255); }

While Firefox Nightly 104 does:

body { background: rgb(255, 255, 255) none repeat scroll 0% 0%; }

__________________

If I am testing incorrectly, please ignore. Thanks!
Comment 25 Brent Fulgham 2022-07-20 17:14:56 PDT
I think if all three browser match behavior, we are probably "RESOLVE | CONFIG CHANGED".

Can you try the test case on Firefox or Chrome to see if they match?
Comment 26 Brent Fulgham 2022-07-20 17:15:25 PDT
My guess is that the 11-year-old test is no longer completely aligned with the modern spec.
Comment 27 Ahmad Saleem 2022-07-20 17:21:32 PDT
In one of the test cases, as mentioned in Comment 24 (refer to JSfiddle), Safari 15.5 output is different from Firefox but matches with Chrome. I am not sure on Web Spec so I can't comment whether Safari behavior is right or Firefox. Thanks!
Comment 28 Ahmad Saleem 2022-10-12 14:57:07 PDT
(In reply to Ahmad Saleem from comment #24)
> In attached test (Flex killer), Safari 15.6 on macOS 12.5 matches with other
> browsers (Chrome Canary 105 and Firefox Nightly 104).
> 
> But from the test case from the patch - there are following differences:
> 
> Test Case - https://jsfiddle.net/zryLu2pa/show
> 
> Safari 15.6 and Chrome Canary 105 fails but show following rules for
> stylesheet:
> 
> body { background: rgb(255, 255, 255); }
> 
> While Firefox Nightly 104 does:
> 
> body { background: rgb(255, 255, 255) none repeat scroll 0% 0%; }
> 
> __________________
> 
> If I am testing incorrectly, please ignore. Thanks!

Just to update:

Now Safari 16 and TP 155 show:

body { background-color: rgb(255, 255, 255); }

while Firefox Nightly 107 matched Chrome Canary 108 and shows:

body { background: rgb(255, 255, 255); }

and now difference is "background-color" vs "background".
Comment 29 Karl Dubost 2022-10-12 23:01:00 PDT
Probably partially a duplicate of Bug 185953 and Bug 245105
Comment 30 Ahmad Saleem 2023-06-08 06:15:57 PDT
(In reply to Ahmad Saleem from comment #28)
> (In reply to Ahmad Saleem from comment #24)
> > In attached test (Flex killer), Safari 15.6 on macOS 12.5 matches with other
> > browsers (Chrome Canary 105 and Firefox Nightly 104).
> > 
> > But from the test case from the patch - there are following differences:
> > 
> > Test Case - https://jsfiddle.net/zryLu2pa/show
> > 
> > Safari 15.6 and Chrome Canary 105 fails but show following rules for
> > stylesheet:
> > 
> > body { background: rgb(255, 255, 255); }
> > 
> > While Firefox Nightly 104 does:
> > 
> > body { background: rgb(255, 255, 255) none repeat scroll 0% 0%; }
> > 
> > __________________
> > 
> > If I am testing incorrectly, please ignore. Thanks!
> 
> Just to update:
> 
> Now Safari 16 and TP 155 show:
> 
> body { background-color: rgb(255, 255, 255); }
> 
> while Firefox Nightly 107 matched Chrome Canary 108 and shows:
> 
> body { background: rgb(255, 255, 255); }
> 
> and now difference is "background-color" vs "background".

All browsers (Chrome Canary 116, WebKit ToT, Firefox Nightly 116) show:

Test parsing of CSS number types.

FAILURE

Rules from the stylesheet:

body { background: rgb(255, 255, 255); }

Expected result:

#a { font-size: 16px; }
	#b { font-size: 16px; }
	#c { font-size: 16px; }
	#d { font-size: 16px; }
	#e { font-size: 1em; }
	#f { font-size: 1em; }
	#g { font-size: 1em; }
	#h { font-size: 1em; }
	#i { font-size: 12pt; }
Comment 31 Radar WebKit Bug Importer 2024-02-08 15:42:55 PST
<rdar://problem/122588295>
Comment 32 Anne van Kesteren 2024-02-13 03:18:26 PST
Thanks Ahmad, let's consider this closed then.