Bug 69083

Summary:

Several CSS lexer rules don't match CSS 2.1 spec

Product:

WebKit

Reporter:

Zoltan Herczeg <zherczeg>

Component:

CSS

Assignee:

Nobody <webkit-unassigned>

Status:

RESOLVED CONFIGURATION CHANGED

Severity:

Normal

CC:

abecsi, ahmad.saleem792, annevk, ap, bfulgham, buildbot, darin, eric, hyatt, karlcow, koivisto, rniwa, webkit-bug-importer

Priority:

Keywords:

BrowserCompat, InRadar

Version:

528+ (Nightly build)

Hardware:

Unspecified

OS:

Unspecified

See Also:

https://bugs.webkit.org/show_bug.cgi?id=185953

Bug Depends on:

74148, 70807, 70909, 71000, 71097, 72007, 72008, 72360, 74178, 74815, 76152

Bug Blocks:

70107

Attachments:

Description	Flags
Flex killer test-case (should have yellow background)	none
Added a test for css number types, where 'px' is defined like \70\78, etc.	none
Added three tests for css lexer parsing problems.	darin: review-, buildbot: commit-queue-

Zoltan Herczeg

Reported 2011-09-29 05:03:21 PDT

I am playing with a hand written CSS lexer, and during that work I found a few bugs in the current flex rules in WebKit. I am not sure we prefer compatibility or standard compilance in such case, so I just tell you what I found, and let you decide what to do with them: The "original" comes form: http://www.w3.org/TR/CSS21/grammar.html "G.2 Lexical scanner" The "wk" comes form css/tokenizer.flex original: nonascii [\240-\377] wk: nonascii [\200-\377] They start nonascii from 160 not 128. Not sure why. original: string1 \"([^\n\r\f\\"]|\\{nl}|{escape})*\" wk: string1 \"([\t !#$%&(-~]|\\{nl}|\'|{nonascii}|{escape})*\" Basically we disallow 127 (DELETE) and <32 non-newline chars while the original grammar allows them. original: unicode \\{h}{1,6}(\r\n|[ \t\r\n\f])? wk: unicode \\{h}{1,6}[ \t\r\n\f]? This can be exploited by a \r\n newline: A\41\r\nB should be "AAB" but it will be "AA" and "B" in WK. original: {num}% {return PERCENTAGE;} wk: {num}%+ {yyTok = PERCENTAGE; return yyTok;} Why do we allow multpile percent sign? Although we still treat them as one...

Attachments
Flex killer test-case (should have yellow background) (81 bytes, text/html) 2011-10-26 03:35 PDT, Zoltan Herczeg	no flags	Details
Added a test for css number types, where 'px' is defined like \70\78, etc. (3.79 KB, patch) 2011-11-07 06:04 PST, Szilard Ledan	no flags	Details Formatted Diff Diff
Added three tests for css lexer parsing problems. (9.80 KB, patch) 2011-11-08 05:48 PST, Szilard Ledan	darin: review- buildbot: commit-queue-	Details Formatted Diff Diff
Show Obsolete (1) View All Add attachment proposed patch, testcase, etc.

Andras Becsi

Comment 1 2011-09-29 05:30:11 PDT

(In reply to comment #0) > I am playing with a hand written CSS lexer, and during that work I found a few bugs in the current flex rules in WebKit. I am not sure we prefer compatibility or standard compilance in such case, so I just tell you what I found, and let you decide what to do with them: > > The "original" comes form: http://www.w3.org/TR/CSS21/grammar.html "G.2 Lexical scanner" > The "wk" comes form css/tokenizer.flex > > original: > nonascii [\240-\377] > wk: > nonascii [\200-\377] > > They start nonascii from 160 not 128. Not sure why. I think this is just wrong in the spec, but there might be an explanation for it. > > original: > string1 \"([^\n\r\f\\"]|\\{nl}|{escape})*\" > wk: > string1 \"([\t !#$%&(-~]|\\{nl}|\'|{nonascii}|{escape})*\" > > Basically we disallow 127 (DELETE) and <32 non-newline chars while the original grammar allows them. > This might be to make the parser simpler, not sure why the spec is that permissive. > original: > unicode \\{h}{1,6}(\r\n|[ \t\r\n\f])? > wk: > unicode \\{h}{1,6}[ \t\r\n\f]? > > This can be exploited by a \r\n newline: A\41\r\nB should be "AAB" but it will be "AA" and "B" in WK. > This feels like a real bug which could need a test case if possible > original: > {num}% {return PERCENTAGE;} > wk: > {num}%+ {yyTok = PERCENTAGE; return yyTok;} > > Why do we allow multpile percent sign? Although we still treat them as one... In this case the WebKit lexer only consumes all the garbage percentage signs which I think is not a big compatibility problem. All this is really ancient code though.

Darin Adler

Comment 2 2011-09-29 08:43:20 PDT

A good way to investigate these concerns is to construct test cases demonstrating the issues and try them in other browsers.

Darin Adler

Comment 3 2011-09-29 09:38:47 PDT

(In reply to comment #0) > The "original" comes form: http://www.w3.org/TR/CSS21/grammar.html "G.2 Lexical scanner" > The "wk" comes form css/tokenizer.flex > > original: > nonascii [\240-\377] > wk: > nonascii [\200-\377] > > They start nonascii from 160 not 128. Not sure why. The characters in the range U+0080-009F are considered “control characters” so I can imagine they might not be permitted in some contexts. But we need to be careful about what kind of text is passed in. If this tokenizer is used on UTF-8 text then it’s critical to allow 80-9F bytes. > original: > string1 \"([^\n\r\f\\"]|\\{nl}|{escape})*\" > wk: > string1 \"([\t !#$%&(-~]|\\{nl}|\'|{nonascii}|{escape})*\" > > Basically we disallow 127 (DELETE) and <32 non-newline chars while the original grammar allows them. Another batch of what are considered control characters. > original: > unicode \\{h}{1,6}(\r\n|[ \t\r\n\f])? > wk: > unicode \\{h}{1,6}[ \t\r\n\f]? > > This can be exploited by a \r\n newline: A\41\r\nB should be "AAB" but it will be "AA" and "B" in WK. Seems likely to just be a bug. > original: > {num}% {return PERCENTAGE;} > wk: > {num}%+ {yyTok = PERCENTAGE; return yyTok;} > > Why do we allow multpile percent sign? Although we still treat them as one... Subversion indicates that I made this change 8 years ago in <http://trac.webkit.org/changeset/4059>. The problem being solved was a table element had a content attribute width="90%%". Other browsers were treating this as 90%. So this wasn’t about CSS itself, but rather about parsing the value of a width content attribute. Here’s what Hyatt advised me back in 2003 (things may have changed since then): > (1) We could just patch the tokenizer of the CSS parser to make the percent token be %+ (i.e., 1 or more percent signs). > (2) The yacc grammar could be patched. > (3) A simple function could be written that parses HTML widths without using the CSS parser at all, which is what we did for color. > > (3) is IMO where we want to go ultimately, but either (1) or (2) would do as a beta fix. I think we are looking at three or four separate issues here, and we should use separate bugs for them. We’ll need to fix and test each separately. Perhaps the two control character issues could be grouped, though. Even if we make no change, for each of these we should construct test cases demonstrating the behavior. As I said, running those cases in other browsers can give us valuable information too. When replacing a piece of code like the lexer, an important first step is to increase test coverage sufficiently. That’s the first order of business. There’s no reason to tie the behavior change to the time we do a rewrite. In fact, it’s better to do it independently either before or after the rewrite.

Zoltan Herczeg

Comment 4 2011-09-29 10:52:24 PDT

> Even if we make no change, for each of these we should construct test cases demonstrating the behavior. As I said, running those cases in other browsers can give us valuable information too. > > When replacing a piece of code like the lexer, an important first step is to increase test coverage sufficiently. That’s the first order of business. There’s no reason to tie the behavior change to the time we do a rewrite. In fact, it’s better to do it independently either before or after the rewrite. Totally agree. The motivation of this bug is simply to raise the visibility of these differences and see that they have a known reason. But it looks like we need further investigation. Testing other browsers and creating regression tests could be a good task for our younger team members. Rewriting the tokeinzer has an entiriely different motivation. According to our recent profile data the total time of the tokenizer is comparable to the total time of the CSS parser which was a surprise for us. We simply want to see how a hand written tokenizer performs compared to the generated one.

Darin Adler

Comment 5 2011-09-29 13:03:04 PDT

(In reply to comment #4) > Rewriting the tokeinzer has an entiriely different motivation. According to our recent profile data the total time of the tokenizer is comparable to the total time of the CSS parser which was a surprise for us. We simply want to see how a hand written tokenizer performs compared to the generated one. Sure, when we did the new HTML parser and the new JavaScript tokenizer both made things much faster. And in both cases, the work began by adding far more test cases to make it safe to replace the component. Unless you see your exercise as a purely scientific experiment, I suggest you beef up test cases as an early step of the rewrite. It’s typically a straightforward and relatively quick task.

Zoltan Herczeg

Comment 6 2011-10-26 03:10:51 PDT

Playing with the CSS tokenizer is quite funny. Although there are several bugs in the tokeinzer, they cannot be exploited because other mechanics hides them. Yesterday I checked an URI parsing bug. Due to a bug in CSSParser::text(...) the slash-newlines sequences are converted to newline in the same way as '\a' converted to 'a', so they are treated as normal escape sequences. I thought I could exploit this, but I didn't work, since there is another funny CSS rule, which says chars < 32 must be skipped. So www<tab>.<tab>google<tab>.<tab>com is the same as www.google.com since <tab> is equal to 9, which is less than 32, and discarded. However, www<space>.<space>google<space>.<space>com is different, since <space> is not less than 32. <newline> is equal to 10, so it is discarded as well. Today I checked this bug: > Another batch of what are considered control characters. > > > original: > > unicode \\{h}{1,6}(\r\n|[ \t\r\n\f])? > > wk: > > unicode \\{h}{1,6}[ \t\r\n\f]? > > > > This can be exploited by a \r\n newline: A\41\r\nB should be "AAB" but it will be "AA" and "B" in WK. > > Seems likely to just be a bug. Guess what it does not appear as well. Why? WebCore/xml/MarkupTokenizerBase.h InputStreamPreprocessor::peek(...) converts '\r\n' sequences to '\n'! This function is indirectly called from WebCore/html/parser/HTMLDocumentParser.cpp:267 through m_tokenizer->nextToken(...). Thus, the CSS parser never encounters a '\r\n'. Although we could simplify the "nl \n|\r\n|\r|\f" rule and CSSParser::text(...), I don't want to do it since I hope they will be dropped in the future as the hand written parser is more than twice as fast. But I want to add a newline related layout test.

Zoltan Herczeg

Comment 7 2011-10-26 03:35:34 PDT

Created attachment 112477 [details] Flex killer test-case (should have yellow background) Just to advertise the advantages of hand written parser a little more, the attached test case does not work in WebKit, but works in FireFox, Opera and even with Internet Explorer 8. Furthermore it cannot be fixed as the bug is in the flex lexer generator :] (It cannot process the last token) By the way WebKit uses a workaround for this in its internal style things: they terminated with an unnecessary extra space, which seemed stupid for me before, but now I understand everything.

Antti Koivisto

Comment 8 2011-10-26 04:31:15 PDT

I'm all for hand written parser. The only advertisement it really needs is that it is faster and at least as correct as the old one, without being too much less readable/hackable. Is there a bug tracking this work?

Zoltan Herczeg

Comment 9 2011-10-26 04:33:26 PDT

(In reply to comment #8) > I'm all for hand written parser. The only advertisement it really needs is that it is faster and at least as correct as the old one, without being too much less readable/hackable. Is there a bug tracking this work? https://bugs.webkit.org/show_bug.cgi?id=70107

Zoltan Herczeg

Comment 10 2011-10-27 02:16:26 PDT

This bug should be a master bug, adding the dependent bugs. Thanks Darin for reviewing them!

Zoltan Herczeg

Comment 11 2011-10-28 03:56:07 PDT

There is one more difference worth mention: According to the spec, you can use escape sequences in any CSS reserved words inlcuding number types, so "100px" and "100\70\78" has exactly the same meaning. This is supported by FireFox, Opera and Internet Explorer but not in WebKit and would be a complicated change with the current lexer. Can I postpone this change until the custom written parser lands?

Darin Adler

Comment 12 2011-10-28 10:36:28 PDT

(In reply to comment #11) > Can I postpone this change until the custom written parser lands? Sure, but don’t postpone writing a test, please.

Zoltan Herczeg

Comment 13 2011-10-28 11:29:14 PDT

> Sure, but don’t postpone writing a test, please. I don't quite understand your request. Do you need a currently failing test or something else ...?

Darin Adler

Comment 14 2011-10-28 11:51:40 PDT

(In reply to comment #13) > > Sure, but don’t postpone writing a test, please. > > I don't quite understand your request. Do you need a currently failing test or something else ...? Our tests are regression tests that help us notice changes in behavior. A test showing the current behavior, which characterize it either as failing or succeeding, would be great. The expectations would reflect the current behavior. And the expectations, at least, would change in the patch that fixes the bug.

Zoltan Herczeg

Comment 15 2011-11-03 05:32:22 PDT

> Our tests are regression tests that help us notice changes in behavior. A test showing the current behavior, which characterize it either as failing or succeeding, would be great. The expectations would reflect the current behavior. And the expectations, at least, would change in the patch that fixes the bug. I gave this task to Szilard. He will submit patches soon.

Szilard Ledan

Comment 16 2011-11-07 06:04:13 PST

Created attachment 113854 [details] Added a test for css number types, where 'px' is defined like \70\78, etc. Added a test for css number types, where 'px' is defined like \70\78, etc.

Szilard Ledan

Comment 17 2011-11-08 05:48:53 PST

Created attachment 114052 [details] Added three tests for css lexer parsing problems. The first and second are testing the css number types parser, where 'px' defined like escape sequences (\70\78), etc. The third is testing the css square braces block parser. The WebKit css lexer didn't pars correctly the css block. Example: 1. <html>... <p style="font-size: 12px;">...</p>...</html> it's ok 2. <html>... <p style="{font-size: 12px;}">...</p>...</html> it's not.

Zoltan Herczeg

Comment 18 2011-11-08 05:52:50 PST

Hey Szilard, please open a separate bugzilla entry for each patches. Your second patch does not invalidate the first one. Moreover, this is a meta-bug not an entry for specific issue.

Zoltan Herczeg

Comment 19 2011-11-21 04:27:54 PST

I found one more thing: http://www.w3.org/TR/selectors/#structural-pseudos says: The following selectors are therefore equivalent: bar:nth-child(1n+0) /* represents all bar elements, specificity (0,1,1) */ bar:nth-child(n+0) /* same */ bar:nth-child(n) /* same */ bar /* same but lower specificity (0,0,1) */ However, in our tokenizer.flex: {ident} {yyTok = IDENT; return yyTok;} <nthchild>{nth} {yyTok = NTH; return yyTok;} Thus, identifier got higher priority than nth in <nthchild> mode, so 'n' is reported as IDENT not as NTH. I think we should swap these rules. Do you agree?

Darin Adler

Comment 20 2011-11-30 10:03:47 PST

(In reply to comment #19) > However, in our tokenizer.flex: > {ident} {yyTok = IDENT; return yyTok;} > <nthchild>{nth} {yyTok = NTH; return yyTok;} > > Thus, identifier got higher priority than nth in <nthchild> mode, so 'n' is reported as IDENT not as NTH. > > I think we should swap these rules. If this is a bug you should be able to demonstrate it with a test case; you could then try it in other browsers to see if they match the spec. Swapping the rules seems like a fine way to fix the bug if it works.

Eric Seidel (no email)

Comment 21 2012-09-13 09:44:09 PDT

Comment on attachment 114052 [details] Added three tests for css lexer parsing problems. I think you want dumpAsText for these (I see one of them is). I assume other browsers pass these tests?

Build Bot

Comment 22 2013-01-17 03:43:12 PST

Comment on attachment 114052 [details] Added three tests for css lexer parsing problems. Attachment 114052 [details] did not pass mac-wk2-ews (mac-wk2): Output: http://queues.webkit.org/results/15937047 New failing tests: fast/css/parsing-css-wrap.html fast/css/parsing-css-number-types.html fast/css/parsing-css-square-braces.html

Darin Adler

Comment 23 2013-04-08 18:11:53 PDT

Comment on attachment 114052 [details] Added three tests for css lexer parsing problems. View in context: https://bugs.webkit.org/attachment.cgi?id=114052&action=review > LayoutTests/fast/css/parsing-css-wrap-expected.txt:11 > + text run at (0,0) width 205: "TEST DID NOT COMPLETE" It would be much better to have a text result, even if this test won’t complete.

Ahmad Saleem

Comment 24 2022-07-20 17:12:55 PDT

In attached test (Flex killer), Safari 15.6 on macOS 12.5 matches with other browsers (Chrome Canary 105 and Firefox Nightly 104). But from the test case from the patch - there are following differences: Test Case - https://jsfiddle.net/zryLu2pa/show Safari 15.6 and Chrome Canary 105 fails but show following rules for stylesheet: body { background: rgb(255, 255, 255); } While Firefox Nightly 104 does: body { background: rgb(255, 255, 255) none repeat scroll 0% 0%; } __________________ If I am testing incorrectly, please ignore. Thanks!

Brent Fulgham

Comment 25 2022-07-20 17:14:56 PDT

I think if all three browser match behavior, we are probably "RESOLVE | CONFIG CHANGED". Can you try the test case on Firefox or Chrome to see if they match?

Brent Fulgham

Comment 26 2022-07-20 17:15:25 PDT

My guess is that the 11-year-old test is no longer completely aligned with the modern spec.

Ahmad Saleem

Comment 27 2022-07-20 17:21:32 PDT

In one of the test cases, as mentioned in Comment 24 (refer to JSfiddle), Safari 15.5 output is different from Firefox but matches with Chrome. I am not sure on Web Spec so I can't comment whether Safari behavior is right or Firefox. Thanks!

Ahmad Saleem

Comment 28 2022-10-12 14:57:07 PDT

(In reply to Ahmad Saleem from comment #24) > In attached test (Flex killer), Safari 15.6 on macOS 12.5 matches with other > browsers (Chrome Canary 105 and Firefox Nightly 104). > > But from the test case from the patch - there are following differences: > > Test Case - https://jsfiddle.net/zryLu2pa/show > > Safari 15.6 and Chrome Canary 105 fails but show following rules for > stylesheet: > > body { background: rgb(255, 255, 255); } > > While Firefox Nightly 104 does: > > body { background: rgb(255, 255, 255) none repeat scroll 0% 0%; } > > __________________ > > If I am testing incorrectly, please ignore. Thanks! Just to update: Now Safari 16 and TP 155 show: body { background-color: rgb(255, 255, 255); } while Firefox Nightly 107 matched Chrome Canary 108 and shows: body { background: rgb(255, 255, 255); } and now difference is "background-color" vs "background".

Karl Dubost

Comment 29 2022-10-12 23:01:00 PDT

Probably partially a duplicate of Bug 185953 and Bug 245105

Ahmad Saleem

Comment 30 2023-06-08 06:15:57 PDT

(In reply to Ahmad Saleem from comment #28) > (In reply to Ahmad Saleem from comment #24) > > In attached test (Flex killer), Safari 15.6 on macOS 12.5 matches with other > > browsers (Chrome Canary 105 and Firefox Nightly 104). > > > > But from the test case from the patch - there are following differences: > > > > Test Case - https://jsfiddle.net/zryLu2pa/show > > > > Safari 15.6 and Chrome Canary 105 fails but show following rules for > > stylesheet: > > > > body { background: rgb(255, 255, 255); } > > > > While Firefox Nightly 104 does: > > > > body { background: rgb(255, 255, 255) none repeat scroll 0% 0%; } > > > > __________________ > > > > If I am testing incorrectly, please ignore. Thanks! > > Just to update: > > Now Safari 16 and TP 155 show: > > body { background-color: rgb(255, 255, 255); } > > while Firefox Nightly 107 matched Chrome Canary 108 and shows: > > body { background: rgb(255, 255, 255); } > > and now difference is "background-color" vs "background". All browsers (Chrome Canary 116, WebKit ToT, Firefox Nightly 116) show: Test parsing of CSS number types. FAILURE Rules from the stylesheet: body { background: rgb(255, 255, 255); } Expected result: #a { font-size: 16px; } #b { font-size: 16px; } #c { font-size: 16px; } #d { font-size: 16px; } #e { font-size: 1em; } #f { font-size: 1em; } #g { font-size: 1em; } #h { font-size: 1em; } #i { font-size: 12pt; }

Radar WebKit Bug Importer

Comment 31 2024-02-08 15:42:55 PST

<rdar://problem/122588295>

Anne van Kesteren

Comment 32 2024-02-13 03:18:26 PST

Thanks Ahmad, let's consider this closed then.

Note You need to log in before you can comment on or make changes to this bug.