Bug 139014 - Implement parser for :lang pseudo class selector arguments that contain wildcard '*' subtags
Summary: Implement parser for :lang pseudo class selector arguments that contain wildc...
Status: REOPENED
Alias: None
Product: WebKit
Classification: Unclassified
Component: New Bugs (show other bugs)
Version: 528+ (Nightly build)
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Dhi Aurrahman
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-11-23 03:53 PST by Dhi Aurrahman
Modified: 2014-12-07 22:32 PST (History)
5 users (show)

See Also:


Attachments
Patch (22.93 KB, patch)
2014-11-23 04:11 PST, Dhi Aurrahman
no flags Details | Formatted Diff | Diff
Patch (22.86 KB, patch)
2014-11-23 20:43 PST, Dhi Aurrahman
no flags Details | Formatted Diff | Diff
Patch (26.22 KB, patch)
2014-11-25 15:47 PST, Dhi Aurrahman
no flags Details | Formatted Diff | Diff
Patch (11.18 KB, patch)
2014-12-01 21:58 PST, Dhi Aurrahman
no flags Details | Formatted Diff | Diff
Patch (40.17 KB, patch)
2014-12-03 01:39 PST, Dhi Aurrahman
no flags Details | Formatted Diff | Diff
Patch (40.17 KB, patch)
2014-12-03 02:02 PST, Dhi Aurrahman
no flags Details | Formatted Diff | Diff
Patch (50.77 KB, patch)
2014-12-03 03:28 PST, Dhi Aurrahman
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Dhi Aurrahman 2014-11-23 03:53:17 PST
Implement parser for :lang pseudo class selector arguments with wildcard '*' subtags
Comment 1 Dhi Aurrahman 2014-11-23 04:11:56 PST
Created attachment 242133 [details]
Patch
Comment 2 Dhi Aurrahman 2014-11-23 20:43:04 PST
Created attachment 242141 [details]
Patch
Comment 3 Dhi Aurrahman 2014-11-25 15:47:42 PST
Created attachment 242214 [details]
Patch
Comment 4 Benjamin Poulain 2014-11-30 18:31:58 PST
Comment on attachment 242214 [details]
Patch

I'll try to review this tomorrow, I am not doing too well.

Some comments form a quick reading:
-I don't think '*' is valid language range for CSS. The specification says that '*' can be the first character only when followed by an hyphen: "consist of an asterisk (* U+002A) immediately followed by an identifier beginning with an ASCII hyphen (U+002D)".
-The terminals LANGRANGE and IDENT alone have no action, that is suspicious but that may be correct.
-The production for the rule "lang_range" is more complex than I'd like. The more complex a production is, the harder it is to verify its correctness.

I'll need to look at the parser/lexer side of things to see if lang_range can be simplified. Hopefully I'll do that tomorrow.
Comment 5 Dhi Aurrahman 2014-11-30 19:19:47 PST
(In reply to comment #4)
> Comment on attachment 242214 [details]
> Patch
> 
> I'll try to review this tomorrow, I am not doing too well.
> 

Ah, get well soon :).

> Some comments form a quick reading:
> -I don't think '*' is valid language range for CSS. The specification says
> that '*' can be the first character only when followed by an hyphen:
> "consist of an asterisk (* U+002A) immediately followed by an identifier
> beginning with an ASCII hyphen (U+002D)".

Is valid language range for CSS != language range defined in http://www.ietf.org/rfc/rfc3066.txt (*) section 2.5? 

(*) In that rfc3066, the language range is denoted as: 
The following definition of language-range is derived from HTTP/1.1 [RFC 2616].

   language-range  = language-tag / "*"

That is, a language-range has the same syntax as a language-tag, or is the single character "*".

> -The terminals LANGRANGE and IDENT alone have no action, that is suspicious
> but that may be correct.
> -The production for the rule "lang_range" is more complex than I'd like. The
> more complex a production is, the harder it is to verify its correctness.

Yes it is, I tempted to use the `goto` statement, but I'm not sure if that path will lead us to simpler production.

> 
> I'll need to look at the parser/lexer side of things to see if lang_range
> can be simplified. Hopefully I'll do that tomorrow.

Thanks!
Comment 6 Benjamin Poulain 2014-12-01 13:05:45 PST
(In reply to comment #5)
> > Some comments form a quick reading:
> > -I don't think '*' is valid language range for CSS. The specification says
> > that '*' can be the first character only when followed by an hyphen:
> > "consist of an asterisk (* U+002A) immediately followed by an identifier
> > beginning with an ASCII hyphen (U+002D)".
> 
> Is valid language range for CSS != language range defined in
> http://www.ietf.org/rfc/rfc3066.txt (*) section 2.5? 
> 
> (*) In that rfc3066, the language range is denoted as: 
> The following definition of language-range is derived from HTTP/1.1 [RFC
> 2616].
> 
>    language-range  = language-tag / "*"
> 
> That is, a language-range has the same syntax as a language-tag, or is the
> single character "*".

It looks like the specification of CSS is more restrictive than the definition of range in rfc3066 (http://dev.w3.org/csswg/selectors4/#the-lang-pseudo).

As far as I know, the CSS working group has given no rationale for this.
Comment 7 Benjamin Poulain 2014-12-01 13:48:20 PST
Comment on attachment 242214 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=242214&action=review

Looking at the parser, the problem is that you are interpreting the range as defined by BCP47. Here we are in a much more restricted context. Since we are parsing a stylesheet, we are expected to parse using the rules of CSS, then try to follow BCP 47 as closely as possible for interpreting the range.

CSS Level 4 defines the valid input of :lang() as (http://dev.w3.org/csswg/selectors4/#the-lang-pseudo): "Each language range in :lang() must be a valid CSS identifier  [CSS21] or consist of an asterisk (* U+002A) immediately followed by an identifier beginning with an ASCII hyphen (U+002D) for the selector to be valid.".

This implies we should use all the rules of CSS Identifiers for interpreting the input. For example, with CSS, you can escape characters with \. A range with a star would be encoded in CSS as :lang(foo-\*-bar). You can even include parenthesis in the range by escaping them, something like :lang(foo-\(bar\)) probably already work today.

I think CharacterAsterisk + "currentCharacter<SrcCharacterType>() == '-'" is a good start to get the new token. But then I would try to take advantage of parseIdentifier() to get the remaining characters.

> Source/WebCore/css/CSSParser.cpp:11531
> +            String range;

Using String::append() does a new memory allocation for every call. When creating String from a series of parts, we use StringBuilder to reduce the allocation impact.
Comment 8 Benjamin Poulain 2014-12-01 13:48:51 PST
Sorry it took me so long to do the first round of review.
Comment 9 Dhi Aurrahman 2014-12-01 21:58:54 PST
Created attachment 242384 [details]
Patch
Comment 10 Benjamin Poulain 2014-12-02 10:51:33 PST
Comment on attachment 242384 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=242384&action=review

You removed a bunch of tests between the last two iterations. You had many great tests in the previous iteration, it is sad to lose them.

What I would do is:
-Create a new test parsing-css-lang.html
-Add all the tests you had previously, just change the expected results to your new parser.
-Duplicate some of the tests you had previously and use escape characters in there.

You can check the tests parsing-matches-X.html, parsing-not-X.html, parsing-nth-child-X.html. Those are extensive parsing tests.

Appart from that, the new grammar looks much simpler, I like it.

> Source/WebCore/css/CSSParser.cpp:11501
> +            result++;

I would do
    result = currentCharacter<SrcCharacterType>();
...it seems clearer.

> Source/WebCore/css/CSSParser.cpp:11502
> +            parseIdentifier(result, yylval->string, hasEscape);

I find it confusing to use yylval->string here, I would use a temporary variable, then yylval->string.init(parsedLangRange.toString()); later initialize the yylval.

I think you will hit an assertion if you use this code with "*--".
Comment 11 Dhi Aurrahman 2014-12-02 16:47:00 PST
I have a question, related to the spec and escaped '*'.

For example we have:

:lang(*-foo-\*), after parsing, it can be then serialized as :lang(*-foo-*), which has '*' that is not at the beginning of the language range. Should it be treated as invalid range? Since in the spec http://dev.w3.org/csswg/selectors4/#the-lang-pseudo EXAMPLE 31, there is a note:

"Note that asterisks are not allowed anywhere else in :lang()'s language range syntax: they only have meaning, and are therefore only allowed, at the beginning."

Thanks!
Comment 12 Benjamin Poulain 2014-12-02 17:36:47 PST
(In reply to comment #11)
> I have a question, related to the spec and escaped '*'.
> 
> For example we have:
> 
> :lang(*-foo-\*), after parsing, it can be then serialized as :lang(*-foo-*),
> which has '*' that is not at the beginning of the language range. Should it
> be treated as invalid range? Since in the spec
> http://dev.w3.org/csswg/selectors4/#the-lang-pseudo EXAMPLE 31, there is a
> note:
> 
> "Note that asterisks are not allowed anywhere else in :lang()'s language
> range syntax: they only have meaning, and are therefore only allowed, at the
> beginning."

That's a very good question. That line in Example 31 is ambiguous.

I think that sentence is wrong, or was referring to the raw input. In any case, I have asked the CSS Working Group to clarify: http://lists.w3.org/Archives/Public/www-style/2014Dec/0046.html
Comment 13 Dhi Aurrahman 2014-12-03 01:39:19 PST
Created attachment 242479 [details]
Patch
Comment 14 Dhi Aurrahman 2014-12-03 01:43:04 PST
(In reply to comment #12)
> (In reply to comment #11)
> That's a very good question. That line in Example 31 is ambiguous.
> 
> I think that sentence is wrong, or was referring to the raw input. In any
> case, I have asked the CSS Working Group to clarify:
> http://lists.w3.org/Archives/Public/www-style/2014Dec/0046.html

Thanks for posting the question to the www-style group.

I know we are still waiting for the clarification of http://lists.w3.org/Archives/Public/www-style/2014Dec/0046.html, but I guess with this #4 patch I'm trying to grab your idea, especially for the tests.
Comment 15 Dhi Aurrahman 2014-12-03 02:02:19 PST
Created attachment 242482 [details]
Patch
Comment 16 Dhi Aurrahman 2014-12-03 03:28:46 PST
Created attachment 242487 [details]
Patch
Comment 17 Benjamin Poulain 2014-12-03 18:15:44 PST
Comment on attachment 242487 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=242487&action=review

I like the patch and the test. If we do not get feedback from the CSS WG tomorrow, let's move forward and review it.

> LayoutTests/fast/css/css-selector-text-expected.txt:323
> +PASS parseThenSerializeRule(':lang(*-foo-\\0072 aisin) { }') is ':lang(*-foo-raisin) { }'
> +PASS parseThenSerializeRule(':lang(*-foo-\\0062 \\0061 r) { }') is ':lang(*-foo-bar) { }'

Interesting.
Comment 18 Benjamin Poulain 2014-12-05 20:28:39 PST
Comment on attachment 242487 [details]
Patch

The patch looks great. Very thorough testing, I like it.

Let's land it
Comment 19 WebKit Commit Bot 2014-12-05 21:08:42 PST
Comment on attachment 242487 [details]
Patch

Clearing flags on attachment: 242487

Committed r176902: <http://trac.webkit.org/changeset/176902>
Comment 20 WebKit Commit Bot 2014-12-05 21:08:46 PST
All reviewed patches have been landed.  Closing bug.
Comment 21 Mark Lam 2014-12-06 11:37:14 PST
This cause 2 crashes in the following tests due to assertion failures on debug builds:
fast/css/parsing-css-lang.html
fast/css/css-selector-text.html	

See:
https://build.webkit.org/builders/Apple%20Yosemite%20Debug%20WK2%20%28Tests%29/builds/806
https://build.webkit.org/results/Apple%20Yosemite%20Debug%20WK2%20(Tests)/r176902%20(806)/results.html

The tests have been skipped in r176908: <http://trac.webkit.org/r176908>.

Please un-skip them when the issue has been fixed.  Thanks.
Comment 22 Alexey Proskuryakov 2014-12-07 11:00:14 PST
Wouldn't it be more appropriate to roll out? It doesn't seem good to have these failures in the tree for any extended period of time (and it's already been more than a day).
Comment 23 Benjamin Poulain 2014-12-07 13:12:06 PST
(In reply to comment #22)
> Wouldn't it be more appropriate to roll out? It doesn't seem good to have
> these failures in the tree for any extended period of time (and it's already
> been more than a day).

I'll have a look. We can also revert without any impact, this is an experimental feature (CSS_SELECTORS_LEVEL4).
Comment 24 Benjamin Poulain 2014-12-07 22:32:37 PST
Patch for review: https://bugs.webkit.org/show_bug.cgi?id=139379