Bug 70107 - Custom written CSS lexer
Summary: Custom written CSS lexer
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: CSS (show other bugs)
Version: 528+ (Nightly build)
Hardware: Unspecified Unspecified
: P2 Enhancement
Assignee: Nobody
URL:
Keywords:
Depends on: 69083
Blocks:
  Show dependency treegraph
 
Reported: 2011-10-14 06:35 PDT by Zoltan Herczeg
Modified: 2012-05-10 22:35 PDT (History)
21 users (show)

See Also:


Attachments
draft patch (52.40 KB, patch)
2011-10-14 06:36 PDT, Zoltan Herczeg
webkit.review.bot: commit-queue-
Details | Formatted Diff | Diff
next draft (52.45 KB, patch)
2011-10-17 03:56 PDT, Zoltan Herczeg
no flags Details | Formatted Diff | Diff
core patch (52.72 KB, patch)
2011-12-12 00:57 PST, Zoltan Herczeg
oliver: review-
webkit.review.bot: commit-queue-
Details | Formatted Diff | Diff
CPU cycle measure (2.64 KB, patch)
2011-12-12 23:32 PST, Zoltan Herczeg
no flags Details | Formatted Diff | Diff
core patch2 (55.29 KB, patch)
2011-12-13 01:35 PST, Zoltan Herczeg
webkit.review.bot: commit-queue-
Details | Formatted Diff | Diff
full patch, build testing (76.65 KB, patch)
2011-12-20 03:20 PST, Zoltan Herczeg
webkit.review.bot: commit-queue-
Details | Formatted Diff | Diff
updated draft patch (81.20 KB, patch)
2012-01-03 05:20 PST, Zoltan Herczeg
webkit.review.bot: commit-queue-
Details | Formatted Diff | Diff
release patch (84.87 KB, patch)
2012-01-05 01:27 PST, Zoltan Herczeg
oliver: review-
Details | Formatted Diff | Diff
next patch (78.82 KB, patch)
2012-01-11 04:47 PST, Zoltan Herczeg
no flags Details | Formatted Diff | Diff
patch update (78.61 KB, patch)
2012-01-24 01:21 PST, Zoltan Herczeg
koivisto: review+
ap: commit-queue-
Details | Formatted Diff | Diff
patch with renamings (83.83 KB, patch)
2012-01-25 06:01 PST, Zoltan Herczeg
no flags Details | Formatted Diff | Diff
parsing test (58.42 KB, application/octet-stream)
2012-01-30 05:12 PST, Zoltan Herczeg
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Zoltan Herczeg 2011-10-14 06:35:56 PDT
In the current state it is far from ready, I am just curious about EWS.

On methanol it is more than twice (~2.18x) as fast as the original.

The result of 3x2 runs in CPU cycles
Original:  1354030040 1377989672 1345539012
Optimized:  622771368  614838444  635476680
Comment 1 Zoltan Herczeg 2011-10-14 06:36:37 PDT
Created attachment 111009 [details]
draft patch
Comment 2 Zoltan Herczeg 2011-10-14 06:38:20 PDT
> On methanol it is more than twice (~2.18x) as fast as the original.

I mean the sum of CSSParser::lex(...) function calls
Comment 3 WebKit Review Bot 2011-10-14 06:38:38 PDT
Attachment 111009 [details] did not pass style-queue:

Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/WebCore/css/CSSParser.cpp', u'Sourc..." exit_code: 1

Source/WebCore/css/CSSParser.cpp:6993:  Boolean expressions that span multiple lines should have their operators on the left side of the line instead of the right side.  [whitespace/operators] [4]
Source/WebCore/css/CSSParser.cpp:7686:  An else if statement should be written as an if statement when the prior "if" concludes with a return, break, continue or goto statement.  [readability/control_flow] [4]
Source/WebCore/css/CSSParser.cpp:7950:  More than one command on the same line in if  [whitespace/parens] [4]
Source/WebCore/css/CSSParser.cpp:7952:  More than one command on the same line in if  [whitespace/parens] [4]
Source/WebCore/css/CSSParser.cpp:7953:  More than one command on the same line in if  [whitespace/parens] [4]
Source/WebCore/css/CSSParser.cpp:7953:  Semicolon defining empty statement for this loop. Use { } instead.  [whitespace/semicolon] [5]
Total errors found: 6 in 2 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 4 WebKit Review Bot 2011-10-14 08:05:18 PDT
Comment on attachment 111009 [details]
draft patch

Attachment 111009 [details] did not pass chromium-ews (chromium-xvfb):
Output: http://queues.webkit.org/results/10054557

New failing tests:
inspector/styles/styles-source-offsets.html
inspector/elements/elements-panel-styles.html
inspector/styles/styles-new-API.html
inspector/styles/styles-source-lines.html
Comment 5 Zoltan Herczeg 2011-10-17 03:56:46 PDT
Created attachment 111241 [details]
next draft
Comment 6 Zoltan Herczeg 2011-10-17 03:59:40 PDT
Layout and style (except debug) errors are fixed.

The question now: what is the correct token style?
Comment 7 WebKit Review Bot 2011-10-17 03:59:57 PDT
Attachment 111241 [details] did not pass style-queue:

Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/WebCore/css/CSSParser.cpp', u'Sourc..." exit_code: 1

Source/WebCore/css/CSSParser.cpp:7950:  More than one command on the same line in if  [whitespace/parens] [4]
Source/WebCore/css/CSSParser.cpp:7952:  More than one command on the same line in if  [whitespace/parens] [4]
Source/WebCore/css/CSSParser.cpp:7953:  More than one command on the same line in if  [whitespace/parens] [4]
Source/WebCore/css/CSSParser.cpp:7953:  Semicolon defining empty statement for this loop. Use { } instead.  [whitespace/semicolon] [5]
Total errors found: 4 in 2 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 8 Zoltan Herczeg 2011-12-12 00:57:53 PST
Created attachment 118748 [details]
core patch

I think the tokens supposedly parsed by CSS are finally covered, and it is time for reviewing the custom written css tokenizer.

The patch is not ready, it has no ChangeLog, and tokenizer.flex is not removed form the trunk. But I would like to hear your opinion since I am sure you have several ideas to improve the patch.
Comment 9 WebKit Review Bot 2011-12-12 01:32:11 PST
Comment on attachment 118748 [details]
core patch

Attachment 118748 [details] did not pass chromium-ews (chromium-xvfb):
Output: http://queues.webkit.org/results/10841018

New failing tests:
fast/regions/parsing-region-style-rule.html
Comment 10 Antti Koivisto 2011-12-12 09:36:44 PST
Someone who knows about parsers should take a look. Oliver perhaps?
Comment 11 Oliver Hunt 2011-12-12 13:09:50 PST
Comment on attachment 118748 [details]
core patch

View in context: https://bugs.webkit.org/attachment.cgi?id=118748&action=review

There seems to be a lot of code in this that assumes null-termination of all strings passed to the lexer -- is this guaranteed?  For instance if i do var foo = "some big string"; someElement.style=foo.substring(2, 8);  (not sure if this exact statement is valid), we may return a StringImpl that is referencing the middle of another string so null termination won't be guaranteed.

At the very least i'd like assertions that m_data never exceeds the buffer length.

> Source/WebCore/css/CSSParser.cpp:7388
> +    return chr | 0x20;

Can you add an assertion that chr is in fact an ascii letter?

> Source/WebCore/css/CSSParser.cpp:7394
> +        || (data[0] == '\\' && isCSSEscape(data[1]));

How do we guarantee data is at least 2 characters long?

> Source/WebCore/css/CSSParser.cpp:7430
> +        } while (isASCIIHexDigit(*data) && --length);

I'd like assertions that data doesn't extend beyond the end of the buffer.

> Source/WebCore/css/CSSParser.cpp:7533
> +        if (UNLIKELY(*m_data == quote)) {
> +            // String parsing is done.
> +            ++m_data;
> +            return;
> +        }

What happens if we have an unterminated string? afaict we just walk off the end of the buffer.
Comment 12 Zoltan Herczeg 2011-12-12 13:25:49 PST
Hey Oliver,

actually the CSS parser working in an internal buffer, which is terminated by two '\0'-s, see the following code:

void CSSParser::setupParser(const char* prefix, const String& string, const char* suffix)
{
    int length = string.length() + strlen(prefix) + strlen(suffix) + 2;

    fastFree(m_data);
    m_data = static_cast<UChar*>(fastMalloc(length * sizeof(UChar)));
    for (unsigned i = 0; i < strlen(prefix); i++)
        m_data[i] = prefix[i];

    memcpy(m_data + strlen(prefix), string.characters(), string.length() * sizeof(UChar));

    unsigned start = strlen(prefix) + string.length();
    unsigned end = start + strlen(suffix);
    for (unsigned i = start; i < end; i++)
        m_data[i] = suffix[i - start];

    m_data[length - 1] = 0;
    m_data[length - 2] = 0;

    yy_hold_char = 0;
    yyleng = 0;
    yytext = yy_c_buf_p = m_data;
    yy_hold_char = *yy_c_buf_p;
    resetRuleBodyMarks();
}

Flex needs this, and I actually liked this approach, since it requires less buffer range checks (you know that any non \0 character must be followed at least two \0-s). CSS actually designed that way that '\0' is not part of any token.
Comment 13 Oliver Hunt 2011-12-12 13:28:17 PST
(In reply to comment #12)
> Hey Oliver,
> 
> actually the CSS parser working in an internal buffer, which is terminated by two '\0'-s, see the following code:
> 
> void CSSParser::setupParser(const char* prefix, const String& string, const char* suffix)
> {
>     int length = string.length() + strlen(prefix) + strlen(suffix) + 2;
> 
>     fastFree(m_data);
>     m_data = static_cast<UChar*>(fastMalloc(length * sizeof(UChar)));
>     for (unsigned i = 0; i < strlen(prefix); i++)
>         m_data[i] = prefix[i];
> 
>     memcpy(m_data + strlen(prefix), string.characters(), string.length() * sizeof(UChar));
> 
>     unsigned start = strlen(prefix) + string.length();
>     unsigned end = start + strlen(suffix);
>     for (unsigned i = start; i < end; i++)
>         m_data[i] = suffix[i - start];
> 
>     m_data[length - 1] = 0;
>     m_data[length - 2] = 0;
> 
>     yy_hold_char = 0;
>     yyleng = 0;
>     yytext = yy_c_buf_p = m_data;
>     yy_hold_char = *yy_c_buf_p;
>     resetRuleBodyMarks();
> }
> 
> Flex needs this, and I actually liked this approach, since it requires less buffer range checks (you know that any non \0 character must be followed at least two \0-s). CSS actually designed that way that '\0' is not part of any token.

On the downside it requires a copy of the css source -- is a well predicted branch really more expensive than a copy?
Comment 14 Oliver Hunt 2011-12-12 13:33:55 PST
Oh, and i still don't see how the null termination will stop the string parsing from running over the end of the buffer

(In reply to comment #12)
> Hey Oliver,
> 
> actually the CSS parser working in an internal buffer, which is terminated by two '\0'-s, see the following code:
> 
> void CSSParser::setupParser(const char* prefix, const String& string, const char* suffix)
> {
>     int length = string.length() + strlen(prefix) + strlen(suffix) + 2;
> 
>     fastFree(m_data);
>     m_data = static_cast<UChar*>(fastMalloc(length * sizeof(UChar)));
>     for (unsigned i = 0; i < strlen(prefix); i++)
>         m_data[i] = prefix[i];
> 
>     memcpy(m_data + strlen(prefix), string.characters(), string.length() * sizeof(UChar));
> 
>     unsigned start = strlen(prefix) + string.length();
>     unsigned end = start + strlen(suffix);
>     for (unsigned i = start; i < end; i++)
>         m_data[i] = suffix[i - start];
> 
>     m_data[length - 1] = 0;
>     m_data[length - 2] = 0;
> 
>     yy_hold_char = 0;
>     yyleng = 0;
>     yytext = yy_c_buf_p = m_data;
>     yy_hold_char = *yy_c_buf_p;
>     resetRuleBodyMarks();
> }
> 
> Flex needs this, and I actually liked this approach, since it requires less buffer range checks (you know that any non \0 character must be followed at least two \0-s). CSS actually designed that way that '\0' is not part of any token.
Comment 15 Andreas Kling 2011-12-12 14:49:58 PST
(In reply to comment #0)
> In the current state it is far from ready, I am just curious about EWS.
> 
> On methanol it is more than twice (~2.18x) as fast as the original.
> 
> The result of 3x2 runs in CPU cycles
> Original:  1354030040 1377989672 1345539012
> Optimized:  622771368  614838444  635476680

Are these callgrind CPU cycle counts, or measured using a hardware CPU profiler? What's the improvement in wall-time? I'm curious because I've seen some rather unbelievable improvement numbers in association with the "methanol" benchmark more than once before and I'm not convinced it can be trusted. (Examples: bug 66851, bug 64262, bug 63909)
Comment 16 Benjamin Poulain 2011-12-12 14:58:51 PST
> On methanol it is more than twice (~2.18x) as fast as the original.

I have seen multiple false or totally erroneous claims based on this "methanol" benchmark.

Could you please provide a cross-platform stress test (a webpage or just C++ calling the lexer) so others can verify the results?
Comment 17 Benjamin Poulain 2011-12-12 15:12:33 PST
By the way, I think you should announce the intent of changing the CSS lexer on WebKit-dev. Given the scope of the change, it would be polite to give visibility for every port.
Comment 18 Zoltan Herczeg 2011-12-12 23:32:21 PST
Created attachment 118964 [details]
CPU cycle measure

The attached patch works with Qt, gcc and x86, but it can be extended to other ports / CPUs if anyone needs it.

I measured the values on an Intel Xeon server, since the cycle counters of all cores are synchronized there. Therefore you don't need to worry about CPU affinity. On other machines you may experince "jumpings" as the kernel migrates the process, that is why I like the Xeon. I saw speedup between 2x - 2.5x on any pages, so you can use any page to measure it.

The reason of using cycle counters is based on that observation that the tokenizer called thousands of times, but a single run itself takes little time. Thus statistic based measurements are useless in our case.
Comment 19 Oliver Hunt 2011-12-12 23:42:38 PST
Based on other patches i've seen go through it might be worth checking against html5 spec load time (although i suspect that's mostly style calc/resolution rather than rule parsing).  Antti can probably give some info on how he's measuring.

Can you explain to me how the quoted string lexing doesn't run off the end of the buffer though?  I still can't see it.
Comment 20 Zoltan Herczeg 2011-12-12 23:49:47 PST
> Can you explain to me how the quoted string lexing doesn't run off the end of the buffer though?  I still can't see it.

I am on #webkit now, we can discuss there if you have a little time.
Comment 21 Zoltan Herczeg 2011-12-13 00:04:36 PST
> > Source/WebCore/css/CSSParser.cpp:7388
> > +    return chr | 0x20;
> 
> Can you add an assertion that chr is in fact an ascii letter?

It doesn't need to be one. I saw this trick in JavaScriptCore/wtf/ASCIICType.h

(chr | 0x20) == 'a' if and only if chr is 'A' or 'a'. Nothing else can satisfy this expression, and I use it for character comparison many times.

The only exception is the second isCaselessEqual, where I describe the rules applies to that function in the comment, and assertions were added to check the rules. Basically if constantString contains minus sign '-', the cssString must contains only cssLetter-s.

> 
> > Source/WebCore/css/CSSParser.cpp:7394
> > +        || (data[0] == '\\' && isCSSEscape(data[1]));
> 
> How do we guarantee data is at least 2 characters long?

C++ has a short circuit expression evaluation method, so data[0] == '\\' must be true before it checks data[1].

> Can you explain to me how the quoted string lexing doesn't run off the end of the buffer though?  I still can't see it.

You are right here, checkString must return with fail if it encounters a \0.
Comment 22 Oliver Hunt 2011-12-13 00:05:18 PST
Comment on attachment 118748 [details]
core patch

View in context: https://bugs.webkit.org/attachment.cgi?id=118748&action=review

> Source/WebCore/css/CSSParser.cpp:7460
> +        if (UNLIKELY(*data == quote)) {

We don't take this path if we've reached the end of the buffer as *data == 0

> Source/WebCore/css/CSSParser.cpp:7464
> +        if (UNLIKELY(*data < 14 && *data > 9 && *data != 11)) {

This is checking for *data == 10, 12 or 13 so we don't match 0 here.

> Source/WebCore/css/CSSParser.cpp:7469
> +        if (LIKELY(data[0] != '\\'))

data[0] is 0 (because we're at the end of the buffer) so we take this path, increment data, and go back to the head of the loop
Comment 23 Zoltan Herczeg 2011-12-13 00:08:24 PST
(In reply to comment #22)
> (From update of attachment 118748 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=118748&action=review
> 
> > Source/WebCore/css/CSSParser.cpp:7460
> > +        if (UNLIKELY(*data == quote)) {
> 
> We don't take this path if we've reached the end of the buffer as *data == 0
> 
> > Source/WebCore/css/CSSParser.cpp:7464
> > +        if (UNLIKELY(*data < 14 && *data > 9 && *data != 11)) {
> 
> This is checking for *data == 10, 12 or 13 so we don't match 0 here.
> 
> > Source/WebCore/css/CSSParser.cpp:7469
> > +        if (LIKELY(data[0] != '\\'))
> 
> data[0] is 0 (because we're at the end of the buffer) so we take this path, increment data, and go back to the head of the loop

I totally agree this one as I said above :) I fix it of course.
Comment 24 Oliver Hunt 2011-12-13 00:14:01 PST
> > > Source/WebCore/css/CSSParser.cpp:7394
> > > +        || (data[0] == '\\' && isCSSEscape(data[1]));
> > 
> > How do we guarantee data is at least 2 characters long?
> 
> C++ has a short circuit expression evaluation method, so data[0] == '\\' must be true before it checks data[1].
Oh right, and we guarantee null termination so even if i have a \ at the end of input this will work.

I don't really like all these dependencies on having 1+ null terminations on the input as it will make it harder to safely move to a non-copying lexer in future, but i recognize this is essentially existing behavior so you don't need to change it in this patch.
Comment 25 Zoltan Herczeg 2011-12-13 00:24:38 PST
> On the downside it requires a copy of the css source -- is a well predicted branch really more expensive than a copy?

This cannot be avoided, since the CSS tokenizer overwrites escape sequences to real characters: Identifier "x\61y" is changed to "xay" before further processing. So we can only avoid buffer dumplication if the input does not contain any escape sequences. We need to take care of prefix and postfix strings as well somehow. And the worst thing is we don't get the ownership of the input string (not even a RefPtr), so it can be freed during tokenizing. What do you think?
Comment 26 Zoltan Herczeg 2011-12-13 01:35:17 PST
Created attachment 118978 [details]
core patch2

Fixed the string parsing bug and added new comments.
Comment 27 WebKit Review Bot 2011-12-13 02:25:20 PST
Comment on attachment 118978 [details]
core patch2

Attachment 118978 [details] did not pass chromium-ews (chromium-xvfb):
Output: http://queues.webkit.org/results/10844274

New failing tests:
fast/regions/parsing-region-style-rule.html
Comment 28 Oliver Hunt 2011-12-13 08:40:17 PST
(In reply to comment #25)
> > On the downside it requires a copy of the css source -- is a well predicted branch really more expensive than a copy?
> 
> This cannot be avoided, since the CSS tokenizer overwrites escape sequences to real characters: Identifier "x\61y" is changed to "xay" before further processing. So we can only avoid buffer dumplication if the input does not contain any escape sequences. We need to take care of prefix and postfix strings as well somehow. And the worst thing is we don't get the ownership of the input string (not even a RefPtr), so it can be freed during tokenizing. What do you think?

The same problem exists when parsing JS -- but copying all strings is a pessimisation, you should only need to copy when there are escape sequences, and they are very rare.  Once again: a change to make it non-copying doesn't need to be in this patch :D
Comment 29 Zoltan Herczeg 2011-12-13 13:50:07 PST
Thanks for the review Oliver. Is there anything else I should do with the core part?
Comment 30 Zoltan Herczeg 2011-12-20 03:20:05 PST
Created attachment 120002 [details]
full patch, build testing
Comment 31 WebKit Review Bot 2011-12-20 15:00:02 PST
Comment on attachment 120002 [details]
full patch, build testing

Attachment 120002 [details] did not pass chromium-ews (chromium-xvfb):
Output: http://queues.webkit.org/results/10992332

New failing tests:
fast/regions/parsing-region-style-rule.html
Comment 32 Benjamin Poulain 2011-12-27 03:39:49 PST
Comment on attachment 120002 [details]
full patch, build testing

View in context: https://bugs.webkit.org/attachment.cgi?id=120002&action=review

I just quickly skimmed over the path. Some comments bellow.
A general concern is lot of code have the assumption the input is ASCII. Is there no need to handle unicode (identifiers, space, digits, etc?).

> Source/WebCore/css/CSSParser.cpp:7446
> +    while (UNLIKELY(*data <= ' ' && (typesOfASCIICharacters[*data] == CharacterWhiteSpace)))

I am not sure the UNLIKELY() is a good idea here. A bunch of spaces grouped together seems very likely :)

> Source/WebCore/css/CSSParser.cpp:8032
> +    // This is not a real loop, just allows us to avoid gotos.
> +    while (true) {

Where would the Goto be? I only see one break at the end. The body is a switch().

> Source/WebCore/css/CSSParser.cpp:8098
> +            if (!isASCIIDigit(m_data[0]))
> +                break;

Isn't this redundant with the following code?
The code of CharacterNumber also handle the dot case.

> Source/WebCore/css/CSSParser.cpp:8102
> +            bool dotSeen = (m_token == '.');

I haven't read the purpose of this yet but that but I have only seen m_token used with named contants everywhere else.
Comment 33 Zoltan Herczeg 2012-01-02 02:19:36 PST
Hey Benjamin,

thank you for the review and Happy New Year!

> A general concern is lot of code have the assumption the input is ASCII. Is there no need to handle unicode (identifiers, space, digits, etc?).

Yes. In CSS everything > 127 is called "nonascii" and treated as part of an identifier.

> > Source/WebCore/css/CSSParser.cpp:8032
> > +    // This is not a real loop, just allows us to avoid gotos.
> > +    while (true) {
> 
> Where would the Goto be? I only see one break at the end. The body is a switch().

It is for /* */ comments. See "case CharacterSlash". Although that is the only place, so we could change that to a goto if you prefer that.

> 
> > Source/WebCore/css/CSSParser.cpp:8098
> > +            if (!isASCIIDigit(m_data[0]))
> > +                break;
> 
> Isn't this redundant with the following code?
> The code of CharacterNumber also handle the dot case.

A little more difficult. A single '.' is a dot token, while '.0' is a number (zero in this example).

> 
> > Source/WebCore/css/CSSParser.cpp:8102
> > +            bool dotSeen = (m_token == '.');
> 
> I haven't read the purpose of this yet but that but I have only seen m_token used with named contants everywhere else.

All m_token < 256 are single character tokens. CSSGrammar.y does not define named constant for dot but I can define one if you think it is better.

I'll fix the unlikely issue.
Comment 34 Antti Koivisto 2012-01-02 03:32:16 PST
(In reply to comment #33)
> > > Source/WebCore/css/CSSParser.cpp:8032
> > > +    // This is not a real loop, just allows us to avoid gotos.
> > > +    while (true) {
> > 
> > Where would the Goto be? I only see one break at the end. The body is a switch().
> 
> It is for /* */ comments. See "case CharacterSlash". Although that is the only place, so we could change that to a goto if you prefer that.

I don't think WebKit has any sort of strict anti-goto policy. If goto is the most natural way to write this then you should just use it.
Comment 35 Zoltan Herczeg 2012-01-03 05:20:44 PST
Created attachment 120935 [details]
updated draft patch

All requestes are fixed
Comment 36 WebKit Review Bot 2012-01-03 05:22:40 PST
Attachment 120935 [details] did not pass style-queue:

Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/WebCore/CMakeLists.txt', u'Source/W..." exit_code: 1

Source/WebCore/css/CSSParser.cpp:1424:  One space before end of line comments  [whitespace/comments] [5]
Source/WebCore/css/CSSParser.cpp:5185:  One space before end of line comments  [whitespace/comments] [5]
Total errors found: 2 in 2 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 37 WebKit Review Bot 2012-01-03 06:10:48 PST
Comment on attachment 120935 [details]
updated draft patch

Attachment 120935 [details] did not pass chromium-ews (chromium-xvfb):
Output: http://queues.webkit.org/results/11074262

New failing tests:
fast/regions/parsing-region-style-rule.html
Comment 38 Zoltan Herczeg 2012-01-04 05:47:12 PST
> New failing tests:
> fast/regions/parsing-region-style-rule.html

The bot always complain about this, but I couldn't find it in its output (and it works with Qt). Can I check this difference somehow?
Comment 39 Adam Barth 2012-01-04 07:42:59 PST
> The bot always complain about this, but I couldn't find it in its output (and it works with Qt). Can I check this difference somehow?

There's no way to read this information back from the bot currently.  (That's our number 1 feature request.)  If you like, you can try building the Chromium port locally and try running the test:

$ ./Tools/Scripts/update-webkit --chromium
$ ./Tools/Scripts/build-webkit --chromium

Otherwise, you can move forward with your patch, but it would be nice to keep an eye on build.webkit.org when landing to see if this problem manifests itself on the buildbots.
Comment 40 Zoltan Herczeg 2012-01-05 01:27:36 PST
Created attachment 121239 [details]
release patch

I think I fixed everything. Changelog added. Ready for review.
Comment 41 WebKit Review Bot 2012-01-05 01:29:02 PST
Attachment 121239 [details] did not pass style-queue:

Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'ChangeLog', u'Source/WebCore/CMakeLists.tx..." exit_code: 1

Source/WebCore/css/CSSParser.cpp:1424:  One space before end of line comments  [whitespace/comments] [5]
Total errors found: 1 in 4 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 42 Zoltan Herczeg 2012-01-05 23:51:31 PST
The patch now passes all tests, and I think it is ready for a review. Oliver, Benjamin what do you think of this patch?
Comment 43 Oliver Hunt 2012-01-06 13:36:21 PST
Comment on attachment 121239 [details]
release patch

View in context: https://bugs.webkit.org/attachment.cgi?id=121239&action=review

Okay, i'm doing a preliminary r- due to the large number of unnecessary formatting changes sorry.  A patch of this scale and complexity shouldn't have unnecessary no-op changes in it :-/

I'll continue with actual content review, but this should let you start working on the formatting stuff without waiting for me to finish this review

> Source/WebCore/css/CSSParser.cpp:263
> -    int length = string.length() + strlen(prefix) + strlen(suffix) + 2;
> +    int length = string.length() + strlen(prefix) + strlen(suffix) + 1;

Why have we dropped to only a single extra null terminator?  I recall (though haven't got that far in the patch) that we rely on two nulls at the end of the string

> Source/WebCore/css/CSSParser.cpp:925
> -        if (id == CSSValueStatic ||
> -             id == CSSValueRelative ||
> -             id == CSSValueAbsolute ||
> -             id == CSSValueFixed)
> +        if (id == CSSValueStatic
> +            || id == CSSValueRelative
> +            || id == CSSValueAbsolute
> +            || id == CSSValueFixed)

This change seems unnecessary formatting which i'd rather avoid in a patch of this size and complexity

> Source/WebCore/css/CSSParser.cpp:939
> -        if (id == CSSValueAuto ||
> -             id == CSSValueAlways ||
> -             id == CSSValueAvoid ||
> -             id == CSSValueLeft ||
> -             id == CSSValueRight)
> +        if (id == CSSValueAuto
> +            || id == CSSValueAlways
> +            || id == CSSValueAvoid
> +            || id == CSSValueLeft
> +            || id == CSSValueRight)

ditto

> Source/WebCore/css/CSSParser.cpp:952
> -        if (id == CSSValueShow ||
> -             id == CSSValueHide)
> +        if (id == CSSValueShow
> +            || id == CSSValueHide)

ditto

> Source/WebCore/css/CSSParser.cpp:965
> -        if (id == CSSValueNormal ||
> -            id == CSSValuePre ||
> -            id == CSSValuePreWrap ||
> -            id == CSSValuePreLine ||
> -            id == CSSValueNowrap)
> +        if (id == CSSValueNormal
> +            || id == CSSValuePre
> +            || id == CSSValuePreWrap
> +            || id == CSSValuePreLine
> +            || id == CSSValueNowrap)

ditto

>> Source/WebCore/css/CSSParser.cpp:1424
>> -     case CSSPropertyCounterReset:        // [ <identifier> <integer>? ]+ | none | inherit
>> +    case CSSPropertyCounterReset:        // [ <identifier> <integer>? ]+ | none | inherit
> 
> One space before end of line comments  [whitespace/comments] [5]

ditto

> Source/WebCore/css/CSSParser.cpp:2025
> -             validPrimitive = true;
> +            validPrimitive = true;

ditto

> Source/WebCore/css/CSSParser.cpp:2894
> -       return cssValuePool()->createIdentifierValue(id);
> +        return cssValuePool()->createIdentifierValue(id);

ditto

> Source/WebCore/css/CSSParser.cpp:5186
> -                              // it is a color, but a color isn't allowed at this point.
> +                          // it is a color, but a color isn't allowed at this point.

ditto

> Source/WebCore/css/CSSParser.cpp:5474
> -     return id == CSSValueStretch || id == CSSValueRepeat || id == CSSValueSpace || id == CSSValueRound;
> +    return id == CSSValueStretch || id == CSSValueRepeat || id == CSSValueSpace || id == CSSValueRound;

ditto

> Source/WebCore/css/CSSParser.cpp:5559
> -             m_left = m_cssValuePool->createValue(m_right->getDoubleValue(), (CSSPrimitiveValue::UnitTypes)m_right->primitiveType());
> +            m_left = m_cssValuePool->createValue(m_right->getDoubleValue(), (CSSPrimitiveValue::UnitTypes)m_right->primitiveType());

ditto
Comment 44 Oliver Hunt 2012-01-06 13:50:17 PST
Comment on attachment 121239 [details]
release patch

View in context: https://bugs.webkit.org/attachment.cgi?id=121239&action=review

> Source/WebCore/css/CSSParser.cpp:7494
> +        if (UNLIKELY(*data < 14 && (!*data || *data == 10 || (*data | 0x1) == 13))) {
> +            // String parsing is failed for character 0, 10, 12 or 13.
> +            return 0;
> +        }

yucky constants -- could you use '\n' etc? e.g. *data <= '\r' (i think?), etc

> Source/WebCore/css/CSSParser.cpp:7561
> +        ASSERT(*m_data >= 14 || (*m_data <= 9 && *m_data) || *m_data == 11);

more nasty constants
Comment 45 Zoltan Herczeg 2012-01-11 04:47:45 PST
Created attachment 122006 [details]
next patch

Style changes are moved to other patches and they are landed. Constants are fixed.

I think it is ready for the next review round.
Comment 46 Zoltan Herczeg 2012-01-11 04:50:39 PST
> > Source/WebCore/css/CSSParser.cpp:263
> > -    int length = string.length() + strlen(prefix) + strlen(suffix) + 2;
> > +    int length = string.length() + strlen(prefix) + strlen(suffix) + 1;
> 
> Why have we dropped to only a single extra null terminator?  I recall (though haven't got that far in the patch) that we rely on two nulls at the end of the string

The double zero terminator is only required by Flex. The attached code does not depend on the double terminator (thanks to C short circuit evaluation model).
Comment 47 Antti Koivisto 2012-01-11 05:05:22 PST
(In reply to comment #18)
> The reason of using cycle counters is based on that observation that the tokenizer called thousands of times, but a single run itself takes little time. Thus statistic based measurements are useless in our case.

This is not true. Sampling profiler works just fine in cases like this. If you are not seeing progression in something like Instruments or Shark then you are very likely not improving anything.
Comment 48 Zoltan Herczeg 2012-01-17 05:15:45 PST
> This is not true. Sampling profiler works just fine in cases like this. If you are not seeing progression in something like Instruments or Shark then you are very likely not improving anything.

I made some profiling on arm. I collected all CSS lexing contributor functions of "opreport -l libQtWebKit.so.4"

Was:
505       1.7080  WebCore::CSSParser::lex()
131       0.4431  WebCore::CSSParser::text(int*)
54        0.1826  WebCore::CSSParser::lex(void*)

Sum: ~700 sample

New:
190       0.6663  WebCore::CSSParser::lex(void*)
12        0.0421  WebCore::CSSParser::detectAtToken(int, bool)
6         0.0210  WebCore::CSSParser::detectNumberToken(unsigned short*, int)
6         0.0210  WebCore::CSSParser::detectNumberToken(unsigned short*, int)
6         0.0210  WebCore::CSSParser::parseIdentifier(unsigned short*&)
6         0.0210  WebCore::CSSParser::parseURI(unsigned short*&, unsigned short*&)
1         0.0035  WebCore::CSSParser::detectFunctionToken(int)

Sum: ~240 sample

So you are right, you can see that with any profiling tool.

Oliver could you review the patch?
Comment 49 Zoltan Herczeg 2012-01-19 14:30:27 PST
Any thoughts?
Comment 50 Zoltan Herczeg 2012-01-23 12:13:13 PST
(In reply to comment #49)
> Any thoughts?

There is quite a silence here. Can I interpret that as the patch is ready and no more changes are needed?
Comment 51 Oliver Hunt 2012-01-23 12:57:18 PST
(In reply to comment #50)
> (In reply to comment #49)
> > Any thoughts?
> 
> There is quite a silence here. Can I interpret that as the patch is ready and no more changes are needed?

Erk, sorry, i've had plague for the last few days, i'll have a quick look through again, but given i've done the last few rounds of review i'd like at least one other set of eyes to go over it as well -- maybe antti?
Comment 52 Oliver Hunt 2012-01-23 13:09:11 PST
Comment on attachment 122006 [details]
next patch

Okay I'm convinced this patch is sound, and would give it an r+, I'd like someone else to go through just to double check as I think i've been the only reviewer for the last few revs.  Antti art thou up for it?  I'll r+ tomorrow if no one objects (anyone knowing the area can r+ if they feel confident in it as well, in the knowledge that i think the patch is sane).  Would be nice to know what the raw perf improvement is if you throw a giant pile of css at this though.  eg. collect a few hundred meg of css files, concatenate them, and see what overall parsing time becomes.
Comment 53 Antti Koivisto 2012-01-23 14:34:59 PST
Sure, I'll check it out tomorrow.
Comment 54 Zoltan Herczeg 2012-01-24 01:21:45 PST
Created attachment 123705 [details]
patch update

The previous patch was quite old so I have updated to the latest revision and tested it on Qt. I also made a better ChangeLog.

No other changes.
Comment 55 Zoltan Herczeg 2012-01-24 01:22:35 PST
Hey Oliver, thanks for the review!
Comment 56 Antti Koivisto 2012-01-24 06:42:26 PST
Comment on attachment 123705 [details]
patch update

View in context: https://bugs.webkit.org/attachment.cgi?id=123705&action=review

Looks good to me. It would have been nice to see a performance test that shows the improvement (CSS parser is currently big enough CPU user that it should be very possible to construct a synthetic test that shows the progression). Setting to r+ based on Oliver's comment.

> Source/WebCore/css/CSSParser.cpp:7953
> +        if (hasEscape)
> +            return;
> +
> +        switch (length) {
> +        case 12:
> +            if (isCaselessEqual(name + 2, "ottom-left"))
> +                m_token = BOTTOMLEFT_SYM;
> +            return;
> +
> +        case 13:

I would prefer less empty lines, here and other places.
Comment 57 Alexey Proskuryakov 2012-01-24 13:04:10 PST
Comment on attachment 123705 [details]
patch update

View in context: https://bugs.webkit.org/attachment.cgi?id=123705&action=review

> Source/WebCore/css/CSSParser.cpp:181
> +static inline UChar lowercaseCharacter(UChar chr)

"chr"? Please don't abbreviate like this.

I suggest putting this in ASCIICType.h as toASCIILowerUnchecked().

> Source/WebCore/css/CSSParser.cpp:190
> +static inline bool isCaselessEqual(UChar cssCharacter, UChar character)

Ditto.

> Source/WebCore/css/CSSParser.cpp:7489
> +static inline bool isCSSLetter(UChar chr)

Please don't abbreviate "character".

> Source/WebCore/css/CSSParser.cpp:7491
> +    return chr >= 128 || typesOfASCIICharacters[chr] <= CharacterDash;

Even U+FFFD and unassigned code points are CSS letters?

> Source/WebCore/css/CSSParser.cpp:7494
> +static inline bool isCSSEscape(UChar chr)

Repeated several more times in the patch.

> Source/WebCore/css/CSSParser.cpp:7510
> +static inline bool isCaselessEqual(UChar* cssString, const char* constantString)

What's important to say in function name is that it's ASCII case insensitive comparison (which is different from Unicode one even for ASCII characters!), and that this depends on CSS syntax.

Otherwise, call sites get really confusing.

> Source/WebCore/css/CSSParser.cpp:7527
> +static UChar* checkEscape(UChar* data)

A check function can return false or true. What does this function do?

> Source/WebCore/css/CSSParser.cpp:7543
> +        if (isHTMLSpace(*data))

Are HTML and CSS whitespace definitions the same?

> Source/WebCore/css/CSSParser.cpp:7550
> +static inline UChar* skipWhiteSpaces(UChar* data)

Should be named "skipWhiteSpace" (there is even a CSS keyword skip-white-space, not to mention that this is how the function is named elsewhere in WebCore).

> Source/WebCore/css/CSSParser.cpp:7552
> +    while (*data <= ' ' && (typesOfASCIICharacters[*data] == CharacterWhiteSpace))

This is a yet another definition of whitespace, with all lower ASCII code being treated as such. If this is right, please explain in a comment.

> Source/WebCore/css/CSSParser.cpp:7565
> +inline UChar* CSSParser::checkString(UChar* data, UChar quote)

Please use a name that implies returning a UChar*, not a boolean.

> Source/WebCore/css/CSSParser.cpp:7619
> +inline bool CSSParser::parseIdentifier(UChar*& result)

Some added "parse" functions are void, but this returns a boolean. This is super confusing. Is there a way to have uniform interface for these? Or names that explain the difference?

> Source/WebCore/css/CSSParser.cpp:7775
> +inline void CSSParser::detectFunctionToken(int length)

Does this actually detect the token? Looks like this function is called after a function has been detected, not to detect it.

> Source/WebCore/css/CSSParser.h:383
> +    OwnArrayPtr<UChar> m_dataStart;
> +    UChar* m_data;

What is m_data, is it actually m_nextCharacter?
Comment 58 Zoltan Herczeg 2012-01-25 05:59:25 PST
AP ythanks for the in-depth review. I am agree all your name change suggestions.

> > Source/WebCore/css/CSSParser.cpp:7491
> > +    return chr >= 128 || typesOfASCIICharacters[chr] <= CharacterDash;
> 
> Even U+FFFD and unassigned code points are CSS letters?

No they are not letters. Characters > 127 are called non-ASCII in CSS, and they all treated as part of literals regardless what they are.

> > Source/WebCore/css/CSSParser.cpp:7543
> > +        if (isHTMLSpace(*data))
> 
> Are HTML and CSS whitespace definitions the same?

Yes. It is used several other places in the CSS parser.

> 
> > Source/WebCore/css/CSSParser.cpp:7552
> > +    while (*data <= ' ' && (typesOfASCIICharacters[*data] == CharacterWhiteSpace))
> 
> This is a yet another definition of whitespace, with all lower ASCII code being treated as such. If this is right, please explain in a comment.

I changed it to isHTMLSpace.

> > Source/WebCore/css/CSSParser.cpp:7619
> > +inline bool CSSParser::parseIdentifier(UChar*& result)
> 
> Some added "parse" functions are void, but this returns a boolean. This is super confusing. Is there a way to have uniform interface for these? Or names that explain the difference?

The return value is moved to bool& argument

> What is m_data, is it actually m_nextCharacter?

It is the currentCharacter more precisely. I renamed to that.
Comment 59 Zoltan Herczeg 2012-01-25 06:01:18 PST
Created attachment 123928 [details]
patch with renamings

AP, is this patch ok?
Comment 60 WebKit Review Bot 2012-01-25 06:03:16 PST
Attachment 123928 [details] did not pass style-queue:

Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'ChangeLog', u'Source/JavaScriptCore/Change..." exit_code: 1

Source/WebCore/css/CSSParser.cpp:7510:  Use 0 or null instead of NULL (even in *comments*).  [readability/null] [4]
Source/WebCore/css/CSSParser.cpp:7550:  Use 0 or null instead of NULL (even in *comments*).  [readability/null] [4]
Total errors found: 2 in 6 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 61 Antti Koivisto 2012-01-25 07:13:59 PST
(In reply to comment #57)
> > Source/WebCore/css/CSSParser.cpp:181
> > +static inline UChar lowercaseCharacter(UChar chr)
> 
> "chr"? Please don't abbreviate like this.

I disagree with this comment. In type of code where a few variable names are repeated a lot (like parsers) using shorter names improves readability. We use 'i' instead of 'index' as the standard loop variable name for the same reason.
Comment 62 Zoltan Herczeg 2012-01-28 04:55:21 PST
Alexey, do you have any more comments to the patch?
Comment 63 Oliver Hunt 2012-01-29 11:17:03 PST
Comment on attachment 123928 [details]
patch with renamings

Fix the style bot complaints and r=me, and apparently antti (from the prior one), and given the absence of response from ap i'll assume he's also okay with the current names :D
Comment 64 Zoltan Herczeg 2012-01-30 00:03:03 PST
Thank you for the reviews!

Fixed the style issues and landed as http://trac.webkit.org/changeset/106217
Comment 65 Simon Fraser (smfr) 2012-01-30 01:22:06 PST
Are there any pages where we think this is a measurable win, that we could use for perf testing?
Comment 66 Csaba Osztrogonác 2012-01-30 01:25:00 PST
Reopen, because it made css2.1/20110323/eof-002.htm flakey. Sometimes pass, sometimes crash, sometimes fail ... You can easily reproduce the bug with the following command:

$ Tools/Scripts/run-webkit-tests css2.1/20110323/eof-002.htm --iterations 100 --run-singly
Comment 67 Zoltan Herczeg 2012-01-30 03:20:10 PST
(In reply to comment #66)
> Reopen, because it made css2.1/20110323/eof-002.htm flakey. Sometimes pass, sometimes crash, sometimes fail ... You can easily reproduce the bug with the following command:

Thanks Csaba, valgrind also got it. Buildfix landed as http://trac.webkit.org/changeset/106227
Comment 68 Zoltan Herczeg 2012-01-30 05:12:06 PST
Created attachment 124528 [details]
parsing test

The patch contains the <style> rules from the most popular 25 web-sites (~270k). The test parses it 100 times, and display the time.
Comment 69 Zoltan Herczeg 2012-01-30 05:13:23 PST
> The patch contains the <style> rules from the most popular 25 web-sites (~270k). The test parses it 100 times, and display the time.

Oh this is not a patch just a simple page in a zipped file.