Summary: | Custom written CSS lexer | ||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | WebKit | Reporter: | Zoltan Herczeg <zherczeg> | ||||||||||||||||||||||||||
Component: | CSS | Assignee: | Nobody <webkit-unassigned> | ||||||||||||||||||||||||||
Status: | RESOLVED FIXED | ||||||||||||||||||||||||||||
Severity: | Enhancement | CC: | abarth, benjamin, dglazkov, donggwan.kim, eric.carlson, hausmann, kennyluck, kling, koivisto, macpherson, mathias, ojan, oliver, ossy, peter, psolanki, rakuco, simon.fraser, tonikitoo, vestbo, webkit.review.bot | ||||||||||||||||||||||||||
Priority: | P2 | ||||||||||||||||||||||||||||
Version: | 528+ (Nightly build) | ||||||||||||||||||||||||||||
Hardware: | Unspecified | ||||||||||||||||||||||||||||
OS: | Unspecified | ||||||||||||||||||||||||||||
Bug Depends on: | 69083 | ||||||||||||||||||||||||||||
Bug Blocks: | |||||||||||||||||||||||||||||
Attachments: |
|
Description
Zoltan Herczeg
2011-10-14 06:35:56 PDT
Created attachment 111009 [details]
draft patch
> On methanol it is more than twice (~2.18x) as fast as the original.
I mean the sum of CSSParser::lex(...) function calls
Attachment 111009 [details] did not pass style-queue:
Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/WebCore/css/CSSParser.cpp', u'Sourc..." exit_code: 1
Source/WebCore/css/CSSParser.cpp:6993: Boolean expressions that span multiple lines should have their operators on the left side of the line instead of the right side. [whitespace/operators] [4]
Source/WebCore/css/CSSParser.cpp:7686: An else if statement should be written as an if statement when the prior "if" concludes with a return, break, continue or goto statement. [readability/control_flow] [4]
Source/WebCore/css/CSSParser.cpp:7950: More than one command on the same line in if [whitespace/parens] [4]
Source/WebCore/css/CSSParser.cpp:7952: More than one command on the same line in if [whitespace/parens] [4]
Source/WebCore/css/CSSParser.cpp:7953: More than one command on the same line in if [whitespace/parens] [4]
Source/WebCore/css/CSSParser.cpp:7953: Semicolon defining empty statement for this loop. Use { } instead. [whitespace/semicolon] [5]
Total errors found: 6 in 2 files
If any of these errors are false positives, please file a bug against check-webkit-style.
Comment on attachment 111009 [details] draft patch Attachment 111009 [details] did not pass chromium-ews (chromium-xvfb): Output: http://queues.webkit.org/results/10054557 New failing tests: inspector/styles/styles-source-offsets.html inspector/elements/elements-panel-styles.html inspector/styles/styles-new-API.html inspector/styles/styles-source-lines.html Created attachment 111241 [details]
next draft
Layout and style (except debug) errors are fixed. The question now: what is the correct token style? Attachment 111241 [details] did not pass style-queue:
Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/WebCore/css/CSSParser.cpp', u'Sourc..." exit_code: 1
Source/WebCore/css/CSSParser.cpp:7950: More than one command on the same line in if [whitespace/parens] [4]
Source/WebCore/css/CSSParser.cpp:7952: More than one command on the same line in if [whitespace/parens] [4]
Source/WebCore/css/CSSParser.cpp:7953: More than one command on the same line in if [whitespace/parens] [4]
Source/WebCore/css/CSSParser.cpp:7953: Semicolon defining empty statement for this loop. Use { } instead. [whitespace/semicolon] [5]
Total errors found: 4 in 2 files
If any of these errors are false positives, please file a bug against check-webkit-style.
Created attachment 118748 [details]
core patch
I think the tokens supposedly parsed by CSS are finally covered, and it is time for reviewing the custom written css tokenizer.
The patch is not ready, it has no ChangeLog, and tokenizer.flex is not removed form the trunk. But I would like to hear your opinion since I am sure you have several ideas to improve the patch.
Comment on attachment 118748 [details] core patch Attachment 118748 [details] did not pass chromium-ews (chromium-xvfb): Output: http://queues.webkit.org/results/10841018 New failing tests: fast/regions/parsing-region-style-rule.html Someone who knows about parsers should take a look. Oliver perhaps? Comment on attachment 118748 [details] core patch View in context: https://bugs.webkit.org/attachment.cgi?id=118748&action=review There seems to be a lot of code in this that assumes null-termination of all strings passed to the lexer -- is this guaranteed? For instance if i do var foo = "some big string"; someElement.style=foo.substring(2, 8); (not sure if this exact statement is valid), we may return a StringImpl that is referencing the middle of another string so null termination won't be guaranteed. At the very least i'd like assertions that m_data never exceeds the buffer length. > Source/WebCore/css/CSSParser.cpp:7388 > + return chr | 0x20; Can you add an assertion that chr is in fact an ascii letter? > Source/WebCore/css/CSSParser.cpp:7394 > + || (data[0] == '\\' && isCSSEscape(data[1])); How do we guarantee data is at least 2 characters long? > Source/WebCore/css/CSSParser.cpp:7430 > + } while (isASCIIHexDigit(*data) && --length); I'd like assertions that data doesn't extend beyond the end of the buffer. > Source/WebCore/css/CSSParser.cpp:7533 > + if (UNLIKELY(*m_data == quote)) { > + // String parsing is done. > + ++m_data; > + return; > + } What happens if we have an unterminated string? afaict we just walk off the end of the buffer. Hey Oliver, actually the CSS parser working in an internal buffer, which is terminated by two '\0'-s, see the following code: void CSSParser::setupParser(const char* prefix, const String& string, const char* suffix) { int length = string.length() + strlen(prefix) + strlen(suffix) + 2; fastFree(m_data); m_data = static_cast<UChar*>(fastMalloc(length * sizeof(UChar))); for (unsigned i = 0; i < strlen(prefix); i++) m_data[i] = prefix[i]; memcpy(m_data + strlen(prefix), string.characters(), string.length() * sizeof(UChar)); unsigned start = strlen(prefix) + string.length(); unsigned end = start + strlen(suffix); for (unsigned i = start; i < end; i++) m_data[i] = suffix[i - start]; m_data[length - 1] = 0; m_data[length - 2] = 0; yy_hold_char = 0; yyleng = 0; yytext = yy_c_buf_p = m_data; yy_hold_char = *yy_c_buf_p; resetRuleBodyMarks(); } Flex needs this, and I actually liked this approach, since it requires less buffer range checks (you know that any non \0 character must be followed at least two \0-s). CSS actually designed that way that '\0' is not part of any token. (In reply to comment #12) > Hey Oliver, > > actually the CSS parser working in an internal buffer, which is terminated by two '\0'-s, see the following code: > > void CSSParser::setupParser(const char* prefix, const String& string, const char* suffix) > { > int length = string.length() + strlen(prefix) + strlen(suffix) + 2; > > fastFree(m_data); > m_data = static_cast<UChar*>(fastMalloc(length * sizeof(UChar))); > for (unsigned i = 0; i < strlen(prefix); i++) > m_data[i] = prefix[i]; > > memcpy(m_data + strlen(prefix), string.characters(), string.length() * sizeof(UChar)); > > unsigned start = strlen(prefix) + string.length(); > unsigned end = start + strlen(suffix); > for (unsigned i = start; i < end; i++) > m_data[i] = suffix[i - start]; > > m_data[length - 1] = 0; > m_data[length - 2] = 0; > > yy_hold_char = 0; > yyleng = 0; > yytext = yy_c_buf_p = m_data; > yy_hold_char = *yy_c_buf_p; > resetRuleBodyMarks(); > } > > Flex needs this, and I actually liked this approach, since it requires less buffer range checks (you know that any non \0 character must be followed at least two \0-s). CSS actually designed that way that '\0' is not part of any token. On the downside it requires a copy of the css source -- is a well predicted branch really more expensive than a copy? Oh, and i still don't see how the null termination will stop the string parsing from running over the end of the buffer (In reply to comment #12) > Hey Oliver, > > actually the CSS parser working in an internal buffer, which is terminated by two '\0'-s, see the following code: > > void CSSParser::setupParser(const char* prefix, const String& string, const char* suffix) > { > int length = string.length() + strlen(prefix) + strlen(suffix) + 2; > > fastFree(m_data); > m_data = static_cast<UChar*>(fastMalloc(length * sizeof(UChar))); > for (unsigned i = 0; i < strlen(prefix); i++) > m_data[i] = prefix[i]; > > memcpy(m_data + strlen(prefix), string.characters(), string.length() * sizeof(UChar)); > > unsigned start = strlen(prefix) + string.length(); > unsigned end = start + strlen(suffix); > for (unsigned i = start; i < end; i++) > m_data[i] = suffix[i - start]; > > m_data[length - 1] = 0; > m_data[length - 2] = 0; > > yy_hold_char = 0; > yyleng = 0; > yytext = yy_c_buf_p = m_data; > yy_hold_char = *yy_c_buf_p; > resetRuleBodyMarks(); > } > > Flex needs this, and I actually liked this approach, since it requires less buffer range checks (you know that any non \0 character must be followed at least two \0-s). CSS actually designed that way that '\0' is not part of any token. (In reply to comment #0) > In the current state it is far from ready, I am just curious about EWS. > > On methanol it is more than twice (~2.18x) as fast as the original. > > The result of 3x2 runs in CPU cycles > Original: 1354030040 1377989672 1345539012 > Optimized: 622771368 614838444 635476680 Are these callgrind CPU cycle counts, or measured using a hardware CPU profiler? What's the improvement in wall-time? I'm curious because I've seen some rather unbelievable improvement numbers in association with the "methanol" benchmark more than once before and I'm not convinced it can be trusted. (Examples: bug 66851, bug 64262, bug 63909) > On methanol it is more than twice (~2.18x) as fast as the original.
I have seen multiple false or totally erroneous claims based on this "methanol" benchmark.
Could you please provide a cross-platform stress test (a webpage or just C++ calling the lexer) so others can verify the results?
By the way, I think you should announce the intent of changing the CSS lexer on WebKit-dev. Given the scope of the change, it would be polite to give visibility for every port. Created attachment 118964 [details]
CPU cycle measure
The attached patch works with Qt, gcc and x86, but it can be extended to other ports / CPUs if anyone needs it.
I measured the values on an Intel Xeon server, since the cycle counters of all cores are synchronized there. Therefore you don't need to worry about CPU affinity. On other machines you may experince "jumpings" as the kernel migrates the process, that is why I like the Xeon. I saw speedup between 2x - 2.5x on any pages, so you can use any page to measure it.
The reason of using cycle counters is based on that observation that the tokenizer called thousands of times, but a single run itself takes little time. Thus statistic based measurements are useless in our case.
Based on other patches i've seen go through it might be worth checking against html5 spec load time (although i suspect that's mostly style calc/resolution rather than rule parsing). Antti can probably give some info on how he's measuring. Can you explain to me how the quoted string lexing doesn't run off the end of the buffer though? I still can't see it. > Can you explain to me how the quoted string lexing doesn't run off the end of the buffer though? I still can't see it.
I am on #webkit now, we can discuss there if you have a little time.
> > Source/WebCore/css/CSSParser.cpp:7388 > > + return chr | 0x20; > > Can you add an assertion that chr is in fact an ascii letter? It doesn't need to be one. I saw this trick in JavaScriptCore/wtf/ASCIICType.h (chr | 0x20) == 'a' if and only if chr is 'A' or 'a'. Nothing else can satisfy this expression, and I use it for character comparison many times. The only exception is the second isCaselessEqual, where I describe the rules applies to that function in the comment, and assertions were added to check the rules. Basically if constantString contains minus sign '-', the cssString must contains only cssLetter-s. > > > Source/WebCore/css/CSSParser.cpp:7394 > > + || (data[0] == '\\' && isCSSEscape(data[1])); > > How do we guarantee data is at least 2 characters long? C++ has a short circuit expression evaluation method, so data[0] == '\\' must be true before it checks data[1]. > Can you explain to me how the quoted string lexing doesn't run off the end of the buffer though? I still can't see it. You are right here, checkString must return with fail if it encounters a \0. Comment on attachment 118748 [details] core patch View in context: https://bugs.webkit.org/attachment.cgi?id=118748&action=review > Source/WebCore/css/CSSParser.cpp:7460 > + if (UNLIKELY(*data == quote)) { We don't take this path if we've reached the end of the buffer as *data == 0 > Source/WebCore/css/CSSParser.cpp:7464 > + if (UNLIKELY(*data < 14 && *data > 9 && *data != 11)) { This is checking for *data == 10, 12 or 13 so we don't match 0 here. > Source/WebCore/css/CSSParser.cpp:7469 > + if (LIKELY(data[0] != '\\')) data[0] is 0 (because we're at the end of the buffer) so we take this path, increment data, and go back to the head of the loop (In reply to comment #22) > (From update of attachment 118748 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=118748&action=review > > > Source/WebCore/css/CSSParser.cpp:7460 > > + if (UNLIKELY(*data == quote)) { > > We don't take this path if we've reached the end of the buffer as *data == 0 > > > Source/WebCore/css/CSSParser.cpp:7464 > > + if (UNLIKELY(*data < 14 && *data > 9 && *data != 11)) { > > This is checking for *data == 10, 12 or 13 so we don't match 0 here. > > > Source/WebCore/css/CSSParser.cpp:7469 > > + if (LIKELY(data[0] != '\\')) > > data[0] is 0 (because we're at the end of the buffer) so we take this path, increment data, and go back to the head of the loop I totally agree this one as I said above :) I fix it of course.
> > > Source/WebCore/css/CSSParser.cpp:7394
> > > + || (data[0] == '\\' && isCSSEscape(data[1]));
> >
> > How do we guarantee data is at least 2 characters long?
>
> C++ has a short circuit expression evaluation method, so data[0] == '\\' must be true before it checks data[1].
Oh right, and we guarantee null termination so even if i have a \ at the end of input this will work.
I don't really like all these dependencies on having 1+ null terminations on the input as it will make it harder to safely move to a non-copying lexer in future, but i recognize this is essentially existing behavior so you don't need to change it in this patch.
> On the downside it requires a copy of the css source -- is a well predicted branch really more expensive than a copy?
This cannot be avoided, since the CSS tokenizer overwrites escape sequences to real characters: Identifier "x\61y" is changed to "xay" before further processing. So we can only avoid buffer dumplication if the input does not contain any escape sequences. We need to take care of prefix and postfix strings as well somehow. And the worst thing is we don't get the ownership of the input string (not even a RefPtr), so it can be freed during tokenizing. What do you think?
Created attachment 118978 [details]
core patch2
Fixed the string parsing bug and added new comments.
Comment on attachment 118978 [details] core patch2 Attachment 118978 [details] did not pass chromium-ews (chromium-xvfb): Output: http://queues.webkit.org/results/10844274 New failing tests: fast/regions/parsing-region-style-rule.html (In reply to comment #25) > > On the downside it requires a copy of the css source -- is a well predicted branch really more expensive than a copy? > > This cannot be avoided, since the CSS tokenizer overwrites escape sequences to real characters: Identifier "x\61y" is changed to "xay" before further processing. So we can only avoid buffer dumplication if the input does not contain any escape sequences. We need to take care of prefix and postfix strings as well somehow. And the worst thing is we don't get the ownership of the input string (not even a RefPtr), so it can be freed during tokenizing. What do you think? The same problem exists when parsing JS -- but copying all strings is a pessimisation, you should only need to copy when there are escape sequences, and they are very rare. Once again: a change to make it non-copying doesn't need to be in this patch :D Thanks for the review Oliver. Is there anything else I should do with the core part? Created attachment 120002 [details]
full patch, build testing
Comment on attachment 120002 [details] full patch, build testing Attachment 120002 [details] did not pass chromium-ews (chromium-xvfb): Output: http://queues.webkit.org/results/10992332 New failing tests: fast/regions/parsing-region-style-rule.html Comment on attachment 120002 [details] full patch, build testing View in context: https://bugs.webkit.org/attachment.cgi?id=120002&action=review I just quickly skimmed over the path. Some comments bellow. A general concern is lot of code have the assumption the input is ASCII. Is there no need to handle unicode (identifiers, space, digits, etc?). > Source/WebCore/css/CSSParser.cpp:7446 > + while (UNLIKELY(*data <= ' ' && (typesOfASCIICharacters[*data] == CharacterWhiteSpace))) I am not sure the UNLIKELY() is a good idea here. A bunch of spaces grouped together seems very likely :) > Source/WebCore/css/CSSParser.cpp:8032 > + // This is not a real loop, just allows us to avoid gotos. > + while (true) { Where would the Goto be? I only see one break at the end. The body is a switch(). > Source/WebCore/css/CSSParser.cpp:8098 > + if (!isASCIIDigit(m_data[0])) > + break; Isn't this redundant with the following code? The code of CharacterNumber also handle the dot case. > Source/WebCore/css/CSSParser.cpp:8102 > + bool dotSeen = (m_token == '.'); I haven't read the purpose of this yet but that but I have only seen m_token used with named contants everywhere else. Hey Benjamin, thank you for the review and Happy New Year! > A general concern is lot of code have the assumption the input is ASCII. Is there no need to handle unicode (identifiers, space, digits, etc?). Yes. In CSS everything > 127 is called "nonascii" and treated as part of an identifier. > > Source/WebCore/css/CSSParser.cpp:8032 > > + // This is not a real loop, just allows us to avoid gotos. > > + while (true) { > > Where would the Goto be? I only see one break at the end. The body is a switch(). It is for /* */ comments. See "case CharacterSlash". Although that is the only place, so we could change that to a goto if you prefer that. > > > Source/WebCore/css/CSSParser.cpp:8098 > > + if (!isASCIIDigit(m_data[0])) > > + break; > > Isn't this redundant with the following code? > The code of CharacterNumber also handle the dot case. A little more difficult. A single '.' is a dot token, while '.0' is a number (zero in this example). > > > Source/WebCore/css/CSSParser.cpp:8102 > > + bool dotSeen = (m_token == '.'); > > I haven't read the purpose of this yet but that but I have only seen m_token used with named contants everywhere else. All m_token < 256 are single character tokens. CSSGrammar.y does not define named constant for dot but I can define one if you think it is better. I'll fix the unlikely issue. (In reply to comment #33) > > > Source/WebCore/css/CSSParser.cpp:8032 > > > + // This is not a real loop, just allows us to avoid gotos. > > > + while (true) { > > > > Where would the Goto be? I only see one break at the end. The body is a switch(). > > It is for /* */ comments. See "case CharacterSlash". Although that is the only place, so we could change that to a goto if you prefer that. I don't think WebKit has any sort of strict anti-goto policy. If goto is the most natural way to write this then you should just use it. Created attachment 120935 [details]
updated draft patch
All requestes are fixed
Attachment 120935 [details] did not pass style-queue:
Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/WebCore/CMakeLists.txt', u'Source/W..." exit_code: 1
Source/WebCore/css/CSSParser.cpp:1424: One space before end of line comments [whitespace/comments] [5]
Source/WebCore/css/CSSParser.cpp:5185: One space before end of line comments [whitespace/comments] [5]
Total errors found: 2 in 2 files
If any of these errors are false positives, please file a bug against check-webkit-style.
Comment on attachment 120935 [details] updated draft patch Attachment 120935 [details] did not pass chromium-ews (chromium-xvfb): Output: http://queues.webkit.org/results/11074262 New failing tests: fast/regions/parsing-region-style-rule.html > New failing tests:
> fast/regions/parsing-region-style-rule.html
The bot always complain about this, but I couldn't find it in its output (and it works with Qt). Can I check this difference somehow?
> The bot always complain about this, but I couldn't find it in its output (and it works with Qt). Can I check this difference somehow?
There's no way to read this information back from the bot currently. (That's our number 1 feature request.) If you like, you can try building the Chromium port locally and try running the test:
$ ./Tools/Scripts/update-webkit --chromium
$ ./Tools/Scripts/build-webkit --chromium
Otherwise, you can move forward with your patch, but it would be nice to keep an eye on build.webkit.org when landing to see if this problem manifests itself on the buildbots.
Created attachment 121239 [details]
release patch
I think I fixed everything. Changelog added. Ready for review.
Attachment 121239 [details] did not pass style-queue:
Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'ChangeLog', u'Source/WebCore/CMakeLists.tx..." exit_code: 1
Source/WebCore/css/CSSParser.cpp:1424: One space before end of line comments [whitespace/comments] [5]
Total errors found: 1 in 4 files
If any of these errors are false positives, please file a bug against check-webkit-style.
The patch now passes all tests, and I think it is ready for a review. Oliver, Benjamin what do you think of this patch? Comment on attachment 121239 [details] release patch View in context: https://bugs.webkit.org/attachment.cgi?id=121239&action=review Okay, i'm doing a preliminary r- due to the large number of unnecessary formatting changes sorry. A patch of this scale and complexity shouldn't have unnecessary no-op changes in it :-/ I'll continue with actual content review, but this should let you start working on the formatting stuff without waiting for me to finish this review > Source/WebCore/css/CSSParser.cpp:263 > - int length = string.length() + strlen(prefix) + strlen(suffix) + 2; > + int length = string.length() + strlen(prefix) + strlen(suffix) + 1; Why have we dropped to only a single extra null terminator? I recall (though haven't got that far in the patch) that we rely on two nulls at the end of the string > Source/WebCore/css/CSSParser.cpp:925 > - if (id == CSSValueStatic || > - id == CSSValueRelative || > - id == CSSValueAbsolute || > - id == CSSValueFixed) > + if (id == CSSValueStatic > + || id == CSSValueRelative > + || id == CSSValueAbsolute > + || id == CSSValueFixed) This change seems unnecessary formatting which i'd rather avoid in a patch of this size and complexity > Source/WebCore/css/CSSParser.cpp:939 > - if (id == CSSValueAuto || > - id == CSSValueAlways || > - id == CSSValueAvoid || > - id == CSSValueLeft || > - id == CSSValueRight) > + if (id == CSSValueAuto > + || id == CSSValueAlways > + || id == CSSValueAvoid > + || id == CSSValueLeft > + || id == CSSValueRight) ditto > Source/WebCore/css/CSSParser.cpp:952 > - if (id == CSSValueShow || > - id == CSSValueHide) > + if (id == CSSValueShow > + || id == CSSValueHide) ditto > Source/WebCore/css/CSSParser.cpp:965 > - if (id == CSSValueNormal || > - id == CSSValuePre || > - id == CSSValuePreWrap || > - id == CSSValuePreLine || > - id == CSSValueNowrap) > + if (id == CSSValueNormal > + || id == CSSValuePre > + || id == CSSValuePreWrap > + || id == CSSValuePreLine > + || id == CSSValueNowrap) ditto >> Source/WebCore/css/CSSParser.cpp:1424 >> - case CSSPropertyCounterReset: // [ <identifier> <integer>? ]+ | none | inherit >> + case CSSPropertyCounterReset: // [ <identifier> <integer>? ]+ | none | inherit > > One space before end of line comments [whitespace/comments] [5] ditto > Source/WebCore/css/CSSParser.cpp:2025 > - validPrimitive = true; > + validPrimitive = true; ditto > Source/WebCore/css/CSSParser.cpp:2894 > - return cssValuePool()->createIdentifierValue(id); > + return cssValuePool()->createIdentifierValue(id); ditto > Source/WebCore/css/CSSParser.cpp:5186 > - // it is a color, but a color isn't allowed at this point. > + // it is a color, but a color isn't allowed at this point. ditto > Source/WebCore/css/CSSParser.cpp:5474 > - return id == CSSValueStretch || id == CSSValueRepeat || id == CSSValueSpace || id == CSSValueRound; > + return id == CSSValueStretch || id == CSSValueRepeat || id == CSSValueSpace || id == CSSValueRound; ditto > Source/WebCore/css/CSSParser.cpp:5559 > - m_left = m_cssValuePool->createValue(m_right->getDoubleValue(), (CSSPrimitiveValue::UnitTypes)m_right->primitiveType()); > + m_left = m_cssValuePool->createValue(m_right->getDoubleValue(), (CSSPrimitiveValue::UnitTypes)m_right->primitiveType()); ditto Comment on attachment 121239 [details] release patch View in context: https://bugs.webkit.org/attachment.cgi?id=121239&action=review > Source/WebCore/css/CSSParser.cpp:7494 > + if (UNLIKELY(*data < 14 && (!*data || *data == 10 || (*data | 0x1) == 13))) { > + // String parsing is failed for character 0, 10, 12 or 13. > + return 0; > + } yucky constants -- could you use '\n' etc? e.g. *data <= '\r' (i think?), etc > Source/WebCore/css/CSSParser.cpp:7561 > + ASSERT(*m_data >= 14 || (*m_data <= 9 && *m_data) || *m_data == 11); more nasty constants Created attachment 122006 [details]
next patch
Style changes are moved to other patches and they are landed. Constants are fixed.
I think it is ready for the next review round.
> > Source/WebCore/css/CSSParser.cpp:263
> > - int length = string.length() + strlen(prefix) + strlen(suffix) + 2;
> > + int length = string.length() + strlen(prefix) + strlen(suffix) + 1;
>
> Why have we dropped to only a single extra null terminator? I recall (though haven't got that far in the patch) that we rely on two nulls at the end of the string
The double zero terminator is only required by Flex. The attached code does not depend on the double terminator (thanks to C short circuit evaluation model).
(In reply to comment #18) > The reason of using cycle counters is based on that observation that the tokenizer called thousands of times, but a single run itself takes little time. Thus statistic based measurements are useless in our case. This is not true. Sampling profiler works just fine in cases like this. If you are not seeing progression in something like Instruments or Shark then you are very likely not improving anything. > This is not true. Sampling profiler works just fine in cases like this. If you are not seeing progression in something like Instruments or Shark then you are very likely not improving anything.
I made some profiling on arm. I collected all CSS lexing contributor functions of "opreport -l libQtWebKit.so.4"
Was:
505 1.7080 WebCore::CSSParser::lex()
131 0.4431 WebCore::CSSParser::text(int*)
54 0.1826 WebCore::CSSParser::lex(void*)
Sum: ~700 sample
New:
190 0.6663 WebCore::CSSParser::lex(void*)
12 0.0421 WebCore::CSSParser::detectAtToken(int, bool)
6 0.0210 WebCore::CSSParser::detectNumberToken(unsigned short*, int)
6 0.0210 WebCore::CSSParser::detectNumberToken(unsigned short*, int)
6 0.0210 WebCore::CSSParser::parseIdentifier(unsigned short*&)
6 0.0210 WebCore::CSSParser::parseURI(unsigned short*&, unsigned short*&)
1 0.0035 WebCore::CSSParser::detectFunctionToken(int)
Sum: ~240 sample
So you are right, you can see that with any profiling tool.
Oliver could you review the patch?
Any thoughts? (In reply to comment #49) > Any thoughts? There is quite a silence here. Can I interpret that as the patch is ready and no more changes are needed? (In reply to comment #50) > (In reply to comment #49) > > Any thoughts? > > There is quite a silence here. Can I interpret that as the patch is ready and no more changes are needed? Erk, sorry, i've had plague for the last few days, i'll have a quick look through again, but given i've done the last few rounds of review i'd like at least one other set of eyes to go over it as well -- maybe antti? Comment on attachment 122006 [details]
next patch
Okay I'm convinced this patch is sound, and would give it an r+, I'd like someone else to go through just to double check as I think i've been the only reviewer for the last few revs. Antti art thou up for it? I'll r+ tomorrow if no one objects (anyone knowing the area can r+ if they feel confident in it as well, in the knowledge that i think the patch is sane). Would be nice to know what the raw perf improvement is if you throw a giant pile of css at this though. eg. collect a few hundred meg of css files, concatenate them, and see what overall parsing time becomes.
Sure, I'll check it out tomorrow. Created attachment 123705 [details]
patch update
The previous patch was quite old so I have updated to the latest revision and tested it on Qt. I also made a better ChangeLog.
No other changes.
Hey Oliver, thanks for the review! Comment on attachment 123705 [details] patch update View in context: https://bugs.webkit.org/attachment.cgi?id=123705&action=review Looks good to me. It would have been nice to see a performance test that shows the improvement (CSS parser is currently big enough CPU user that it should be very possible to construct a synthetic test that shows the progression). Setting to r+ based on Oliver's comment. > Source/WebCore/css/CSSParser.cpp:7953 > + if (hasEscape) > + return; > + > + switch (length) { > + case 12: > + if (isCaselessEqual(name + 2, "ottom-left")) > + m_token = BOTTOMLEFT_SYM; > + return; > + > + case 13: I would prefer less empty lines, here and other places. Comment on attachment 123705 [details] patch update View in context: https://bugs.webkit.org/attachment.cgi?id=123705&action=review > Source/WebCore/css/CSSParser.cpp:181 > +static inline UChar lowercaseCharacter(UChar chr) "chr"? Please don't abbreviate like this. I suggest putting this in ASCIICType.h as toASCIILowerUnchecked(). > Source/WebCore/css/CSSParser.cpp:190 > +static inline bool isCaselessEqual(UChar cssCharacter, UChar character) Ditto. > Source/WebCore/css/CSSParser.cpp:7489 > +static inline bool isCSSLetter(UChar chr) Please don't abbreviate "character". > Source/WebCore/css/CSSParser.cpp:7491 > + return chr >= 128 || typesOfASCIICharacters[chr] <= CharacterDash; Even U+FFFD and unassigned code points are CSS letters? > Source/WebCore/css/CSSParser.cpp:7494 > +static inline bool isCSSEscape(UChar chr) Repeated several more times in the patch. > Source/WebCore/css/CSSParser.cpp:7510 > +static inline bool isCaselessEqual(UChar* cssString, const char* constantString) What's important to say in function name is that it's ASCII case insensitive comparison (which is different from Unicode one even for ASCII characters!), and that this depends on CSS syntax. Otherwise, call sites get really confusing. > Source/WebCore/css/CSSParser.cpp:7527 > +static UChar* checkEscape(UChar* data) A check function can return false or true. What does this function do? > Source/WebCore/css/CSSParser.cpp:7543 > + if (isHTMLSpace(*data)) Are HTML and CSS whitespace definitions the same? > Source/WebCore/css/CSSParser.cpp:7550 > +static inline UChar* skipWhiteSpaces(UChar* data) Should be named "skipWhiteSpace" (there is even a CSS keyword skip-white-space, not to mention that this is how the function is named elsewhere in WebCore). > Source/WebCore/css/CSSParser.cpp:7552 > + while (*data <= ' ' && (typesOfASCIICharacters[*data] == CharacterWhiteSpace)) This is a yet another definition of whitespace, with all lower ASCII code being treated as such. If this is right, please explain in a comment. > Source/WebCore/css/CSSParser.cpp:7565 > +inline UChar* CSSParser::checkString(UChar* data, UChar quote) Please use a name that implies returning a UChar*, not a boolean. > Source/WebCore/css/CSSParser.cpp:7619 > +inline bool CSSParser::parseIdentifier(UChar*& result) Some added "parse" functions are void, but this returns a boolean. This is super confusing. Is there a way to have uniform interface for these? Or names that explain the difference? > Source/WebCore/css/CSSParser.cpp:7775 > +inline void CSSParser::detectFunctionToken(int length) Does this actually detect the token? Looks like this function is called after a function has been detected, not to detect it. > Source/WebCore/css/CSSParser.h:383 > + OwnArrayPtr<UChar> m_dataStart; > + UChar* m_data; What is m_data, is it actually m_nextCharacter? AP ythanks for the in-depth review. I am agree all your name change suggestions. > > Source/WebCore/css/CSSParser.cpp:7491 > > + return chr >= 128 || typesOfASCIICharacters[chr] <= CharacterDash; > > Even U+FFFD and unassigned code points are CSS letters? No they are not letters. Characters > 127 are called non-ASCII in CSS, and they all treated as part of literals regardless what they are. > > Source/WebCore/css/CSSParser.cpp:7543 > > + if (isHTMLSpace(*data)) > > Are HTML and CSS whitespace definitions the same? Yes. It is used several other places in the CSS parser. > > > Source/WebCore/css/CSSParser.cpp:7552 > > + while (*data <= ' ' && (typesOfASCIICharacters[*data] == CharacterWhiteSpace)) > > This is a yet another definition of whitespace, with all lower ASCII code being treated as such. If this is right, please explain in a comment. I changed it to isHTMLSpace. > > Source/WebCore/css/CSSParser.cpp:7619 > > +inline bool CSSParser::parseIdentifier(UChar*& result) > > Some added "parse" functions are void, but this returns a boolean. This is super confusing. Is there a way to have uniform interface for these? Or names that explain the difference? The return value is moved to bool& argument > What is m_data, is it actually m_nextCharacter? It is the currentCharacter more precisely. I renamed to that. Created attachment 123928 [details]
patch with renamings
AP, is this patch ok?
Attachment 123928 [details] did not pass style-queue:
Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'ChangeLog', u'Source/JavaScriptCore/Change..." exit_code: 1
Source/WebCore/css/CSSParser.cpp:7510: Use 0 or null instead of NULL (even in *comments*). [readability/null] [4]
Source/WebCore/css/CSSParser.cpp:7550: Use 0 or null instead of NULL (even in *comments*). [readability/null] [4]
Total errors found: 2 in 6 files
If any of these errors are false positives, please file a bug against check-webkit-style.
(In reply to comment #57) > > Source/WebCore/css/CSSParser.cpp:181 > > +static inline UChar lowercaseCharacter(UChar chr) > > "chr"? Please don't abbreviate like this. I disagree with this comment. In type of code where a few variable names are repeated a lot (like parsers) using shorter names improves readability. We use 'i' instead of 'index' as the standard loop variable name for the same reason. Alexey, do you have any more comments to the patch? Comment on attachment 123928 [details]
patch with renamings
Fix the style bot complaints and r=me, and apparently antti (from the prior one), and given the absence of response from ap i'll assume he's also okay with the current names :D
Thank you for the reviews! Fixed the style issues and landed as http://trac.webkit.org/changeset/106217 Are there any pages where we think this is a measurable win, that we could use for perf testing? Reopen, because it made css2.1/20110323/eof-002.htm flakey. Sometimes pass, sometimes crash, sometimes fail ... You can easily reproduce the bug with the following command: $ Tools/Scripts/run-webkit-tests css2.1/20110323/eof-002.htm --iterations 100 --run-singly (In reply to comment #66) > Reopen, because it made css2.1/20110323/eof-002.htm flakey. Sometimes pass, sometimes crash, sometimes fail ... You can easily reproduce the bug with the following command: Thanks Csaba, valgrind also got it. Buildfix landed as http://trac.webkit.org/changeset/106227 Created attachment 124528 [details]
parsing test
The patch contains the <style> rules from the most popular 25 web-sites (~270k). The test parses it 100 times, and display the time.
> The patch contains the <style> rules from the most popular 25 web-sites (~270k). The test parses it 100 times, and display the time.
Oh this is not a patch just a simple page in a zipped file.
|