RESOLVED FIXED 41845
refactor identifier parsing in lexer
https://bugs.webkit.org/show_bug.cgi?id=41845
Summary refactor identifier parsing in lexer
Zoltan Herczeg
Reported 2010-07-08 03:05:30 PDT
SS --parse-only: no change TEST COMPARISON FROM TO DETAILS ============================================================================= ** TOTAL **: ?? 31.3ms +/- 1.9% 31.4ms +/- 1.6% not conclusive: might be *1.003x as slow* ============================================================================= jquery: - 5.0ms +/- 0.0% 5.0ms +/- 0.0% 1.3.2: - 5.0ms +/- 0.0% 5.0ms +/- 0.0% mootools: - 5.0ms +/- 0.0% 5.0ms +/- 0.0% 1.2.2-core-nc: - 5.0ms +/- 0.0% 5.0ms +/- 0.0% prototype: ?? 5.8ms +/- 5.2% 5.9ms +/- 3.8% not conclusive: might be *1.017x as slow* 1.6.0.3: ?? 5.8ms +/- 5.2% 5.9ms +/- 3.8% not conclusive: might be *1.017x as slow* concat: - 15.5ms +/- 2.4% 15.5ms +/- 2.4% jquery-mootools-prototype: - 15.5ms +/- 2.4% 15.5ms +/- 2.4% SS: no change as well
Attachments
patch (22.25 KB, patch)
2010-07-08 03:19 PDT, Zoltan Herczeg
no flags
updated patch (23.58 KB, patch)
2010-07-12 04:58 PDT, Zoltan Herczeg
no flags
style patch (18.46 KB, patch)
2010-07-13 05:55 PDT, Zoltan Herczeg
darin: review-
indentation patch (15.59 KB, patch)
2010-07-14 00:44 PDT, Zoltan Herczeg
no flags
the second part (12.30 KB, patch)
2010-07-14 02:22 PDT, Zoltan Herczeg
no flags
the second part again (12.24 KB, patch)
2010-07-15 02:17 PDT, Zoltan Herczeg
no flags
identifier parsing patch (7.00 KB, patch)
2010-08-05 06:28 PDT, Zoltan Herczeg
no flags
identifier parsing patch (v2) (7.23 KB, patch)
2010-08-05 06:52 PDT, Zoltan Herczeg
no flags
Zoltan Herczeg
Comment 1 2010-07-08 03:19:54 PDT
Zoltan Herczeg
Comment 2 2010-07-09 23:39:35 PDT
remove the r? from the patch until conflicts are resolved.
Zoltan Herczeg
Comment 3 2010-07-12 04:58:34 PDT
Created attachment 61209 [details] updated patch Although there is nothing new in the patch compared to the previous one (except the merge), it causes 8% slowdown on --parse-only now. Not worth to set r?
Zoltan Herczeg
Comment 4 2010-07-13 05:55:55 PDT
Created attachment 61361 [details] style patch
Darin Adler
Comment 5 2010-07-13 10:41:31 PDT
Comment on attachment 61361 [details] style patch Looks good. > + // The first free types are fixed, and also used for > + // identifying ascii alpha and alphanumeric characters I think you mean the first *three* types. This comment should have a period, and ASCII should be in all capitals. > // 128 ascii codes ASCII. > -static unsigned short AsciiCharacters[128] = { > +static unsigned short asciiCharacters[128] = { Can this be static const? The name of this array is not great. It's called asciiCharacters, but it's not an array of ASCII characters. It's actually an array of the character type of each ASCII character. So its name should be asciiCharacterTypes or typesOfASCIICharacters or something like that. Also, the CharacterTypes enum has a strange name. A value of this enum type is a single character type, so the type name should be CharacterType. Just as "int" is not named "ints". > static inline bool isIdentStart(int c) > { > - return isASCII(c) ? isASCIIAlpha(c) || c == '$' || c == '_' : isNonASCIIIdentStart(c); > + return isASCII(c) ? asciiCharacters[c] == CharacterAlpha : isNonASCIIIdentStart(c); > } CharacterAlpha seems like the wrong name for something that includes "$" and "_". Is this a term from the ECMAScript specification? If not, then I suggest that we don't call this CharacterAlpha, and instead call it CharacterIdentifierStart or something along those lines. > static inline bool isIdentPart(int c) > { > - return isASCII(c) ? isASCIIAlphanumeric(c) || c == '$' || c == '_' : isNonASCIIIdentPart(c); > + return isASCII(c) ? asciiCharacters[c] <= CharacterNumber : isNonASCIIIdentPart(c); > } Needs a comment to explain why "<=" is correct here. > + // temporary contains an ascii representation of the code > + // helps to avoid code duplication It's not clear to me what "temporary" is. Maybe you mean the local variable named "asciiCharacter". Comment should be sentence style. "ascii" should be "ASCII". I don't think this comment says anything more clearly than the code. Saying "avoid code duplication" is too vague to add much clarity. I suggest omitting this comment entirely. > + if (isNonASCIIIdentStart(m_current)) > + asciiCharacter = 'a'; This code, however, does need a comment. The comment needs to say why it's OK to set the local variable to the letter "a". Comments are most valuable when they explain "why" rather than "what" -- the code itself makes it clear "what". In this case, we set asciiCharacter to any legal identifier starting character so we can get the right result from the code below, but the actual character is not used for anything else. I don't think the code needs to work this confusing way. Instead the local variable could just be a CharacterTypes value. CharacterType type; if (LIKELY(isASCII(m_current)) type = typesOfASCIICharacters[m_current]; else { if (isNonASCIIIdentStart(m_current)) type = CharacterIdentifierStart; else type = CharacterLineTerminator; } switch (type) { That code seems clear enough that you don't even need comments for it, and no less efficient than the code we have in the patch. Can you separate the indentation change of the giant switch statement from the rest of the patch somehow? The diff is hard to read because of that. I'm going to say review- because I think this change is unnecessarily oblique and it's easy to make it better. You could also consider some of my naming suggestions to make the code easier to understand.
Zoltan Herczeg
Comment 6 2010-07-13 10:57:56 PDT
I like your comments. Really makes things better. > Can you separate the indentation change of the giant switch statement from the rest of the patch somehow? The diff is hard to read because of that. You mean create a patch, which change only the indentation, and explain in the ChangeLog that the followup patch will use this indentation change?
Zoltan Herczeg
Comment 7 2010-07-14 00:44:56 PDT
Created attachment 61478 [details] indentation patch
Zoltan Herczeg
Comment 8 2010-07-14 02:22:01 PDT
Created attachment 61492 [details] the second part
Darin Adler
Comment 9 2010-07-14 08:06:16 PDT
Comment on attachment 61492 [details] the second part > + (JSC::): You should delete tis line in the ChangeLog, created because of a bug in the prepare-ChangeLog script. > + // Character types are divided into two groups depending they can be part of an identifier or not. > + // Those, whose type value is less or equal than CharacterNumber, can be part of an identifier. > + // (see the CharacterType defintion for more details) A few small mistakes here. One is "depending they", which should be "depending on whether they". The two commas are not needed or helpful in the second sentence. The sentence in parentheses should have a capital letter and a period. Patch otherwise looks good.
WebKit Commit Bot
Comment 10 2010-07-14 08:21:14 PDT
Comment on attachment 61478 [details] indentation patch Clearing flags on attachment: 61478 Committed r63322: <http://trac.webkit.org/changeset/63322>
WebKit Commit Bot
Comment 11 2010-07-14 08:21:19 PDT
All reviewed patches have been landed. Closing bug.
Zoltan Herczeg
Comment 12 2010-07-15 02:17:18 PDT
Created attachment 61619 [details] the second part again
Darin Adler
Comment 13 2010-07-15 06:34:34 PDT
Comment on attachment 61619 [details] the second part again JavaScriptCore/parser/Lexer.cpp:353 + // part of an identifier. (See the CharacterType defintion for more details.) Typo here: "defintion". JavaScriptCore/parser/Lexer.cpp:465 + if (UNLIKELY(((static_cast<unsigned>(m_current) - 0xE) & 0x2000))) { The tricky expression here should be inside a named inline function to document what it does. Even if we don't want to do that, at the very least it needs a comment because its non-obvious. That’s not new in this patch, though.
Zoltan Herczeg
Comment 14 2010-07-15 07:24:06 PDT
"the second part again" Landed in http://trac.webkit.org/changeset/63423 The work is still not finished.
Zoltan Herczeg
Comment 15 2010-08-05 06:28:53 PDT
Created attachment 63582 [details] identifier parsing patch This patch reaches the speed of the old code with gotos. SunSpider --parse-only: TEST COMPARISON FROM TO DETAILS ============================================================================= ** TOTAL **: - 34.0ms +/- 2.2% 33.6ms +/- 1.5% ============================================================================= jquery: - 5.2ms +/- 5.8% 5.1ms +/- 4.4% 1.3.2: - 5.2ms +/- 5.8% 5.1ms +/- 4.4% mootools: - 5.7ms +/- 6.1% 5.4ms +/- 6.8% 1.2.2-core-nc: - 5.7ms +/- 6.1% 5.4ms +/- 6.8% prototype: - 6.5ms +/- 5.8% 6.4ms +/- 5.8% 1.6.0.3: - 6.5ms +/- 5.8% 6.4ms +/- 5.8% concat: ?? 16.6ms +/- 2.2% 16.7ms +/- 2.1% not conclusive: might be *1.006x as slow* jquery-mootools-prototype: ?? 16.6ms +/- 2.2% 16.7ms +/- 2.1% not conclusive: might be *1.006x as slow* Normal SunSpider reports no change.
Eric Seidel (no email)
Comment 16 2010-08-05 06:42:36 PDT
Zoltan Herczeg
Comment 17 2010-08-05 06:52:13 PDT
Created attachment 63583 [details] identifier parsing patch (v2) removed the unused variable
Darin Adler
Comment 18 2010-08-05 08:21:48 PDT
Comment on attachment 63583 [details] identifier parsing patch (v2) Thanks. This looks good. > +ALWAYS_INLINE JSTokenType Lexer::parseIdent(JSTokenData* lvalp, LexType lexType) Since this is identifier parsing, how about naming it parseIdentifier? I know the file sometimes does abbreviate to "ident", but I find the full word easier to read as I almost always do in code in general. > + while (true) { > + if (LIKELY(isIdentPart(m_current))) > + shift(); > + else if (LIKELY(m_current != '\\')) > + break; > + else { > + // \uXXXX unicode characters > + bufferRequired = true; > + if (identifierStart != currentCharacter()) > + m_buffer16.append(identifierStart, currentCharacter() - identifierStart); > + shift(); > + if (UNLIKELY(m_current != 'u')) > + return ERRORTOK; > + shift(); > + int character = getUnicodeCharacter(); > + if (UNLIKELY(character == -1)) > + return ERRORTOK; > + if (UNLIKELY(m_buffer16.size() ? !isIdentPart(character) : !isIdentStart(character))) > + return ERRORTOK; > + record16(character); > + identifierStart = currentCharacter(); > + } > + } I prefer the "early return" style, so if it gives you the same efficiency I'd like this to be written as follows: if (LIKELY(isIdentPart(m_current))) { shift(); continue; } if (LIKELY(m_current != '\\')) break; ... With the rest of the loop body not indented. > + // keywords are not checked if there was an \uXXXX in the identifier This is a confusing comment. I think what we mean to say is that keywords must not be recognized in such cases. Saying keywords "are not checked" leaves it ambiguous whether we are describing a bug or a feature. Also WebKit preferred style is to capitalize comments and end with a period. Especially a comment like this one that is a full sentence. > + // Continue to the next case Would be better to have a period and use the terminology "Fall through" as we do in other similar comments. > + ALWAYS_INLINE JSTokenType parseIdent(JSTokenData* lvalp, LexType lexType); No need for the argument names lvavlp and lexType here. The types of the arguments speak for themselves without argument names and in such cases we omit the argument names in WebKit code.
Zoltan Herczeg
Comment 19 2010-08-06 04:07:52 PDT
Did the requested changes and landed in http://trac.webkit.org/changeset/64827 Thank you Darin.
Note You need to log in before you can comment on or make changes to this bug.