41845 – refactor identifier parsing in lexer

RESOLVED FIXED 41845

refactor identifier parsing in lexer

https://bugs.webkit.org/show_bug.cgi?id=41845

Summary refactor identifier parsing in lexer

Zoltan Herczeg

Reported 2010-07-08 03:05:30 PDT

SS --parse-only: no change TEST COMPARISON FROM TO DETAILS ============================================================================= ** TOTAL **: ?? 31.3ms +/- 1.9% 31.4ms +/- 1.6% not conclusive: might be *1.003x as slow* ============================================================================= jquery: - 5.0ms +/- 0.0% 5.0ms +/- 0.0% 1.3.2: - 5.0ms +/- 0.0% 5.0ms +/- 0.0% mootools: - 5.0ms +/- 0.0% 5.0ms +/- 0.0% 1.2.2-core-nc: - 5.0ms +/- 0.0% 5.0ms +/- 0.0% prototype: ?? 5.8ms +/- 5.2% 5.9ms +/- 3.8% not conclusive: might be *1.017x as slow* 1.6.0.3: ?? 5.8ms +/- 5.2% 5.9ms +/- 3.8% not conclusive: might be *1.017x as slow* concat: - 15.5ms +/- 2.4% 15.5ms +/- 2.4% jquery-mootools-prototype: - 15.5ms +/- 2.4% 15.5ms +/- 2.4% SS: no change as well

Attachments
patch (22.25 KB, patch) 2010-07-08 03:19 PDT, Zoltan Herczeg	no flags	Details Formatted Diff Diff
updated patch (23.58 KB, patch) 2010-07-12 04:58 PDT, Zoltan Herczeg	no flags	Details Formatted Diff Diff
style patch (18.46 KB, patch) 2010-07-13 05:55 PDT, Zoltan Herczeg	darin: review-	Details Formatted Diff Diff
indentation patch (15.59 KB, patch) 2010-07-14 00:44 PDT, Zoltan Herczeg	no flags	Details Formatted Diff Diff
the second part (12.30 KB, patch) 2010-07-14 02:22 PDT, Zoltan Herczeg	no flags	Details Formatted Diff Diff
the second part again (12.24 KB, patch) 2010-07-15 02:17 PDT, Zoltan Herczeg	no flags	Details Formatted Diff Diff
identifier parsing patch (7.00 KB, patch) 2010-08-05 06:28 PDT, Zoltan Herczeg	no flags	Details Formatted Diff Diff
identifier parsing patch (v2) (7.23 KB, patch) 2010-08-05 06:52 PDT, Zoltan Herczeg	no flags	Details Formatted Diff Diff
Show Obsolete (4) View All Add attachment proposed patch, testcase, etc.

Zoltan Herczeg

Comment 1 2010-07-08 03:19:54 PDT

Created attachment 60862 [details] patch

Zoltan Herczeg

Comment 2 2010-07-09 23:39:35 PDT

remove the r? from the patch until conflicts are resolved.

Zoltan Herczeg

Comment 3 2010-07-12 04:58:34 PDT

Created attachment 61209 [details] updated patch Although there is nothing new in the patch compared to the previous one (except the merge), it causes 8% slowdown on --parse-only now. Not worth to set r?

Zoltan Herczeg

Comment 4 2010-07-13 05:55:55 PDT

Created attachment 61361 [details] style patch

Darin Adler

Comment 5 2010-07-13 10:41:31 PDT

Comment on attachment 61361 [details] style patch Looks good. > + // The first free types are fixed, and also used for > + // identifying ascii alpha and alphanumeric characters I think you mean the first *three* types. This comment should have a period, and ASCII should be in all capitals. > // 128 ascii codes ASCII. > -static unsigned short AsciiCharacters[128] = { > +static unsigned short asciiCharacters[128] = { Can this be static const? The name of this array is not great. It's called asciiCharacters, but it's not an array of ASCII characters. It's actually an array of the character type of each ASCII character. So its name should be asciiCharacterTypes or typesOfASCIICharacters or something like that. Also, the CharacterTypes enum has a strange name. A value of this enum type is a single character type, so the type name should be CharacterType. Just as "int" is not named "ints". > static inline bool isIdentStart(int c) > { > - return isASCII(c) ? isASCIIAlpha(c) || c == '$' || c == '_' : isNonASCIIIdentStart(c); > + return isASCII(c) ? asciiCharacters[c] == CharacterAlpha : isNonASCIIIdentStart(c); > } CharacterAlpha seems like the wrong name for something that includes "$" and "_". Is this a term from the ECMAScript specification? If not, then I suggest that we don't call this CharacterAlpha, and instead call it CharacterIdentifierStart or something along those lines. > static inline bool isIdentPart(int c) > { > - return isASCII(c) ? isASCIIAlphanumeric(c) || c == '$' || c == '_' : isNonASCIIIdentPart(c); > + return isASCII(c) ? asciiCharacters[c] <= CharacterNumber : isNonASCIIIdentPart(c); > } Needs a comment to explain why "<=" is correct here. > + // temporary contains an ascii representation of the code > + // helps to avoid code duplication It's not clear to me what "temporary" is. Maybe you mean the local variable named "asciiCharacter". Comment should be sentence style. "ascii" should be "ASCII". I don't think this comment says anything more clearly than the code. Saying "avoid code duplication" is too vague to add much clarity. I suggest omitting this comment entirely. > + if (isNonASCIIIdentStart(m_current)) > + asciiCharacter = 'a'; This code, however, does need a comment. The comment needs to say why it's OK to set the local variable to the letter "a". Comments are most valuable when they explain "why" rather than "what" -- the code itself makes it clear "what". In this case, we set asciiCharacter to any legal identifier starting character so we can get the right result from the code below, but the actual character is not used for anything else. I don't think the code needs to work this confusing way. Instead the local variable could just be a CharacterTypes value. CharacterType type; if (LIKELY(isASCII(m_current)) type = typesOfASCIICharacters[m_current]; else { if (isNonASCIIIdentStart(m_current)) type = CharacterIdentifierStart; else type = CharacterLineTerminator; } switch (type) { That code seems clear enough that you don't even need comments for it, and no less efficient than the code we have in the patch. Can you separate the indentation change of the giant switch statement from the rest of the patch somehow? The diff is hard to read because of that. I'm going to say review- because I think this change is unnecessarily oblique and it's easy to make it better. You could also consider some of my naming suggestions to make the code easier to understand.

Zoltan Herczeg

Comment 6 2010-07-13 10:57:56 PDT

I like your comments. Really makes things better. > Can you separate the indentation change of the giant switch statement from the rest of the patch somehow? The diff is hard to read because of that. You mean create a patch, which change only the indentation, and explain in the ChangeLog that the followup patch will use this indentation change?

Zoltan Herczeg

Comment 7 2010-07-14 00:44:56 PDT

Created attachment 61478 [details] indentation patch

Zoltan Herczeg

Comment 8 2010-07-14 02:22:01 PDT

Created attachment 61492 [details] the second part

Darin Adler

Comment 9 2010-07-14 08:06:16 PDT

Comment on attachment 61492 [details] the second part > + (JSC::): You should delete tis line in the ChangeLog, created because of a bug in the prepare-ChangeLog script. > + // Character types are divided into two groups depending they can be part of an identifier or not. > + // Those, whose type value is less or equal than CharacterNumber, can be part of an identifier. > + // (see the CharacterType defintion for more details) A few small mistakes here. One is "depending they", which should be "depending on whether they". The two commas are not needed or helpful in the second sentence. The sentence in parentheses should have a capital letter and a period. Patch otherwise looks good.

WebKit Commit Bot

Comment 10 2010-07-14 08:21:14 PDT

Comment on attachment 61478 [details] indentation patch Clearing flags on attachment: 61478 Committed r63322: <http://trac.webkit.org/changeset/63322>

WebKit Commit Bot

Comment 11 2010-07-14 08:21:19 PDT

All reviewed patches have been landed. Closing bug.

Zoltan Herczeg

Comment 12 2010-07-15 02:17:18 PDT

Created attachment 61619 [details] the second part again

Darin Adler

Comment 13 2010-07-15 06:34:34 PDT

Comment on attachment 61619 [details] the second part again JavaScriptCore/parser/Lexer.cpp:353 + // part of an identifier. (See the CharacterType defintion for more details.) Typo here: "defintion". JavaScriptCore/parser/Lexer.cpp:465 + if (UNLIKELY(((static_cast<unsigned>(m_current) - 0xE) & 0x2000))) { The tricky expression here should be inside a named inline function to document what it does. Even if we don't want to do that, at the very least it needs a comment because its non-obvious. That’s not new in this patch, though.

Zoltan Herczeg

Comment 14 2010-07-15 07:24:06 PDT

"the second part again" Landed in http://trac.webkit.org/changeset/63423 The work is still not finished.

Zoltan Herczeg

Comment 15 2010-08-05 06:28:53 PDT

Created attachment 63582 [details] identifier parsing patch This patch reaches the speed of the old code with gotos. SunSpider --parse-only: TEST COMPARISON FROM TO DETAILS ============================================================================= ** TOTAL **: - 34.0ms +/- 2.2% 33.6ms +/- 1.5% ============================================================================= jquery: - 5.2ms +/- 5.8% 5.1ms +/- 4.4% 1.3.2: - 5.2ms +/- 5.8% 5.1ms +/- 4.4% mootools: - 5.7ms +/- 6.1% 5.4ms +/- 6.8% 1.2.2-core-nc: - 5.7ms +/- 6.1% 5.4ms +/- 6.8% prototype: - 6.5ms +/- 5.8% 6.4ms +/- 5.8% 1.6.0.3: - 6.5ms +/- 5.8% 6.4ms +/- 5.8% concat: ?? 16.6ms +/- 2.2% 16.7ms +/- 2.1% not conclusive: might be *1.006x as slow* jquery-mootools-prototype: ?? 16.6ms +/- 2.2% 16.7ms +/- 2.1% not conclusive: might be *1.006x as slow* Normal SunSpider reports no change.

Eric Seidel (no email)

Comment 16 2010-08-05 06:42:36 PDT

Attachment 63582 [details] did not build on mac: Build output: http://queues.webkit.org/results/3559933

Zoltan Herczeg

Comment 17 2010-08-05 06:52:13 PDT

Created attachment 63583 [details] identifier parsing patch (v2) removed the unused variable

Darin Adler

Comment 18 2010-08-05 08:21:48 PDT

Comment on attachment 63583 [details] identifier parsing patch (v2) Thanks. This looks good. > +ALWAYS_INLINE JSTokenType Lexer::parseIdent(JSTokenData* lvalp, LexType lexType) Since this is identifier parsing, how about naming it parseIdentifier? I know the file sometimes does abbreviate to "ident", but I find the full word easier to read as I almost always do in code in general. > + while (true) { > + if (LIKELY(isIdentPart(m_current))) > + shift(); > + else if (LIKELY(m_current != '\\')) > + break; > + else { > + // \uXXXX unicode characters > + bufferRequired = true; > + if (identifierStart != currentCharacter()) > + m_buffer16.append(identifierStart, currentCharacter() - identifierStart); > + shift(); > + if (UNLIKELY(m_current != 'u')) > + return ERRORTOK; > + shift(); > + int character = getUnicodeCharacter(); > + if (UNLIKELY(character == -1)) > + return ERRORTOK; > + if (UNLIKELY(m_buffer16.size() ? !isIdentPart(character) : !isIdentStart(character))) > + return ERRORTOK; > + record16(character); > + identifierStart = currentCharacter(); > + } > + } I prefer the "early return" style, so if it gives you the same efficiency I'd like this to be written as follows: if (LIKELY(isIdentPart(m_current))) { shift(); continue; } if (LIKELY(m_current != '\\')) break; ... With the rest of the loop body not indented. > + // keywords are not checked if there was an \uXXXX in the identifier This is a confusing comment. I think what we mean to say is that keywords must not be recognized in such cases. Saying keywords "are not checked" leaves it ambiguous whether we are describing a bug or a feature. Also WebKit preferred style is to capitalize comments and end with a period. Especially a comment like this one that is a full sentence. > + // Continue to the next case Would be better to have a period and use the terminology "Fall through" as we do in other similar comments. > + ALWAYS_INLINE JSTokenType parseIdent(JSTokenData* lvalp, LexType lexType); No need for the argument names lvavlp and lexType here. The types of the arguments speak for themselves without argument names and in such cases we omit the argument names in WebKit code.

Zoltan Herczeg

Comment 19 2010-08-06 04:07:52 PDT

Did the requested changes and landed in http://trac.webkit.org/changeset/64827 Thank you Darin.

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Normal

Classification Unclassified

Version 528+ (Nightly build)

Hardware PC

OS OS X 10.5

Product WebKit

Component JavaScriptCore

Assignee

Nobody

Reported

2010-07-08 03:05 PDT

Modified

2010-08-06 04:08 PDT History

CC List

3 users Show

URL

Keywords

Depends on

Blocks