Bug 41845 - refactor identifier parsing in lexer
Summary: refactor identifier parsing in lexer
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: JavaScriptCore (show other bugs)
Version: 528+ (Nightly build)
Hardware: PC OS X 10.5
: P2 Normal
Assignee: Nobody
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-07-08 03:05 PDT by Zoltan Herczeg
Modified: 2010-08-06 04:08 PDT (History)
3 users (show)

See Also:


Attachments
patch (22.25 KB, patch)
2010-07-08 03:19 PDT, Zoltan Herczeg
no flags Details | Formatted Diff | Diff
updated patch (23.58 KB, patch)
2010-07-12 04:58 PDT, Zoltan Herczeg
no flags Details | Formatted Diff | Diff
style patch (18.46 KB, patch)
2010-07-13 05:55 PDT, Zoltan Herczeg
darin: review-
Details | Formatted Diff | Diff
indentation patch (15.59 KB, patch)
2010-07-14 00:44 PDT, Zoltan Herczeg
no flags Details | Formatted Diff | Diff
the second part (12.30 KB, patch)
2010-07-14 02:22 PDT, Zoltan Herczeg
no flags Details | Formatted Diff | Diff
the second part again (12.24 KB, patch)
2010-07-15 02:17 PDT, Zoltan Herczeg
no flags Details | Formatted Diff | Diff
identifier parsing patch (7.00 KB, patch)
2010-08-05 06:28 PDT, Zoltan Herczeg
no flags Details | Formatted Diff | Diff
identifier parsing patch (v2) (7.23 KB, patch)
2010-08-05 06:52 PDT, Zoltan Herczeg
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Zoltan Herczeg 2010-07-08 03:05:30 PDT
SS --parse-only: no change

TEST                           COMPARISON            FROM                 TO             DETAILS

=============================================================================

** TOTAL **:                   ??                31.3ms +/- 1.9%   31.4ms +/- 1.6%     not conclusive: might be *1.003x as slow*

=============================================================================

  jquery:                      -                  5.0ms +/- 0.0%    5.0ms +/- 0.0%
    1.3.2:                     -                  5.0ms +/- 0.0%    5.0ms +/- 0.0%

  mootools:                    -                  5.0ms +/- 0.0%    5.0ms +/- 0.0%
    1.2.2-core-nc:             -                  5.0ms +/- 0.0%    5.0ms +/- 0.0%

  prototype:                   ??                 5.8ms +/- 5.2%    5.9ms +/- 3.8%     not conclusive: might be *1.017x as slow*
    1.6.0.3:                   ??                 5.8ms +/- 5.2%    5.9ms +/- 3.8%     not conclusive: might be *1.017x as slow*

  concat:                      -                 15.5ms +/- 2.4%   15.5ms +/- 2.4%
    jquery-mootools-prototype: -                 15.5ms +/- 2.4%   15.5ms +/- 2.4%

SS: no change as well
Comment 1 Zoltan Herczeg 2010-07-08 03:19:54 PDT
Created attachment 60862 [details]
patch
Comment 2 Zoltan Herczeg 2010-07-09 23:39:35 PDT
remove the r? from the patch until conflicts are resolved.
Comment 3 Zoltan Herczeg 2010-07-12 04:58:34 PDT
Created attachment 61209 [details]
updated patch

Although there is nothing new in the patch compared to the previous one (except the merge), it causes 8% slowdown on --parse-only now. Not worth to set r?
Comment 4 Zoltan Herczeg 2010-07-13 05:55:55 PDT
Created attachment 61361 [details]
style patch
Comment 5 Darin Adler 2010-07-13 10:41:31 PDT
Comment on attachment 61361 [details]
style patch

Looks good.

> +    // The first free types are fixed, and also used for
> +    // identifying ascii alpha and alphanumeric characters

I think you mean the first *three* types. This comment should have a period, and ASCII should be in all capitals.

>  // 128 ascii codes

ASCII.

> -static unsigned short AsciiCharacters[128] = {
> +static unsigned short asciiCharacters[128] = {

Can this be static const?

The name of this array is not great. It's called asciiCharacters, but it's not an array of ASCII characters. It's actually an array of the character type of each ASCII character. So its name should be asciiCharacterTypes or typesOfASCIICharacters or something like that.

Also, the CharacterTypes enum has a strange name. A value of this enum type is a single character type, so the type name should be CharacterType. Just as "int" is not named "ints".

>  static inline bool isIdentStart(int c)
>  {
> -    return isASCII(c) ? isASCIIAlpha(c) || c == '$' || c == '_' : isNonASCIIIdentStart(c);
> +    return isASCII(c) ? asciiCharacters[c] == CharacterAlpha : isNonASCIIIdentStart(c);
>  }

CharacterAlpha seems like the wrong name for something that includes "$" and "_". Is this a term from the ECMAScript specification? If not, then I suggest that we don't call this CharacterAlpha, and instead call it CharacterIdentifierStart or something along those lines.

>  static inline bool isIdentPart(int c)
>  {
> -    return isASCII(c) ? isASCIIAlphanumeric(c) || c == '$' || c == '_' : isNonASCIIIdentPart(c);
> +    return isASCII(c) ? asciiCharacters[c] <= CharacterNumber : isNonASCIIIdentPart(c);
>  }

Needs a comment to explain why "<=" is correct here.

> +    // temporary contains an ascii representation of the code
> +    // helps to avoid code duplication

It's not clear to me what "temporary" is. Maybe you mean the local variable named "asciiCharacter". Comment should be sentence style. "ascii" should be "ASCII".

I don't think this comment says anything more clearly than the code. Saying "avoid code duplication" is too vague to add much clarity. I suggest omitting this comment entirely.

> +        if (isNonASCIIIdentStart(m_current))
> +            asciiCharacter = 'a';

This code, however, does need a comment. The comment needs to say why it's OK to set the local variable to the letter "a". Comments are most valuable when they explain "why" rather than "what" -- the code itself makes it clear "what". In this case, we set asciiCharacter to any legal identifier starting character so we can get the right result from the code below, but the actual character is not used for anything else.

I don't think the code needs to work this confusing way. Instead the local variable could just be a CharacterTypes value.

    CharacterType type;
    if (LIKELY(isASCII(m_current))
        type = typesOfASCIICharacters[m_current];
    else {
        if (isNonASCIIIdentStart(m_current))
            type = CharacterIdentifierStart;
        else
            type = CharacterLineTerminator;
    }
    switch (type) {


That code seems clear enough that you don't even need comments for it, and no less efficient than the code we have in the patch.

Can you separate the indentation change of the giant switch statement from the rest of the patch somehow? The diff is hard to read because of that.

I'm going to say review- because I think this change is unnecessarily oblique and it's easy to make it better. You could also consider some of my naming suggestions to make the code easier to understand.
Comment 6 Zoltan Herczeg 2010-07-13 10:57:56 PDT
I like your comments. Really makes things better.

> Can you separate the indentation change of the giant switch statement from the rest of the patch somehow? The diff is hard to read because of that.

You mean create a patch, which change only the indentation, and explain in the ChangeLog that the followup patch will use this indentation change?
Comment 7 Zoltan Herczeg 2010-07-14 00:44:56 PDT
Created attachment 61478 [details]
indentation patch
Comment 8 Zoltan Herczeg 2010-07-14 02:22:01 PDT
Created attachment 61492 [details]
the second part
Comment 9 Darin Adler 2010-07-14 08:06:16 PDT
Comment on attachment 61492 [details]
the second part

> +        (JSC::):

You should delete tis line in the ChangeLog, created because of a bug in the prepare-ChangeLog script.

> +    // Character types are divided into two groups depending they can be part of an identifier or not.
> +    // Those, whose type value is less or equal than CharacterNumber, can be part of an identifier.
> +    // (see the CharacterType defintion for more details)

A few small mistakes here. One is "depending they", which should be "depending on whether they". The two commas are not needed or helpful in the second sentence. The sentence in parentheses should have a capital letter and a period.

Patch otherwise looks good.
Comment 10 WebKit Commit Bot 2010-07-14 08:21:14 PDT
Comment on attachment 61478 [details]
indentation patch

Clearing flags on attachment: 61478

Committed r63322: <http://trac.webkit.org/changeset/63322>
Comment 11 WebKit Commit Bot 2010-07-14 08:21:19 PDT
All reviewed patches have been landed.  Closing bug.
Comment 12 Zoltan Herczeg 2010-07-15 02:17:18 PDT
Created attachment 61619 [details]
the second part again
Comment 13 Darin Adler 2010-07-15 06:34:34 PDT
Comment on attachment 61619 [details]
the second part again

JavaScriptCore/parser/Lexer.cpp:353
 +      // part of an identifier. (See the CharacterType defintion for more details.)
Typo here: "defintion".

JavaScriptCore/parser/Lexer.cpp:465
 +          if (UNLIKELY(((static_cast<unsigned>(m_current) - 0xE) & 0x2000))) {
The tricky expression here should be inside a named inline function to document what it does. Even if we don't want to do that, at the very least it needs a comment because its non-obvious. That’s not new in this patch, though.
Comment 14 Zoltan Herczeg 2010-07-15 07:24:06 PDT
"the second part again" Landed in http://trac.webkit.org/changeset/63423

The work is still not finished.
Comment 15 Zoltan Herczeg 2010-08-05 06:28:53 PDT
Created attachment 63582 [details]
identifier parsing patch

This patch reaches the speed of the old code with gotos.

SunSpider --parse-only:
TEST                           COMPARISON            FROM                 TO             DETAILS

=============================================================================

** TOTAL **:                   -                 34.0ms +/- 2.2%   33.6ms +/- 1.5%

=============================================================================

  jquery:                      -                  5.2ms +/- 5.8%    5.1ms +/- 4.4%
    1.3.2:                     -                  5.2ms +/- 5.8%    5.1ms +/- 4.4%

  mootools:                    -                  5.7ms +/- 6.1%    5.4ms +/- 6.8%
    1.2.2-core-nc:             -                  5.7ms +/- 6.1%    5.4ms +/- 6.8%

  prototype:                   -                  6.5ms +/- 5.8%    6.4ms +/- 5.8%
    1.6.0.3:                   -                  6.5ms +/- 5.8%    6.4ms +/- 5.8%

  concat:                      ??                16.6ms +/- 2.2%   16.7ms +/- 2.1%     not conclusive: might be *1.006x as slow*
    jquery-mootools-prototype: ??                16.6ms +/- 2.2%   16.7ms +/- 2.1%     not conclusive: might be *1.006x as slow*

Normal SunSpider reports no change.
Comment 16 Eric Seidel (no email) 2010-08-05 06:42:36 PDT
Attachment 63582 [details] did not build on mac:
Build output: http://queues.webkit.org/results/3559933
Comment 17 Zoltan Herczeg 2010-08-05 06:52:13 PDT
Created attachment 63583 [details]
 identifier parsing patch (v2)

removed the unused variable
Comment 18 Darin Adler 2010-08-05 08:21:48 PDT
Comment on attachment 63583 [details]
 identifier parsing patch (v2)

Thanks. This looks good.

> +ALWAYS_INLINE JSTokenType Lexer::parseIdent(JSTokenData* lvalp, LexType lexType)

Since this is identifier parsing, how about naming it parseIdentifier? I know the file sometimes does abbreviate to "ident", but I find the full word easier to read as I almost always do in code in general.

> +    while (true) {
> +        if (LIKELY(isIdentPart(m_current)))
> +            shift();
> +        else if (LIKELY(m_current != '\\'))
> +            break;
> +        else {
> +            // \uXXXX unicode characters
> +            bufferRequired = true;
> +            if (identifierStart != currentCharacter())
> +                m_buffer16.append(identifierStart, currentCharacter() - identifierStart);
> +            shift();
> +            if (UNLIKELY(m_current != 'u'))
> +                return ERRORTOK;
> +            shift();
> +            int character = getUnicodeCharacter();
> +            if (UNLIKELY(character == -1))
> +                return ERRORTOK;
> +            if (UNLIKELY(m_buffer16.size() ? !isIdentPart(character) : !isIdentStart(character)))
> +                return ERRORTOK;
> +            record16(character);
> +            identifierStart = currentCharacter();
> +        }
> +    }

I prefer the "early return" style, so if it gives you the same efficiency I'd like this to be written as follows:

    if (LIKELY(isIdentPart(m_current))) {
        shift();
        continue;
    }
    if (LIKELY(m_current != '\\'))
        break;
    ...

With the rest of the loop body not indented.

> +        // keywords are not checked if there was an \uXXXX in the identifier

This is a confusing comment. I think what we mean to say is that keywords must not be recognized in such cases. Saying keywords "are not checked" leaves it ambiguous whether we are describing a bug or a feature.

Also WebKit preferred style is to capitalize comments and end with a period. Especially a comment like this one that is a full sentence.

> +        // Continue to the next case

Would be better to have a period and use the terminology "Fall through" as we do in other similar comments.

> +        ALWAYS_INLINE JSTokenType parseIdent(JSTokenData* lvalp, LexType lexType);

No need for the argument names lvavlp and lexType here. The types of the arguments speak for themselves without argument names and in such cases we omit the argument names in WebKit code.
Comment 19 Zoltan Herczeg 2010-08-06 04:07:52 PDT
Did the requested changes and landed in http://trac.webkit.org/changeset/64827

Thank you Darin.