Bug 40048

Summary: Make HTML5Lexer go fast
Product: WebKit Reporter: Adam Barth <abarth>
Component: New BugsAssignee: Adam Barth <abarth>
Status: RESOLVED FIXED    
Severity: Normal CC: abarth, eric, webkit.review.bot
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: Other   
OS: OS X 10.5   
Bug Depends on:    
Bug Blocks: 39259    
Attachments:
Description Flags
work in progress
none
more work in progress
none
Patch
none
Diff w/o changing indent
none
Patch eric: review+

Description Adam Barth 2010-06-02 02:33:36 PDT
Make HTML5Lexer go fast
Comment 1 Adam Barth 2010-06-02 02:35:16 PDT
Created attachment 57638 [details]
work in progress
Comment 2 Adam Barth 2010-06-02 21:43:03 PDT
Created attachment 57733 [details]
more work in progress
Comment 3 Adam Barth 2010-06-03 00:08:01 PDT
Looks like ~1% speedup.  I was hoping for more.  I still think it's worth doing.
Comment 4 Adam Barth 2010-06-03 00:50:31 PDT
Created attachment 57742 [details]
Patch
Comment 5 WebKit Review Bot 2010-06-03 00:52:07 PDT
Attachment 57742 [details] did not pass style-queue:

Failed to run "['WebKitTools/Scripts/check-webkit-style', '--no-squash']" exit_code: 1
WebCore/html/HTML5Lexer.cpp:357:  Non-label code inside switch statements should be indented.  [whitespace/indent] [4]
Total errors found: 1 in 3 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 6 WebKit Review Bot 2010-06-03 01:35:58 PDT
Attachment 57742 [details] did not build on win:
Build output: http://webkit-commit-queue.appspot.com/results/3045014
Comment 7 Eric Seidel (no email) 2010-06-04 00:33:34 PDT
Comment on attachment 57742 [details]
Patch

I can't really review a change this large in depth.  But I'll give some comments and we should talk more in person.

WebCore/html/HTML5Lexer.cpp:346
 +  #define BEGIN_STATE_MAP()                                                  \
Why does this need to be a macro?

WebCore/html/HTML5Lexer.cpp:353
 +  #define END_STATE() ASSERT_NOT_REACHED(); break;
I think instead of using END_STATE we should just use blocks { }

WebCore/html/HTML5Lexer.cpp:420
 +          if (cc == '&')
If we didn't change the indent in this change it would be easier to read.

WebCore/html/HTML5Lexer.cpp:428
 +              ADVANCE_TO(TagOpenState);
Seems we could do all the ADVANCE_TO search/replace in a separate change to minimize this diff.

I think this is great.  It just needs to be done in smaller pieces if you want more substantial comments.

I like the idea of using macros.  I like the idea of using gotos.
Comment 8 Eric Seidel (no email) 2010-06-04 00:52:26 PDT
Created attachment 57850 [details]
Diff w/o changing indent
Comment 9 Adam Barth 2010-06-04 09:32:47 PDT
> WebCore/html/HTML5Lexer.cpp:346
>  +  #define BEGIN_STATE_MAP()                                                  \
> Why does this need to be a macro?

I just thought it was more balanced, but we don't need to use a macro here.

> WebCore/html/HTML5Lexer.cpp:353
>  +  #define END_STATE() ASSERT_NOT_REACHED(); break;
> I think instead of using END_STATE we should just use blocks { }

Using blocks seems independent of whether we use END_STATE.  We still want to assert that we don't ever fall through.

> WebCore/html/HTML5Lexer.cpp:420
>  +          if (cc == '&')
> If we didn't change the indent in this change it would be easier to read.

Thanks for uploading a diff with a different indent.

> WebCore/html/HTML5Lexer.cpp:428
>  +              ADVANCE_TO(TagOpenState);
> Seems we could do all the ADVANCE_TO search/replace in a separate change to minimize this diff.

ok.
Comment 10 Adam Barth 2010-06-04 12:06:29 PDT
Committed r60694: <http://trac.webkit.org/changeset/60694>
Comment 11 Adam Barth 2010-06-04 12:19:53 PDT
Committed r60696: <http://trac.webkit.org/changeset/60696>
Comment 12 Adam Barth 2010-06-04 12:30:22 PDT
Committed r60697: <http://trac.webkit.org/changeset/60697>
Comment 13 Adam Barth 2010-06-04 12:38:45 PDT
Committed r60699: <http://trac.webkit.org/changeset/60699>
Comment 14 Adam Barth 2010-06-04 12:44:38 PDT
Committed r60701: <http://trac.webkit.org/changeset/60701>
Comment 15 Adam Barth 2010-06-04 12:54:45 PDT
Committed r60702: <http://trac.webkit.org/changeset/60702>
Comment 16 Adam Barth 2010-06-04 13:00:54 PDT
Committed r60703: <http://trac.webkit.org/changeset/60703>
Comment 17 Adam Barth 2010-06-04 13:06:33 PDT
Committed r60704: <http://trac.webkit.org/changeset/60704>
Comment 18 Adam Barth 2010-06-04 14:56:54 PDT
Committed r60711: <http://trac.webkit.org/changeset/60711>
Comment 19 Adam Barth 2010-06-04 15:12:51 PDT
Committed r60712: <http://trac.webkit.org/changeset/60712>
Comment 20 Adam Barth 2010-06-04 15:24:54 PDT
Created attachment 57921 [details]
Patch
Comment 21 Adam Barth 2010-06-04 15:25:45 PDT
Ok, after landing like a billion wind-up patches, this patch should hopefully be reviewable now.
Comment 24 Eric Seidel (no email) 2010-06-04 22:24:50 PDT
Comment on attachment 57921 [details]
Patch

WebCore/html/HTML5Lexer.cpp:323
 +  #define RECONSUME_IN(StateName)                                            \
Intentionally changing which column the \ is in?

WebCore/html/HTML5Lexer.cpp:327
 +          goto StateName;                                                    \
And the case of StateName?

WebCore/html/HTML5Lexer.cpp:334
 +          if (source.isEmpty())                                              \
Why does this check go after the advance and not before? (aka, after processing)?

WebCore/html/HTML5Lexer.cpp:399
 +      if (source.isEmpty())
You write this check in two places.  Should it be a macro?
Comment 25 Eric Seidel (no email) 2010-06-04 22:26:28 PDT
Comment on attachment 57921 [details]
Patch

Would like an answer to the "why does the isEmpty check go there" question before landing, but in general looks good!.
Comment 26 Adam Barth 2010-06-04 23:29:40 PDT
> WebCore/html/HTML5Lexer.cpp:323
>  +  #define RECONSUME_IN(StateName)                                            \
> Intentionally changing which column the \ is in?

Now, that's an artifact of merging back in code from the ASAD patch.

> WebCore/html/HTML5Lexer.cpp:327
>  +          goto StateName;                                                    \
> And the case of StateName?

Same here.  I'll change this one back.

> WebCore/html/HTML5Lexer.cpp:334
>  +          if (source.isEmpty())                                              \
> Why does this check go after the advance and not before? (aka, after processing)?

We process the lookahead character and then advance when we're done "consuming" it.  If we've just advanced over the last character of the string, there's no lookahead character and nothing left to do, we so we return.

The spec has a mix of when it decides to consume the lookahead character.  I went with this design because it's what the PreloadScanner does.  Now that we're using gotos instead of a jump table, we can do this verbatim the way the spec works, if you like, which would mean having a CONSUME macro at the beginning of most (but not all) states.

I don't think there's a big difference one way or the other.  We might go with whatever is faster.

> WebCore/html/HTML5Lexer.cpp:399
>  +      if (source.isEmpty())
> You write this check in two places.  Should it be a macro?

Maybe.  It tried to keep the macros under control, but we can proliferate them more if you like.
Comment 27 Adam Barth 2010-06-04 23:58:08 PDT
Committed r60739: <http://trac.webkit.org/changeset/60739>