Bug 142977 - Add case insensitive checks to DFA bytecode
Summary: Add case insensitive checks to DFA bytecode
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: WebCore Misc. (show other bugs)
Version: 528+ (Nightly build)
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Alex Christensen
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-03-23 13:15 PDT by Alex Christensen
Modified: 2015-03-25 16:06 PDT (History)
1 user (show)

See Also:


Attachments
Patch (15.14 KB, patch)
2015-03-23 13:20 PDT, Alex Christensen
no flags Details | Formatted Diff | Diff
Patch (16.15 KB, patch)
2015-03-25 11:17 PDT, Alex Christensen
benjamin: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Alex Christensen 2015-03-23 13:15:29 PDT
Many of the DFANodes have transitions that are case-insensitive, and most of the transitions are text.  Making case-insensitive checks would reduce the byte code size by up to 50%.
Comment 1 Alex Christensen 2015-03-23 13:20:40 PDT
Created attachment 249255 [details]
Patch
Comment 2 Alex Christensen 2015-03-23 17:32:53 PDT
In a large-ish real-life-ish test case, this reduced the byte code size from 93925673 to 61499201, so a memory reduction of about 30%.
Comment 3 Benjamin Poulain 2015-03-23 21:39:52 PDT
(In reply to comment #2)
> In a large-ish real-life-ish test case, this reduced the byte code size from
> 93925673 to 61499201, so a memory reduction of about 30%.

Awesome!
Comment 4 Benjamin Poulain 2015-03-23 21:54:59 PDT
Comment on attachment 249255 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=249255&action=review

> Source/WebCore/contentextensions/DFABytecodeCompiler.cpp:-82
> -    ASSERT_WITH_MESSAGE(lowValue != highValue, "A single value check should be emitted for single values.");

This is odd. Is the assert no longer valid?

> Source/WebCore/contentextensions/DFABytecodeCompiler.cpp:117
> +    for (uint16_t i = 0; i < 128; i++) {

uint16_t here.

> Source/WebCore/contentextensions/DFABytecodeCompiler.cpp:128
> +    uint8_t rangeMin;
> +    bool hasRangeMin = false;
> +    for (uint8_t i = 0; i < 128; i++) {

uint8_t here

> Source/WebCore/contentextensions/DFABytecodeCompiler.cpp:136
> +                if (rangeMin >= 'A' && rangeMax <= 'Z') {

Note that in WebKit, toASCIILower() is faster than toASCIIUpper(), the convention is to use lowercase for the reference.

> Source/WebCore/contentextensions/DFABytecodeCompiler.cpp:138
> +                    for (uint16_t rangeIndex = rangeMin; rangeIndex <= rangeMax; rangeIndex++) {

Back to uint16_t here

> Source/WebCore/contentextensions/DFABytecodeCompiler.cpp:139
> +                        if (destinations[rangeMin] != destinations[rangeIndex + 'a' - 'A']) {

You can just use toASCIILower() on the rangeIndex to get its lowercase value.

> Source/WebCore/contentextensions/DFABytecodeCompiler.cpp:150
> +                        destinations[rangeIndex + 'a' - 'A'] = noDestination;

ditto.

> Source/WebCore/contentextensions/DFABytecodeCompiler.cpp:168
> -        compileCheckForRange(rangeMin, 127, rangeDestination);
> +        ranges.append(Range(rangeMin, 127, destinations[rangeMin], true));

It is really weird to have a loop tail completely different from the loop body.
I think a comment with your rationale would be useful here.
Comment 5 Alex Christensen 2015-03-25 11:15:34 PDT
Comment on attachment 249255 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=249255&action=review

I'll upload another patch for re-reviewing a few things.

>> Source/WebCore/contentextensions/DFABytecodeCompiler.cpp:117
>> +    for (uint16_t i = 0; i < 128; i++) {
> 
> uint16_t here.

The keys of node.transitions are uint16_ts.

>> Source/WebCore/contentextensions/DFABytecodeCompiler.cpp:128
>> +    for (uint8_t i = 0; i < 128; i++) {
> 
> uint8_t here

These could be chars, but I thought it looks confusing to rely on char overflow:
for (char i = 0; i > 0; ++i)

>> Source/WebCore/contentextensions/DFABytecodeCompiler.cpp:136
>> +                if (rangeMin >= 'A' && rangeMax <= 'Z') {
> 
> Note that in WebKit, toASCIILower() is faster than toASCIIUpper(), the convention is to use lowercase for the reference.

Is that because char toASCIILower uses a lookup table?  That eliminates any conditional checks, and the table would be well-cached in the CPU.  Cool!
I'll use toASCIILower in the interpreter, but when compiling it would actually be faster to just add an immediate value of 32 rather than lookup a value in the table.  I'll use those functions to be consistent, though, because the performance gains would not be significant here, but it might be worth looking into in some places.

>> Source/WebCore/contentextensions/DFABytecodeCompiler.cpp:138
>> +                    for (uint16_t rangeIndex = rangeMin; rangeIndex <= rangeMax; rangeIndex++) {
> 
> Back to uint16_t here

Unintentional.  Should be uint8_t.

>> Source/WebCore/contentextensions/DFABytecodeCompiler.cpp:168
>> +        ranges.append(Range(rangeMin, 127, destinations[rangeMin], true));
> 
> It is really weird to have a loop tail completely different from the loop body.
> I think a comment with your rationale would be useful here.

It's *your* rationale.  You wrote this in the first place.
Comment 6 Alex Christensen 2015-03-25 11:17:25 PDT
Created attachment 249416 [details]
Patch
Comment 7 Benjamin Poulain 2015-03-25 16:02:22 PDT
Comment on attachment 249416 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=249416&action=review

> Source/WebCore/contentextensions/DFABytecodeCompiler.cpp:171
> +        // Ranges are appended after passing the end of them.
> +        // If a range goes to 127, we will have an uncommitted rangeMin because the loop does not check 128.

Oh, ok, that's not what I meant :)

What I meant is explaining why you don't need to check for caseInsensitive ranges in this case.
Comment 8 Alex Christensen 2015-03-25 16:06:39 PDT
http://trac.webkit.org/changeset/181980