Bug 142998 - Make URL filter patterns matching consistent and add a simple canonicalization step
Summary: Make URL filter patterns matching consistent and add a simple canonicalizatio...
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: New Bugs (show other bugs)
Version: 528+ (Nightly build)
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Benjamin Poulain
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-03-23 21:30 PDT by Benjamin Poulain
Modified: 2015-03-24 16:25 PDT (History)
2 users (show)

See Also:


Attachments
Patch (28.66 KB, patch)
2015-03-23 21:37 PDT, Benjamin Poulain
no flags Details | Formatted Diff | Diff
Patch (31.73 KB, patch)
2015-03-24 14:25 PDT, Benjamin Poulain
achristensen: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Benjamin Poulain 2015-03-23 21:30:48 PDT
Make URL filter patterns matching consistent and add a simple canonicalization step
Comment 1 Benjamin Poulain 2015-03-23 21:37:50 PDT
Created attachment 249311 [details]
Patch
Comment 2 Alex Christensen 2015-03-24 10:53:30 PDT
Comment on attachment 249311 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=249311&action=review

Hooray! Fewer DFA nodes!  This looks good, but there are a few things that need to be addressed.

> Source/WebCore/contentextensions/URLFilterParser.cpp:260
> +            // ".*" is the only simple term matching any string.
> +            return isUniversalTransition() && m_quantifier == AtomQuantifier::ZeroOrMore;

You forgot about our good friend .?

> Source/WebCore/contentextensions/URLFilterParser.cpp:273
> +                // -(.)*, (.+)*, (.?)* etc.

Put some parsing test cases in for these.
testPatternStatus("(.?)*", ContentExtensions::URLFilterParser::ParseStatus::MatchesEverything);
etc.

> Source/WebCore/contentextensions/URLFilterParser.cpp:768
> +        Term canonicalDotStart(Term::UniversalTransition);

canonicalDotStar

> Source/WebCore/contentextensions/URLFilterParser.cpp:774
> +            bool isAfterDotStart = false;

isAfterDotStar

> Source/WebCore/contentextensions/URLFilterParser.cpp:777
> +                    m_sunkTerms.remove(termIndex);

This is removing from the middle of a vector, which could be slow.  I think we should use a Deque for m_sunkTerms, only remove the first term while it matches everything, then prepend the canonicalDotStar if we removed anything.

> Source/WebCore/contentextensions/URLFilterParser.cpp:795
> +        if (m_sunkTerms.size() > 2 && m_sunkTerms.last().isEndOfLineAssertion() && m_sunkTerms[m_sunkTerms.size() - 1].isKnownToMatchAnyString())

m_sunkTerms[m_sunkTerms.size() - 2].isKnownToMatchAnyString(), right?

> Tools/TestWebKitAPI/Tests/WebCore/ContentExtensions.cpp:348
> +    const char* termsKnownToMatchAnythingFilter = "[{\"action\":{\"type\":\"block\"},\"trigger\":{\"url-filter\":\"^pre1.*post1$\"}}, {\"action\":{\"type\":\"block\"},\"trigger\":{\"url-filter\":\"^pre2(.*)post2$\"}}, {\"action\":{\"type\":\"block\"},\"trigger\":{\"url-filter\":\"^pre3(.*)?post3$\"}}, {\"action\":{\"type\":\"block\"},\"trigger\":{\"url-filter\":\"^pre4(.*)+post4$\"}}, {\"action\":{\"type\":\"block\"},\"trigger\":{\"url-filter\":\"^pre5(.*)*post5$\"}}, {\"action\":{\"type\":\"block\"},\"trigger\":{\"url-filter\":\"^pre6(.)*post6$\"}}, {\"action\":{\"type\":\"block\"},\"trigger\":{\"url-filter\":\"^pre7(.+)*post7$\"}}, {\"action\":{\"type\":\"block\"},\"trigger\":{\"url-filter\":\"^pre8(.?)*post8$\"}}, {\"action\":{\"type\":\"block\"},\"trigger\":{\"url-filter\":\"^pre9(.+)?post9$\"}}, {\"action\":{\"type\":\"block\"},\"trigger\":{\"url-filter\":\"^pre9(.?)+post9$\"}}]";

Put each action on its own line here and elsewhere.  This will match the other tests.
Also, you have two pre9/post9 pairs.  Replace one with pre0/post0 and add a test for it.

> Tools/TestWebKitAPI/Tests/WebCore/ContentExtensions.cpp:355
> +    testRequest(backend, mainDocumentRequest("pre1://webkit.org/post1"), { ContentExtensions::ActionType::BlockLoad });

Put some failures in like pre1://webkit.org/post2

> Tools/TestWebKitAPI/Tests/WebCore/ContentExtensions.cpp:416
> +    // The patten must finish with cat, with any number of of

any number of s's following it.
Comment 3 Benjamin Poulain 2015-03-24 14:23:27 PDT
Comment on attachment 249311 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=249311&action=review

>> Source/WebCore/contentextensions/URLFilterParser.cpp:260
>> +            return isUniversalTransition() && m_quantifier == AtomQuantifier::ZeroOrMore;
> 
> You forgot about our good friend .?

I think you are mixing this with matchesAtLeastOneCharacter(). ".?" does not match any string. It only match a string of size 0 or 1.

>> Source/WebCore/contentextensions/URLFilterParser.cpp:777
>> +                    m_sunkTerms.remove(termIndex);
> 
> This is removing from the middle of a vector, which could be slow.  I think we should use a Deque for m_sunkTerms, only remove the first term while it matches everything, then prepend the canonicalDotStar if we removed anything.

I don't think it is worth speculating the performance of this yet. This should be dwarfed by any operations of the DFA graph since it is so much bigger.
Comment 4 Benjamin Poulain 2015-03-24 14:25:21 PDT
Created attachment 249353 [details]
Patch
Comment 5 Alex Christensen 2015-03-24 16:00:11 PDT
Comment on attachment 249353 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=249353&action=review

> Tools/TestWebKitAPI/Tests/WebCore/ContentExtensions.cpp:458
> +    // The patten must finish with cat, with any number of 's' following it, but no other character.

pattern
Comment 6 Benjamin Poulain 2015-03-24 16:25:15 PDT
Committed r181917: <http://trac.webkit.org/changeset/181917>