Bug 25552 (H5F-PatternAttr)

Summary: Support for HTML5 Forms "pattern" attribute
Product: WebKit Reporter: Michelangelo De Simone <michelangelo>
Component: FormsAssignee: Michelangelo De Simone <michelangelo>
Status: RESOLVED FIXED    
Severity: Enhancement CC: darin, mike, pkasting, tkent
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: All   
OS: All   
URL: http://www.whatwg.org/specs/web-apps/current-work/multipage/forms.html#attr-input-pattern
Bug Depends on: 19562    
Bug Blocks: 19264    
Attachments:
Description Flags
Proposed patch
darin: review-
Proposed patch, rev. 2
darin: review+
Proposed patch, rev. 2b
darin: review+
Proposed Patch, rev. 2c darin: review+

Description Michelangelo De Simone 2009-05-04 13:27:51 PDT
WebKit should support this attribute as per WHATWG specs.
Comment 1 Michelangelo De Simone 2009-07-20 16:29:14 PDT
Created attachment 33122 [details]
Proposed patch
Comment 2 Peter Kasting 2009-07-20 16:40:27 PDT
(In reply to comment #1)
> Created an attachment (id=33122) [details]
> Proposed patch

As mentioned on IRC, before this lands you should probably add tests that set patterns with mismatched ()s (e.g. "...) ... (..." and trailing \s, in order to demonstrate that the implementation doesn't actually prepend "^(?:" and append "$)" as mentioned on the spec.
Comment 3 Darin Adler 2009-07-20 16:55:24 PDT
Comment on attachment 33122 [details]
Proposed patch

> +            RegularExpression patternRegExp(pattern(), TextCaseSensitive);
> +            return patternRegExp.match(value()) != 0;

If you are trying to determine if the regular expression matches the entire string you need to do more work. You need to create a test where there is a regular expression that matches the beginning of a string but there are extra characters in the value. I believe that test would fail.

This code also recompiles the regular expression pattern every time you do the match. I suppose that's intentional. If you remove pattern() as I request below, then you'll instead do getAttribute(patternAttr) and put it into a local variable of type const AtomicString& at the top of the function.

> +String HTMLInputElement::pattern() const
> +{
> +    return getAttribute(patternAttr);
> +}

For new code, we try to avoid having this kind of function. Instead we put [Reflect] in the .idl file and let the function be automatically generated. If you were going to write it by hand there are some issues about using String vs. AtomicString.

> +void HTMLInputElement::setPattern(const String& p)
> +{
> +    if (p.isEmpty()) {
> +        int exccode;
> +        removeAttribute(patternAttr, exccode);
> +    } else
> +        setAttribute(patternAttr, p);
> +}

Are you sure this does the right thing for the empty string? I recommend just using [Reflect] in the .idl file and not writing custom code for this. I think it's unlikely this has a rule different from all other attributes.

I'd like to see some tests about how this works with all the special characters for regular expressions, especially ones like "$" and "^".

review- mainly because of the "matches the entire string" issue.
Comment 4 Michelangelo De Simone 2009-07-22 12:01:01 PDT
(In reply to comment #3)

> If you are trying to determine if the regular expression matches the entire
> string you need to do more work. You need to create a test where there is a
> regular expression that matches the beginning of a string but there are extra
> characters in the value. I believe that test would fail.

I've been testing this approach:
            if (value().isEmpty())
                return true; // <- patternMismatch = true if there's a pattern but no value.

            RegularExpression patternRegExp(String("^(?:") + l_pattern + ")$", TextCaseSensitive);
            return patternRegExp.match(value()) != 0;

As far as I undestood RegularExpression::match returns 0 if its parameter matches the whole pattern, -1 otherwise. Is that correct?

> This code also recompiles the regular expression pattern every time you do the
> match. I suppose that's intentional. If you remove pattern() as I request

Yes, honestly I don't like much the way I've been using RegularExpression; I'd like to allocate-use -deallocate it on the heap, is there any reasonable approach to this?

> For new code, we try to avoid having this kind of function. Instead we put
> [Reflect] in the .idl file and let the function be automatically generated. If

Ok.

> I'd like to see some tests about how this works with all the special characters
> for regular expressions, especially ones like "$" and "^".

I'd appreciate if you could provide some regular expression as reference, this patch is quite "peculiar" so it shall be tested deeply.
Comment 5 Darin Adler 2009-07-22 12:39:53 PDT
(In reply to comment #4)
> As far as I undestood RegularExpression::match returns 0 if its parameter
> matches the whole pattern, -1 otherwise. Is that correct?

No. The match function returns the position within its parameter that the regular expression matches. So if it matches at the beginning of the parameter it returns 0, and if it matches 1 character into the parameter it returns 1.

It does not check if the regular expression matches the entire parameter, which is what you want here.

> > This code also recompiles the regular expression pattern every time you do the
> > match. I suppose that's intentional. If you remove pattern() as I request
> 
> Yes, honestly I don't like much the way I've been using RegularExpression; I'd
> like to allocate-use -deallocate it on the heap, is there any reasonable
> approach to this?

I don’t think the heap would be better than the stack for this.

> > I'd like to see some tests about how this works with all the special characters
> > for regular expressions, especially ones like "$" and "^".
> 
> I'd appreciate if you could provide some regular expression as reference, this
> patch is quite "peculiar" so it shall be tested deeply.

For the substring issue, if the string is "ab" then you should have a regular expression "a" which should not match, a regular expression "b" which should not match and a regular expression "ab" that should match. The expression "^ab" should also match, but "^a" should not nor should "^b", and similarly "ab$" should match, etc.

There are tons of special characters in regular expressions, so providing examples that test all the various special characters is not something I can easily do. The ECMAScript standard <http://www.ecma-international.org/publications/standards/Ecma-262.htm> lists the special characters you can use in regular expressions.
Comment 6 Michelangelo De Simone 2009-07-23 11:36:25 PDT
(In reply to comment #5)

> No. The match function returns the position within its parameter that the
> regular expression matches. So if it matches at the beginning of the parameter
> it returns 0, and if it matches 1 character into the parameter it returns 1.
> 
> It does not check if the regular expression matches the entire parameter, which
> is what you want here.

I've modified it in this way:

            RegularExpression patternRegExp(String("^(?:" + l_pattern + ")$"), TextCaseSensitive);
            int matchLength = 0;
            int matchOffset = patternRegExp.match(value(), 0, &matchLength);
            bool fullMatch = (matchOffset == 0 && matchLength == (int)value().length());

            return !fullMatch;

I guess this should solve the issue: ^(?: and )$ force the string to be matched entirely; in the return statement the value() should match from its beginning and for its whole length.

Anyway I'm writing more tests to ensure this is right.
Comment 7 Peter Kasting 2009-07-23 11:41:29 PDT
(In reply to comment #6)
>             bool fullMatch = (matchOffset == 0 && matchLength ==
> (int)value().length());
> 
>             return !fullMatch;

Two preemptive nits:
* Go ahead and return the comparison result directory, you don't need a temp.
* Use static_cast<int>() instead of (int).
Comment 8 Michelangelo De Simone 2009-07-23 11:43:19 PDT
(In reply to comment #7)

> Two preemptive nits:
> * Go ahead and return the comparison result directory, you don't need a temp.
> * Use static_cast<int>() instead of (int).

Absolutely, they were just for my debug purposes...:-)
Comment 9 Darin Adler 2009-07-23 14:14:15 PDT
(In reply to comment #6)
> I've modified it in this way:
> 
>             RegularExpression patternRegExp(String("^(?:" + l_pattern + ")$"),
> TextCaseSensitive);
>             int matchLength = 0;
>             int matchOffset = patternRegExp.match(value(), 0, &matchLength);
>             bool fullMatch = (matchOffset == 0 && matchLength ==
> (int)value().length());
> 
>             return !fullMatch;
> 
> I guess this should solve the issue: ^(?: and )$ force the string to be matched
> entirely; in the return statement the value() should match from its beginning
> and for its whole length.

Will this work properly for strings that have "^" or "$" or "(" or ")" characters in them? I think the fullMatch test alone probably solves the problem without the change to the regular expression change.

The typecast to int is unfortunate. I suggest putting value().length() into a local variable of type int to avoid the cast.
Comment 10 Peter Kasting 2009-07-23 14:21:14 PDT
(In reply to comment #9)
> Will this work properly for strings that have "^" or "$" or "(" or ")"
> characters in them? I think the fullMatch test alone probably solves the
> problem without the change to the regular expression change.

I agree. Prepending/appending to the supplied pattern won't fly (discussed w/Hixie on IRC the other day, I'm surprised this has reappeared; consider expressions with a trailing \, or that look like "...) ... (...").

> The typecast to int is unfortunate. I suggest putting value().length() into a
> local variable of type int to avoid the cast.

Both options cast.  Making the cast explicit seems better to me;

int foo = (some size_t);
if (foo == (some int)) ...

...is more subtle than:

if (static_cast<int>(foo) == (some int))

...if there is ever a potential problem (e.g. overflow).
Comment 11 Darin Adler 2009-07-23 14:59:32 PDT
(In reply to comment #10)
> Both options cast.  Making the cast explicit seems better to me

I don't agree that those are both the same.

WebKit has lots of the implicit integer type conversions using assignment, and those only work for integer conversion. On the other hand, the (int) syntax can convert anything into an int.

If we had a numeric-only cast, then I would agree that the both options were equivalent.
Comment 12 Michelangelo De Simone 2009-07-23 15:11:12 PDT
(In reply to comment #11)

> If we had a numeric-only cast, then I would agree that the both options were
> equivalent.

This is the snippet of the code I'm gonna attach to this bug:

            if (value().isEmpty())
                return true;

            RegularExpression patternRegExp(l_pattern, TextCaseSensitive);
            int matchLength = 0;
            int valueLength = value().length();
            int matchOffset = patternRegExp.match(value(), 0, &matchLength);

            return matchOffset != 0 || matchLength != valueLength;

On http://paste.lisp.org/display/84099 I've temporarily pasted some new negative and positive test to check the code against special chars.

Waiting for comments.
Comment 13 Darin Adler 2009-07-23 15:13:09 PDT
Looks to me like you're on the right track now.
Comment 14 Peter Kasting 2009-07-23 15:16:33 PDT
(In reply to comment #11)
> (In reply to comment #10)
> I don't agree that those are both the same.
> 
> WebKit has lots of the implicit integer type conversions using assignment, and
> those only work for integer conversion. On the other hand, the (int) syntax can
> convert anything into an int.

I explicitly suggested static_cast<int>(), which is much narrower than (int).  I agree that (int), or reinterpret_cast<int>(), would be poor choices.

My memory of the ANSI spec is not good enough to recall whether static_cast<>() allows (without warning) conversions that initialization does not allow.  I don't think it does, but I could be wrong.
Comment 15 Darin Adler 2009-07-23 15:17:41 PDT
(In reply to comment #14)
> My memory of the ANSI spec is not good enough to recall whether
> static_cast<>()  allows (without warning) conversions that initialization
> does not allow. I don't think it does, but I could be wrong.

It does.

For example, a cast from void* to char*.
Comment 16 Peter Kasting 2009-07-23 15:21:56 PDT
(In reply to comment #15)
> (In reply to comment #14)
> > My memory of the ANSI spec is not good enough to recall whether
> > static_cast<>()  allows (without warning) conversions that initialization
> > does not allow. I don't think it does, but I could be wrong.
> 
> It does.
> 
> For example, a cast from void* to char*.

Sorry; I was thinking only of static_cast<int>().  I think that one ought to be OK?

It's clearly not a huge deal; I'm mostly discussing in hopes of determining the "canonical best way" so that I do the right thing in my own future patches.  I tend to eschew temporaries and make all type conversions as explicit as possible, so this would be a notable change in direction for me.

(Sorry, Michelangelo, for hijacking your bug :) )
Comment 17 Michelangelo De Simone 2009-07-23 16:11:06 PDT
Created attachment 33389 [details]
Proposed patch, rev. 2
Comment 18 Darin Adler 2009-07-23 16:22:08 PDT
Comment on attachment 33389 [details]
Proposed patch, rev. 2

> +    const AtomicString& l_pattern = getAttribute(patternAttr);

What's the "l_" prefix here? We don't use underscores in variable names. You can just name this local variable pattern, I believe. It will shadow the pattern() function, but that should be fine; you can always use this->pattern() if you need to get at the pattern function.

> +    if (l_pattern.isEmpty() || value().isEmpty())

This function calls value(), a virtual function, three times. I would suggest doing this at the start of the function:

    String value = this->value();

And then use the local value throughout.

I'm going to say r=me, but we probably do want to fix the l_pattern variable name.
Comment 19 Michelangelo De Simone 2009-07-23 16:49:49 PDT
Created attachment 33393 [details]
Proposed patch, rev. 2b
Comment 20 Darin Adler 2009-07-23 17:01:35 PDT
Comment on attachment 33393 [details]
Proposed patch, rev. 2b

r=me

I did have a few more thoughts, but probably not ones that have to be resolved during review.

The code to get the pattern and value could moved inside the switch so we wouldn't waste any time getting the pattern or value for input types that don't support pattern. You might also be able to remove one or both special cases, since I think compiling an empty regular expression will fail and return false.

Is it really true that an empty value can never mismatch?
Comment 21 Peter Kasting 2009-07-23 17:11:00 PDT
(In reply to comment #20)
> (From update of attachment 33393 [details])
> Is it really true that an empty value can never mismatch?

Yes.  I thought this was wacky at first too.  The relevant text in HTML5 4.10.4.2.8 is:

"If the element's value is not the empty string, and ...(more conditions)..., then the element is suffering from a pattern mismatch."

I think the use case is: As specified, pattern allows you to check that input in optional fields is valid.  If you want to enforce that a field is filled, you can also specify "required".  This way, you can have "Enter your city and state, OR your zip" form fields, and check the zip with pattern.
Comment 22 Michelangelo De Simone 2009-07-23 17:14:26 PDT
(In reply to comment #20)

> The code to get the pattern and value could moved inside the switch so we
> wouldn't waste any time getting the pattern or value for input types that don't
> support pattern. You might also be able to remove one or both special cases,
> since I think compiling an empty regular expression will fail and return false.

This was deliberate, I tought that most of the times developers ask for patternMismatch (through valid()) when no pattern attribute is specified, I tried to avoid entering into the "switch" for its higher overhead.

On the other hand I should have coded something such as:

if ((pattern.isEmpty() || value.isEmpty() || inputType == HIDDEN || inputType == FILE...etc...) return false;

But that would have been too much a shame.:-) Anyway, you tell me... I'm available to move these checks down in the switch/case if you think that it would be wiser.

> Is it really true that an empty value can never mismatch?

I know, it's strange, but it's OK. Peter and I checked this reading specs.
Comment 23 Michelangelo De Simone 2009-07-23 17:15:48 PDT
(In reply to comment #22)

> On the other hand I should have coded something such as:

s/should/could
Comment 24 Darin Adler 2009-07-23 17:16:05 PDT
Sounds like we're good to go then.
Comment 25 Darin Adler 2009-07-23 17:18:00 PDT
(In reply to comment #16)
> Sorry; I was thinking only of static_cast<int>().  I think that one ought to be
> OK?

Peter, I was working on a reply to this, with some examples and comments about what my style preference is and rationale for sometimes preferring to have local variables, and accidentally lost it. I'll probably get back to this soon, but I didn't want to just leave you hanging.
Comment 26 Peter Kasting 2009-07-23 17:26:25 PDT
(In reply to comment #25)
> (In reply to comment #16)
> > Sorry; I was thinking only of static_cast<int>().  I think that one ought to be
> > OK?
> 
> Peter, I was working on a reply to this, with some examples and comments about
> what my style preference is and rationale for sometimes preferring to have
> local variables, and accidentally lost it. I'll probably get back to this soon,
> but I didn't want to just leave you hanging.

No worries :)

Out of curiosity I asked a fewer of the Chromium developers (a tech lead and an ex-chief architect of something or other) what their preferred style would be, and they suggested using static_cast<>().  One commented that the presence of (ugly) static_casts could serve as eventual impetus to go modify the code design to do less casting, which I hadn't thought of.  However, I doubt that the world will end either way people write the code; it's a pretty minor distinction.

Feel free to toss something up on the WK style guide or on IRC anytime!
Comment 27 Michelangelo De Simone 2009-07-23 17:28:33 PDT
Created attachment 33395 [details]
Proposed Patch, rev. 2c
Comment 28 Michelangelo De Simone 2009-07-25 10:02:39 PDT
Does the patch 2c conform with Darin's comments?
Comment 29 Darin Adler 2009-07-25 19:56:42 PDT
Comment on attachment 33395 [details]
Proposed Patch, rev. 2c

r=me
Comment 30 Peter Kasting 2009-07-27 13:22:42 PDT
Fixed in r46423.