Bug 27456

Summary: Add <input type=url> validation support
Product: WebKit Reporter: Peter Kasting <pkasting>
Component: FormsAssignee: Michelangelo De Simone <michelangelo>
Status: RESOLVED FIXED    
Severity: Enhancement CC: ap, commit-queue, darin, ddkilzer, ian, laszlo.gombos, mike, pkasting, tkent
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: All   
OS: All   
Bug Depends on:    
Bug Blocks: 19264, 29235, 29913    
Attachments:
Description Flags
Just a "demo"
none
Patch v1
none
Patch v1, rebased
none
Patch v1, rebased
none
Patch v1
none
Patch v1 none

Description Peter Kasting 2009-07-20 12:26:32 PDT
See spec section 4.10.4.1.4.

Basically, the only real difference between a URL input and a normal single-line text input is that if the value is not a valid absolute URL, validityState's typeMismatch() method should return true.
Comment 1 Peter Kasting 2009-07-21 10:24:45 PDT
Note: The underlying "URL" input type and associated parser changes were made on bug 25554, so this is only about validation.
Comment 2 Michelangelo De Simone 2009-08-04 16:30:58 PDT
In order to check for value's validity I've been using KURL::isValid(), however there's a failing assertion in KURL::KURL(String&), called by this code:

    if (inputType() == URL && !value().isEmpty()) {
        KURL kurl(value());

        return !kurl.isValid();
    }

The assertion in KURL::KURL(String&) fails due to the fact that the input parameter should be already a valid url.

Any idea/suggestion about that?
Comment 3 Peter Kasting 2009-08-04 16:39:37 PDT
I'm not sure you can directly convert the user's input to a URL type without performing fixup.  For example, we probably want to allow "www.apple.com", but technically this needs to be fixed to "http://www.apple.com/".

I know how to do fixup in the Chromium code but I'm not sure where the right place in WebKit is.  There has to be one since we have to handle a lot of URLs in webpages already...
Comment 4 Michelangelo De Simone 2009-08-04 16:41:15 PDT
Do you think that using a regexp to check it out would be wiser and/or more efficient?
Every other indication you can provide is welcome.
Comment 5 Peter Kasting 2009-08-04 16:42:08 PDT
I'm pretty sure you do not want to validate this via a regexp.
Comment 6 Darin Adler 2009-08-04 17:25:12 PDT
(In reply to comment #2)
> In order to check for value's validity I've been using KURL::isValid(), however
> there's a failing assertion in KURL::KURL(String&), called by this code:
> 
>     if (inputType() == URL && !value().isEmpty()) {
>         KURL kurl(value());
> 
>         return !kurl.isValid();
>     }
> 
> The assertion in KURL::KURL(String&) fails due to the fact that the input
> parameter should be already a valid url.

The single-argument constructor to KURL is only for use for strings that are already-encoded ASCII-only valid absolute URLs. So you can't just pass an arbitrary string to it. The multiple argument constructors are for use to turn strings from web pages into URLs, usually called through the Document::completeURL function rather than directly.

Alexey Proskuryakov is the person who invented this rule.
Comment 7 Darin Adler 2009-08-04 17:25:45 PDT
(In reply to comment #3)
> I'm not sure you can directly convert the user's input to a URL type without
> performing fixup. For example, we probably want to allow "www.apple.com", but
> technically this needs to be fixed to "http://www.apple.com/".
> 
> I know how to do fixup in the Chromium code but I'm not sure where the right
> place in WebKit is. There has to be one since we have to handle a lot of URLs
> in webpages already.

This operation you're calling fixup is something that web browsers do in their page address fields, but WebKit never does this kind of operation on a URL.
Comment 8 Peter Kasting 2009-08-04 17:33:49 PDT
(In reply to comment #7)
> (In reply to comment #3)
> > I know how to do fixup in the Chromium code but I'm not sure where the right
> > place in WebKit is. There has to be one since we have to handle a lot of URLs
> > in webpages already.
> 
> This operation you're calling fixup is something that web browsers do in their
> page address fields, but WebKit never does this kind of operation on a URL.

Hmm.  While the spec allows the UA to restrict user input to valid absolute URLs (so people can't type "www.apple.com"), it also allows the UA to do this kind of conversion, which I think provides a much better user experience.  But if there is no code in WebCore to do this, that's problematic.  I suppose I could port over the code from Chromium, but it's a noticeable chunk of code (roughly one whole file worth, I think) and I'd want to be sure that's what we want to do first.

I suppose another possibility, since in theory it might be nice to suggest URLs from history for these fields, is that we could somehow plumb the user's value up to the embedder, and let them worry about providing both a "fixed up" result and (maybe in the future someday) suggestions.

Thoughts on the best route forward?  I am happy to port some code from Chromium if it's desirable.
Comment 9 Michelangelo De Simone 2009-08-04 18:02:09 PDT
(In reply to comment #8)

> Hmm.  While the spec allows the UA to restrict user input to valid absolute
> URLs (so people can't type "www.apple.com"), it also allows the UA to do this
> kind of conversion, which I think provides a much better user experience.  But

I've been testing with this:

bool HTMLInputElement::typeMismatch() const
{
    if (inputType() == URL && !value().isEmpty()) {
        KURL kurl(KURL(), value());

        return !kurl.isValid();;
    }
    
    return false;
}

It matches correctly all the "absolute urls" (eg: http://www.something.com:PORT/else), not short urls such as "www.something.com". Remembering that we're just speaking about static validation and that the constraint is "while the value of the element is not a valid absolute URL, the element is suffering from a type mismatch" this would be somehow accepted.

What do you guys think?
Comment 10 Peter Kasting 2009-08-04 18:05:40 PDT
(In reply to comment #9)
> It matches correctly all the "absolute urls" (eg:
> http://www.something.com:PORT/else), not short urls such as
> "www.something.com". Remembering that we're just speaking about static
> validation and that the constraint is "while the value of the element is not a
> valid absolute URL, the element is suffering from a type mismatch" this would
> be somehow accepted.

I repeat what I said in comment 8: this is allowed by the spec but IMO would be a poor user experience (especially once we hook the bits up that prevent form submission for forms with errors; then users will have to know precisely how to write a valid URL, which I think is borderline unacceptable).  I would prefer to see something that allowed more flexibility on the user's part, but I'm not sure what the best implementation method for that would be.
Comment 11 Michelangelo De Simone 2009-08-04 19:00:34 PDT
(In reply to comment #10)

> I repeat what I said in comment 8: this is allowed by the spec but IMO would be
> a poor user experience (especially once we hook the bits up that prevent form
> submission for forms with errors; then users will have to know precisely how to
> write a valid URL, which I think is borderline unacceptable).  I would prefer
> to see something that allowed more flexibility on the user's part, but I'm not
> sure what the best implementation method for that would be.

I agree with you on the user experience issue, we could address it in this way: if the user starts typing something that is not "http://" we prepend it automatically to the value; obviously this applies only on the UI side: values set from DOM are parsed as they are.
Comment 12 Peter Kasting 2009-08-04 22:41:45 PDT
(In reply to comment #11)
> if the user starts typing something that is not "http://" we prepend it
> automatically to the value

If we're going to fix up the user's input to make it a valid URL, we might as well really do it; see http://src.chromium.org/viewvc/chrome/trunk/src/chrome/browser/net/url_fixer_upper.cc?view=markup for an idea of what that means.

Looking at that code myself, it'd be nice to avoid adding it to WebCore if we thought it was just as acceptable to call out to the embedder to do this for us.
Comment 13 Michelangelo De Simone 2009-08-06 20:18:56 PDT
Created attachment 34243 [details]
Just a "demo"

The attached code (only) validates http/https urls which have at least a 1-char long hostname: if the validation is negative (no error) it updates the value with a "pretty url". It's just to understand whether or not the direction is correct.

tkent suggested to look at http://www.w3.org/html/wg/href/draft.html for the [WEBADDRESS] spec.
Comment 14 Peter Kasting 2009-08-07 13:16:24 PDT
Comment on attachment 34243 [details]
Just a "demo"

> +    switch (input->inputType()) {
> +        case HTMLInputElement::TEXT:
> +        case HTMLInputElement::SEARCH:
> +        case HTMLInputElement::TELEPHONE:
> +        case HTMLInputElement::EMAIL:
> +        case HTMLInputElement::PASSWORD:
> +        case HTMLInputElement::NUMBER:
> +        case HTMLInputElement::FILE:
> +        case HTMLInputElement::CHECKBOX:
> +        case HTMLInputElement::RADIO:
> +        case HTMLInputElement::HIDDEN:
> +        case HTMLInputElement::RANGE:
> +        case HTMLInputElement::SUBMIT:
> +        case HTMLInputElement::IMAGE:
> +        case HTMLInputElement::RESET:
> +        case HTMLInputElement::BUTTON:
> +        case HTMLInputElement::ISINDEX:
> +            return false;
> +        case HTMLInputElement::URL:

Doing a giant switch for one case seems lame to me.  Why not just "if (input->inputType() != HTMLInputElement::URL) return false;"?

Adding validation of email later won't present much of a challenge to that code.  And it will mean you don't have to be super-indented.

> +            if (kurl.isValid()
> +                && !kurl.host().isEmpty()
> +                && (kurl.protocolIs("http") || kurl.protocolIs("https"))) {

Why check the protocol?

> +                input->setValue(kurl.prettyURL());
> +                input->setNeedsStyleRecalc();

I don't think you should ever screw with the value displayed to the user.  I think that means for URL types, you'll need to have two variables, one for the user-visible value, and one for the underlying value that script sees.
Comment 15 Kent Tamura 2009-08-11 17:29:57 PDT
(In reply to comment #14)
> I don't think you should ever screw with the value displayed to the user.  I
> think that means for URL types, you'll need to have two variables, one for the
> user-visible value, and one for the underlying value that script sees.

I agree that modifying the value inside typeMismatch() is not good.

We already have two storages for the "value"; a "value" HTML attribute and InputElementData:m_value, and they have different roles.  Introducing another storage seems messy.  If we'd like to support for fixup such as "webkit.org" -> "http://webkit.org", I recommend
 - storing user-input string to InputElementData:m_value as ever
 - HTMLInputElement::value(), typeMatch(), appendFormData() (and others?) make a fixup value from InputElementData:m_value, use it, and don't store it
Comment 16 Peter Kasting 2009-08-11 18:27:19 PDT
(In reply to comment #15)
> We already have two storages for the "value"; a "value" HTML attribute and
> InputElementData:m_value, and they have different roles.  Introducing another
> storage seems messy.  If we'd like to support for fixup such as "webkit.org" ->
> "http://webkit.org", I recommend
>  - storing user-input string to InputElementData:m_value as ever
>  - HTMLInputElement::value(), typeMatch(), appendFormData() (and others?) make
> a fixup value from InputElementData:m_value, use it, and don't store it

The only concern I have about this is speed.  Needing to parse, fix up, etc. on every access to the DOM value can be costly.  Maybe there's a way we can add a subclass somewhere so we can split m_value into m_rawValue and m_cookedValue only for inputs of type URL?  That way at least we'd avoid paying the memory cost on every input element...

Interested in Darin Adler's feedback.
Comment 17 Michelangelo De Simone 2009-08-17 16:00:07 PDT
Created attachment 34996 [details]
Patch v1
Comment 18 Peter Kasting 2009-08-17 16:08:58 PDT
Comment on attachment 34996 [details]
Patch v1

> diff --git a/LayoutTests/fast/forms/ValidityState-typeMismatch-001.html b/LayoutTests/fast/forms/ValidityState-typeMismatch-001.html
> +<script>
> +description("This test performs a negative check on type=url input elements.");

It's a negative check of typeMismatch, but a positive check of type=url validating.  Maybe "...a positive check of type=url validation."

> diff --git a/LayoutTests/fast/forms/ValidityState-typeMismatch-002.html b/LayoutTests/fast/forms/ValidityState-typeMismatch-002.html
> +<input name="victim" type="url" value="www.google.com" />
> +<input name="victim" type="url" value="127.0.0.1" />
> +<input name="victim" type="url" value=".com" />
> +<input name="victim" type="url" value="http://www.g**gle.com" />
> +<input name="victim" type="url" value="http://www.google.com:aaaa" />
> +<input name="victim" type="url" value="http:/www.google.com" />
> +<input name="victim" type="url" value="http:/www.google.com" />

The line above is a duplicate of the one above it, and should be removed (and expected results updated).

> +description("This test performs a positive check on type=url input elements.");

Same comment as above, but reversed.

> diff --git a/WebCore/html/ValidityState.cpp b/WebCore/html/ValidityState.cpp
> +bool ValidityState::typeMismatch()
> +{
> +    if (!control()->hasTagName(inputTag))
> +        return false;
> +
> +    HTMLInputElement* input = static_cast<HTMLInputElement*>(control());
> +    String value = input->value();
> +    
> +    if (value.isEmpty())
> +        return false;
> +
> +    if (input->inputType() == HTMLInputElement::URL) {
> +        KURL kurl(KURL(), value);
> +        
> +        return !kurl.isValid() || kurl.host().isEmpty();

Nit: No blank line in the middle of that conditional.

You don't have any layout tests that test URLs with a non-empty path.  This seems important.

You don't do any fixup (at all) of the user's input, not even adding a scheme, or escaping spaces in the path.  Is this what we want?  I'm concerned about the UX.
Comment 19 Michelangelo De Simone 2009-08-17 16:35:22 PDT
(In reply to comment #18)

> You don't do any fixup (at all) of the user's input, not even adding a scheme,

IMO this should be left to embedder, once type=url has its own UI widget.

> or escaping spaces in the path.

KURL::parse does.
Comment 20 Michelangelo De Simone 2009-08-17 16:39:04 PDT
(In reply to comment #19)

> > or escaping spaces in the path.
> KURL::parse does.

Correction: "general escaping" is managed by KURL::parse, I tought to escape/strip spaces from value but the only sanitiziation allowed by specs is for CR/LF which is already wired into HTMLInputElement.
Comment 21 Peter Kasting 2009-08-17 16:44:24 PDT
(In reply to comment #19)
> (In reply to comment #18)
> 
> > You don't do any fixup (at all) of the user's input, not even adding a scheme,
> 
> IMO this should be left to embedder, once type=url has its own UI widget.

There are no plans to grant this a special widget, AFAIK.

> > or escaping spaces in the path.
> 
> KURL::parse does.

...but the value being provided via DOM access is the original value the user types in, not the result of parse().  See comments 14 - 16...
Comment 22 Michelangelo De Simone 2009-08-18 06:40:51 PDT
(In reply to comment #21)

> There are no plans to grant this a special widget, AFAIK.

I'd say, why not give it a tought?:) However, we can address these two issues separately: one patch for static validation and another one for the value fixup (and PERHAPS a more specialized behaviour for type=url input elements), what do you think?
Comment 23 Peter Kasting 2009-08-18 10:03:43 PDT
(In reply to comment #22)
> one patch for static validation and another one for the value fixup
> (and PERHAPS a more specialized behaviour for type=url input elements), what do
> you think?

The only worry I have with this plan is about cases when KURL internally does some fixup work (like escaping spaces in the path) and reports "valid", but the value we supply to the DOM is the "invalid" version.  I'm not sure what the spec says about how this functions, but that seems like it could potentially cause problems.

CCing Hixie.
Comment 24 Peter Kasting 2009-08-20 22:46:47 PDT
The following comments were lost due to database corruption:

--- Comment #24 from Michelangelo De Simone <micdesim@gmail.com>  2009-08-20 15:17:05 PDT ---
(In reply to comment #23)

> The only worry I have with this plan is about cases when KURL internally does
> some fixup work (like escaping spaces in the path) and reports "valid", but the

While waiting for Hixie's opinion, what would you suggest instead of KURL?
Having an ex-novo method to check for urls' validity could be quite prone to
errors.

****

I replied with a comment 25 like:

I'm not suggesting something instead of KURL-- I'm suggesting that we may need to take the resulting "canonical" (or whatever you'd call it) URL string from inside KURL and supply that back to callers.

****

--- Comment #26 from Michelangelo De Simone <micdesim@gmail.com>  2009-08-20 15:49:14 PDT ---
(In reply to comment #25)

> I'm not suggesting something instead of KURL-- I'm suggesting that we may need
> to take the resulting "canonical" (or whatever you'd call it) URL string from
> inside KURL and supply that back to callers.

Let's put all the pieces together:

Mere validation: KURL.isValid is ok; returns correctly if the URI is a valid
URL (meaning it must have a scheme, otherwise it is obviously invalid). Assume
this "fixed" for now.

Fixup: KURL.prettyUrl could be used but (for the reason I've written above) it
doesn't fixup urls without scheme so the work to prepend a default "http://" to
the value is needed anyway.

Possible solution for the fixup issue: check if value has a scheme (KURL could
be of help), if NO scheme is found (and, therefore, the value is INVALID)
prepend a default scheme ("http://"?) and finally check for value's validity
with kurl::isvalid().

If this "possible solution" is sound: when should it be triggered? On the first
char input by the user (setting value from DOM)?

****

I replied with a comment 27 like:

The worry I have is that KURL may do fixup that makes it report "valid", but if we don't give that string to callers, we could give them something that isn't technically valid.

I also gave several possibilities for how to implement validation and supplying values back to callers.  Michelangelo can supply the details of this comment, I don't have them in my inbox.
Comment 25 Michelangelo De Simone 2009-08-21 13:51:03 PDT
Here follows the comment #27 from Peter Kasting.

--- Comment #27 from Peter Kasting <pkasting@google.com>  2009-08-20 15:59:29 PDT ---
(In reply to comment #26)
> Mere validation: KURL.isValid is ok; returns correctly if the URI is a valid
> URL (meaning it must have a scheme, otherwise it is obviously invalid). Assume
> this "fixed" for now.

The issue I'm trying to get at is that if KURL says the URL is valid, you'll
pass validation, but if you return the raw user input to callers, the actual
string you're returning may not be a valid URL (due to minor stuff like not
converting spaces in the path into %20).  This might screw up somebody's
script.  On the other hand maybe it's not worse than the no-validation we're
doing today.

> If this "possible solution" is sound: when should it be triggered? On the first
> char input by the user (setting value from DOM)?

You can do it one of several ways.

* You can avoid adding any member variables, and always do whatever minimal
"fix up" you're going to do when someone asks what the value is, or asks if the
field is valid.  Adds no variables, but has the highest runtime cost.

* You can hold a member variable on the element that's the "fixed up" version
out of KURL, supply this to anyone who asks, and update it every time the
element's value is mutated.  Saves runtime cost at the expense of some
complexity and a new member.

* You can add both a variable to hold the "fixed up" value and a bool to say
whether the value is dirty, set the bool anytime the value is modified, and do
fixup any time someone asks for the value and the bool is set.  Most complex,
but lowest runtime cost.

Or, we could decide that we should do no fixup at all (legal per the spec, may
be a poor UX), or that we'll do some minimal fixes (i.e. what KURL does
automatically) but are OK sending the unmodified user string to callers
(probably not legal per the spec), and do none of these.

I can't make this decision.
Comment 26 Kent Tamura 2009-09-10 18:40:32 PDT
I think we may commit this patch without any fix-up, and discuss how to fix up and make another patch for fix-up.
Smaller patch is better.
Comment 27 Michelangelo De Simone 2009-09-10 23:48:59 PDT
(In reply to comment #26)
> I think we may commit this patch without any fix-up, and discuss how to fix up
> and make another patch for fix-up.
> Smaller patch is better.

I'll rebase the patch in hours.
Comment 28 Michelangelo De Simone 2009-09-11 09:12:25 PDT
Created attachment 39440 [details]
Patch v1, rebased
Comment 29 Michelangelo De Simone 2009-09-11 09:13:47 PDT
Created attachment 39441 [details]
Patch v1, rebased
Comment 30 Michelangelo De Simone 2009-09-11 14:19:07 PDT
Created attachment 39474 [details]
Patch v1
Comment 31 Michelangelo De Simone 2009-09-11 14:31:30 PDT
Created attachment 39475 [details]
Patch v1
Comment 32 WebKit Commit Bot 2009-09-11 15:01:11 PDT
Comment on attachment 39475 [details]
Patch v1

Clearing flags on attachment: 39475

Committed r48318: <http://trac.webkit.org/changeset/48318>
Comment 33 WebKit Commit Bot 2009-09-11 15:01:17 PDT
All reviewed patches have been landed.  Closing bug.
Comment 34 Kent Tamura 2009-09-13 18:07:53 PDT
I have made a bug entry for fix up: Bug#29235