Bug 36291 - placeholder text should be stripped from line breaks
Summary: placeholder text should be stripped from line breaks
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: Forms (show other bugs)
Version: 528+ (Nightly build)
Hardware: All All
: P2 Normal
Assignee: Nobody
URL: http://dev.w3.org/html5/spec/forms.ht...
Keywords:
Depends on:
Blocks:
 
Reported: 2010-03-18 04:23 PDT by Mounir Lamouri
Modified: 2010-05-12 08:57 PDT (History)
4 users (show)

See Also:


Attachments
Proposed patch (31.52 KB, patch)
2010-05-05 02:59 PDT, Kent Tamura
no flags Details | Formatted Diff | Diff
Proposed patch (rev.2) (32.47 KB, patch)
2010-05-07 09:59 PDT, Kent Tamura
no flags Details | Formatted Diff | Diff
Proposed patch (rev.3) (32.47 KB, patch)
2010-05-07 10:03 PDT, Kent Tamura
no flags Details | Formatted Diff | Diff
Proposed patch (rev.4) (32.18 KB, patch)
2010-05-10 06:20 PDT, Kent Tamura
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Mounir Lamouri 2010-03-18 04:23:32 PDT
According to the specs, the placeholder should be shown after having been stripped from line breaks [1]. At the moment, WebKit doesn't look to do that.

A simple test:
<input placeholder="first part&#10 &#13second part"/>
You should see "first part second part" instead of "first part".

[1] "User agents should present this hint to the user, after having stripped line breaks from it, when the element's value is the empty string and the control is not focused (e.g. by displaying it inside a blank unfocused control).", see $URL
Comment 1 Kent Tamura 2010-05-05 02:59:34 PDT
Created attachment 55102 [details]
Proposed patch
Comment 2 Darin Adler 2010-05-06 18:20:51 PDT
Comment on attachment 55102 [details]
Proposed patch

> +    const AtomicString& attributeValue = getAttribute(placeholderAttr);
> +    if (attributeValue.isEmpty())
> +        return String();
> +    // According to the HTML5 specification, we need to remove CR and LF from
> +    // the attribute value.
> +    Vector<UChar> stripped;
> +    unsigned length = attributeValue.length();
> +    stripped.reserveCapacity(length);
> +    for (unsigned i = 0; i < length; ++i) {
> +        UChar character = attributeValue[i];
> +        if (character == newlineCharacter || character == carriageReturn)
> +            continue;
> +        stripped.append(character);
> +    }
> +    return String(stripped.data(), stripped.size());

Since most strings don't have line breaks, it would be good to have a more efficient code path for the common case. A simple way to do that is to simply search for newline character and carriage returns first with the contains function and add an early return.

Further, we can make this code faster by removing one allocation by using:

    Vector<UChar, 256> stripped;

The storage will be allocated on the stack instead of the heap if it fits in 256 characters. Another possibility, if you decide not to do this and to stick with Vector<UChar>, is to use String::adopt instead, which uses the same pointer. The resulting string will use a bit more storage, but we avoid copying all the characters.

> +        && !strippedPlaceholder().isEmpty();

Seems wasteful to calculate the stripped placeholder just to check if it's empty. If this is at all hot I suggest using a separate function that does the far more efficient emptiness test, stopping on the first character that is not a newline or carriage return.

>      virtual ~HTMLTextFormControlElement();
>      virtual void dispatchFocusEvent();
>      virtual void dispatchBlurEvent();
> +    String strippedPlaceholder() const;

I think this function should be in its own paragraph, not in the same one as dispatchFocusEvent and dispatchBlurEvent.

>  const UChar blackSquare = 0x25A0;
>  const UChar bullet = 0x2022;
> +const UChar carriageReturn = 0x000D;
>  const UChar ethiopicPrefaceColon = 0x1366;
>  const UChar hebrewPunctuationGeresh = 0x05F3;

The characters in CharacterNames.h are supposed to use the names from the Unicode standard. So the name should be lineFeed, not newlineCharacter. The one you are adding has the right name.

> +        // node() should be an HTMLInputElement. WMLInputElement doesn't support placeholder.
> +        ASSERT(node()->isHTMLElement());

This is not the right assertion. You need to assert this is an HTMLInputElement since that's what you are casting to.

    ASSERT(node()->hasTagName(inputTag));

review- because I think you should do at least some of the things I mentioned above
Comment 3 Kent Tamura 2010-05-07 09:59:27 PDT
Created attachment 55390 [details]
Proposed patch (rev.2)
Comment 4 Kent Tamura 2010-05-07 10:03:30 PDT
Created attachment 55391 [details]
Proposed patch (rev.3)
Comment 5 Kent Tamura 2010-05-07 10:09:57 PDT
Thank you for reviewing!

(In reply to comment #2)
> (From update of attachment 55102 [details])
> > +    const AtomicString& attributeValue = getAttribute(placeholderAttr);
> > +    if (attributeValue.isEmpty())
> > +        return String();
> > +    // According to the HTML5 specification, we need to remove CR and LF from
> > +    // the attribute value.
> > +    Vector<UChar> stripped;
> > +    unsigned length = attributeValue.length();
> > +    stripped.reserveCapacity(length);
> > +    for (unsigned i = 0; i < length; ++i) {
> > +        UChar character = attributeValue[i];
> > +        if (character == newlineCharacter || character == carriageReturn)
> > +            continue;
> > +        stripped.append(character);
> > +    }
> > +    return String(stripped.data(), stripped.size());
> 
> Since most strings don't have line breaks, it would be good to have a more
> efficient code path for the common case. A simple way to do that is to simply
> search for newline character and carriage returns first with the contains
> function and add an early return.

Ok, i added code to check CR/LF existence before copying.

> Further, we can make this code faster by removing one allocation by using:
> 
>     Vector<UChar, 256> stripped;
> 
> The storage will be allocated on the stack instead of the heap if it fits in
> 256 characters. Another possibility, if you decide not to do this and to stick
> with Vector<UChar>, is to use String::adopt instead, which uses the same
> pointer. The resulting string will use a bit more storage, but we avoid copying
> all the characters.

I chose String::adopt().

> > +        && !strippedPlaceholder().isEmpty();
> 
> Seems wasteful to calculate the stripped placeholder just to check if it's
> empty. If this is at all hot I suggest using a separate function that does the
> far more efficient emptiness test, stopping on the first character that is not
> a newline or carriage return.

Ok, I added new function; isPlaceholderEmpty().

> >      virtual ~HTMLTextFormControlElement();
> >      virtual void dispatchFocusEvent();
> >      virtual void dispatchBlurEvent();
> > +    String strippedPlaceholder() const;
> 
> I think this function should be in its own paragraph, not in the same one as
> dispatchFocusEvent and dispatchBlurEvent.

Added an empty line.

> >  const UChar blackSquare = 0x25A0;
> >  const UChar bullet = 0x2022;
> > +const UChar carriageReturn = 0x000D;
> >  const UChar ethiopicPrefaceColon = 0x1366;
> >  const UChar hebrewPunctuationGeresh = 0x05F3;
> 
> The characters in CharacterNames.h are supposed to use the names from the
> Unicode standard. So the name should be lineFeed, not newlineCharacter. The one
> you are adding has the right name.

No.  The right name for U+000D is CARRIAGE RETURN according to the Unicode standard.
LINE FEED is U+000A.  Note that we already have a wrong name 'newlineCharacter' for U+000A in CharacterNames.h.

> > +        // node() should be an HTMLInputElement. WMLInputElement doesn't support placeholder.
> > +        ASSERT(node()->isHTMLElement());
> 
> This is not the right assertion. You need to assert this is an HTMLInputElement
> since that's what you are casting to.

No.
 - WMLInputElement also has inputTag.
 - HTMLInputElement represents <isindex> too, and we support placeholder for it (See LayoutTests/fast/forms/isindex-placeholder.html)
Comment 6 Kent Tamura 2010-05-07 10:18:38 PDT
(In reply to comment #5)
> > > +        ASSERT(node()->isHTMLElement());
> > 
> > This is not the right assertion. You need to assert this is an HTMLInputElement
> > since that's what you are casting to.
> 
> No.
>  - WMLInputElement also has inputTag.
>  - HTMLInputElement represents <isindex> too, and we support placeholder for it
> (See LayoutTests/fast/forms/isindex-placeholder.html)

I meant "No" for ASSERT(node()->hasTagName(inputTag)).
I don't think ASSERT(hasTagName(inputTag) || hasTagName(isindexTag)) should be added to  here because we assume isHTMLElement() == HTMLInputElement in RenderTextControlSingleLine.
Comment 7 Darin Adler 2010-05-07 18:10:23 PDT
(In reply to comment #6)
> >  - WMLInputElement also has inputTag.

That's WMLNames::inputTag, which is not the same as HTMLNames::inputTag.

> >  - HTMLInputElement represents <isindex> too, and we support placeholder for it

Good point.

> we assume isHTMLElement() == HTMLInputElement in RenderTextControlSingleLine.

OK.

But generally speaking if we're about to typecast something, we want an assertion to state that the typecast is OK. And isHTMLElement() does not check that for HTMLInputElement*!

ValidityState::typeMismatch is one of many places that uses hasTagName(inputTag) as a guard for a cast to HTMLInputElement*. It's almost certainly OK that site returns false for <isindex>, though, so it's probably OK.

Maybe we need to put an isHTMLInputElement function somewhere so we can use it in this assertion. We might also want to audit all the call sites checking inputTag to see if any of them need to cover isindexTag too.
Comment 8 Darin Adler 2010-05-07 18:11:20 PDT
(In reply to comment #5)
> > >  const UChar blackSquare = 0x25A0;
> > >  const UChar bullet = 0x2022;
> > > +const UChar carriageReturn = 0x000D;
> > >  const UChar ethiopicPrefaceColon = 0x1366;
> > >  const UChar hebrewPunctuationGeresh = 0x05F3;
> > 
> > The characters in CharacterNames.h are supposed to use the names from the
> > Unicode standard. So the name should be lineFeed, not newlineCharacter. The one
> > you are adding has the right name.
> 
> No.  The right name for U+000D is CARRIAGE RETURN according to the Unicode
> standard.
> LINE FEED is U+000A.  Note that we already have a wrong name 'newlineCharacter'
> for U+000A in CharacterNames.h.

Yes, my comment was about the existing wrong name, not about your new correct name.
Comment 9 Darin Adler 2010-05-07 18:21:31 PDT
Comment on attachment 55391 [details]
Proposed patch (rev.3)

I looked in HTML5 and could not find any text saying that we need to remove CR and LF from attribute values. What I saw was text telling HTML5 content authors they must not include CR and LF in their attribute values, but nothing that tells HTML5 browser implementations what to do.

> +<head>
> +<link rel="stylesheet" href="../../../../fast/js/resources/js-test-style.css"
> +</head>

This is strange. Why do you have an unterminated <link> element here? I think you should just leave out this <head> entirely. Do you need this style sheet at all?

> +static bool isNewLineCharacter(UChar ch) { return ch == newlineCharacter || ch == carriageReturn; }
> +
> +static bool isNotNewLineCharacter(UChar ch) { return ch != newlineCharacter && ch != carriageReturn; }

Why these particular function names? Are these two characters called "new line characters"? Is that terminology from the HTML5 specification? I'd prefer a different name given that one of these characters is called "newlineCharacter". The function name seems too close to that.

> +    if (attributeValue.string().find(isNewLineCharacter) == -1)
> +        return attributeValue;

I'm sure it's much more efficient to call contains on a UChar twice than to call the version of find that takes a character-matching function once. I suggest using contains.

> +bool HTMLTextFormControlElement::isPlaceholderEmpty() const
> +{
> +    const AtomicString& attributeValue = getAttribute(placeholderAttr);
> +    if (attributeValue.isEmpty())
> +        return true;
> +    return attributeValue.string().find(isNotNewLineCharacter) == -1;
> +}

I don't think you need a special case for the empty string here. The find function is already quite fast for empty strings, and will give the correct result.

> +        // node() should be an HTMLInputElement. WMLInputElement doesn't support placeholder.

The phrase "should be" is not strong enough. I would say "must be".
Comment 10 Kent Tamura 2010-05-10 06:20:11 PDT
Created attachment 55547 [details]
Proposed patch (rev.4)
Comment 11 Kent Tamura 2010-05-10 06:25:51 PDT
Thank you for reviewing!

(In reply to comment #9)
> I looked in HTML5 and could not find any text saying that we need to remove CR and LF from attribute values. What I saw was text telling HTML5 content authors they must not include CR and LF in their attribute values, but nothing that tells HTML5 browser implementations what to do.

Here it is:
http://www.whatwg.org/specs/web-apps/current-work/multipage/common-input-element-attributes.html#the-placeholder-attribute
| User agents should present this hint to the user,
| after having stripped line breaks from it, when the
| element's value is the empty string and the control
| is not focused (e.g. by displaying it inside a blank
| unfocused control).


> > +<head>
> > +<link rel="stylesheet" href="../../../../fast/js/resources/js-test-style.css"
> > +</head>
> 
> This is strange. Why do you have an unterminated <link> element here? I think you should just leave out this <head> entirely. Do you need this style sheet at all?

You're right.  This is not needed at all.  Removed.


> > +static bool isNewLineCharacter(UChar ch) { return ch == newlineCharacter || ch == carriageReturn; }
> > +
> > +static bool isNotNewLineCharacter(UChar ch) { return ch != newlineCharacter && ch != carriageReturn; }
> 
> Why these particular function names? Are these two characters called "new line characters"? Is that terminology from the HTML5 specification? I'd prefer a different name given that one of these characters is called "newlineCharacter". The function name seems too close to that.

My naming was wrong. The specification calls them "line breaks".  So I should name them is(Not)LineBreak().


> > +    if (attributeValue.string().find(isNewLineCharacter) == -1)
> > +        return attributeValue;
> 
> I'm sure it's much more efficient to call contains on a UChar twice than to call the version of find that takes a character-matching function once. I suggest using contains.

Ok, I changed it to two contains() calls.


> > +bool HTMLTextFormControlElement::isPlaceholderEmpty() const
> > +{
> > +    const AtomicString& attributeValue = getAttribute(placeholderAttr);
> > +    if (attributeValue.isEmpty())
> > +        return true;
> > +    return attributeValue.string().find(isNotNewLineCharacter) == -1;
> > +}
> 
> I don't think you need a special case for the empty string here. The find function is already quite fast for empty strings, and will give the correct result.

Removed attributeValue.isEmpty().


> > +        // node() should be an HTMLInputElement. WMLInputElement doesn't support placeholder.
> 
> The phrase "should be" is not strong enough. I would say "must be".

Changed to "must be".
Comment 12 Kent Tamura 2010-05-12 08:57:34 PDT
Comment on attachment 55547 [details]
Proposed patch (rev.4)

Clearing flags on attachment: 55547

Committed r59236: <http://trac.webkit.org/changeset/59236>
Comment 13 Kent Tamura 2010-05-12 08:57:44 PDT
All reviewed patches have been landed.  Closing bug.