Bug 120030 - input/textarea: Count text length for maxLength check with the standard way
Summary: input/textarea: Count text length for maxLength check with the standard way
Status: RESOLVED WONTFIX
Alias: None
Product: WebKit
Classification: Unclassified
Component: Forms (show other bugs)
Version: 528+ (Nightly build)
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Myles C. Maxfield
URL:
Keywords: BlinkMergeCandidate, InRadar
: 229553 (view as bug list)
Depends on:
Blocks:
 
Reported: 2013-08-19 15:39 PDT by Ryosuke Niwa
Modified: 2021-08-27 16:54 PDT (History)
9 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Ryosuke Niwa 2013-08-19 15:39:05 PDT
Consider merging https://chromium.googlesource.com/chromium/blink/+/07f11c2650bffcdf07b2e55b50fef917940366a1

We counted use-input text in grapheme cluster unit. i.e. A letter +
combining characters are counted as '1,' and a surrogate pair is counted
as '1.' According to the standard and other browsers, we shoudl count
them in UTF-16 code point unit.
Comment 1 Ryosuke Niwa 2013-08-19 15:39:29 PDT
Darin, do you know why we do what we currently do? (i.e. count grapheme clusters).
Comment 2 Ryosuke Niwa 2013-08-19 16:43:22 PDT
This behavior was implemented in https://bugs.webkit.org/show_bug.cgi?id=7622 following https://bugs.webkit.org/show_bug.cgi?id=6987#c11:

 Comment #11 From Darin Adler 2006-03-05 15:49:06 PST (-) [reply] 
(From update of attachment 6878 [details])
One major difference between this maxLength implementation and the one I did in KWQTextField is that this one limits you to a certain number of UTF-16 characters. But the one in KWQTextField limits you to a certain number of "composed character sequences". That means that an e with an umlaut over it counts as 1 character even though it can be two Unicode characters in a row (the e followed by the non-spacing umlaut) and a single Japanese character that is outside the "BMP" that requires two UTF-16 codes (a "surrogate pair") to encode also counts as a single character.

The code that deals with this in KWQTextField is _KWQ_numComposedCharacterSequences and _KWQ_truncateToNumComposedCharacterSequences:.

We will need to replicate this, although I guess it's fine not to at first, but I'd like to see another bug about that.

To tell if a character is half of a surrogate pair, you use macros in <unicode/utf16.h>, such as U16_LENGTH or U16_IS_LEAD. To tell if the character is going to combine with the one before it is more difficult. There's code in CoreFoundation that does this analysis and I presume there's some way to do it with ICU, but I don't know what that is.

In addition to determining such things, code will have to be careful not to do math on the length of strings, since composing means that "length of A plus length of B" is not necessarily the same as "length of A plus B".
Comment 3 Ryosuke Niwa 2013-08-19 16:49:47 PDT
I've started a thread on whatwg to see if we can change the specification: http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2013-August/040505.html
Comment 4 Darin Adler 2013-08-20 09:28:51 PDT
(In reply to comment #1)
> Darin, do you know why we do what we currently do? (i.e. count grapheme clusters).

The original reason was that I understood maxlength as trying to limit to number of characters that fit physically into a fixed size text field. When a font was fixed width or nearly so, it made sense to count the U character and umlaut character that combined with it as a single one rather than letting the internal representation affect the length. It’s particularly bad because decomposed and precomposed strings that look identical have different lengths! And it seems super silly to have non-BMP characters count as two characters!

If maxlength is instead present to try to avoid violating a server protocol, then I could understand it being limit on the number of UTF-16 units instead. But then why not UTF-8 units?

Maybe wide characters are a concern too. Anyway, we should match the standard, but it’s nuts that the standard does not consider actual end user concepts of string length and instead concentrates on internals in this way.
Comment 5 Alexey Proskuryakov 2014-01-08 19:16:11 PST
<rdar://problem/15776076>
Comment 6 Domenic Denicola 2016-07-07 13:16:20 PDT
FYI we are considering changing the standard here to count code points instead of code units at least. The thinking is that's a good balance between what a developer might expect (since code points are somewhat first-class in JS with ES6, and are what pattern="*{2}" will restrict to) and what a user might expect.

We also are hoping to change to count linebreaks as one character instead of 2.

Thoughts and feedback welcome at https://github.com/whatwg/html/pull/1517
Comment 7 Alexey Proskuryakov 2021-08-26 17:11:31 PDT
*** Bug 229553 has been marked as a duplicate of this bug. ***
Comment 8 Myles C. Maxfield 2021-08-26 20:32:58 PDT
I think it isn’t unreasonable to revisit this, given the current browser landscape. Maxlength in text areas doesn’t seem like a competitive advantage.
Comment 9 Alexey Proskuryakov 2021-08-27 09:21:02 PDT
WebKit behavior makes at least some sense in comparison with other browsers, where it is outright nonsensical, and our behavior does not cause any compatibility fallout. So I'm not sure what the reason to change it would be.

Perhaps a reasonable step would be to deprecate maxlength and to remove it from all browser engines, as everyone appears to be struggling to find any meaning for it.
Comment 10 Ryosuke Niwa 2021-08-27 12:38:40 PDT
https://github.com/whatwg/html/issues/1467 has a number of other suggestions like allowing websites to specify that kind of counting method to use. I tend to agree that using UTF-16 code units for maxlength doesn't make much sense in any of the use cases mentioned. The only argument for it is to match JS string API, and that's not a good one because it very easily results in an end-user confusion of single accented character vs accent modifier & character pair, let alone emojis and languages which use code points that involve surrogate pairs as well as single code units in UTF-16.
Comment 11 Myles C. Maxfield 2021-08-27 16:09:25 PDT
My comment regarding “competitive advantage” is really about standardization. This area doesn’t seem like it’s worth willfully violating the spec, and other browsers, about. If we think our implementation is better, we should pursue changing the spec to match us, so other browsers will get the better behavior too. Regardless of what the policy is, this seems like the kind of thing that browser divergence is worse than any individual policy. It’s hard to imagine that people are choosing which browser they use based on this issue.
Comment 12 Ryosuke Niwa 2021-08-27 16:54:16 PDT
(In reply to Myles C. Maxfield from comment #11)
> My comment regarding “competitive advantage” is really about
> standardization. This area doesn’t seem like it’s worth willfully violating
> the spec, and other browsers, about.

I think Alexey, at least, is arguing precisely that it is? It's probably very confusing for users to be able to type in ä but not a then  ̈.

> If we think our implementation is better, we should pursue changing the spec to match us, so other browsers will get the better behavior too.

We did. That's https://github.com/whatwg/html/issues/1467 is, and it's still open today although it doesn't seem like Gecko and Blink are willing to change at this point.

> Regardless of what the policy is, this seems like the kind of thing that browser divergence is worse than any individual policy.

Probably is.

> It’s hard to imagine that people are choosing which browser they use based on this issue.

They may not be. But they might think Safari / WebKit introduced a new regression we implemented what the spec says (i.e. what Gecko & Blink do today).