120030 – input/textarea: Count text length for maxLength check with the standard way

RESOLVED WONTFIX 120030

input/textarea: Count text length for maxLength check with the standard way

https://bugs.webkit.org/show_bug.cgi?id=120030

Summary input/textarea: Count text length for maxLength check with the standard way

Ryosuke Niwa

Reported 2013-08-19 15:39:05 PDT

Consider merging https://chromium.googlesource.com/chromium/blink/+/07f11c2650bffcdf07b2e55b50fef917940366a1 We counted use-input text in grapheme cluster unit. i.e. A letter + combining characters are counted as '1,' and a surrogate pair is counted as '1.' According to the standard and other browsers, we shoudl count them in UTF-16 code point unit.

Attachments
Add attachment proposed patch, testcase, etc.

Ryosuke Niwa

Comment 1 2013-08-19 15:39:29 PDT

Darin, do you know why we do what we currently do? (i.e. count grapheme clusters).

Ryosuke Niwa

Comment 2 2013-08-19 16:43:22 PDT

This behavior was implemented in https://bugs.webkit.org/show_bug.cgi?id=7622 following https://bugs.webkit.org/show_bug.cgi?id=6987#c11: Comment #11 From Darin Adler 2006-03-05 15:49:06 PST (-) [reply] (From update of attachment 6878 [details]) One major difference between this maxLength implementation and the one I did in KWQTextField is that this one limits you to a certain number of UTF-16 characters. But the one in KWQTextField limits you to a certain number of "composed character sequences". That means that an e with an umlaut over it counts as 1 character even though it can be two Unicode characters in a row (the e followed by the non-spacing umlaut) and a single Japanese character that is outside the "BMP" that requires two UTF-16 codes (a "surrogate pair") to encode also counts as a single character. The code that deals with this in KWQTextField is _KWQ_numComposedCharacterSequences and _KWQ_truncateToNumComposedCharacterSequences:. We will need to replicate this, although I guess it's fine not to at first, but I'd like to see another bug about that. To tell if a character is half of a surrogate pair, you use macros in <unicode/utf16.h>, such as U16_LENGTH or U16_IS_LEAD. To tell if the character is going to combine with the one before it is more difficult. There's code in CoreFoundation that does this analysis and I presume there's some way to do it with ICU, but I don't know what that is. In addition to determining such things, code will have to be careful not to do math on the length of strings, since composing means that "length of A plus length of B" is not necessarily the same as "length of A plus B".

Ryosuke Niwa

Comment 3 2013-08-19 16:49:47 PDT

I've started a thread on whatwg to see if we can change the specification: http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2013-August/040505.html

Darin Adler

Comment 4 2013-08-20 09:28:51 PDT

(In reply to comment #1) > Darin, do you know why we do what we currently do? (i.e. count grapheme clusters). The original reason was that I understood maxlength as trying to limit to number of characters that fit physically into a fixed size text field. When a font was fixed width or nearly so, it made sense to count the U character and umlaut character that combined with it as a single one rather than letting the internal representation affect the length. It’s particularly bad because decomposed and precomposed strings that look identical have different lengths! And it seems super silly to have non-BMP characters count as two characters! If maxlength is instead present to try to avoid violating a server protocol, then I could understand it being limit on the number of UTF-16 units instead. But then why not UTF-8 units? Maybe wide characters are a concern too. Anyway, we should match the standard, but it’s nuts that the standard does not consider actual end user concepts of string length and instead concentrates on internals in this way.

Alexey Proskuryakov

Comment 5 2014-01-08 19:16:11 PST

<rdar://problem/15776076>

Domenic Denicola

Comment 6 2016-07-07 13:16:20 PDT

FYI we are considering changing the standard here to count code points instead of code units at least. The thinking is that's a good balance between what a developer might expect (since code points are somewhat first-class in JS with ES6, and are what pattern="*{2}" will restrict to) and what a user might expect. We also are hoping to change to count linebreaks as one character instead of 2. Thoughts and feedback welcome at https://github.com/whatwg/html/pull/1517

Alexey Proskuryakov

Comment 7 2021-08-26 17:11:31 PDT

*** Bug 229553 has been marked as a duplicate of this bug. ***

Myles C. Maxfield

Comment 8 2021-08-26 20:32:58 PDT

I think it isn’t unreasonable to revisit this, given the current browser landscape. Maxlength in text areas doesn’t seem like a competitive advantage.

Alexey Proskuryakov

Comment 9 2021-08-27 09:21:02 PDT

WebKit behavior makes at least some sense in comparison with other browsers, where it is outright nonsensical, and our behavior does not cause any compatibility fallout. So I'm not sure what the reason to change it would be. Perhaps a reasonable step would be to deprecate maxlength and to remove it from all browser engines, as everyone appears to be struggling to find any meaning for it.

Ryosuke Niwa

Comment 10 2021-08-27 12:38:40 PDT

https://github.com/whatwg/html/issues/1467 has a number of other suggestions like allowing websites to specify that kind of counting method to use. I tend to agree that using UTF-16 code units for maxlength doesn't make much sense in any of the use cases mentioned. The only argument for it is to match JS string API, and that's not a good one because it very easily results in an end-user confusion of single accented character vs accent modifier & character pair, let alone emojis and languages which use code points that involve surrogate pairs as well as single code units in UTF-16.

Myles C. Maxfield

Comment 11 2021-08-27 16:09:25 PDT

My comment regarding “competitive advantage” is really about standardization. This area doesn’t seem like it’s worth willfully violating the spec, and other browsers, about. If we think our implementation is better, we should pursue changing the spec to match us, so other browsers will get the better behavior too. Regardless of what the policy is, this seems like the kind of thing that browser divergence is worse than any individual policy. It’s hard to imagine that people are choosing which browser they use based on this issue.

Ryosuke Niwa

Comment 12 2021-08-27 16:54:16 PDT

(In reply to Myles C. Maxfield from comment #11) > My comment regarding “competitive advantage” is really about > standardization. This area doesn’t seem like it’s worth willfully violating > the spec, and other browsers, about. I think Alexey, at least, is arguing precisely that it is? It's probably very confusing for users to be able to type in ä but not a then ̈. > If we think our implementation is better, we should pursue changing the spec to match us, so other browsers will get the better behavior too. We did. That's https://github.com/whatwg/html/issues/1467 is, and it's still open today although it doesn't seem like Gecko and Blink are willing to change at this point. > Regardless of what the policy is, this seems like the kind of thing that browser divergence is worse than any individual policy. Probably is. > It’s hard to imagine that people are choosing which browser they use based on this issue. They may not be. But they might think Safari / WebKit introduced a new regression we implemented what the spec says (i.e. what Gecko & Blink do today).

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution WONTFIX

Priority P2

Severity Normal

Classification Unclassified

Version 528+ (Nightly build)

Hardware Unspecified

OS Unspecified

Product WebKit

Component Forms

Assignee

Myles C. Maxfield

Reported

2013-08-19 15:39 PDT

Modified

2021-08-27 16:54 PDT History

CC List

9 users Show

URL

Keywords BlinkMergeCandidate, InRadar

Duplicates (1)

229553 View as bug list

Depends on

Blocks