NEW 255467
isASCIISpace vs isHTMLSpace
https://bugs.webkit.org/show_bug.cgi?id=255467
Summary isASCIISpace vs isHTMLSpace
Anne van Kesteren
Reported 2023-04-14 13:05:05 PDT
In WebKit isASCIISpace includes U+000B, which is almost always wrong for web platform purposes. In fact, I'm not sure where U+000B is considered whitespace in the web platform. Pretty much everything I'm familiar with uses https://infra.spec.whatwg.org/#ascii-whitespace or a subset thereof (e.g., HTTP uses a subset). Yet given isASCIISpace is a rather attractive name we use it in a number of places that likely want isHTMLSpace instead. E.g., it appears our CSP code does this wrong. Curious if people could give the historical context here and whether we should maybe consider deprecating isASCIISpace.
Attachments
Darin Adler
Comment 1 2023-04-14 18:10:06 PDT
As documented in the header (there's a comment right at the top explaining this), functions in ASCIICType.h started as non-locale-sensitive versions of the functions in <ctype.h>, safe to use even if setLocale("UTF-8") has been called. So isASCIISpace does exactly what the C library isspace function does in the default C locale, and does not innovate or get adapted to the needs of the web platform. Later when we found that HTTP and HTML used different space definitions (each different from the other and both *subsets* of C's isspace) we added new functions rather than changing the behavior of the underlying one. It’s undeniable that it’s tempting to call isASCIISpace when the HTML documentation says "ASCII whitespace" since sadly HTML uses that name for its own ASCII whitespace subset! One property of isASCIISpace is that it has the same definition as Unicode whitespace, and returns the same thing as ICU’s u_isspace would when the argument is an ASCII character. And it returns true for all ASCII characters that are have the Unicode White_Space property. It would be OK to remove the isASCIISpace function if nothing in WebKit needs it. And leave a comment behind explaining why it does not exist. But first, we must change all the callers to not use it! That might take a while. Another possibility would be to rename isASCIISpace in a way that tries to make it clear it should rarely be used. There is precedent for this in the project. I don’t immediately have an idea for the name. Maybe isLegacyASCIISpace? Maybe isASCIIUnicodeSpace? Maybe isASCIISpaceButCallIsHTMLSpaceInstead. There are other quite a few other functions that call isASCIISpace that are almost never correct for the web platform because they use treat U+000B as whitespace. This is a complicated subject because we have at least 5 interesting definitions of whitespace: - Unicode whitespace (u_isspace) - ASCII whitespace (isASCII), same as Unicode whitespace but only for ASCII characters - HTML’s "ASCII whitespace" <https://infra.spec.whatwg.org/#ascii-whitespace> (isHTMLSpace) - HTTP’s whitespace <https://www.rfc-editor.org/rfc/rfc9110.html#name-whitespace> (isHTTPSpace) - CSS’s document white space characters <https://www.w3.org/TR/css-text-3/#white-space> Given this, it’s hard to write functions that handle spaces correctly without parameterizing them on which definition of whitespace to use. The following all currently use isASCIISpace and so at least some callers probably get incorrect behavior. String/StringImpl::stripWhiteSpace String/StringImpl::simplifyWhiteSpace StringView::stripWhiteSpace isSpaceOrNewline in StringImpl.h isNotSpaceOrNewline in StringImpl.h isNotASCIISpace in ParsingUtilities.h parseInteger parseIntegerAllowingTrailingJunk charactersToDouble charactersToFloat
Anne van Kesteren
Comment 2 2023-04-21 06:22:53 PDT
Thank you, that and bug 184529 comment 11 which I discovered recently have helped me understand the status quo a whole lot better. I might slowly try to chip away at investigating the callers as time allows.
Radar WebKit Bug Importer
Comment 3 2023-04-21 13:06:22 PDT
Anne van Kesteren
Comment 4 2024-03-13 04:49:13 PDT
*** Bug 38850 has been marked as a duplicate of this bug. ***
Note You need to log in before you can comment on or make changes to this bug.