Bug 8738 - Text should be always normalized to NFC
Summary: Text should be always normalized to NFC
Status: RESOLVED CONFIGURATION CHANGED
Alias: None
Product: WebKit
Classification: Unclassified
Component: DOM (show other bugs)
Version: 420+
Hardware: Mac OS X 10.4
: P3 Normal
Assignee: Nobody
URL: http://www.w3.org/TR/charmod-norm/#C302
Keywords:
: 13150 (view as bug list)
Depends on:
Blocks:
 
Reported: 2006-05-04 13:01 PDT by Alexey Proskuryakov
Modified: 2022-07-20 18:12 PDT (History)
10 users (show)

See Also:


Attachments
test case (620 bytes, text/html)
2006-05-04 13:04 PDT, Alexey Proskuryakov
no flags Details
Safari 15.5 matches other browsers (709.94 KB, image/png)
2022-07-19 10:50 PDT, Ahmad Saleem
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Alexey Proskuryakov 2006-05-04 13:01:18 PDT
From Character Model for the World Wide Web 1.0: Normalization (W3C Working Draft 27 October 2005):

---------------------------------------------------
C302  A text-processing component that receives suspect text MUST NOT perform any normalization-sensitive operations unless it has first either confirmed through inspection that the text is in normalized form or it has re-normalized the text itself. Private agreements MAY, however, be created within private systems which are not subject to these rules, but any externally observable results MUST be the same as if the rules had been obeyed.


C303 A text-processing component which modifies text and performs normalization-sensitive operations MUST behave as if normalization took place after each modification, so that any subsequent normalization-sensitive operations always behave as if they were dealing with normalized text.

EXAMPLE: If the 'z' is deleted from the (normalized) string cz? (where '?' represents a combining cedilla, U+0327), normalization is necessary to turn the denormalized result c? into the properly normalized ?. If the software that deletes the 'z' later uses the string in a normalization-sensitive operation, it needs to normalize the string before this operation to ensure correctness; otherwise, normalization may be deferred until the data is exposed. Analogous cases exist for insertion and concatenation (e.g. xf:concat(xf:substring('cz?', 1, 1), xf:substring('cz?', 3, 1)) in XQuery [XQuery Operators]).

NOTE: Software that denormalizes a string such as in the deletion example above does not need to perform a potentially expensive re-normalization of the whole string to ensure that the string is normalized. It is sufficient to go back to the last non-composing character and re-normalize forward to the next non-composing character; if the string was normalized before the denormalizing operation, it will now be re-normalized.
---------------------------------------------------

WebKit doesn't perform any Unicode normalization (and I'm going to file a separate bug about text that WebKit produces).
Comment 1 Alexey Proskuryakov 2006-05-04 13:04:28 PDT
Created attachment 8110 [details]
test case

Firefox fails pretty miserably - it cannot even display static decomposed Unicode in this test. I assume this may be related to previous versions of this draft reportedly postulating that Web content should always be in NKC, and consumers needn't check for this.
Comment 2 Alexey Proskuryakov 2006-05-10 06:01:40 PDT
(In reply to comment #1)
> Firefox fails pretty miserably

Mac Firefox 1.5, that is - Windows Firefox 1.0 gives the same result as Safari.
Comment 3 Frank Yung-Fong Tang 2006-06-17 03:20:25 PDT
see 9483 and 9482 and mozilla bug 341854
Comment 4 Alexey Proskuryakov 2007-03-21 23:23:08 PDT
*** Bug 13150 has been marked as a duplicate of this bug. ***
Comment 5 Robert Burns 2007-03-22 03:16:07 PDT
*** Bug 13150 has been marked as a duplicate of this bug. ***
Comment 6 Robert Burns 2007-03-23 21:06:17 PDT
Regarding bug 

Relevant to this discussion there has been some discussion on bug’13150. Keep in mind that there are several strageies for normalization as outlined at:

<http://www.unicode.org/unicode/reports/tr15/#Canonical_Equivalence>

There it says:

"Strategy A (where each component ensures "that each system component respects canonical equivalence.") is the most robust, but may be less efficient."

Not only is ths the most robust, but it strikes me that this would be the "Apple/WebKit/KDE" way.

This would realy on a low-level text handling classes thaat treated in-memory strings and substring as  canonical-equivalent where appropriate without serializing or deserializing normalized form stirng. An approach like this is probably already required for normalization.form compatibility dcomposition. Other similar measures are also required for other "decompositions" and relations between characters (uppercase, lowercase, ...).

So I think WebKit should follow this approach (and perhaps much of this relies on the text system that the system is builg on and maybe WebKit is accessing the text system just right):

• Strategy 'A' cited above. In other words, don't change the stored or inputed text, but instead process canonical normalization along with compatibility normalization and other string processing issues independent of the stored/input text.
• For web editing, input should respect input characters (whether those are compatibility characters, or canonical equivalent characters)
• When input is not explicitly a compatibility character, the core (non-compatibility) unicode character should be used
• When glyphs exist for canonical-equivalent characters (and don't exist for the stored or input character), the view should render the canonical-equivalent characters"s glyph

Again, I'm not sure how much of this is handled by the text system and how much WebKit handles on its own, but these issues should be discussed, understood and considered when addressing this bug.
Comment 7 Robert Burns 2007-03-23 21:37:19 PDT
(In reply to my own comment #6 to clarify some tings)
fix typo:
> Relevant to this discussion there has been some discussion on bug 13150. Keep

fixed some typos (in all caps)
> (and perhaps much of this relies
> on the text system that WEBKIT is BUILT on and maybe WebKit is NOT accessing
> the text system just right):

Some example for my last two bullet points:
> • When input is not explicitly a compatibility character, the core
> (non-compatibility) unicode character should be used

This realates more to the input manage, but again, it's worth mentioning here for clarfication purposes. For example, U+03BC should be used as the canonical character in preference to U+00B5 the non-canonical compatibility character.  The compatibility characters are there for legacy reasons and are 'discouraged' (Unicode Standard's word) for newly entered text. Neither the canonical equivalent characters nor the compatibility characters are deprecated (meaning "strongly discouraged") by the Unicod Standard. Rather:

• compatibility characters are discouraged for newly created text
• should be preserved if lossless round-trip translations are expected to occur
• canonical-equivalent characters are not deprecated (there was some confusion on bug 13150). Canoncical-equivalent characters are "canon" because they are NOT deprecated: they are canonical-equivalence. The algorithm for normalization in 

<http://www.unicode.org/unicode/reports/tr15/#Canonical_Equivalence>

is non-normative, The conformance chapter is normative. That means for singletons (like the U+2329 /U+3008 example below) one could translate the string from one canonical equivalent tot he other or vice versa as long as the system is internally consistent. Neither of those canonical-equivalent characters is deprecated or discouraged. The Unicode requirements for the algorithm require that any newly added canonical-equivalent character is the one used in the algorithm's description.

> • When glyphs exist for canonical-equivalent characters (and don't exist for
> the stored or input character), the view should render the canonical-equivalent
> characters"s glyph

For example if someone inputs or the stored deserialized string contains U+2329 (left-angle pointing bracket) which is canonical-equivalent to U+3008 (left angle bracket) and no system font has a glyph for character U+2329 then turn to U+3008 as a fallback for glyphs. This might even be handled font by font as WebKit moves through each font in the CSS declaration. There are no semantic differences between these two characters, however, the font glyph differences should be respected whenever possible.

I'd be happy to clarify further if anyone has any questions on this research.
Comment 8 Ahmad Saleem 2022-07-19 10:50:31 PDT
Created attachment 461013 [details]
Safari 15.5 matches other browsers

I am unable to reproduce this bug using attached test case in Safari 15.5 on macOS 12.4 and it matches with all other browsers as can be seen from attached screenshot. I think it was fixed along the way. Can this be marked as "RESOLVED CONFIGURATION CHANGED"? Thanks!

In case, if I am testing incorrectly, please test accordingly. Thanks!
Comment 9 Ryosuke Niwa 2022-07-19 11:36:12 PDT
It does seem like this is a config changed. Thanks for testing.
Comment 10 Alexey Proskuryakov 2022-07-20 17:21:32 PDT
I don't think that WebKit behavior is quite right in this general area.

It is exceptionally tricky, as we need to consider end to end behavior on macOS and on Windows, with the differences coming from input method and file system API behaviors. Sometimes, matching end to end behavior of Chrome on Windows can mean diverging from Mac Chrome behavior on minimized test cases.

This document is still a WG note, but it continued to be steadily updated - the latest revision is from 2021. We need to look into what it says now, and work towards converging.

I don't know if keeping this bug open is helpful for actually getting that done. What do others think?
Comment 11 Darin Adler 2022-07-20 18:12:32 PDT
I think focused bugs about particular workflows that this affects are more useful to our project than a bug that talks about a general principle. We should take our understanding about what might be wrong here and turn it into a small set of test cases that can demonstrate what goes wrong. Some of those test might be able to go into Web Platform Tests, but many might be too much about system interaction to fit that well. However, I think we’ll get better results with bugs about symptoms rather than using a bug to track "match a specification", in this case at least.