13150 – Character entity references should produce same result as numeric

Robert Burns

Reported 2007-03-21 15:57:26 PDT

Currently, the HTML named character entititses &amprang; and &amplang; produce different results than the numeric equivalents ;&#9001; (HEX: &#x2329;) and &#9002; (HEX: &#x232A;). Also the direct inclusion of the character matches the numeric references (i.e., the numeric references use the system fonts, whiule the named is either mapped to the wrong characcter or includes some fallback mechanism). For consistency’s sake these named references should map to the right number. If one uses fallback, so should the other (and the direct charcter as well if possible).

Robert Burns

Comment 1 2007-03-21 16:22:10 PDT

Created attachment 13752 [details] This shows the character entity raeferenced that's not supported. Compare this to the URL provided. With the URL, I get the glyph appearing with the named character reference, bu tno the others. However, when the page is loaded locally, I don't get glyph’s at all.

Alexey Proskuryakov

Comment 2 2007-03-21 23:23:08 PDT

This change was intentional, and was a result of a conflict between different specs. In Unicode, code points U+2329 and U+232A are deprecated in favor of U+3008 and U+3009, respectively. According to <http://www.w3.org/TR/charmod-norm/>, text on the Web should be NFC-normalized, and following this requirement is impossible if we map &lang; to U+2329 (deprecated characters cannot appear in normalized text). So, the current situation is that we let conforming documents remain conforming by applying an updated mapping for &rang; and &lang;, but we don't yet fix documents that explicitly contain denormalized text. Bug 8738 is tracking this issue. *** This bug has been marked as a duplicate of 8738 ***

Robert Burns

Comment 3 2007-03-22 00:30:35 PDT

(In reply to comment #2) > specs. In Unicode, code points U+2329 and U+232A are deprecated in favor of > U+3008 and U+3009, respectively … > So, the current situation is that we let conforming documents remain conforming > by applying an updated mapping for &rang; and &lang;, … I don't believe that U+2329 and U+232A have been deprecated. If anything I think it’s U+3008 and U+3009 are the ones deprecated. Since those appear in the CJK puncutation, II blieve those are compatibility characters. In contrast U+2329 and U+232A appear in the math category of the UCS. There are many named character references that are compatibility (deprecated) characters, but those aren't. The reason I noticed these is because the Mac OS System fonts do not support this character,(but not because it’s dprecated). I'm trying to find a list I put together of deprecated (compaitbility and otherwise) characters. I don't recall these ones being on it. So I think this is a separate bug than the normlized text bug, (though I'm happy to be cc'd on that one).

Robert Burns

Comment 4 2007-03-22 00:31:25 PDT

I'm going to reopen this bug until we can get confirmation of which characters are the compatibility (depreated characters).

Robert Burns

Comment 5 2007-03-22 00:39:15 PDT

After glancing at the W3C page again (<http://www.w3.org/TR/charmod-norm/>) I recall that the compatibility charaacters are merely discouraged and not part of normalized text. Also, XML1.1 is relevant to normalization (since it opens it up to many more UCS characters), but that probably belongs on bug 8738

Robert Burns

Comment 6 2007-03-22 01:10:50 PDT

Just as an example, the pound sterling symbol is in the basic latin 1 from ISO. It is a compatiblity character at U+00A3. It is cannonically identical to U+20A4 (₤). However, for normalization these characters should always be treated as identical. The named character reference £ need not be mapped to U+20A4 for normalization (as far as I understand it). However, that might be an easy start. There are many HTML named character references (like £) that map to these discouraged compatibility characters. However, the &lang; and &rang; are not among them. These are characters that are simply not supported byt the standard Mac OS fonts (which I should probably file a separate bug on that since that would be a simple fix to many of the fonts). Right now the &lang; and &rang; are being mapped from the non-discouraged character to the discouraged compatibility characters (though this does at least give us a glyph under the default install its a kludgy way to do it).

Alexey Proskuryakov

Comment 7 2007-03-22 01:42:33 PDT

This has nothing to do with characters being "compatibility" or not. U+2329 has a canonical decomposition that consists of a single code point, which means that it cannot appear in any normalized text (NFKD, NFKC, NFD or NFC), and thus it's effectively deprecated. U+00A3 does not have any decompositions, canonical or otherwise, as of Unicode 4.0.1. I still believe that this issue is completely covered by bug 8738.

Robert Burns

Comment 8 2007-03-22 02:34:36 PDT

(In reply to comment #7) > U+2329 has a canonical decomposition that consists of a single code point, I'm not doubting you, but could you cite where you're finding that U+2329 has a cannonical decomposition that consists of a single code point? My understanding is that this isn't even a decomposable character. If it's not a decomposable character, then the only issue (though not applicable to normalized text), is whether either of these are compatibility characters. If they are, then I think one could make a case for ignoring the HTML recommendation in favor of following Unicode / XML, etc.

Robert Burns

Comment 9 2007-03-22 03:16:07 PDT

(In reply to comment #8) > (In reply to comment #7) > > U+2329 has a canonical decomposition that consists of a single code point, > > I'm not doubting you, but could you cite where you're finding that U+2329 has a > cannonical decomposition that consists of a single code point? I found it. First, the exclusion table for normalization. http://www.unicode.org/Public/UNIDATA/CompositionExclusions.txt Second, the Unicode database which shows the cnonical equivalence. <http://www.unicode.org/Public/UNIDATA/UnicodeData.txt> So changing back to duplicate because normalization would definitely treat these uniformly. I'm surprised that Unicode didn't treat it in the opposite direction: with U+2328 and U+2329 used for normalization and U+3008 and U+3009 exlucded as singletons. Or for that matter why are these involved in normalization at all since both are singletons? *** This bug has been marked as a duplicate of 8738 ***

Alexey Proskuryakov

Comment 10 2007-03-22 03:36:47 PDT

I am not aware of the rationale the Unicode Consortium had for this decision, unfortunately. BTW, I highly recommend UnicodeChecker <http://earthlingsoft.net/UnicodeChecker/index.html> for examining Unicode character properties.

Robert Burns

Comment 11 2007-03-22 11:49:30 PDT

OK, I've looked into this some more and I think this bug should be reopened. These decompositions ddo not imply deprecated.These normalizations are only for string processing and shouldn't effect display. Above all, the character should be displayed as the author entered character using the glyph designated by the font for that character and not for the character from the normalizaed string (this is the distinct from SGML/XML normalization if I understand those correctly). A few examples. U+2124 (ℤ) Double-struck Capital Z has a decomposition of U+005A (Z) Latin Capital Letter Z U+2126 (Ω) Ohm sign has a decomposition of U+03A9 (Ω) Greek Capital Letter Omega In both of these cases, the normalization of string that differ only by the decomposition should bre treated as identical strings for sort and searching. However, the display of these strings should always use the glyphs from the font for the original (pre-normalized) characters. If not then the double-struck capital letter Z would never be displayed double-struck. So it's the same for the left and right angle brackets. The normalization should not effect the rendering. So I think this is a separate issue to bug 8738. Bug 8738 is appropriately about comparing 2 strings for length and retrieving a composed character from an index in the string. It's all about string comparison; not about rendering the string. Again, I think this should be reopened. (thanks for the tip on UnicodeChecker BTW).

Alexey Proskuryakov

Comment 12 2007-03-22 12:05:22 PDT

Please note that these are examples of compatibility decomposition, which is entirely different from the canonical one. Compatibility decompositions do not affect normalization to NFC or NFD. It's definitely a bug in OS X fonts that these characters do not have identical glyphs (I did file a Radar report a while ago). I would actually suggest contacting HTML WG and/or WHATWG on the subject of updating the mapping for these entities - I think this would make more sense long-term than trying to fight the Unicode spec. In fact, I intend to do this myself at some point.

Robert Burns

Comment 13 2007-03-23 14:53:43 PDT

(In reply to comment #2) > This change was intentional, and was a result of a conflict between different > specs. In Unicode, code points U+2329 and U+232A are deprecated in favor of > U+3008 and U+3009, respectively. I've been researching this further and I think it’s incorrect to treat the canonical equivalent decomposable singletons as deprecated. The compatibility characters are discouraged (those that have a decomposition with the keyword <compat>) but not the canonical equivalent decomposable characters. As far as the relevance to this bug, it means that there is no conflict between the W3C and the Unicode spec. However, thre's also probably nothing wrong with translating the &lang; and &rang; character entitities into U+3008 and U+3009 respectively. If these had the keyword <compat>, then Unicode prohibits changing the meaning of the text, which NFKC normalization does (does change the meaning of the text that is). NFC normalization does not change the meaning of the text. From Unicode 3 3.6 (D21) <http://www.unicode.org/book/ch03.pdf>, the following definition: " Compatibility characters are included in the Unicode Standard to represent distinctions in other base standards. They support transmission and processing of legacy data. Their use is discouraged other than for legacy data." There is nothing about canonical equivalent characters being deprecated or discouraged. In fact, it looks to me that the distinction between the compat and canonical distinction is that the canonical characters are not part of the discouraged legacy compatibility characters. I also read this to imply that the normalization should only happen in memory. That is to say, the NFC normalization could be serialized after completeion, but it need not. In many ways I think it would be better not to serialize the normalization. From the same URL (C9): "A process shall not assume that the interpretations of two canonical-equivalent character sequences are distinct." and "Ideally, an implementation would always interpret two canonical-equivalent charactr sequences identically. There are practical circumstances under which implementations may reasonably distinguish them." I read this (along with the W3C document on normalization) to say that WebKit should normalize strings NFC for in-memory processing of strings. There are other issues, that are not implied by this. • The issue over whether an editable WebView should serialize substituted canonical-equiavlent characters is not clear from my reading of the Unicode Standard. • Also, whether WebKit should render glyphs for a canonical-equivalent or the original is not clear either. I would say from a reading of this conformance chapter that it might be best for WebKit to render the glyphs from a font from either the stored string code point or any cnonical-equivalent code point for which there was a glyph (as a fallback mechanism). Anyway, the compatibility characters are discouraged (though I'm not sure if the term deprecated applies). The canonical-equivalent characters are not deprecated or discouraged (as far as I can tell). Only ten code points have been deprecated (from <http://www.unicode.org/Public/UNIDATA/PropList.txt>): 0340..0341 ; Deprecated # Mn [2] COMBINING GRAVE TONE MARK..COMBINING ACUTE TONE MARK 17A3 ; Deprecated # Lo KHMER INDEPENDENT VOWEL QAQ 17D3 ; Deprecated # Mn KHMER SIGN BATHAMASAT 206A..206F ; Deprecated # Cf [6] INHIBIT SYMMETRIC SWAPPING..NOMINAL DIGIT SHAPES

Alexey Proskuryakov

Comment 14 2007-03-24 01:00:22 PDT

Talking about characters with singleton decompositions being deprecated, I agree that there is no formal official provision saying this. However, that fact that such characters can not appear in any normalized text speaks for itself. See, for example, the definition of "singleton decomposition" at <http://safari.oreilly.com/0201700522/idd1e35240>. I am not completely sure what you are suggesting for this bug. Should we return to the HTML mapping for &rang; and &lang;? Then they won't render with default OS X fonts, and any text using them would become denormalized. Neither is a positive outcome. And one can even argue that we do not violate the HTML spec by changing the mapping, see <http://www.unicode.org/faq/char_combmark.html#8>. So, I do not think that we are going to change this back. Your comment has some interesting ideas about in-memory processing. It would be really great if you could verify those against <http://www.w3.org/TR/charmod-norm/>, and file separate bugs (preferably with tests) for cases where we do not conform.

Robert Burns

Comment 15 2007-03-24 10:42:30 PDT

(In reply to comment #14) > Talking about characters with singleton decompositions being deprecated, I > agree that there is no formal official provision saying this. However, that > fact that such characters can not appear in any normalized text speaks for > itself. See, for example, the definition of "singleton decomposition" at > <http://safari.oreilly.com/0201700522/idd1e35240>. I'm saying that I think that's a misconception about canonical equivalent characters. It more applies to compatibility characters (where the Unicode Standard says this explicitly). > I am not completely sure what you are suggesting for this bug. Should we return > to the HTML mapping for &rang; and &lang;? Then they won't render with default > OS X fonts, and any text using them would become denormalized. Neither is a > positive outcome. I'm not sure I think anything needs to be done about this bug (I did leave it closed). But I think that eventually (when the late normalization and perhaps glyph issues are dealt with), that WebKit should not change characters input by a user or read from storage in a lossy way and this bug should then be reopened ()or marked as dependent on bug 8738). > Your comment has some interesting ideas about in-memory processing. It would be > really great if you could verify those against > <http://www.w3.org/TR/charmod-norm/>, and file separate bugs (preferably with > tests) for cases where we do not conform. Well, I think bug 8738 covers things for now (or needs to be fixed first). Keep in mind that there are two types of normalization from the W3C: late and early. Given the nature of WebKit (as often a read-only user agent), it would be best if WebKit processed strings using late normalizaation and using full NFKC late normalization. Early normalization only arises in a role as a content creation user agent. As a content creation user agent (through HTML editing), I'm not sure WebKit should adopt the early normalization, by default. As a developer using WebKit in this way, I can use HTMLTidy to do an early normaliziation of my text if the user, or I as the developer, want that. Without normalization flags set, I want WebKit to load the text as is and accept input (say from the Mac OS X character palette) without arbitrarily changing the characters to their canonical equivalents. Since WebKit needs to do late normaization anyway (in its read-only mode), there's no reason to force early normalization on users. As I've suggested on bug 8738 I think WebKit should use canonical equivalent charactrs for fallback glyphs. So with proper late normalization implemented, the lack of a glyph for &rang; and &lang; would not even be an issue. Earlier you suggested not going against Unicode on these canonical mappings of lang and rang. However, Unicode has a problem in going aginst the vast majority of authors and content creators who, when looking for a left angle bracket fence or right angle bracket fence, will look in the mathemtical symbols category (and not the CJK punctuation cateogry) and find the character and the glyph appearance they're looking for. Changing this on the author is not appropriate except for fallback. I think the Unicode Standard got these backwards when they added these characters in 1.1. However, that only becomes an issue if we interpret (like O"Reilly) that canonical-equivalent singletons with decompsoition mappings. Obviously if they're canonical equivalents then U+2329 being equivallent to U+3008 means that U+3008 is equivallent to U+2329 too. They can't both be deprecated singletons. The compatibility characters are actually "disouraged" by the Unicode Standard though not "strongly discouraged" (i.e., "deprecated"). The canonical equivalents are not discouraged at all by the Unicode Standard. However, despite compaitility characters being discouraged, early NFKC normalization is not promoted by the W3C because NFKC normalization is semantically lossy. However, both NFKC and NFC normalizations are presentation (glyph) lossy. This is a lot of writing on this esoteric issue. For the most part, I don't think early normalization is a problem except for the glyph presentation issue. Font makers are often trying to make more glyph's available to content creators than the Unicode Standard has provided character mappings for (user selected and somewhat semantic glyph variants). The attempt to deprecate canonical equivalents only contributes to that problem, since font makers could render the glyph difference through author selection of canonical equivalent characters.

Attachments
This shows the character entity raeferenced that's not supported. (2.24 KB, text/html) 2007-03-21 16:22 PDT, Robert Burns	no flags	Details
View All Add attachment proposed patch, testcase, etc.