Bug 8769 - TextEncoding::fromUnicode() - support non-BMP characters and convert to NFC
: TextEncoding::fromUnicode() - support non-BMP characters and convert to NFC
Status: RESOLVED FIXED
: WebKit
WebCore Misc.
: 420+
: Macintosh Mac OS X 10.4
: P2 Enhancement
Assigned To:
: http://www.w3.org/TR/charmod-norm
:
:
:
  Show dependency treegraph
 
Reported: 2006-05-07 04:05 PST by
Modified: 2006-06-19 13:46 PST (History)


Attachments
proposed patch (6.47 KB, patch)
2006-05-07 04:06 PST, Alexey Proskuryakov
no flags Review Patch | Details | Formatted Diff | Diff
update a failing test (2.08 KB, patch)
2006-05-07 14:29 PST, Alexey Proskuryakov
no flags Review Patch | Details | Formatted Diff | Diff
patch for ICU branch (2.06 KB, patch)
2006-05-08 01:51 PST, Alexey Proskuryakov
darin: review+
Review Patch | Details | Formatted Diff | Diff


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2006-05-07 04:05:26 PST
Since the CFString branch seems to stay for a while, here are some improvements to its fromUnicode().

- properly combine surrogate pairs coming from CFString;
- normalize the string to NFC (canonically composed form).

I don't think any browser actually does the latter, but this is obviously more important on Mac OS X than on Windows. If Mail.app uses this code, this may correct problems with sending files having accented characters in their names to Windows recipients.

form-data-encoding.html has been modified to cover more Latin-1 cases, and another case has been added for UTF-8.
------- Comment #1 From 2006-05-07 04:06:23 PST -------
Created an attachment (id=8140) [details]
proposed patch
------- Comment #2 From 2006-05-07 10:39:47 PST -------
(From update of attachment 8140 [details])
Looks fine.

Using more ICU in cases like this leads to code that I think is more readable. For example:

    - Instead of unsigned int, ICU has the UChar32 type.
    - Instead of (badChar & 0xfc00) == 0xd800, ICU has the U16_IS_LEAD macro.
    - Instead of ((low & 0xfc00) == 0xdc00), ICU has the U16_IS_TRAIL macro.
    - instead of:

+                    badChar <<= 10;
+                    badChar += low;
+                    badChar += 0x10000 - (0xd800 << 10) - 0xdc00;

    ICU has the U16_GET_SUPPLEMENTARY macro.

There are other advantages to ICU too. One of the worst things about this function is the number of times it copies the string. It's particularly unfortunate that we create a brand new CGString by calling DeprecatedString::getCFString and then immdiately turn around and call CFStringCreateMutableCopy.

If this function was hot at all, we could take advantage of the unorm_quickCheck call in ICU which will quickly determine if there's a possibility normalization is needed; in the common case we won't have to make a copy.

But I suppose we can do some of those sorts of things when we make a non-Mac-specific version of the function.

I suppose that ultimately that's what I would have liked to see. A non-Macintosh-specific ICU-based version.

r=me on this anyway
------- Comment #3 From 2006-05-07 11:01:02 PST -------
> But I suppose we can do some of those sorts of things when we make a
> non-Mac-specific version of the function.

Yes, I didn't want to create an unholy mess of technologies in the Mac-specific code path. The ICU-based fromUnicode() will obviously need to use these ICU macros.

Landed, r14217.
------- Comment #4 From 2006-05-07 13:57:54 PST -------
This has caused fast/parser/entities-in-xhtml.xhtml to fail - turns out 'rang' and 'lang' entities map to characters that are canonically equivalent to U+3009 and U+3008, respectively. Sorry - this test always fails for me because of font differences :-(.
------- Comment #5 From 2006-05-07 14:02:55 PST -------
(From update of attachment 8140 [details])
Clearing the review flag, so that this doesn't show up in the commit queue.

I think the change in test results is OK, but I don't understand why only the rightmost column has changed - will investigate.
------- Comment #6 From 2006-05-07 14:29:49 PST -------
Created an attachment (id=8154) [details]
update a failing test

I'm suggesting to update the test case to the new results. Since we're supposed to normalize input to NFC anyway, the change is a slight improvement (also, standard Mac fonts don't have glyphs for U+2328 or U+2329, but do have glyphs for the equivalents).

OTOH, processing of entities becomes pretty inconsistent now, and that's ugly. An alternative fix would be to modify getXHTMLEntity() not to do normalization.
------- Comment #7 From 2006-05-07 18:41:47 PST -------
(From update of attachment 8154 [details])
OK. Seems like there are some loose ends here, but fine to land this.
------- Comment #8 From 2006-05-07 22:41:13 PST -------
(In reply to comment #7)
> OK. Seems like there are some loose ends here, but fine to land this.

I'd really like to see this patch apply the same kind of fix to TextEncoding.cpp, since that's a cross-platform implementation that will replace the Mac-specific one once we deal with the few encodings that are not supported in ICU.
------- Comment #9 From 2006-05-08 00:09:36 PST -------
(From update of attachment 8154 [details])
>I'd really like to see this patch apply the same kind of fix to
TextEncoding.cpp

OK. Since this is basically the same bug, and the fix won't have a separate test case, I'll attach it here. Landed the test case fix, cleared the review flag.
------- Comment #10 From 2006-05-08 01:51:06 PST -------
Created an attachment (id=8164) [details]
patch for ICU branch

The ICU branch already had non-BMP entities properly supported, but the normalization behavior was even worse - it didn't try to convert to legacy encodings such as Latin-1 nearly as hard as CFString tries. E.g., decomposed "c with cedilla" was encoded as c followed with an entity for cedilla, even though Latin-1 has a code for this (composed) character.
------- Comment #11 From 2006-05-08 11:59:51 PST -------
(From update of attachment 8164 [details])
r=me
------- Comment #12 From 2006-06-19 13:45:38 PST -------
*** Bug 9483 has been marked as a duplicate of this bug. ***
------- Comment #13 From 2006-06-19 13:46:05 PST -------
*** Bug 9482 has been marked as a duplicate of this bug. ***