Bug 8769 - TextEncoding::fromUnicode() - support non-BMP characters and convert to NFC
: TextEncoding::fromUnicode() - support non-BMP characters and convert to NFC
Status: RESOLVED FIXED
Product: WebKit
Classification: Unclassified
Component: WebCore Misc.
: 420+
: Macintosh Mac OS X 10.4
: P2 Enhancement
Assigned To: Alexey Proskuryakov
http://www.w3.org/TR/charmod-norm
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2006-05-07 04:05 PDT by Alexey Proskuryakov
Modified: 2006-06-19 13:46 PDT (History)
1 user (show)

See Also:


Attachments
proposed patch (6.47 KB, patch)
2006-05-07 04:06 PDT, Alexey Proskuryakov
no flags Details | Formatted Diff | Diff
update a failing test (2.08 KB, patch)
2006-05-07 14:29 PDT, Alexey Proskuryakov
no flags Details | Formatted Diff | Diff
patch for ICU branch (2.06 KB, patch)
2006-05-08 01:51 PDT, Alexey Proskuryakov
darin: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Alexey Proskuryakov 2006-05-07 04:05:26 PDT
Since the CFString branch seems to stay for a while, here are some improvements to its fromUnicode().

- properly combine surrogate pairs coming from CFString;
- normalize the string to NFC (canonically composed form).

I don't think any browser actually does the latter, but this is obviously more important on Mac OS X than on Windows. If Mail.app uses this code, this may correct problems with sending files having accented characters in their names to Windows recipients.

form-data-encoding.html has been modified to cover more Latin-1 cases, and another case has been added for UTF-8.
Comment 1 Alexey Proskuryakov 2006-05-07 04:06:23 PDT
Created attachment 8140 [details]
proposed patch
Comment 2 Darin Adler 2006-05-07 10:39:47 PDT
Comment on attachment 8140 [details]
proposed patch

Looks fine.

Using more ICU in cases like this leads to code that I think is more readable. For example:

    - Instead of unsigned int, ICU has the UChar32 type.
    - Instead of (badChar & 0xfc00) == 0xd800, ICU has the U16_IS_LEAD macro.
    - Instead of ((low & 0xfc00) == 0xdc00), ICU has the U16_IS_TRAIL macro.
    - instead of:

+                    badChar <<= 10;
+                    badChar += low;
+                    badChar += 0x10000 - (0xd800 << 10) - 0xdc00;

    ICU has the U16_GET_SUPPLEMENTARY macro.

There are other advantages to ICU too. One of the worst things about this function is the number of times it copies the string. It's particularly unfortunate that we create a brand new CGString by calling DeprecatedString::getCFString and then immdiately turn around and call CFStringCreateMutableCopy.

If this function was hot at all, we could take advantage of the unorm_quickCheck call in ICU which will quickly determine if there's a possibility normalization is needed; in the common case we won't have to make a copy.

But I suppose we can do some of those sorts of things when we make a non-Mac-specific version of the function.

I suppose that ultimately that's what I would have liked to see. A non-Macintosh-specific ICU-based version.

r=me on this anyway
Comment 3 Alexey Proskuryakov 2006-05-07 11:01:02 PDT
> But I suppose we can do some of those sorts of things when we make a
> non-Mac-specific version of the function.

Yes, I didn't want to create an unholy mess of technologies in the Mac-specific code path. The ICU-based fromUnicode() will obviously need to use these ICU macros.

Landed, r14217.
Comment 4 Alexey Proskuryakov 2006-05-07 13:57:54 PDT
This has caused fast/parser/entities-in-xhtml.xhtml to fail - turns out 'rang' and 'lang' entities map to characters that are canonically equivalent to U+3009 and U+3008, respectively. Sorry - this test always fails for me because of font differences :-(.
Comment 5 Alexey Proskuryakov 2006-05-07 14:02:55 PDT
Comment on attachment 8140 [details]
proposed patch

Clearing the review flag, so that this doesn't show up in the commit queue.

I think the change in test results is OK, but I don't understand why only the rightmost column has changed - will investigate.
Comment 6 Alexey Proskuryakov 2006-05-07 14:29:49 PDT
Created attachment 8154 [details]
update a failing test

I'm suggesting to update the test case to the new results. Since we're supposed to normalize input to NFC anyway, the change is a slight improvement (also, standard Mac fonts don't have glyphs for U+2328 or U+2329, but do have glyphs for the equivalents).

OTOH, processing of entities becomes pretty inconsistent now, and that's ugly. An alternative fix would be to modify getXHTMLEntity() not to do normalization.
Comment 7 Darin Adler 2006-05-07 18:41:47 PDT
Comment on attachment 8154 [details]
update a failing test

OK. Seems like there are some loose ends here, but fine to land this.
Comment 8 Darin Adler 2006-05-07 22:41:13 PDT
(In reply to comment #7)
> OK. Seems like there are some loose ends here, but fine to land this.

I'd really like to see this patch apply the same kind of fix to TextEncoding.cpp, since that's a cross-platform implementation that will replace the Mac-specific one once we deal with the few encodings that are not supported in ICU.
Comment 9 Alexey Proskuryakov 2006-05-08 00:09:36 PDT
Comment on attachment 8154 [details]
update a failing test

>I'd really like to see this patch apply the same kind of fix to
TextEncoding.cpp

OK. Since this is basically the same bug, and the fix won't have a separate test case, I'll attach it here. Landed the test case fix, cleared the review flag.
Comment 10 Alexey Proskuryakov 2006-05-08 01:51:06 PDT
Created attachment 8164 [details]
patch for ICU branch

The ICU branch already had non-BMP entities properly supported, but the normalization behavior was even worse - it didn't try to convert to legacy encodings such as Latin-1 nearly as hard as CFString tries. E.g., decomposed "c with cedilla" was encoded as c followed with an entity for cedilla, even though Latin-1 has a code for this (composed) character.
Comment 11 Darin Adler 2006-05-08 11:59:51 PDT
Comment on attachment 8164 [details]
patch for ICU branch

r=me
Comment 12 Darin Adler 2006-06-19 13:45:38 PDT
*** Bug 9483 has been marked as a duplicate of this bug. ***
Comment 13 Darin Adler 2006-06-19 13:46:05 PDT
*** Bug 9482 has been marked as a duplicate of this bug. ***