8769 – TextEncoding::fromUnicode() - support non-BMP characters and convert to NFC

RESOLVED FIXED 8769

TextEncoding::fromUnicode() - support non-BMP characters and convert to NFC

https://bugs.webkit.org/show_bug.cgi?id=8769

Summary TextEncoding::fromUnicode() - support non-BMP characters and convert to NFC

Alexey Proskuryakov

Reported 2006-05-07 04:05:26 PDT

Since the CFString branch seems to stay for a while, here are some improvements to its fromUnicode(). - properly combine surrogate pairs coming from CFString; - normalize the string to NFC (canonically composed form). I don't think any browser actually does the latter, but this is obviously more important on Mac OS X than on Windows. If Mail.app uses this code, this may correct problems with sending files having accented characters in their names to Windows recipients. form-data-encoding.html has been modified to cover more Latin-1 cases, and another case has been added for UTF-8.

Attachments
proposed patch (6.47 KB, patch) 2006-05-07 04:06 PDT, Alexey Proskuryakov	no flags	Details Formatted Diff Diff
update a failing test (2.08 KB, patch) 2006-05-07 14:29 PDT, Alexey Proskuryakov	no flags	Details Formatted Diff Diff
patch for ICU branch (2.06 KB, patch) 2006-05-08 01:51 PDT, Alexey Proskuryakov	darin: review+	Details Formatted Diff Diff
View All Add attachment proposed patch, testcase, etc.

Alexey Proskuryakov

Comment 1 2006-05-07 04:06:23 PDT

Created attachment 8140 [details] proposed patch

Darin Adler

Comment 2 2006-05-07 10:39:47 PDT

Comment on attachment 8140 [details] proposed patch Looks fine. Using more ICU in cases like this leads to code that I think is more readable. For example: - Instead of unsigned int, ICU has the UChar32 type. - Instead of (badChar & 0xfc00) == 0xd800, ICU has the U16_IS_LEAD macro. - Instead of ((low & 0xfc00) == 0xdc00), ICU has the U16_IS_TRAIL macro. - instead of: + badChar <<= 10; + badChar += low; + badChar += 0x10000 - (0xd800 << 10) - 0xdc00; ICU has the U16_GET_SUPPLEMENTARY macro. There are other advantages to ICU too. One of the worst things about this function is the number of times it copies the string. It's particularly unfortunate that we create a brand new CGString by calling DeprecatedString::getCFString and then immdiately turn around and call CFStringCreateMutableCopy. If this function was hot at all, we could take advantage of the unorm_quickCheck call in ICU which will quickly determine if there's a possibility normalization is needed; in the common case we won't have to make a copy. But I suppose we can do some of those sorts of things when we make a non-Mac-specific version of the function. I suppose that ultimately that's what I would have liked to see. A non-Macintosh-specific ICU-based version. r=me on this anyway

Alexey Proskuryakov

Comment 3 2006-05-07 11:01:02 PDT

> But I suppose we can do some of those sorts of things when we make a > non-Mac-specific version of the function. Yes, I didn't want to create an unholy mess of technologies in the Mac-specific code path. The ICU-based fromUnicode() will obviously need to use these ICU macros. Landed, r14217.

Alexey Proskuryakov

Comment 4 2006-05-07 13:57:54 PDT

This has caused fast/parser/entities-in-xhtml.xhtml to fail - turns out 'rang' and 'lang' entities map to characters that are canonically equivalent to U+3009 and U+3008, respectively. Sorry - this test always fails for me because of font differences :-(.

Alexey Proskuryakov

Comment 5 2006-05-07 14:02:55 PDT

Comment on attachment 8140 [details] proposed patch Clearing the review flag, so that this doesn't show up in the commit queue. I think the change in test results is OK, but I don't understand why only the rightmost column has changed - will investigate.

Alexey Proskuryakov

Comment 6 2006-05-07 14:29:49 PDT

Created attachment 8154 [details] update a failing test I'm suggesting to update the test case to the new results. Since we're supposed to normalize input to NFC anyway, the change is a slight improvement (also, standard Mac fonts don't have glyphs for U+2328 or U+2329, but do have glyphs for the equivalents). OTOH, processing of entities becomes pretty inconsistent now, and that's ugly. An alternative fix would be to modify getXHTMLEntity() not to do normalization.

Darin Adler

Comment 7 2006-05-07 18:41:47 PDT

Comment on attachment 8154 [details] update a failing test OK. Seems like there are some loose ends here, but fine to land this.

Darin Adler

Comment 8 2006-05-07 22:41:13 PDT

(In reply to comment #7) > OK. Seems like there are some loose ends here, but fine to land this. I'd really like to see this patch apply the same kind of fix to TextEncoding.cpp, since that's a cross-platform implementation that will replace the Mac-specific one once we deal with the few encodings that are not supported in ICU.

Alexey Proskuryakov

Comment 9 2006-05-08 00:09:36 PDT

Comment on attachment 8154 [details] update a failing test >I'd really like to see this patch apply the same kind of fix to TextEncoding.cpp OK. Since this is basically the same bug, and the fix won't have a separate test case, I'll attach it here. Landed the test case fix, cleared the review flag.

Alexey Proskuryakov

Comment 10 2006-05-08 01:51:06 PDT

Created attachment 8164 [details] patch for ICU branch The ICU branch already had non-BMP entities properly supported, but the normalization behavior was even worse - it didn't try to convert to legacy encodings such as Latin-1 nearly as hard as CFString tries. E.g., decomposed "c with cedilla" was encoded as c followed with an entity for cedilla, even though Latin-1 has a code for this (composed) character.

Darin Adler

Comment 11 2006-05-08 11:59:51 PDT

Comment on attachment 8164 [details] patch for ICU branch r=me

Darin Adler

Comment 12 2006-06-19 13:45:38 PDT

*** Bug 9483 has been marked as a duplicate of this bug. ***

Darin Adler

Comment 13 2006-06-19 13:46:05 PDT

*** Bug 9482 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Enhancement

Classification Unclassified

Version 420+

Hardware Mac

OS OS X 10.4

Product WebKit

Component WebCore Misc.

Assignee

Alexey Proskuryakov

Reported

2006-05-07 04:05 PDT

Modified

2006-06-19 13:46 PDT History

CC List

1 user Show

URL

http://www.w3.org/TR/charmod-norm

Keywords

Duplicates (2)

9482 9483 View as bug list

Depends on

Blocks