RESOLVED FIXED 8769
TextEncoding::fromUnicode() - support non-BMP characters and convert to NFC
https://bugs.webkit.org/show_bug.cgi?id=8769
Summary TextEncoding::fromUnicode() - support non-BMP characters and convert to NFC
Alexey Proskuryakov
Reported 2006-05-07 04:05:26 PDT
Since the CFString branch seems to stay for a while, here are some improvements to its fromUnicode(). - properly combine surrogate pairs coming from CFString; - normalize the string to NFC (canonically composed form). I don't think any browser actually does the latter, but this is obviously more important on Mac OS X than on Windows. If Mail.app uses this code, this may correct problems with sending files having accented characters in their names to Windows recipients. form-data-encoding.html has been modified to cover more Latin-1 cases, and another case has been added for UTF-8.
Attachments
proposed patch (6.47 KB, patch)
2006-05-07 04:06 PDT, Alexey Proskuryakov
no flags
update a failing test (2.08 KB, patch)
2006-05-07 14:29 PDT, Alexey Proskuryakov
no flags
patch for ICU branch (2.06 KB, patch)
2006-05-08 01:51 PDT, Alexey Proskuryakov
darin: review+
Alexey Proskuryakov
Comment 1 2006-05-07 04:06:23 PDT
Created attachment 8140 [details] proposed patch
Darin Adler
Comment 2 2006-05-07 10:39:47 PDT
Comment on attachment 8140 [details] proposed patch Looks fine. Using more ICU in cases like this leads to code that I think is more readable. For example: - Instead of unsigned int, ICU has the UChar32 type. - Instead of (badChar & 0xfc00) == 0xd800, ICU has the U16_IS_LEAD macro. - Instead of ((low & 0xfc00) == 0xdc00), ICU has the U16_IS_TRAIL macro. - instead of: + badChar <<= 10; + badChar += low; + badChar += 0x10000 - (0xd800 << 10) - 0xdc00; ICU has the U16_GET_SUPPLEMENTARY macro. There are other advantages to ICU too. One of the worst things about this function is the number of times it copies the string. It's particularly unfortunate that we create a brand new CGString by calling DeprecatedString::getCFString and then immdiately turn around and call CFStringCreateMutableCopy. If this function was hot at all, we could take advantage of the unorm_quickCheck call in ICU which will quickly determine if there's a possibility normalization is needed; in the common case we won't have to make a copy. But I suppose we can do some of those sorts of things when we make a non-Mac-specific version of the function. I suppose that ultimately that's what I would have liked to see. A non-Macintosh-specific ICU-based version. r=me on this anyway
Alexey Proskuryakov
Comment 3 2006-05-07 11:01:02 PDT
> But I suppose we can do some of those sorts of things when we make a > non-Mac-specific version of the function. Yes, I didn't want to create an unholy mess of technologies in the Mac-specific code path. The ICU-based fromUnicode() will obviously need to use these ICU macros. Landed, r14217.
Alexey Proskuryakov
Comment 4 2006-05-07 13:57:54 PDT
This has caused fast/parser/entities-in-xhtml.xhtml to fail - turns out 'rang' and 'lang' entities map to characters that are canonically equivalent to U+3009 and U+3008, respectively. Sorry - this test always fails for me because of font differences :-(.
Alexey Proskuryakov
Comment 5 2006-05-07 14:02:55 PDT
Comment on attachment 8140 [details] proposed patch Clearing the review flag, so that this doesn't show up in the commit queue. I think the change in test results is OK, but I don't understand why only the rightmost column has changed - will investigate.
Alexey Proskuryakov
Comment 6 2006-05-07 14:29:49 PDT
Created attachment 8154 [details] update a failing test I'm suggesting to update the test case to the new results. Since we're supposed to normalize input to NFC anyway, the change is a slight improvement (also, standard Mac fonts don't have glyphs for U+2328 or U+2329, but do have glyphs for the equivalents). OTOH, processing of entities becomes pretty inconsistent now, and that's ugly. An alternative fix would be to modify getXHTMLEntity() not to do normalization.
Darin Adler
Comment 7 2006-05-07 18:41:47 PDT
Comment on attachment 8154 [details] update a failing test OK. Seems like there are some loose ends here, but fine to land this.
Darin Adler
Comment 8 2006-05-07 22:41:13 PDT
(In reply to comment #7) > OK. Seems like there are some loose ends here, but fine to land this. I'd really like to see this patch apply the same kind of fix to TextEncoding.cpp, since that's a cross-platform implementation that will replace the Mac-specific one once we deal with the few encodings that are not supported in ICU.
Alexey Proskuryakov
Comment 9 2006-05-08 00:09:36 PDT
Comment on attachment 8154 [details] update a failing test >I'd really like to see this patch apply the same kind of fix to TextEncoding.cpp OK. Since this is basically the same bug, and the fix won't have a separate test case, I'll attach it here. Landed the test case fix, cleared the review flag.
Alexey Proskuryakov
Comment 10 2006-05-08 01:51:06 PDT
Created attachment 8164 [details] patch for ICU branch The ICU branch already had non-BMP entities properly supported, but the normalization behavior was even worse - it didn't try to convert to legacy encodings such as Latin-1 nearly as hard as CFString tries. E.g., decomposed "c with cedilla" was encoded as c followed with an entity for cedilla, even though Latin-1 has a code for this (composed) character.
Darin Adler
Comment 11 2006-05-08 11:59:51 PDT
Comment on attachment 8164 [details] patch for ICU branch r=me
Darin Adler
Comment 12 2006-06-19 13:45:38 PDT
*** Bug 9483 has been marked as a duplicate of this bug. ***
Darin Adler
Comment 13 2006-06-19 13:46:05 PDT
*** Bug 9482 has been marked as a duplicate of this bug. ***
Note You need to log in before you can comment on or make changes to this bug.