5232 – Rewrite QTextCodec::fromUnicode to use ICU

RESOLVED FIXED 5232

Rewrite QTextCodec::fromUnicode to use ICU

https://bugs.webkit.org/show_bug.cgi?id=5232

Summary Rewrite QTextCodec::fromUnicode to use ICU

Alexey Proskuryakov

Reported 2005-10-02 08:39:09 PDT

After a fix to bug 4821 is committed, try to advance its solution: 1) Use ICU instead of CFString. 2) Support surrogate pairs correctly. 3) If converting to UTF-8, make precomposed Unicode. 4) Verify handling of reverse solidus and tilde when converting to Japanese or Korean encodings.

Attachments
test results for yen, won and tilde (487.01 KB, application/zip) 2005-12-04 10:39 PST, Alexey Proskuryakov	no flags	Details
proposed patch (19.62 KB, patch) 2005-12-11 11:09 PST, Alexey Proskuryakov	darin: review-	Details Formatted Diff Diff
revised patch (19.88 KB, patch) 2005-12-11 22:02 PST, Alexey Proskuryakov	darin: review-	Details Formatted Diff Diff
Show Obsolete (1) View All Add attachment proposed patch, testcase, etc.

Alexey Proskuryakov

Comment 1 2005-12-04 10:39:39 PST

Created attachment 4938 [details] test results for yen, won and tilde Well, verifying the handling of the questionable characters appears troublesome - when rendering, MSIE, FF and Safari give very different results, and even different versions of Firefox do (tested with 1.5/Mac and 1.0.4/Linux). I think I'll just try to keep their encoding unchanged, as much as possible.

Alexey Proskuryakov

Comment 2 2005-12-11 11:09:54 PST

Created attachment 5042 [details] proposed patch - Rewrote fromUnicode to use ICU - ...which nicely fixed surrogate pair handling - ... and changed the results for Japanese encodings. Please see the comments in layout tests for test matrices. The change in ISO-2022-JP looks especially suspicious. Please see <http://dev.icu-project.org/cgi-bin/icu-bugs/returned?id=1252> and <http://dev.icu-project.org/cgi-bin/icu-bugs/returned?id=1287>. I cannot determine which behavior is correct. I decided not to do the conversion to precomposed Unicode here. It's unclear if it should be done at this level at all, and the scope of the proposed patch is already fairly large.

Darin Adler

Comment 3 2005-12-11 17:04:24 PST

Comment on attachment 5042 [details] proposed patch Looks OK. It's unfortunate that the loop uses QCString::append because that makes the loop n^2 in the size of the string. Each conversion buffer has its size recalculated using strlen() and worse, the base QCString has its size recalculated using strlen() each time QCString::append is called. A more efficient version would use QCString::resize and keep track of the length of the result. The original CFString version had this problem too, so I won't do a review- on this just because of it, but it's worth fixing. I'm concerned about the handling of U+005C in ISO-2022-JP, but it looks like it's no worse after this change than before. What are we going to do about character sets not yet supported by ICU, such as KOI8-U (see bug 4195)? Setting review- until I have the answer to at least this last question.

Alexey Proskuryakov

Comment 4 2005-12-11 22:02:35 PST

Created attachment 5047 [details] revised patch Uses QByteArray::resize() to avoid multiple strlens. Not sure if it's practical, because the buffer size is 16K, but it makes the code only marginally more complex, so sounds like a good change. As for KOI8-U, one solution is to keep ICU changes out of mainline until Leopard. Alternatively, it's possible to add support for this encoding via ucnv_openPackage(). Depends on whether there are more missing encodings, I guess.

Darin Adler

Comment 5 2005-12-11 22:28:22 PST

Comment on attachment 5047 [details] revised patch Looking great! If we want to keep this out of "mainline" then we can't check into tip of tree. We need TOT to be something suitable for shipping before Leopard -- it's not OK to knowingly regress something in a way that's only fixed by Leopard. There's no need to call QByteArray::resize directly -- QCString::resize doesn't do any call to strlen, and in fact it sets the terminating 0 byte correctly, so this: static_cast<QByteArray&>(result).resize(resultLength + chunkLength + 1); memcpy(result.data() + resultLength, buffer, chunkLength); result.data()[resultLength + chunkLength] = '\0'; can just be: result.resize(resultLength + chunkLength + 1); memcpy(result.data() + resultLength, buffer, chunkLength); I think we really need to consider what to do about the character sets that ICU can't handle and CF can. First step I suppose would be to inventory which character sets those are. I think that the call to ucnv_setSubstChars needs its own ASSERT about the error code returned.

Alexey Proskuryakov

Comment 6 2005-12-12 11:44:58 PST

QCString::resize() is private, this is why I had to use QByteArray::resize(). Actually, if ucnv_setSubstChars() returns an error (which is hardly realistic), it will be reported by an existing assert() below, because of ICU's error handling mechanism. Going to investigate what other encodings besides KOI8-U are broken...

Alexey Proskuryakov

Comment 7 2006-02-28 00:46:48 PST

Similar code has been just landed by Maciej for Win32. FWIW, it uses QCString::append(), not resize(). Japanese backslash handling is also different. Since KOI8-U is by far not the only encoding missing from ICU, further architectural fixes may be needed to use both ICU and CFString on Mac OS X.

Alexey Proskuryakov

Comment 8 2006-07-14 23:58:27 PDT

Everything here has been addressed over time (with a possible exception of reverse solidus and tilde handling, which will need to be revisited if any bugs are reported).

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Normal

Classification Unclassified

Version 420+

Hardware Mac

OS OS X 10.4

Product WebKit

Component DOM

Assignee

Nobody

Reported

2005-10-02 08:39 PDT

Modified

2006-07-14 23:58 PDT History

CC List

2 users Show

URL

Keywords

Depends on

4195 4821

Blocks

Dependencies

tree graph