WebKit Bugzilla
New
Browse
Log In
×
Sign in with GitHub
or
Remember my login
Create Account
·
Forgot Password
Forgotten password account recovery
RESOLVED FIXED
5232
Rewrite QTextCodec::fromUnicode to use ICU
https://bugs.webkit.org/show_bug.cgi?id=5232
Summary
Rewrite QTextCodec::fromUnicode to use ICU
Alexey Proskuryakov
Reported
2005-10-02 08:39:09 PDT
After a fix to
bug 4821
is committed, try to advance its solution: 1) Use ICU instead of CFString. 2) Support surrogate pairs correctly. 3) If converting to UTF-8, make precomposed Unicode. 4) Verify handling of reverse solidus and tilde when converting to Japanese or Korean encodings.
Attachments
test results for yen, won and tilde
(487.01 KB, application/zip)
2005-12-04 10:39 PST
,
Alexey Proskuryakov
no flags
Details
proposed patch
(19.62 KB, patch)
2005-12-11 11:09 PST
,
Alexey Proskuryakov
darin
: review-
Details
Formatted Diff
Diff
revised patch
(19.88 KB, patch)
2005-12-11 22:02 PST
,
Alexey Proskuryakov
darin
: review-
Details
Formatted Diff
Diff
Show Obsolete
(1)
View All
Add attachment
proposed patch, testcase, etc.
Alexey Proskuryakov
Comment 1
2005-12-04 10:39:39 PST
Created
attachment 4938
[details]
test results for yen, won and tilde Well, verifying the handling of the questionable characters appears troublesome - when rendering, MSIE, FF and Safari give very different results, and even different versions of Firefox do (tested with 1.5/Mac and 1.0.4/Linux). I think I'll just try to keep their encoding unchanged, as much as possible.
Alexey Proskuryakov
Comment 2
2005-12-11 11:09:54 PST
Created
attachment 5042
[details]
proposed patch - Rewrote fromUnicode to use ICU - ...which nicely fixed surrogate pair handling - ... and changed the results for Japanese encodings. Please see the comments in layout tests for test matrices. The change in ISO-2022-JP looks especially suspicious. Please see <
http://dev.icu-project.org/cgi-bin/icu-bugs/returned?id=1252
> and <
http://dev.icu-project.org/cgi-bin/icu-bugs/returned?id=1287
>. I cannot determine which behavior is correct. I decided not to do the conversion to precomposed Unicode here. It's unclear if it should be done at this level at all, and the scope of the proposed patch is already fairly large.
Darin Adler
Comment 3
2005-12-11 17:04:24 PST
Comment on
attachment 5042
[details]
proposed patch Looks OK. It's unfortunate that the loop uses QCString::append because that makes the loop n^2 in the size of the string. Each conversion buffer has its size recalculated using strlen() and worse, the base QCString has its size recalculated using strlen() each time QCString::append is called. A more efficient version would use QCString::resize and keep track of the length of the result. The original CFString version had this problem too, so I won't do a review- on this just because of it, but it's worth fixing. I'm concerned about the handling of U+005C in ISO-2022-JP, but it looks like it's no worse after this change than before. What are we going to do about character sets not yet supported by ICU, such as KOI8-U (see
bug 4195
)? Setting review- until I have the answer to at least this last question.
Alexey Proskuryakov
Comment 4
2005-12-11 22:02:35 PST
Created
attachment 5047
[details]
revised patch Uses QByteArray::resize() to avoid multiple strlens. Not sure if it's practical, because the buffer size is 16K, but it makes the code only marginally more complex, so sounds like a good change. As for KOI8-U, one solution is to keep ICU changes out of mainline until Leopard. Alternatively, it's possible to add support for this encoding via ucnv_openPackage(). Depends on whether there are more missing encodings, I guess.
Darin Adler
Comment 5
2005-12-11 22:28:22 PST
Comment on
attachment 5047
[details]
revised patch Looking great! If we want to keep this out of "mainline" then we can't check into tip of tree. We need TOT to be something suitable for shipping before Leopard -- it's not OK to knowingly regress something in a way that's only fixed by Leopard. There's no need to call QByteArray::resize directly -- QCString::resize doesn't do any call to strlen, and in fact it sets the terminating 0 byte correctly, so this: static_cast<QByteArray&>(result).resize(resultLength + chunkLength + 1); memcpy(result.data() + resultLength, buffer, chunkLength); result.data()[resultLength + chunkLength] = '\0'; can just be: result.resize(resultLength + chunkLength + 1); memcpy(result.data() + resultLength, buffer, chunkLength); I think we really need to consider what to do about the character sets that ICU can't handle and CF can. First step I suppose would be to inventory which character sets those are. I think that the call to ucnv_setSubstChars needs its own ASSERT about the error code returned.
Alexey Proskuryakov
Comment 6
2005-12-12 11:44:58 PST
QCString::resize() is private, this is why I had to use QByteArray::resize(). Actually, if ucnv_setSubstChars() returns an error (which is hardly realistic), it will be reported by an existing assert() below, because of ICU's error handling mechanism. Going to investigate what other encodings besides KOI8-U are broken...
Alexey Proskuryakov
Comment 7
2006-02-28 00:46:48 PST
Similar code has been just landed by Maciej for Win32. FWIW, it uses QCString::append(), not resize(). Japanese backslash handling is also different. Since KOI8-U is by far not the only encoding missing from ICU, further architectural fixes may be needed to use both ICU and CFString on Mac OS X.
Alexey Proskuryakov
Comment 8
2006-07-14 23:58:27 PDT
Everything here has been addressed over time (with a possible exception of reverse solidus and tilde handling, which will need to be revisited if any bugs are reported).
Note
You need to
log in
before you can comment on or make changes to this bug.
Top of Page
Format For Printing
XML
Clone This Bug