WebKit Bugzilla
New
Browse
Log In
×
Sign in with GitHub
or
Remember my login
Create Account
·
Forgot Password
Forgotten password account recovery
RESOLVED FIXED
7582
c_utility.cpp contains CFString OS X platform-dependent code; should use ICU
https://bugs.webkit.org/show_bug.cgi?id=7582
Summary
c_utility.cpp contains CFString OS X platform-dependent code; should use ICU
David Carson
Reported
2006-03-03 15:52:38 PST
c_utility.cpp is using CFString to convert from UTF8/Latin1 to UTF16.
Attachments
proposed patch
(105.65 KB, patch)
2006-03-03 15:55 PST
,
David Carson
darin
: review-
Details
Formatted Diff
Diff
new patch with changes and ChangeLog
(105.69 KB, patch)
2006-03-05 19:47 PST
,
David Carson
darin
: review+
Details
Formatted Diff
Diff
Show Obsolete
(1)
View All
Add attachment
proposed patch, testcase, etc.
David Carson
Comment 1
2006-03-03 15:55:55 PST
Created
attachment 6832
[details]
proposed patch Proposed patch makes use of libicu rather than platform APIs.
Alexey Proskuryakov
Comment 2
2006-03-04 02:08:28 PST
According to the ICU converter explorer <
http://www-950.ibm.com/software/globalization/icu/demo/converters?conv=latin1&s=ALL
>, "latin1" actually has holes - will this work correctly for such? Perhaps we should use sample code from <
http://www.unicode.org/Public/PROGRAMS/CVTUTF/
> to convert between UTF encodings, rather than adjust the ICU results. IMO, this would improve both performance and maintainability of the code. Also, what if the string passed from the plugin already had a replacement character in it?
Darin Adler
Comment 3
2006-03-04 12:08:53 PST
Comment on
attachment 6832
[details]
proposed patch Excellent thing to do -- no reason for this to use CoreFoundation just to convert to UTF-8. Ideally we would svn copy uenum.h, ucnv.h, and ucnv_err.h from WebCore instead of checking new copies in. But that's pretty unimportant. + *UTF16Length = UTF8Length+1; Our normal coding style would have spaces around the + operator here. + *UTF16Chars = NULL; Our coding style is to use 0 for this sort of thing instead of NULL. + UConverter* conv = ucnv_open( "utf8", &status); + conv = ucnv_open( "latin1", &status); There's a space before "utf8" and "latin1" there. + ucnv_toUChars(conv, *UTF16Chars, *UTF16Length, UTF8Chars, + UTF8Length, &status); Seems short enough to fit on one line. + if (U_SUCCESS(status)) { + for (unsigned int i=0; i<*UTF16Length && U_SUCCESS(status); i++) { + if ((*UTF16Chars)[i] == 0xFFFD) { + status = U_INVALID_CHAR_FOUND; + } + } + } Our coding style doesn't use spaces around single-character things, so these braces should be removed. Also, we use "unsigned" instead of "unsigned int" and put spaces around things like the = in "i = 0" and the < in "i < *UTF16Length". Also seems unnecessary to have the if (U_SUCCESS) around the thing -- the loop already takes care of that. For conversion from UTF-8 to UTF-16 and from Latin-1 to UTF-16 there are much more efficient ways than calling ucnv -- on the other hand ucnv is at least as good as calling CF, so that seems fine. UString::UTF8String has UTF-16 -> UTF-8 code and decode() in function.cpp has a function that converts from UTF-8 to UTF-16, but I think it's better to not make more of these. The conversion from Latin-1 to UTF-16 is literally just turning unsigned chars into unsigned shorts, so it's a shame to call the library for that case. But I'm probably "optimizing for the wrong thing"; seems unnecessary to write custom code for something that's unlikely to be performance-critical. Overall, the change looks great. How did you test it? Is there a way we can test this in one of the layout tests, perhaps with our test plug-in? Marking review-, not for the minor style issues above, but because we need a change log and if possible a test to land along with this.
Darin Adler
Comment 4
2006-03-04 12:16:10 PST
Comment on
attachment 6832
[details]
proposed patch I just read Alexey's comments, he makes some good points: 1) Are we sure that Latin-1 has no holes in ICU? We need to make sure of this. These holes have caused actual incompatibilities with plug-ins recently, so it's not just a theoretical issue. 2) It's not a good strategy to post-process to look for the replacement character as a way to detect failure. I'm not sure that I agree that writing custom code would increase maintainability. But I agree that if this code ends up being hot, it's going to be pretty slow to do a ucnv_open every time. We've seen that be a problem before. Given that, I think we should consider custom code, especially for the Latin-1 to UTF-16 case. As far as the UTF-8 to UTF-16 case, I'm not quite as sure that we need custom code. But it would be nice to not have to forbid the replacement character. If we stick with ICU, we can use ucnv_setFromUCallBack to control ICU's behavior. But since the UTF-8 to UTF-16 conversion is pretty simple, maybe we should do it ourselves as Alexey suggests instead of calling the general character conversion library.
David Carson
Comment 5
2006-03-05 19:47:08 PST
Created
attachment 6886
[details]
new patch with changes and ChangeLog
Darin Adler
Comment 6
2006-03-05 21:19:23 PST
Comment on
attachment 6886
[details]
new patch with changes and ChangeLog Looks good, r=me. We could simplify this by doing the malloc outside either if statement, but it's good as is. At some point we should test that ucnv_open is fast enough when used over and over again like this.
Alexey Proskuryakov
Comment 7
2006-03-12 11:01:46 PST
Comment on
attachment 6886
[details]
new patch with changes and ChangeLog This doesn't set the result length correctly;
bug 7708
has a fix.
Note
You need to
log in
before you can comment on or make changes to this bug.
Top of Page
Format For Printing
XML
Clone This Bug