7582 – c_utility.cpp contains CFString OS X platform-dependent code; should use ICU

RESOLVED FIXED 7582

c_utility.cpp contains CFString OS X platform-dependent code; should use ICU

https://bugs.webkit.org/show_bug.cgi?id=7582

Summary c_utility.cpp contains CFString OS X platform-dependent code; should use ICU

David Carson

Reported 2006-03-03 15:52:38 PST

c_utility.cpp is using CFString to convert from UTF8/Latin1 to UTF16.

Attachments
proposed patch (105.65 KB, patch) 2006-03-03 15:55 PST, David Carson	darin: review-	Details Formatted Diff Diff
new patch with changes and ChangeLog (105.69 KB, patch) 2006-03-05 19:47 PST, David Carson	darin: review+	Details Formatted Diff Diff
Show Obsolete (1) View All Add attachment proposed patch, testcase, etc.

David Carson

Comment 1 2006-03-03 15:55:55 PST

Created attachment 6832 [details] proposed patch Proposed patch makes use of libicu rather than platform APIs.

Alexey Proskuryakov

Comment 2 2006-03-04 02:08:28 PST

According to the ICU converter explorer <http://www-950.ibm.com/software/globalization/icu/demo/converters?conv=latin1&s=ALL>, "latin1" actually has holes - will this work correctly for such? Perhaps we should use sample code from <http://www.unicode.org/Public/PROGRAMS/CVTUTF/> to convert between UTF encodings, rather than adjust the ICU results. IMO, this would improve both performance and maintainability of the code. Also, what if the string passed from the plugin already had a replacement character in it?

Darin Adler

Comment 3 2006-03-04 12:08:53 PST

Comment on attachment 6832 [details] proposed patch Excellent thing to do -- no reason for this to use CoreFoundation just to convert to UTF-8. Ideally we would svn copy uenum.h, ucnv.h, and ucnv_err.h from WebCore instead of checking new copies in. But that's pretty unimportant. + *UTF16Length = UTF8Length+1; Our normal coding style would have spaces around the + operator here. + *UTF16Chars = NULL; Our coding style is to use 0 for this sort of thing instead of NULL. + UConverter* conv = ucnv_open( "utf8", &status); + conv = ucnv_open( "latin1", &status); There's a space before "utf8" and "latin1" there. + ucnv_toUChars(conv, *UTF16Chars, *UTF16Length, UTF8Chars, + UTF8Length, &status); Seems short enough to fit on one line. + if (U_SUCCESS(status)) { + for (unsigned int i=0; i<*UTF16Length && U_SUCCESS(status); i++) { + if ((*UTF16Chars)[i] == 0xFFFD) { + status = U_INVALID_CHAR_FOUND; + } + } + } Our coding style doesn't use spaces around single-character things, so these braces should be removed. Also, we use "unsigned" instead of "unsigned int" and put spaces around things like the = in "i = 0" and the < in "i < *UTF16Length". Also seems unnecessary to have the if (U_SUCCESS) around the thing -- the loop already takes care of that. For conversion from UTF-8 to UTF-16 and from Latin-1 to UTF-16 there are much more efficient ways than calling ucnv -- on the other hand ucnv is at least as good as calling CF, so that seems fine. UString::UTF8String has UTF-16 -> UTF-8 code and decode() in function.cpp has a function that converts from UTF-8 to UTF-16, but I think it's better to not make more of these. The conversion from Latin-1 to UTF-16 is literally just turning unsigned chars into unsigned shorts, so it's a shame to call the library for that case. But I'm probably "optimizing for the wrong thing"; seems unnecessary to write custom code for something that's unlikely to be performance-critical. Overall, the change looks great. How did you test it? Is there a way we can test this in one of the layout tests, perhaps with our test plug-in? Marking review-, not for the minor style issues above, but because we need a change log and if possible a test to land along with this.

Darin Adler

Comment 4 2006-03-04 12:16:10 PST

Comment on attachment 6832 [details] proposed patch I just read Alexey's comments, he makes some good points: 1) Are we sure that Latin-1 has no holes in ICU? We need to make sure of this. These holes have caused actual incompatibilities with plug-ins recently, so it's not just a theoretical issue. 2) It's not a good strategy to post-process to look for the replacement character as a way to detect failure. I'm not sure that I agree that writing custom code would increase maintainability. But I agree that if this code ends up being hot, it's going to be pretty slow to do a ucnv_open every time. We've seen that be a problem before. Given that, I think we should consider custom code, especially for the Latin-1 to UTF-16 case. As far as the UTF-8 to UTF-16 case, I'm not quite as sure that we need custom code. But it would be nice to not have to forbid the replacement character. If we stick with ICU, we can use ucnv_setFromUCallBack to control ICU's behavior. But since the UTF-8 to UTF-16 conversion is pretty simple, maybe we should do it ourselves as Alexey suggests instead of calling the general character conversion library.

David Carson

Comment 5 2006-03-05 19:47:08 PST

Created attachment 6886 [details] new patch with changes and ChangeLog

Darin Adler

Comment 6 2006-03-05 21:19:23 PST

Comment on attachment 6886 [details] new patch with changes and ChangeLog Looks good, r=me. We could simplify this by doing the malloc outside either if statement, but it's good as is. At some point we should test that ucnv_open is fast enough when used over and over again like this.

Alexey Proskuryakov

Comment 7 2006-03-12 11:01:46 PST

Comment on attachment 6886 [details] new patch with changes and ChangeLog This doesn't set the result length correctly; bug 7708 has a fix.

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Normal

Classification Unclassified

Version 420+

Hardware Mac

OS OS X 10.4

Product WebKit

Component JavaScriptCore

Assignee

Nobody

Reported

2006-03-03 15:52 PST

Modified

2006-03-12 11:01 PST History

CC List

1 user Show

URL

Keywords

Depends on

Blocks