Bug 7582 - c_utility.cpp contains CFString OS X platform-dependent code; should use ICU
Summary: c_utility.cpp contains CFString OS X platform-dependent code; should use ICU
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: JavaScriptCore (show other bugs)
Version: 420+
Hardware: Mac OS X 10.4
: P2 Normal
Assignee: Nobody
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-03-03 15:52 PST by David Carson
Modified: 2006-03-12 11:01 PST (History)
1 user (show)

See Also:


Attachments
proposed patch (105.65 KB, patch)
2006-03-03 15:55 PST, David Carson
darin: review-
Details | Formatted Diff | Diff
new patch with changes and ChangeLog (105.69 KB, patch)
2006-03-05 19:47 PST, David Carson
darin: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description David Carson 2006-03-03 15:52:38 PST
c_utility.cpp is using CFString to convert from UTF8/Latin1 to UTF16.
Comment 1 David Carson 2006-03-03 15:55:55 PST
Created attachment 6832 [details]
proposed patch

Proposed patch makes use of libicu rather than platform APIs.
Comment 2 Alexey Proskuryakov 2006-03-04 02:08:28 PST
According to the ICU converter explorer <http://www-950.ibm.com/software/globalization/icu/demo/converters?conv=latin1&s=ALL>, "latin1" actually has holes - will this work correctly for such?

Perhaps we should use sample code from <http://www.unicode.org/Public/PROGRAMS/CVTUTF/> to convert between UTF encodings, rather than adjust the ICU results. IMO, this would improve both performance and maintainability of the code. Also, what if the string passed from the plugin already had a replacement character in it?
Comment 3 Darin Adler 2006-03-04 12:08:53 PST
Comment on attachment 6832 [details]
proposed patch

Excellent thing to do -- no reason for this to use CoreFoundation just to convert to UTF-8.

Ideally we would svn copy uenum.h, ucnv.h, and ucnv_err.h from WebCore instead of checking new copies in. But that's pretty unimportant.

+    *UTF16Length = UTF8Length+1; 

Our normal coding style would have spaces around the + operator here.

+    *UTF16Chars = NULL;

Our coding style is to use 0 for this sort of thing instead of NULL.

+    UConverter* conv = ucnv_open( "utf8", &status);
+        conv = ucnv_open( "latin1", &status);

There's a space before "utf8" and "latin1" there.

+        ucnv_toUChars(conv, *UTF16Chars, *UTF16Length, UTF8Chars,
+                                 UTF8Length, &status); 

Seems short enough to fit on one line.

+        if (U_SUCCESS(status)) {
+            for (unsigned int i=0; i<*UTF16Length && U_SUCCESS(status); i++) {
+                if ((*UTF16Chars)[i] == 0xFFFD) {
+                    status = U_INVALID_CHAR_FOUND;
+                }
+            }
+        }

Our coding style doesn't use spaces around single-character things, so these braces should be removed. Also, we use "unsigned" instead of "unsigned int" and put spaces around things like the = in "i = 0" and the <  in "i < *UTF16Length". Also seems unnecessary to have the if (U_SUCCESS) around the thing -- the loop already takes care of that.

For conversion from UTF-8 to UTF-16 and from Latin-1 to UTF-16 there are much more efficient ways than calling ucnv -- on the other hand ucnv is at least as good as calling CF, so that seems fine. UString::UTF8String has UTF-16 -> UTF-8 code and decode() in function.cpp has a function that converts from UTF-8 to UTF-16, but I think it's better to not make more of these. The conversion from Latin-1 to UTF-16 is literally just turning unsigned chars into unsigned shorts, so it's a shame to call the library for that case. But I'm probably "optimizing for the wrong thing"; seems unnecessary to write custom code for something that's unlikely to be performance-critical.

Overall, the change looks great. How did you test it? Is there a way we can test this in one of the layout tests, perhaps with our test plug-in?

Marking review-, not for the minor style issues above, but because we need a change log and if possible a test to land along with this.
Comment 4 Darin Adler 2006-03-04 12:16:10 PST
Comment on attachment 6832 [details]
proposed patch

I just read Alexey's comments, he makes some good points:

    1) Are we sure that Latin-1 has no holes in ICU? We need to make sure of this. These holes have caused actual incompatibilities with plug-ins recently, so it's not just a theoretical issue.
    2) It's not a good strategy to post-process to look for the replacement character as a way to detect failure.

I'm not sure that I agree that writing custom code would increase maintainability. But I agree that if this code ends up being hot, it's going to be pretty slow to do a ucnv_open every time. We've seen that be a problem before.

Given that, I think we should consider custom code, especially for the Latin-1 to UTF-16 case.

As far as the UTF-8 to UTF-16 case, I'm not quite as sure that we need custom code. But it would be nice to not have to forbid the replacement character. If we stick with ICU, we can use ucnv_setFromUCallBack to control ICU's behavior. But since the UTF-8 to UTF-16 conversion is pretty simple, maybe we should do it ourselves as Alexey suggests instead of calling the general character conversion library.
Comment 5 David Carson 2006-03-05 19:47:08 PST
Created attachment 6886 [details]
new patch with changes and ChangeLog
Comment 6 Darin Adler 2006-03-05 21:19:23 PST
Comment on attachment 6886 [details]
new patch with changes and ChangeLog

Looks good, r=me.

We could simplify this by doing the malloc outside either if statement, but it's good as is.

At some point we should test that ucnv_open is fast enough when used over and over again like this.
Comment 7 Alexey Proskuryakov 2006-03-12 11:01:46 PST
Comment on attachment 6886 [details]
new patch with changes and ChangeLog

This doesn't set the result length correctly; bug 7708 has a fix.