15914 – [GTK] Implement Unicode functionality using GLib

RESOLVED FIXED15914

[GTK] Implement Unicode functionality using GLib

https://bugs.webkit.org/show_bug.cgi?id=15914

Summary [GTK] Implement Unicode functionality using GLib

Alp Toker

Reported 2007-11-08 22:29:23 PST

GLib's Unicode manipulation API: http://library.gnome.org/devel/glib/stable/glib-Unicode-Manipulation.html The old gtk-webcore project does this to a small extent and is a good starting place for inspiration. This could be a fairly easy way to shave off 10M shared library memory use due to ICU. I haven't investigated how complete the coverage is though.

Attachments
Preliminary patch removing ICU dependency (44.29 KB, patch) 2008-02-01 14:00 PST, Jürg Billeter	no flags	Details Formatted Diff Diff
Updated patch (44.50 KB, patch) 2008-04-25 14:42 PDT, Jürg Billeter	no flags	Details Formatted Diff Diff
Modified the Juergbi's patch for umemcasecmp function (44.57 KB, patch) 2008-06-04 22:59 PDT, Naiem	no flags	Details Formatted Diff Diff
Patch updated for trunk revision 35916 (46.59 KB, patch) 2008-08-26 09:03 PDT, Dominik Röttsches (drott)	no flags	Details Formatted Diff Diff
Overview and Comparison of Text Codecs in Safari Win/Mac and libiconv (3.33 KB, text/plain) 2008-08-27 08:01 PDT, Dominik Röttsches (drott)	no flags	Details
patch updated for probing for iconv encodings, updated to revision 36025 (63.92 KB, patch) 2008-09-02 13:04 PDT, Dominik Röttsches (drott)	no flags	Details Formatted Diff Diff
replacing icu with glib, updated for 36087, ChangeLogs now UTF-8, coding style fixes (63.93 KB, patch) 2008-09-04 06:24 PDT, Dominik Röttsches (drott)	no flags	Details Formatted Diff Diff
Replacing ICU with glib (67.69 KB, patch) 2008-09-08 10:08 PDT, Dominik Röttsches (drott)	eric: review-	Details Formatted Diff Diff
Replacing ICU with glib (75.30 KB, patch) 2008-11-20 08:11 PST, Dominik Röttsches (drott)	no flags	Details Formatted Diff Diff
Replacing ICU with glib (75.71 KB, patch) 2008-11-21 04:38 PST, Dominik Röttsches (drott)	no flags	Details Formatted Diff Diff
Replacing ICU with glib, IDN support (78.07 KB, patch) 2008-12-01 05:54 PST, Dominik Röttsches (drott)	no flags	Details Formatted Diff Diff
Fast/Encoding Test Results for GTK GLib-Unicode Release Build (25.91 KB, application/zip) 2008-12-01 05:58 PST, Dominik Röttsches (drott)	no flags	Details
Fast/Encoding Test Results for GTK ICU-Unicode Release Build (10.76 KB, application/zip) 2008-12-01 05:59 PST, Dominik Röttsches (drott)	no flags	Details
1/4 - Moving WTF Unicode backend to GLib (38.66 KB, patch) 2009-01-14 08:51 PST, Dominik Röttsches (drott)	darin: review-	Details Formatted Diff Diff
2/4 - Moving Text Codecs to GLib (36.73 KB, patch) 2009-01-14 08:52 PST, Dominik Röttsches (drott)	no flags	Details Formatted Diff Diff
1/4 - Moving WTF Unicode backend to GLib (v2) (30.56 KB, patch) 2009-01-16 07:46 PST, Dominik Röttsches (drott)	no flags	Details Formatted Diff Diff
Sunspider Results Comparing icu against GLib Unicode Backend (3.38 KB, text/plain) 2009-01-16 07:49 PST, Dominik Röttsches (drott)	no flags	Details
2/4 - Moving Text Codecs to GLib (v2) (33.60 KB, patch) 2009-10-19 11:24 PDT, Dominik Röttsches (drott)	gustavo: review-	Details Formatted Diff Diff
2/4 - Moving Text Codecs to GLib (v3) (32.97 KB, patch) 2009-10-26 07:39 PDT, Dominik Röttsches (drott)	gustavo: review-	Details Formatted Diff Diff
2/4 - Moving Text Codecs to GLib (v4) (32.97 KB, patch) 2009-10-30 06:35 PDT, Dominik Röttsches (drott)	commit-queue: commit-queue-	Details Formatted Diff Diff
2/4 - Moving Text Codecs to GLib (v5) (32.52 KB, patch) 2009-11-04 01:33 PST, Dominik Röttsches (drott)	no flags	Details Formatted Diff Diff
Show Obsolete (15) View All Add attachment proposed patch, testcase, etc.

Xan Lopez

Comment 1 2007-11-21 12:07:18 PST

Can you quickly point where is the code that should be modified?

Alp Toker

Comment 2 2008-01-06 07:31:17 PST

JavaScriptCore/wtf/unicode/qt4/UnicodeQt4.h is a good place to start porting, though we probably don't want to inline some of the larger functions here.

Jürg Billeter

Comment 3 2008-02-01 14:00:13 PST

Created attachment 18852 [details] Preliminary patch removing ICU dependency I've replaced the ICU dependency by the unicode functionality of GLib, Pango, and fribidi. The attached patch is experimental. It's probably quite a bit slower as I had to add many conversions between UTF-8 and UTF-16. TextBoundariesGtk.cpp implementation is still missing, so at least text entry won't work. Pango already contains a copy of fribidi, however, the Pango API doesn't provide access to Bidi Type information. That's why I had to add a direct dependency on fribidi. Maybe this could be eliminated by extending the API in a future version of Pango.

Jürg Billeter

Comment 4 2008-02-09 08:20:27 PST

I've filed a Pango enhancement request to avoid the fribidi dependency: http://bugzilla.gnome.org/show_bug.cgi?id=515432

Alp Toker

Comment 5 2008-03-09 18:09:57 PDT

r30913 now conditionalises ICU use in the GTK+ port.

Jürg Billeter

Comment 6 2008-04-25 14:42:23 PDT

Created attachment 20824 [details] Updated patch I've updated the patch to not depend on fribidi anymore using the new function available in pango 1.21.0. I've also implemented TextBoundariesGtk.cpp, so selection in text input elements basically work again. The patch still needs work, though. It doesn't register all available encodings and the semantics of the text break iterator implementation doesn't seem to completely match the ICU iterator.

Naiem

Comment 7 2008-06-04 22:59:20 PDT

Created attachment 21500 [details] Modified the Juergbi's patch for umemcasecmp function Modified couple of lines in earlier patch in umemcasecmp function for sending the string length.

Dominik Röttsches (drott)

Comment 8 2008-06-09 08:13:54 PDT

In r33380 (http://trac.webkit.org/changeset/33380) the appendOmittingBOM function was removed. So this patch can't be cleanly applied to tip of trunk currently (appendOmittingBOM is used for example in TextCodecGtk::decode()). The solution I came up with so far was to explicitly specify UTF-16LE as the target encoding in g_iconv_open() of TextCodecGtk::createIConvDecoder(). Currently I only used LE, but Glib's byte-order macros could be used to determine the correct target encoding including its byte order at compile time. This stops libiconv from generating BOMs for the result of the conversion. Consequently, there's no need for removing the BOMs when appending them to the Vector<UChar>, allowing us to just use result.append(buffer, count / sizeof(UChar)). What is your opinion on this approach? Do you think that's the right direction to go or do you see any risks?

Mike Hommey

Comment 9 2008-06-09 08:51:14 PDT

> This stops libiconv from generating BOMs for the result of the conversion. Keep in mind that while it won't *add* a BOM, it will *keep* existing BOMs (zero-width no-break space to be precise) if the original stream contains them, wherever they are located (start of the stream or not)

Daniel Laird

Comment 10 2008-06-30 12:45:36 PDT

I have downloaded r34824 and tried the patch. I have updated it so that it can be applied except I am new to this and have no idea about the now missing appendOmittingBOM. For now to get webkit building and hopefully running on GTK-DirectFB I have just commented it out. However I was wonderig what progress had been made with this patch? If I get it running on my platform I will be happy to try out updated patches. For now it is just compiling and then when I run it crashes so ill sort that first but the idea of removing ICU support is great hope you can finish what you started. dan

Manoj

Comment 11 2008-08-25 02:34:22 PDT

(In reply to comment #10) > I have downloaded r34824 and tried the patch. I have updated it so that it can > be applied except I am new to this and have no idea about the now missing > appendOmittingBOM. For now to get webkit building and hopefully running on > GTK-DirectFB I have just commented it out. However I was wonderig what > progress had been made with this patch? > If I get it running on my platform I will be happy to try out updated patches. > For now it is just compiling and then when I run it crashes so ill sort that > first but the idea of removing ICU support is great hope you can finish what > you started. > dan > (In reply to comment #10) > I have downloaded r34824 and tried the patch. I have updated it so that it can > be applied except I am new to this and have no idea about the now missing > appendOmittingBOM. For now to get webkit building and hopefully running on > GTK-DirectFB I have just commented it out. However I was wonderig what > progress had been made with this patch? > If I get it running on my platform I will be happy to try out updated patches. > For now it is just compiling and then when I run it crashes so ill sort that > first but the idea of removing ICU support is great hope you can finish what > you started. > dan > I amtrying to remove ICU lib from webkit dep. I have applied patch to r35806 version. Commented out appendOmittingBOM from TextCodecGtk::decode of TextCodecGtk.cpp. When i try to lauch GtkLauncher, there is not text on my GtkLauncher, though i have set google.com as url. If any one suggests where is problem. I could try to resolve it.

Manoj

Comment 12 2008-08-26 03:29:11 PDT

Replacing ICU lib with Glib is a great idea. Are there anyone success with Unicode support with Glib instead of ICU lib ? I have applied the Glib patch for Unicode, But there is no text display on my browser(gtklauncher).

Dominik Röttsches (drott)

Comment 13 2008-08-26 03:38:43 PDT

(In reply to comment #11) > I amtrying to remove ICU lib from webkit dep. > > I have applied patch to r35806 version. > Commented out appendOmittingBOM from TextCodecGtk::decode of TextCodecGtk.cpp. > When i try to lauch GtkLauncher, there is not text on my GtkLauncher, though i > have set google.com as url. > If any one suggests where is problem. I could try to resolve it. Yes, it does work, I can confirm that - although the patch attachment here is not up to date with trunk. Look at my comment #8. Commenting out appendOmittingBOM is not sufficient for getting it to work. The comment should give you pointers on how to resolve this. I contacted Jürg via Email to get more information for setting up a development environment that supports pango 1.21 (which he used to get rid of a fribidi dependency) - no reply set. I was trying to work on an updated patch that is compatible with current trunk.

Dominik Röttsches (drott)

Comment 14 2008-08-26 06:46:28 PDT

Now, using an Ubuntu 8.04 plus some *-dev packages from Ubuntu 8.10 I was able to set up an environment that has libpango-dev with at a 1.21.x version number.

Dominik Röttsches (drott)

Comment 15 2008-08-26 08:41:34 PDT

(In reply to comment #6) > The patch still needs work, though. It doesn't register all available encodings [...] Regarding listing the encodings, I'm currently considering two alternatives: 1) Probe g_iconv_open with all the supported encodings listed in the libiconv's list of supported encodings (http://www.gnu.org/software/libiconv/). For each probe, if successful, register the respective encoding with webcore. This avoids a direct dependency to libiconv but isn't very elegant. 2) Introduce a dependency to libiconv and use its iconvlist() function directly. This function takes a callback as a parameter and calls it for each of the available encodings. Introducing this dependency doesn't necessarily mean much extra trouble, as http://library.gnome.org/devel/glib/stable/glib-building.html#dependencies lists libiconv as a dependency of glib on many platforms. Any other suggestions? Is it acceptable to introduce the libiconv dependency in the --with-unicode-backend=glib case?

Alexey Proskuryakov

Comment 16 2008-08-26 08:55:59 PDT

Another possible approach would be to redesign WebCore text conversion classes so that they wouldn't need to enumerate all supported encodings. We've discussed this briefly before, but at that point, all conversion libraries we needed did provide encoding enumeration APIs.

Daniel Laird

Comment 17 2008-08-26 08:57:08 PDT

I would think that a dependency on libiconv is ok as this is commonly a dependency of glib. My only concern would be to make sure that libiconv cross-compiles better than icu otherwise we have substituted one bad package for another.

Dominik Röttsches (drott)

Comment 18 2008-08-26 09:03:46 PDT

Created attachment 23001 [details] Patch updated for trunk revision 35916 Now explicitly specifying byte order of UTF-16 target encoding based on glib macros.

Christian Dywan

Comment 19 2008-08-26 10:04:23 PDT

(In reply to comment #17) > I would think that a dependency on libiconv is ok as this is commonly a > dependency of glib. My only concern would be to make sure that libiconv > cross-compiles better than icu otherwise we have substituted one bad package > for another. Note that Glib on Win32 is built without iconv these days, so it is only required for unix platforms. In that respect avoiding the dependency would be preferrable.

Dominik Röttsches (drott)

Comment 20 2008-08-27 08:01:57 PDT

Created attachment 23031 [details] Overview and Comparison of Text Codecs in Safari Win/Mac and libiconv Libiconv becomes a less favorable choice for enumerating the available encodings given the fact that glib does not rely on libiconv on all platforms but uses the standard libc iconv on a number of Unices or win32 functions on Windows. I have created an overview of Safari Win & Mac encodings compared to the ones supported by libiconv (and most likely supported by standard libc iconv, too, as from what I understand, these implementations are often very similar). This table shows that almost all Safari encodings can be covered by using these encoding names. The table does not yet list all the aliases for each canonical encoding. For a start, a list of typical MIME encoding names could be extracted from: http://www.iana.org/assignments/character-sets. The canonical encoding names in the libiconv column of this table could be used for probing whether glib's g_iconv_open supports this encoding on a particular platform. Otherwise, the only way out would be to redesign WebCore's text codecs part, as Alexey pointed out in comment #16.

Manoj

Comment 21 2008-08-27 21:18:24 PDT

Applied patch to 35917. But no luck, still GtkLauncher is without any text. I am trying to test it on ARM Little Endian.

Dominik Röttsches (drott)

Comment 22 2008-09-02 13:04:11 PDT

Created attachment 23124 [details] patch updated for probing for iconv encodings, updated to revision 36025 I updated Jürg's patch for probing the list of codecs that are supported by iconv (as stated on the libiconv web page). TIS-620, WINDOWS-1251, EUC-KR, WINDOWS-1253 added manually. This patch now works without a dependency to libiconv, but relies on hard-coded codec tables. It could serve as an interim solution until it's decided to change the text codecs architecture.

Alexey Proskuryakov

Comment 23 2008-09-02 14:25:30 PDT

Comment on attachment 23124 [details] patch updated for probing for iconv encodings, updated to revision 36025 When asking for review, please mark patches as review?, not review+ (which means that a review has been granted).

Dominik Röttsches (drott)

Comment 24 2008-09-02 14:30:04 PDT

Sorry, of course.

Mark Rowe (bdash)

Comment 25 2008-09-03 16:55:51 PDT

There appear to be some encoding issues with this patch. Several non-ASCII characters in peoples names in ChangeLog entries and copyright lines appear to have been somewhat mangled.

monil

Comment 26 2008-09-04 05:08:59 PDT

I have used patch-36025 for Webkit version-36053 (using glib). I tried to open few html pages and some url using GtkLauncer on fedora-8, but facing following problems: 1) html page with english language is opening fine. 2) html page with korean or japanese language(utf-16) not showing any text. 3) tried to open url like google.com, these pages are not coming up.i put some debug statements in code so find out when launcher try to access the .js pages it hung there itself and not proceeding further.

Dominik Röttsches (drott)

Comment 27 2008-09-04 06:24:46 PDT

Created attachment 23167 [details] replacing icu with glib, updated for 36087, ChangeLogs now UTF-8, coding style fixes (In reply to comment #25) The UTF-8 troubles in the ChangeLogs are now hopefully fixed. In addition, I tried being more coding style compliant.

Dominik Röttsches (drott)

Comment 28 2008-09-04 06:57:38 PDT

(In reply to comment #26) > 2) html page with korean or japanese language(utf-16) not showing any text. It would be helpful if you could post the URLs you're having problems with. In my opinion, UTF-16 source-encoded web pages are really rare as using this encoding inflates the content size. I tried some pages as found on the Alexa.com top 100 pages by country or language. For Korea, for example http://www.dreamwiz.com/, http://www.gmarket.co.kr/ - These ones are EUC-KR encoded. They are okay from an encoding point of view, however I cannot verify rendering completely as I don't have Korean fonts on my system. For Japanese I found that certain pages are in EUC-JP which might not be available. It depends on whether your system's iconv supports EUC-JP. My installation of iconv for example cannot instantiate a codec for this encoding. If you do a debug build, you will be able to see if there are problems instantiating a certain encoding by console error messages like: "EUC-JP not decodable => not available!" or "EUC-JP not encodable => not available!". > 3) tried to open url like google.com, these pages are not coming up.i put some > debug statements in code so find out when launcher try to access the .js pages > it hung there itself and not proceeding further. google.com, google.co.kr, google.co.jp display fine here. It would be helpful if you could provide more details and specify the exact URLs or procedure that lead to the problems you are observing. Have you tried the same in a built of the same revision but without applying the patch? - Just compare whether it's a new issue introduced by this page or one that might have existed previously.

Mike Hommey

Comment 29 2008-09-04 10:19:52 PDT

(In reply to comment #26) > 2) html page with korean or japanese language(utf-16) not showing any text. This might be bug #16792

Dominik Röttsches (drott)

Comment 30 2008-09-05 06:07:51 PDT

Comment on attachment 23167 [details] replacing icu with glib, updated for 36087, ChangeLogs now UTF-8, coding style fixes Found an issue with this patch. It doesn't handle partially transmitted multibyte characters correctly. I'm looking into it, removing r?.

Alexey Proskuryakov

Comment 31 2008-09-05 08:51:17 PDT

A TextCodec implementation is supposed to keep partial bytes in it internal state, unless decode() is called with flush argument. In the latter case, an incomplete character is an error. This is handled by ICU internally, so TextCodecICU doesn't need any such logic (but TextCodecUTF16 has it).

Dominik Röttsches (drott)

Comment 32 2008-09-08 10:08:22 PDT

Created attachment 23263 [details] Replacing ICU with glib Patch updated for 36268, and: * now supports partially transmitted multibyte characters. Thank you, Alexey, for the clarification. * changed semantics of textBreakLast and textBreakFirst to look for characters not breaks.

monil

Comment 33 2008-09-18 07:29:01 PDT

Its strange, but i tried on diffrent versions of webkit by applying patches(glib) on that, still iam unable to open any url on gtklauncher. Following are the version which right now iam using: webkit->r36477 glib-patch->Patch updated for 36268 pango->pango-1.21.4 glib->glib-2.17.6 iconv->libiconv-1.12 I tried to open following url : www.google.co.in www.google.co.kr www.daum.net www.ana.co.jp none of the above url opened in GtkLuancher. Is some modification or some other packages are required? Also all these url's opened well with ICU. if some more information required please tell.

Eric Seidel (no email)

Comment 34 2008-09-23 04:34:39 PDT

Comment on attachment 23263 [details] Replacing ICU with glib There are a bunch of webkit-style violations in this patch, which make it difficult to r+. It concerns me that you'd need to copy all those macros from the ICU headers. Does GLIB not provide any equivalents? Things like where * and & go, relative to the variable name. Single line if/else statements don't use {} Use of c-style casts (Foo*)foo instead of static_cast<Foo*>(foo) Each variable gets its own declaration line, no int foo, bar; 4-space indents 0 instead of NULL when referring to pointers in c++ code true and false instead of TRUE and FALSE when dealing with c++ "bool" types Header #define names should be identical to the header, like #define UnicodeGLib_h Anything copied from Glib should really be in its own header file, IMO. Separating things out into their own file makes it easier to update them if they ever change upstream, or for platforms which have the necessary headers to not use our local copies of that data. Probably the same goes of the ICU macros. UnicodeMacrosFromICU.h would be one possible name. :) WebKit source code files are generally UT8, I'm not sure what encoding you're editing with: + * Copyright (C) 2008 Jürg Billeter <j@bitron.ch> There is a lot of code in UnicodeGLib.h, much of that could be moved into a corresponding .cpp file. :( Even though this code touches Gtk, it would be best if we could avoid manual-menory management. Manual memory management is *evil* (IMO) and gets you into all sorts of trouble. I think there was a GOwnPtr landed recently, no? That would know how to call g_free on its contents when it went out of scope? If so, one could use that instead of all the manual g_free calls. Thanks for the patch, but I think we need another round of cleanup to this one.

Eric Seidel (no email)

Comment 35 2008-09-23 04:35:57 PDT

If you haven't found it already, here is the WebKit style guide: http://webkit.org/coding/coding-style.html it's linked from the contribution guidelines: http://webkit.org/coding/contributing.html

Dominik Röttsches (drott)

Comment 36 2008-09-23 04:41:39 PDT

(In reply to comment #34) Thanks for your remarks, Eric. I will have a closer look at it once I find the time, maybe Jürg can support in this if he's available. Regarding the GOwnPtr, I will try to incorporate it. I saw the discussions but it wasn't landed at the time.

Miguel L. Turbón

Comment 37 2008-10-01 12:23:32 PDT

I think there is a bug in the patch in toLower, toUpper from UnicodeGlib.h (used in StringPrototype.cpp). Maybe a *error = FALSE is missing in this functions at the end.

Kalle Vahlman

Comment 38 2008-11-12 10:21:12 PST

I tried this patch, surfed through some pages with UTF-8 and didn't see any errors. All of these worked ok for me (well, except the bits I don't have font support for.. not many of those though): http://www.ltg.ed.ac.uk/~richard/unicode-sample.html http://www.ltg.ed.ac.uk/~richard/unicode-sample-3-2.html http://www.madore.org/~david/misc/unitest/ http://www.borut.com/library/s_10646t.htm It'd be great to get this cleaned up and in, dragging extra 14 MegaBytes of unneeded (and pretty annoyingly built) libraries seems silly, specially in mobile devices context.

Dominik Röttsches (drott)

Comment 39 2008-11-20 08:11:30 PST

Created attachment 25314 [details] Replacing ICU with glib Another round, now updated to apply cleanly to 38619: * lots of coding style fixes, with special focus on the ones Eric pointed out * tried to use GOwnPtr instead of manual memory management * externalized GLib casefolding table and macros copied from ICU to separate files * Changes to GNUMakefile.am in JSC depending on choice of unicode backend This is a preliminary version. Before asking for a review I need to do some more testing on pages with exotic encodings. Kalle, if you have some time, could have a go with this one? Any feedback is welcome.

Dominik Röttsches (drott)

Comment 40 2008-11-21 04:38:24 PST

Created attachment 25348 [details] Replacing ICU with glib In addition to the previous changes: * sawError argument to the decode() function now (hopefully) correctly used. * In response to Miguel L. Turbón's comment, initializing error out-argument to false for safety.

Holger Freyther

Comment 41 2008-11-21 18:46:04 PST

Does this pass the "fast/encoding" tests? You can run them with ./WebKitTools/Scripts/run-webkit-tests --gtk (--debug|--release) fast/encoding.

Dominik Röttsches (drott)

Comment 42 2008-11-24 05:47:55 PST

(In reply to comment #41) > Does this pass the "fast/encoding" tests? You can run them with > ./WebKitTools/Scripts/run-webkit-tests --gtk (--debug|--release) fast/encoding. Unfortunately not (yet). I've started looking into it. What's the expected outcome? Would this patch have to pass all of them? 2 or 3 test cases (admittedly the easier ones) depend on a certain canonical name for the encoding which seems different in ICU and GLib (for example windows-1256 != CP1256). Would it be acceptable to modify the test cases here, e.g. doing the actual result check with a more tolerant RegExp in JS and print PASS/FAIL? Also, some of the tests specifically checking the decoding result for some exotic encodings will not pass because libiconv underlying to the GLib routines does not support all of them. How can we account for that?

Alexey Proskuryakov

Comment 43 2008-11-24 06:55:27 PST

(In reply to comment #42) > 2 or 3 test cases (admittedly the easier ones) depend on a certain canonical > name for the encoding which seems different in ICU and GLib (for example > windows-1256 != CP1256). Would it be acceptable to modify the test cases here, > e.g. doing the actual result check with a more tolerant RegExp in JS and print > PASS/FAIL? We need to decide on a case by case basis. In particular, windows-1256 is a registered IANA name, unlike CP1256, so changing from former to latter is not an improvement. In fact, some cpXXX names do not even match windows-XXX ones with the same number (e.g. windows-949 vs. cp949). I don't have an opinion on whether the issues you described should be dealt with now, or left for later.

Darin Adler

Comment 44 2008-11-24 09:35:08 PST

(In reply to comment #42) > Unfortunately not (yet). I've started looking into it. What's the expected > outcome? Would this patch have to pass all of them? The regression tests tell you what behavior you're changing. Policy for what can be right and wrong in a given port is a different question. We have a mechanism for having port-specific results for various tests, so you can have expected failures in some ports rather than others. > 2 or 3 test cases (admittedly the easier ones) depend on a certain canonical > name for the encoding which seems different in ICU and GLib (for example > windows-1256 != CP1256). Would it be acceptable to modify the test cases here, > e.g. doing the actual result check with a more tolerant RegExp in JS and print > PASS/FAIL? The names actually used on the web are the ones that matter. Also, consistent availability of the same names for encoding across platforms. But this is not yet extensively tested -- the test coverage is sparse. The text encoding libraries often don't match what's needed on the web very closely. Back early in the WebKit project, when we used the Mac-specific "Text Encoding Converter" library we had a long list of encoding names that had to be added explicitly; we didn't even try to use the names in the library itself. When we switched to the ICU library, we discovered that it had a much better list of names in the library, so the list of encoding names that need to be hardcoded in WebKit itself for ICU is a lot shorter. I imagine that GLib omits many encoding names that are needed for practical compatibility with the web, so you'll need code to deal with this. To see how this is implemented for ICU, look at the TextCodecICU::registerExtendedEncodingNames function. To see the remnants of the former mechanism used in older versions of WebKit, look at the TextCodecMac::registerEncodingNames function and at the character-sets.txt, mac-encodings.txt, and make-charset-table.pl files in the platform/text/mac directory. > Also, some of the tests specifically checking the decoding result for some > exotic encodings will not pass because libiconv underlying to the GLib routines > does not support all of them. How can we account for that? Those test failures are specific examples of how a version of WebKit using the GLib functions would be deficient compared to versions based on other text decoding libraries. There are at least three things we can do about this: 1) Put expected test results in a platform-specific test result directory, reflecting the known shortcomings of the GLib-based support. 2) Supplement the GLib decoding with some other mechanism so we can support more encodings without pulling in an entire library, such as ICU. 3) Take the existing tests and break them into multiple tests to separate "important" from "exotic" encodings, so that the tests that cover important encodings don't have to have platform-specific results. However, this will require discussion of what encodings are important enough.

Kalle Vahlman

Comment 45 2008-11-24 12:27:12 PST

(In reply to comment #39) > Kalle, if you have some time, could have a go with this one? Any feedback is > welcome. I took it for a spin, and visually it worked for me :) Valgrind disagreed though, there were two leaks introduced. Fortunately fixing those was trivial: --- a/WebCore/platform/graphics/gtk/FontGtk.cpp +++ b/WebCore/platform/graphics/gtk/FontGtk.cpp @@ -139,6 +139,7 @@ static gchar* convertUniCharToUTF8(const UChar* characters, gint length, int fro pos += start; len -= start; } + g_free(utf8); return g_string_free(ret, FALSE); } The contents of utf8 is appended to a gstring which copies the data, so the original needs to be freed. --- a/WebCore/platform/text/gtk/TextBreakIteratorGtk.cpp +++ b/WebCore/platform/text/gtk/TextBreakIteratorGtk.cpp @@ -67,6 +67,8 @@ static TextBreakIterator* setUpIterator(bool& createdIterator, TextBreakIterator iterator->m_type = type; iterator->m_length = length; + if (createdIterator) + g_free(iterator->m_log_attrs); iterator->m_log_attrs = g_new0(PangoLogAttr, length + 1); iterator->m_index = -1; pango_get_log_attrs(utf8.get(), utf8len, -1, 0, iterator->m_log_attrs, length + 1); Here the PangoLogAttr array was leaked for each setup after the first one. Since this seems to be done a lot, I wonder if using the g_slice API would be a good idea performance-wise... That would require profiling to prove it's hurting though.

Dominik Röttsches (drott)

Comment 46 2008-11-26 06:04:19 PST

(In reply to comment #45) > (In reply to comment #39) > I took it for a spin, and visually it worked for me :) Thanks for your feedback and taking a look, Kalle! > Valgrind disagreed though, there were two leaks introduced. Fortunately fixing > those was trivial: > > --- a/WebCore/platform/graphics/gtk/FontGtk.cpp > +++ b/WebCore/platform/graphics/gtk/FontGtk.cpp > @@ -139,6 +139,7 @@ static gchar* convertUniCharToUTF8(const UChar* characters, > gint length, int fro > pos += start; > len -= start; > } > + g_free(utf8); > return g_string_free(ret, FALSE); > } > > The contents of utf8 is appended to a gstring which copies the data, so the > original needs to be freed. Interesting that you found this one. The patch actually doesn't touch FontGtk.cpp AFAICS. So this leak must have been there before. I think it would make sense to file this one as a separate bug. FontGtk.cpp looks like it would profit from being reworked using GOwnPtr, so another option would be to just add this problem to bug 21594. > Here the PangoLogAttr array was leaked for each setup after the first one. > Since this seems to be done a lot, I wonder if using the g_slice API would be a > good idea performance-wise... That would require profiling to prove it's > hurting though. So far, I just incorporated your fix proposal.

Dominik Röttsches (drott)

Comment 47 2008-12-01 05:54:44 PST

Created attachment 25627 [details] Replacing ICU with glib, IDN support New iteration: * incorporated one of the leak fixes from Kalle for TextBreakIteratorGtk.cpp * added IDN support using libidn (autoconf check included) (this is required to pass /fast/encoding/url-hostname-non-ascii.html) * reordered some encoding names, added some aliases and changed some letter cases in order to pass the testcases that check for specific encoding names Now passes a lot more of the test cases, the remaining failing ones are mostly GBK and Hebrew encoding cases, which I am unable to test on my machine as my libiconv doesn't have them.

Dominik Röttsches (drott)

Comment 48 2008-12-01 05:58:21 PST

Created attachment 25628 [details] Fast/Encoding Test Results for GTK GLib-Unicode Release Build

Dominik Röttsches (drott)

Comment 49 2008-12-01 05:59:17 PST

Created attachment 25629 [details] Fast/Encoding Test Results for GTK ICU-Unicode Release Build For reference, the results for the ICU-backend release build.

Vitaliy Pronkin

Comment 50 2009-01-03 06:32:53 PST

I compiled this for N810/Maemo and it hangs in g_convert_with_iconv when passed length == 0. returning empty string in this case helps

Darin Adler

Comment 51 2009-01-10 15:57:52 PST

Comment on attachment 25627 [details] Replacing ICU with glib, IDN support Do all these changes have to go in one big patch? It seems that the codec changes are entirely separate from the other changes. By making the patch big, you make it a lot harder to review, so you'll end up waiting longer for someone to have time to review the whole thing. Roughly speaking, it looks good, but I don't see any reason to tie the codec changes with the others.

Alexander Butenko

Comment 52 2009-01-11 15:58:49 PST

i was using this patch for a while and it works perfect with english, spanish and russian languages. I dont know any other languages, so cant test more.

Dominik Röttsches (drott)

Comment 53 2009-01-12 03:23:56 PST

(In reply to comment #51) > Roughly speaking, it looks good, but I don't see any reason to tie the codec > changes with the others. Darin, first of all, thanks for the positive feedback. I am not sure if I understand how you see the codec changes separate from the rest of the patch. Could you explain in a little more detail? TextCodecGtk.cpp and TextCodecGtk.h are new in this patch, not modifying old code and essential to the overall functioning of this patch. On the one hand, on their own in a separate patch they wouldn't add anything functional and on the other hand, the changes in TextEncodingRegistry.cpp for example wouldn't work without these new files. Maybe my Changelog entry + TextCodecGtk now supports partially transmitted multibyte characters. is misleading. TextCodecGtk was only existing in prior versions of the patch, it didn't exist in the trunk before. So from my point view the patch cannot be split into codec changes / other changes - but I might have misunderstood your comment.

Kalle Vahlman

Comment 54 2009-01-12 04:31:13 PST

(In reply to comment #53) > (In reply to comment #51) > > Roughly speaking, it looks good, but I don't see any reason to tie the codec > > changes with the others. > > Darin, first of all, thanks for the positive feedback. > I am not sure if I understand how you see the codec changes separate from the > rest of the patch. Could you explain in a little more detail? He probably means that the changes could be split to allow individual review, not the whole 78kB of it :) On another issue, I guess related to comment #50, there's something that introduces a seemingly infinite loop with the glib code. I haven't been able to make real sense of it, but at least it's easy to reproduce: 1. Load google.com 2. type something to the search field (I think it needs to be > somecount chars, but not sure. 1234567890 should work) 3. go back to the beginning of the line and try typing additional text there 4. consistently, glib backend goes in to a busyloop while icu does not I've tried to print some debugs from the iterator constructor, but haven't seen anything exceptionally funny in there (but I admit that I'm not terribly well aware of what is actually happening there...).

Darin Adler

Comment 55 2009-01-12 09:08:00 PST

(In reply to comment #53) > I am not sure if I understand how you see the codec changes separate from the > rest of the patch. Could you explain in a little more detail? > TextCodecGtk.cpp and TextCodecGtk.h are new in this patch, not modifying old > code and essential to the overall functioning of this patch. On the one hand, > on their own in a separate patch they wouldn't add anything functional and on > the other hand, the changes in TextEncodingRegistry.cpp for example wouldn't > work without these new files. I mean that the TextCodecGtk and TextEncodingRegistry changes together can stand alone, and don't rely on the WTF changes. I see nothing where it requires the WTF changes. Similarly, the IDN changes don't have to go with either the codec changes or the WTF changes. Same for the text break iterator changes. These seem like they could be done in at least 4 separate pieces, each easier to review, and each helpful in making progress.

Daniel Laird

Comment 56 2009-01-13 02:51:55 PST

Can I ask if libiconv is a definate requirement now? I have libglib without libiconv support and when I run with an older version of this patch and webkit it crashes. If I build glib with libiconv=gnu option then webkit runs. However to get this to work for me requires some hacking so ideally I would not have a requirement to build libiconv. Hope someone can give a yes/no answer Cheers Dan

Dominik Röttsches (drott)

Comment 57 2009-01-13 03:26:09 PST

libiconv is not a direct requirement of this patch. The patch in its current form requires GLib and libidn. It needs at least GLib functions g_convert_with_iconv, g_iconv_open, g_iconv_close to be available and functional. So the patch indirectly depends on what your GLib's using to implement those functions.

Kalle Vahlman

Comment 58 2009-01-13 03:34:17 PST

(In reply to comment #54) > On another issue, I guess related to comment #50, there's something that > introduces a seemingly infinite loop with the glib code. I haven't been able to > make real sense of it, but at least it's easy to reproduce: > > 1. Load google.com > 2. type something to the search field (I think it needs to be > somecount > chars, but not sure. 1234567890 should work) > 3. go back to the beginning of the line and try typing additional text there > 4. consistently, glib backend goes in to a busyloop while icu does not Yesterday I took time to debug this further and found the following (excerpt from mail conversation): I did some digging with gdb and found that the busyloop revolves inside VisiblePosition::leftVisuallyDistinctCandidate() in WebCore/editing/VisiblePosition.cpp. There's two while(true) loops there and the inner one never exits. After a while of head-scratching, comparison with ICU implementation and reading ICU reference, I came to the conclusion that the following should be the correct fix: @@ -197,7 +197,7 @@ int textBreakPrevious(TextBreakIterator* bi) return i; } } - return textBreakFirst(bi); + return TextBreakDone; } The icu docs for ubrk_preceding() say: "The value returned is always smaller than offset, or UBRK_DONE." but the glib implementation was never returning DONE (which is -1) even for 0 since it calls the BreakFirst which (as the BreakLast comments also state) always returns the first character of the text. Also note that BreakNext also returns DONE instead of the last character already. --- Dominik agreed and will incorporate this into the next petch/patches.

Dominik Röttsches (drott)

Comment 59 2009-01-14 08:49:45 PST

Thanks to Darin's suggestions, for splitting the patch into smaller pieces, my plan is as follows. I will create 4 patches, the first one for moving WTF Unicode to GLib, the second one for moving Text Codecs from ICU to GLib, the third for switching TextBreakIterator to GLib and the fourth one for adding IDN support. For the first two modificaitons, I have created two mostly independent patches which can be reviewed separately. Unfortunately, they overlap in changes for /configure.ac and /GNUMakefile.am. This is because I introduced a compile flag for compiling a hybrid WebKit version that has a mixed ICU/GLib Unicode backend. Patch 2/4 can be applied after 1/4 if the changes to configure.ac and GNUMakefile.am and JavaScriptCore/wtf/unicode/Unicode.h are ignored/skipped. I can quickly clean up the second patch in this regard once the first is landed. My plan is that once these two would be landed, I will provide 3/4 and 4/4 containing the TextBreakIterator change and the IDN support - applying cleanly to a future SVN revision that contains 1&2. If anyone has a suggestion for improving or simplifying this staged process, please let me know. Also, any testing/feedback on the two separated patches is very welcome.

Dominik Röttsches (drott)

Comment 60 2009-01-14 08:51:34 PST

Created attachment 26711 [details] 1/4 - Moving WTF Unicode backend to GLib

Dominik Röttsches (drott)

Comment 61 2009-01-14 08:52:19 PST

Created attachment 26712 [details] 2/4 - Moving Text Codecs to GLib

Darin Adler

Comment 62 2009-01-15 10:26:01 PST

Comment on attachment 26711 [details] 1/4 - Moving WTF Unicode backend to GLib > Index: JavaScriptCore/wtf/unicode/glib/CasefoldTableFromGLib.h > =================================================================== > --- JavaScriptCore/wtf/unicode/glib/CasefoldTableFromGLib.h (revision 0) > +++ JavaScriptCore/wtf/unicode/glib/CasefoldTableFromGLib.h (revision 0) > @@ -0,0 +1,200 @@ > +/* > + * Copyright (C) 2006 George Staikos <staikos@kde.org> > + * Copyright (C) 2006 Alexey Proskuryakov <ap@nypop.com> > + * Copyright (C) 2007 Apple Computer, Inc. All rights reserved. > + * Copyright (C) 2008 Jürg Billeter <j@bitron.ch> > + * Copyright (C) 2008 Dominik Röttsches <dominik.roettsches@access-company.com> Is this file really needed? I don't think the copyrights are right on this file. > +} casefold_table[] = { This is not the normal naming for identifiers in WebKit. This would typically be inside the WTF::Unicode namespace and be named something like caseFoldTable or CaseFoldTable. > +#include "CasefoldTableFromGLib.h" Do we really need to compile in a copy of the table into each source file? That seems like a very bad idea. Since the case folding table is declared static, it will have internal linkage so each source file that uses this header will get a separate copy of the table! > + inline UChar32 foldCase(UChar32 ch) These functions are pretty long. I don't think it makes sense to inline them. > + inline int umemcasecmp(const UChar* a, const UChar* b, int len) > + { > + GOwnPtr<char> utf8a; > + GOwnPtr<char> utf8b; > + > + utf8a.set(g_utf16_to_utf8(a, len, 0, 0, 0)); > + utf8b.set(g_utf16_to_utf8(b, len, 0, 0, 0)); > + > + GOwnPtr<char> foldedA; > + GOwnPtr<char> foldedB; > + > + foldedA.set(g_utf8_casefold(utf8a.get(), -1)); > + foldedB.set(g_utf8_casefold(utf8b.get(), -1)); > + > + int result = g_utf8_collate(foldedA.get(), foldedB.get()); > + > + return result; > + } Converting to UTF-8 just to do the case folding is going to be very inefficient. This is going to result in a quite-slow GTK port if it's used anywhere. In general we had to work hard to eliminate the conversion from UTF-16 and UTF-8 and back that used to be present in WebKit and JavaScriptCore in its core algorithms, and move that conversion to the external API. I think you need to do some performance tests if you're going to introduce memory allocation and UTF-8 conversion into these algorithms, unless it's OK to have the performance of the GTK port suffer. > +#ifndef UnicodeGLibTypes_h > +#define UnicodeGLibTypes_h > + > +typedef uint16_t UChar; > +typedef int32_t UChar32; > + > +#endif Seems a little excessive to have a separate header file for these, but it might be OK. I'm going to say review- because of some of the issues with the case folding table and the performance of the UTF-8 conversion.

Dominik Röttsches (drott)

Comment 63 2009-01-16 07:46:07 PST

Created attachment 26793 [details] 1/4 - Moving WTF Unicode backend to GLib (v2) (In reply to comment #62) Darin, thanks for taking the time to review the patch. Here's another iteration that tries to address your comments. > (From update of attachment 26711 [details] [review]) > > Index: JavaScriptCore/wtf/unicode/glib/CasefoldTableFromGLib.h > > =================================================================== > > [...] > > Is this file really needed? I got rid of it at the price of a single-character ucs4->utf8->casefold->ucs4 conversion. See below regarding the conversion costs. > > +} casefold_table[] = { > This is not the normal naming for identifiers in WebKit. This would typically > be inside the WTF::Unicode namespace and be named something like caseFoldTable > or CaseFoldTable. gone. > > +#include "CasefoldTableFromGLib.h" > Do we really need to compile in a copy of the table into each source file? gone as well. > > + inline UChar32 foldCase(UChar32 ch) > > These functions are pretty long. I don't think it makes sense to inline them. I disabled inlining for the longer ones and moved them to a new file UnicodeGLib.cpp > > + inline int umemcasecmp(const UChar* a, const UChar* b, int len) > Converting to UTF-8 just to do the case folding is going to be very > inefficient. This is going to result in a quite-slow GTK port if it's used > anywhere. > > In general we had to work hard to eliminate the conversion from UTF-16 and > UTF-8 and back that used to be present in WebKit and JavaScriptCore in its core > algorithms, and move that conversion to the external API. I think you need to > do some performance tests if you're going to introduce memory allocation and > UTF-8 conversion into these algorithms, unless it's OK to have the performance > of the GTK port suffer. Looking at the public GLib Unicode API (http://library.gnome.org/devel/glib/stable/glib-Unicode-Manipulation.html) it seems difficult to call the more interesting unicode functionality with utf-16 strings directly - casefolding, collation, normalization all require utf-8. Please see below for my my view regarding these costs. > > +#ifndef UnicodeGLibTypes_h > > +#define UnicodeGLibTypes_h > > + > > +typedef uint16_t UChar; > > +typedef int32_t UChar32; > > + > > +#endif > > Seems a little excessive to have a separate header file for these, but it might > be OK. Pulled that into UnicodeGLib.h Also added a FIXME in umemcasecmp that discusses the discrepancy to the icu implementation. > I'm going to say review- because of some of the issues with the case folding > table and the performance of the UTF-8 conversion. While I am aware it's probably not a very appropriate benchmark, for a first and very rough idea, I compared the ICU build against the GLib build running sunspider. You can see a perfomance regression of 2-3% in most of the string tests. I will attach these results. My understanding is that this patch helps to reduce ROM footprint when deploying on an embedded platform by eliminating the ICU dependency, saving approximately 10MB (as Alp reports). Currently, the glib backend is optional, icu would remain default. So in my opinion, the benefit of this patch is that it gives a choice to integrators between sacrificing a little performance in favor of the binary reduction or package ICU onto their targets. Future changes in GLib could lead to removing the utf-16 to utf-8 conversions eventually.

Dominik Röttsches (drott)

Comment 64 2009-01-16 07:49:42 PST

Created attachment 26794 [details] Sunspider Results Comparing icu against GLib Unicode Backend

Mikkel Kruse Johnsen

Comment 65 2009-02-08 07:34:39 PST

When will the two missing patches be published ? I need them to cross compile WebKit for windows (no ICU). Would love to test them.

Dominik Röttsches (drott)

Comment 66 2009-02-09 01:20:40 PST

(In reply to comment #65) > When will the two missing patches be published ? As I mentioned in comment #59, the plan is to publish them as soon as 1 & 2 are landed. There is not really much new in patch 3 & 4 to come, the code is already available in patch "Replacing ICU with glib, IDN support" from 1st of December. It's just split into 4 parts now. HTH.

Mikkel Kruse Johnsen

Comment 67 2009-04-08 06:54:45 PDT

How do we deal with this error, what could replace these functions ? ../WebCore/editing/visible_units.cpp: In function 'int WebCore::firstNonComplexContextLineBreak(const UChar*, int)': ../WebCore/editing/visible_units.cpp:49: error: 'hasLineBreakingPropertyComplexContext' was not declared in this scope ../WebCore/editing/visible_units.cpp: In function 'int WebCore::lastNonComplexContextLineBreak(const UChar*, int)': ../WebCore/editing/visible_units.cpp:58: error: 'hasLineBreakingPropertyComplexContext' was not declared in this scope ../WebCore/editing/visible_units.cpp: In function 'WebCore::VisiblePosition WebCore::previousBoundary(const WebCore::VisiblePosition&, unsigned int (*)(const UChar*, unsigned int, unsigned int))': ../WebCore/editing/visible_units.cpp:89: error: 'hasLineBreakingPropertyComplexContext' was not declared in this scope ../WebCore/editing/visible_units.cpp: In function 'WebCore::VisiblePosition WebCore::nextBoundary(const WebCore::VisiblePosition&, unsigned int (*)(const UChar*, unsigned int, unsigned int))': ../WebCore/editing/visible_units.cpp:175: error: 'hasLineBreakingPropertyComplexContext' was not declared in this scope make[1]: *** [WebCore/editing/libWebCore_la-visible_units.lo] Error 1 make[1]: Leaving directory `/home/mkj/projects/XCare/build/webkit-1.1.4/WebKitBuild' make: *** [all] Error 2

Mikkel Kruse Johnsen

Comment 68 2009-04-09 04:48:57 PDT

Just added to "JavaScriptCore/wtf/unicode/glib/UnicodeGLib.h" inline bool hasLineBreakingPropertyComplexContext(UChar32 c) { // FIXME return false; }

Eric Seidel (no email)

Comment 69 2009-05-21 18:45:26 PDT

Comment on attachment 26712 [details] 2/4 - Moving Text Codecs to GLib No comments have been made in over a month. Marking this r- to remove it from the review queue. Please email one of the gtk reviewers directly to get comments. Feel free to post for re-review if you have had contact with a reviewer and believe this will be reviewed in a timely manner.

Eric Seidel (no email)

Comment 70 2009-05-21 18:45:39 PDT

Comment on attachment 26793 [details] 1/4 - Moving WTF Unicode backend to GLib (v2) No comments have been made in over a month. Marking this r- to remove it from the review queue. Please email one of the gtk reviewers directly to get comments. Feel free to post for re-review if you have had contact with a reviewer and believe this will be reviewed in a timely manner.

Maciej Stachowiak

Comment 71 2009-05-21 22:31:43 PDT

Comment on attachment 26712 [details] 2/4 - Moving Text Codecs to GLib Reflagging for review; this still needs substantive review, ideally from a Gtk reviewer.

Maciej Stachowiak

Comment 72 2009-05-21 22:31:51 PDT

Comment on attachment 26793 [details] 1/4 - Moving WTF Unicode backend to GLib (v2) Reflagging for review; this still needs substantive review, ideally from a Gtk reviewer.

Gustavo Noronha (kov)

Comment 73 2009-05-22 10:23:50 PDT

Comment on attachment 26712 [details] 2/4 - Moving Text Codecs to GLib This seems to be an earlier attempt, that has been superseded by https://bugs.webkit.org/attachment.cgi?id=26793, right? Clearing flag.

Gustavo Noronha (kov)

Comment 74 2009-05-22 10:56:23 PDT

Comment on attachment 26793 [details] 1/4 - Moving WTF Unicode backend to GLib (v2) > + https://bugs.webkit.org/show_bug.cgi?id=15914 > + [GTK] Implement Unicode functionality using GLib I have gone through all the comments now, and have verified that most comments have been addressed. I think the best way to move forward is to land the patch in its current state, before it gets old enough that too much has changed, now that you have made it smaller and self-contained. I have given it another pass, and found only indentation problems, which I fixed (see UnicodeGLib.h for reference, so that you can get it right for the next patches), and two missing function implementations (including the one found by Mikkel), which I added. I also fixed header inclusion on top of UnicodeGLib.h. I have tested that the ICU backend keeps working, and passing tests. This, along with the facts that icu keeps being the default, and that your patch touches little cross platform code makes me believe that we're ready for landing it now.

Gustavo Noronha (kov)

Comment 75 2009-05-22 11:08:58 PDT

Comment on attachment 26793 [details] 1/4 - Moving WTF Unicode backend to GLib (v2) Landed as r44050.

Dominik Röttsches (drott)

Comment 76 2009-05-25 05:12:06 PDT

(In reply to comment #73) > (From update of attachment 26712 [details] [review]) > This seems to be an earlier attempt, that has been superseded by > https://bugs.webkit.org/attachment.cgi?id=26793, right? Clearing flag. 26712 belonged to an earlier set (26711 and 26712). I hope I will soon find the time to provide an updated version of patch 2/4 for the Text Codecs. Thanks for landing!

Gustavo Noronha (kov)

Comment 77 2009-10-16 05:31:17 PDT

Ping. Do you plan to keep going on this one, or should we revert the original patch?

Christian Dywan

Comment 78 2009-10-16 10:35:04 PDT

I would love to see this progress, if it weren't for the bleeding edge Glib requirement making it unusable on my N900 :-/

Xan Lopez

Comment 79 2009-10-16 15:49:23 PDT

(In reply to comment #78) > I would love to see this progress, if it weren't for the bleeding edge Glib > requirement making it unusable on my N900 :-/ Upgrading the glib in the system should be trivial, I'm sure at worst there's a handful of patches applied to it.

Christian Dywan

Comment 80 2009-10-16 17:02:42 PDT

Emphasis on "should". Glib can't (easily) be replaced because several system applications explicitly require the version that is ships with.

Xan Lopez

Comment 81 2009-10-16 17:05:54 PDT

(In reply to comment #80) > Emphasis on "should". Glib can't (easily) be replaced because several system > applications explicitly require the version that is ships with. Upgrade your glib and then just change the version number back to the previous one. Stupid workarounds for stupid applications (and good God, file bugs).

Dominik Röttsches (drott)

Comment 82 2009-10-19 01:58:40 PDT

(In reply to comment #77) > Ping. Do you plan to keep going on this one, or should we revert the original > patch? Yes, I will provide updated patches until the middle of this week.

Dominik Röttsches (drott)

Comment 83 2009-10-19 11:24:06 PDT

Created attachment 41434 [details] 2/4 - Moving Text Codecs to GLib (v2) New version of patch 2/4 - applies cleanly against r49769 Moving TextCodec to GLib and removing respective HYBRID macro usages.

Eric Seidel (no email)

Comment 84 2009-10-19 14:08:57 PDT

Doing more than one patch per bug, is generally a recipe for slow reviews. In this case, I'm scared to review this because I don't have time to read all 83 comments before doing my review. :(

Dominik Röttsches (drott)

Comment 85 2009-10-19 14:48:26 PDT

(In reply to comment #84) > Doing more than one patch per bug, is generally a recipe for slow reviews. In > this case, I'm scared to review this because I don't have time to read all 83 > comments before doing my review. :( Well, in comment #51 Darin asked me to split it into smaller portions, that's why there's more than one patch. Maybe Gustavo can help reviewing this as he reviewed 1/4?

Gustavo Noronha (kov)

Comment 86 2009-10-25 11:54:12 PDT

Comment on attachment 41434 [details] 2/4 - Moving Text Codecs to GLib (v2) > @@ -114,6 +117,20 @@ CString TextEncoding::encode(const UChar > QString str(reinterpret_cast<const QChar*>(characters), length); > str = str.normalized(QString::NormalizationForm_C); > return newTextCodec(*this)->encode(reinterpret_cast<const UChar *>(str.utf16()), str.length(), handling); > +#elif USE(GLIB_UNICODE) > + GOwnPtr<char> utf8source; > + utf8source.set(g_utf16_to_utf8(characters, length, 0, 0, 0)); > + > + GOwnPtr<char> utf8normalized; > + utf8normalized.set(g_utf8_normalize(utf8source.get(), -1, G_NORMALIZE_NFC)); > + > + long utf16length; > + GOwnPtr<UChar> utf16normalized; > + utf16normalized.set(g_utf8_to_utf16(utf8normalized.get(), -1, 0, &utf16length, 0)); Naming of the variables is a bit off our style. UTF8Normalized is more like it, for instance. > + CString result = newTextCodec(*this)->encode(utf16normalized.get(), utf16length, handling); > + > + return result; No need for a variable, just return the newTextCodec call directly. > +#include "gtk/TextCodecGtk.h" > +#endif > #if PLATFORM(WINCE) && !PLATFORM(QT) > #include "TextCodecWince.h" > #endif > > + > using namespace WTF; Why this space? > +namespace WebCore { > + > + > + Too many blank lines to my taste =). > +// list of text encodings and their aliases > +// the first entry/"alias" is the canonical name, each alias list must be terminated by a 0 Comments should in general be full sentences, starting with a capital letter, and ending in a period. > +// Unicode > +TextCodecGtk::codecAliasList TextCodecGtk::m_codecAliases_UTF_8 = { "UTF-8", 0}; You're missing a space before } on all of these. I don't think we have a style rule for this, but it looks inconsistent with the fact that you have one at the start =). > +static PassOwnPtr<TextCodec> newTextCodecGtk(const TextEncoding& encoding, const void*) > +{ > +#ifndef NDEBUG > + // too noisy if each UTF-8 codec creation is logged > + // if (strcmp(encoding.name(), "UTF-8")) > + // LOG(TextConversion, "creating new text codec for encoding name %s", encoding.name()); > +#endif This is dead code. Either let it log stuff, or remove the code altogether, please. I think it may make sense to log in a debug build, but up to you. > + for (unsigned int i = 0; i < listLength; ++i) { > + codecAliasList *codecAliases = static_cast<codecAliasList*>(encodingList[i]); > + int codecCount = -1; > + const gchar *canonicalName, *currentAlias; > + bool canonicalAvailable = true; > + > + // for each of the alias lists: > + // bail out if canonical is not available, > + // then continue testing each alias, > + // if available, add it to the list, mapping it to its canonical name > + > + while ((currentAlias = (*codecAliases)[++codecCount]) && canonicalAvailable) { > + bool currentAvailable = isEncodingAvailable(currentAlias); > + > + if (codecCount == 0 && currentAvailable) { > + canonicalName = currentAlias; > + canonicalAvailable = true; > + LOG_VERBOSE(TextConversion, "registering encoding canonical %s", canonicalName); > + registrar(canonicalName, canonicalName); > + } else if (codecCount == 0 && !currentAvailable) > + canonicalAvailable = false; This feels a bit convoluted. Have you considered moving these two condition checks before the while loop? There's no need to check them on every loop run, since they are only useful in the first one. > +TextCodecGtk::~TextCodecGtk() > +{ > + releaseIConv(); > +} > + > +void TextCodecGtk::releaseIConv() const > +{ > + if (m_iconvDecoder != reinterpret_cast<GIConv>(-1)) { > + g_iconv_close(m_iconvDecoder); > + m_iconvDecoder = reinterpret_cast<GIConv>(-1); > + } > + if (m_iconvEncoder != reinterpret_cast<GIConv>(-1)) { > + g_iconv_close(m_iconvEncoder); > + m_iconvEncoder = reinterpret_cast<GIConv>(-1); > + } > +} Why is this a separate function? Do you call releaseIConv anywhere else? Why not put this code inside the destructor, directly? > + if (err) { > + LOG_ERROR("GIConv conversion error"); Would it be useful to log the error message here? > + CString result = CString(buffer.get(), count); > + > + return result; No need for a variable here. > +#ifndef TextCodecGTK_h > +#define TextCodecGTK_h > + > +#include <glib.h> > + > +#include "TextCodec.h" > +#include "TextEncoding.h" No need for that blank line between the includes. Another round! =)

Dominik Röttsches (drott)

Comment 87 2009-10-26 07:39:01 PDT

Created attachment 41866 [details] 2/4 - Moving Text Codecs to GLib (v3) (In reply to comment #86) Thanks a lot for your review, Gustavo! > (From update of attachment 41434 [details]) > > @@ -114,6 +117,20 @@ CString TextEncoding::encode(const UChar > > [...] > > + GOwnPtr<char> utf8source; > > [...] > > + GOwnPtr<char> utf8normalized; > > [...] > > + GOwnPtr<UChar> utf16normalized; > > Naming of the variables is a bit off our style. UTF8Normalized is more like it, > for instance. If I understood the style guide correctly, it should be CamelCase starting with a lowercase letter for variables, also lowercasing acronyms. so I changed it to utf8Source, utf8Normalized etc. Hope that's fine now. > > + CString result = newTextCodec(*this)->encode(utf16normalized.get(), utf16length, handling); > > + > > + return result; > > No need for a variable, just return the newTextCodec call directly. Done. > > +#include "gtk/TextCodecGtk.h" > > +#endif > > #if PLATFORM(WINCE) && !PLATFORM(QT) > > #include "TextCodecWince.h" > > #endif > > > > + > > using namespace WTF; > > Why this space? Removed. > > +namespace WebCore { > > + > > + > > + > > Too many blank lines to my taste =). Removed. > > +// list of text encodings and their aliases > > +// the first entry/"alias" is the canonical name, each alias list must be terminated by a 0 > > Comments should in general be full sentences, starting with a capital letter, > and ending in a period. Rephrased this comment as well as the Changelog a little bit. > > +// Unicode > > +TextCodecGtk::codecAliasList TextCodecGtk::m_codecAliases_UTF_8 = { "UTF-8", 0}; > > You're missing a space before } on all of these. I don't think we have a style > rule for this, but it looks inconsistent with the fact that you have one at the > start =). Changed to match. > > +static PassOwnPtr<TextCodec> newTextCodecGtk(const TextEncoding& encoding, const void*) > > +{ > > +#ifndef NDEBUG > > + // too noisy if each UTF-8 codec creation is logged > > + // if (strcmp(encoding.name(), "UTF-8")) > > + // LOG(TextConversion, "creating new text codec for encoding name %s", encoding.name()); > > +#endif > > This is dead code. Either let it log stuff, or remove the code altogether, > please. I think it may make sense to log in a debug build, but up to you. Removed the log statement here and as well as in isEncodingAvailable. Replaced those by more accurate logging in registerEncodingNames. > > + for (unsigned int i = 0; i < listLength; ++i) { > > + codecAliasList *codecAliases = static_cast<codecAliasList*>(encodingList[i]); > > + int codecCount = -1; > > + const gchar *canonicalName, *currentAlias; > > + bool canonicalAvailable = true; > > + > > + // for each of the alias lists: > > + // bail out if canonical is not available, > > + // then continue testing each alias, > > + // if available, add it to the list, mapping it to its canonical name > > + > > + while ((currentAlias = (*codecAliases)[++codecCount]) && canonicalAvailable) { > > + bool currentAvailable = isEncodingAvailable(currentAlias); > > + > > + if (codecCount == 0 && currentAvailable) { > > + canonicalName = currentAlias; > > + canonicalAvailable = true; > > + LOG_VERBOSE(TextConversion, "registering encoding canonical %s", canonicalName); > > + registrar(canonicalName, canonicalName); > > + } else if (codecCount == 0 && !currentAvailable) > > + canonicalAvailable = false; > > This feels a bit convoluted. Have you considered moving these two condition > checks before the while loop? There's no need to check them on every loop run, > since they are only useful in the first one. Yes, thanks for the remark. I tried to unclutter this and indeed it should be much easier to read now. > > +TextCodecGtk::~TextCodecGtk() > > +{ > > + releaseIConv(); > > +} > > + > > +void TextCodecGtk::releaseIConv() const > > +{ > > + if (m_iconvDecoder != reinterpret_cast<GIConv>(-1)) { > > + g_iconv_close(m_iconvDecoder); > > + m_iconvDecoder = reinterpret_cast<GIConv>(-1); > > + } > > + if (m_iconvEncoder != reinterpret_cast<GIConv>(-1)) { > > + g_iconv_close(m_iconvEncoder); > > + m_iconvEncoder = reinterpret_cast<GIConv>(-1); > > + } > > +} > > Why is this a separate function? Do you call releaseIConv anywhere else? Why > not put this code inside the destructor, directly? "Inlined". > > + if (err) { > > + LOG_ERROR("GIConv conversion error"); > > Would it be useful to log the error message here? Printing code and error message now in encode() and decode(). > > + CString result = CString(buffer.get(), count); > > + > > + return result; > > No need for a variable here. Removed. > > +#ifndef TextCodecGTK_h > > +#define TextCodecGTK_h > > + > > +#include <glib.h> > > + > > +#include "TextCodec.h" > > +#include "TextEncoding.h" > > No need for that blank line between the includes. Removed. In addition, due to the log error output I changed the canonical name for Mac Central Europe Encoding from MACCENTRALEUROPE to MAC-CENTRALEUROPE and EUC_JP and EUC_CN to EUC-JP and EUC-CN, so that they work with GLib/iconv. > Another round! =) Thanks again for your review.

Gustavo Noronha (kov)

Comment 88 2009-10-28 06:20:55 PDT

Comment on attachment 41866 [details] 2/4 - Moving Text Codecs to GLib (v3) > +#elif USE(GLIB_UNICODE) > + GOwnPtr<char> utf8Source; > + utf8Source.set(g_utf16_to_utf8(characters, length, 0, 0, 0)); > + > + GOwnPtr<char> utf8Normalized; > + utf8Normalized.set(g_utf8_normalize(utf8Source.get(), -1, G_NORMALIZE_NFC)); > + > + long utf16Length; > + GOwnPtr<UChar> utf16Normalized; > + utf16Normalized.set(g_utf8_to_utf16(utf8Normalized.get(), -1, 0, &utf16Length, 0)); > + > + return newTextCodec(*this)->encode(utf16Normalized.get(), utf16Length, handling); Regarding this, there's an exception to the camel case rule for acronyms, and indeed all usages of "utf" in this file seem to be upper-cased, so I think we should strive for consistency. This seems to be the only problem I am able to spot by now. Sorry for being nit-picky, but please upload a patch with this fixed so that we can use the commit queue =). Thanks!

Dominik Röttsches (drott)

Comment 89 2009-10-30 06:35:18 PDT

Created attachment 42210 [details] 2/4 - Moving Text Codecs to GLib (v4) (In reply to comment #88) > Regarding this, there's an exception to the camel case rule for acronyms, and > indeed all usages of "utf" in this file seem to be upper-cased, so I think we > should strive for consistency. This seems to be the only problem I am able to > spot by now. Sorry for being nit-picky, but please upload a patch with this > fixed so that we can use the commit queue =). Thanks! Sure, here it is. Thanks for reviewing.

Gustavo Noronha (kov)

Comment 90 2009-11-03 14:30:14 PST

Comment on attachment 42210 [details] 2/4 - Moving Text Codecs to GLib (v4) Thanks.

WebKit Commit Bot

Comment 91 2009-11-03 14:59:11 PST

Comment on attachment 42210 [details] 2/4 - Moving Text Codecs to GLib (v4) Rejecting patch 42210 from commit-queue. Failed to run "['/Users/eseidel/Projects/CommitQueue/WebKitTools/Scripts/svn-apply', '--reviewer', 'Gustavo Noronha', '--force']" exit_code: 1 Last 500 characters of output: stry.cpp |=================================================================== |--- wkTrunkGlib_2_4_Codec/WebCore/platform/text/TextEncodingRegistry.cpp (revision 50327) |+++ wkTrunkGlib_2_4_Codec/WebCore/platform/text/TextEncodingRegistry.cpp (working copy) -------------------------- No file to patch. Skipping patch. 5 out of 5 hunks ignored patching file wkTrunkGlib_2_4_Codec/WebCore/platform/text/gtk/TextCodecGtk.cpp patching file wkTrunkGlib_2_4_Codec/WebCore/platform/text/gtk/TextCodecGtk.h

Eric Seidel (no email)

Comment 92 2009-11-03 15:01:01 PST

Looks like the patch is rooted incorrectly. It's possible that svn-create-patch crawled up to the wrong repository root. If you believe that it's a problem with svn-create-patch, please file a bug. :)

Dominik Röttsches (drott)

Comment 93 2009-11-04 01:33:25 PST

Created attachment 42465 [details] 2/4 - Moving Text Codecs to GLib (v5) (In reply to comment #92) > Looks like the patch is rooted incorrectly. It's possible that > svn-create-patch crawled up to the wrong repository root. If you believe that > it's a problem with svn-create-patch, please file a bug. :) Argh, yes, there was an orphaned .svn folder one level up, looks like svn-create-patch only stopped at this one. Here's a correctly rooted version of the patch, otherwise no changes.

Eric Seidel (no email)

Comment 94 2009-11-04 01:41:16 PST

Comment on attachment 42465 [details] 2/4 - Moving Text Codecs to GLib (v5) If you believe svn-create-patch to have been in error in its handling of this patch creation please do file a bug. :) But yes this patch looks good. Since all patches require both an r+ and a cq+ for the commit-queue to land, I'm adding an r+ here too, mostly based on gns's previous review.

WebKit Commit Bot

Comment 95 2009-11-04 02:28:11 PST

Comment on attachment 42465 [details] 2/4 - Moving Text Codecs to GLib (v5) Clearing flags on attachment: 42465 Committed r50508: <http://trac.webkit.org/changeset/50508>

WebKit Commit Bot

Comment 96 2009-11-04 02:28:30 PST

All reviewed patches have been landed. Closing bug.

Christian Dywan

Comment 97 2009-11-05 20:00:57 PST

Patches 3 and 4 are still pending, hence re-opening.

Eric Seidel (no email)

Comment 98 2009-11-06 11:30:45 PST

Comment on attachment 42210 [details] 2/4 - Moving Text Codecs to GLib (v4) Clearing r+ on this obsolete patch.

Eric Seidel (no email)

Comment 99 2009-11-06 11:31:26 PST

(In reply to comment #97) > Patches 3 and 4 are still pending, hence re-opening. They're not marked for review? Should they be? I really think we should close this bug and start a new ones for any remaining changes.

Dominik Röttsches (drott)

Comment 100 2009-11-13 05:45:52 PST

(In reply to comment #99) > (In reply to comment #97) > > Patches 3 and 4 are still pending, hence re-opening. > > They're not marked for review? Should they be? No, nothing to be marked for review here. > I really think we should close this bug and start a new ones for any remaining > changes. I agree, I'm creating those and adding them as dependencies. So, there won't be any new patches on this one.

Dominik Röttsches (drott)

Comment 101 2010-01-28 01:21:30 PST

The patch for sub-bug 31470 was landed, so this one here can be closed. I don't seem to have the rights for it, could somebody help me? I created a new issue for streamlining the differences in behavior between ICU and GLib unicode backend: bug 34247.

Gustavo Noronha (kov)

Comment 102 2010-01-28 05:14:26 PST

there you go, thanks!

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Enhancement

Classification Unclassified

Version 528+ (Nightly build)

Hardware All

OS All

Product WebKit

Component New Bugs

Assignee

Nobody

Reported

2007-11-08 22:29 PST

Modified

2011-05-24 19:01 PDT History

CC List

26 users Show

URL

Keywords Gtk

Depends on

31468 31469 31470

Blocks

Dependencies

tree graph

ap
christian
christophe.public
commit-queue
cosimoc
darin
david.krause
d-r
eric
erik-webkit
fsulima
grzegorz.dabrowski
gustavo
j
laszlo.gombos
marco.barisione
mh+webkit
mikkel
mlturbon
naiem.shaik
slomo
s.mathur
tofu.linden
tonikitoo
xan.lopez
zuh