Bug 15914 - [GTK] Implement Unicode functionality using GLib
Summary: [GTK] Implement Unicode functionality using GLib
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: New Bugs (show other bugs)
Version: 528+ (Nightly build)
Hardware: All All
: P2 Enhancement
Assignee: Nobody
URL:
Keywords: Gtk
Depends on: 31468 31469 31470
Blocks:
  Show dependency treegraph
 
Reported: 2007-11-08 22:29 PST by Alp Toker
Modified: 2011-05-24 19:01 PDT (History)
26 users (show)

See Also:


Attachments
Preliminary patch removing ICU dependency (44.29 KB, patch)
2008-02-01 14:00 PST, Jürg Billeter
no flags Details | Formatted Diff | Diff
Updated patch (44.50 KB, patch)
2008-04-25 14:42 PDT, Jürg Billeter
no flags Details | Formatted Diff | Diff
Modified the Juergbi's patch for umemcasecmp function (44.57 KB, patch)
2008-06-04 22:59 PDT, Naiem
no flags Details | Formatted Diff | Diff
Patch updated for trunk revision 35916 (46.59 KB, patch)
2008-08-26 09:03 PDT, Dominik Röttsches (drott)
no flags Details | Formatted Diff | Diff
Overview and Comparison of Text Codecs in Safari Win/Mac and libiconv (3.33 KB, text/plain)
2008-08-27 08:01 PDT, Dominik Röttsches (drott)
no flags Details
patch updated for probing for iconv encodings, updated to revision 36025 (63.92 KB, patch)
2008-09-02 13:04 PDT, Dominik Röttsches (drott)
no flags Details | Formatted Diff | Diff
replacing icu with glib, updated for 36087, ChangeLogs now UTF-8, coding style fixes (63.93 KB, patch)
2008-09-04 06:24 PDT, Dominik Röttsches (drott)
no flags Details | Formatted Diff | Diff
Replacing ICU with glib (67.69 KB, patch)
2008-09-08 10:08 PDT, Dominik Röttsches (drott)
eric: review-
Details | Formatted Diff | Diff
Replacing ICU with glib (75.30 KB, patch)
2008-11-20 08:11 PST, Dominik Röttsches (drott)
no flags Details | Formatted Diff | Diff
Replacing ICU with glib (75.71 KB, patch)
2008-11-21 04:38 PST, Dominik Röttsches (drott)
no flags Details | Formatted Diff | Diff
Replacing ICU with glib, IDN support (78.07 KB, patch)
2008-12-01 05:54 PST, Dominik Röttsches (drott)
no flags Details | Formatted Diff | Diff
Fast/Encoding Test Results for GTK GLib-Unicode Release Build (25.91 KB, application/zip)
2008-12-01 05:58 PST, Dominik Röttsches (drott)
no flags Details
Fast/Encoding Test Results for GTK ICU-Unicode Release Build (10.76 KB, application/zip)
2008-12-01 05:59 PST, Dominik Röttsches (drott)
no flags Details
1/4 - Moving WTF Unicode backend to GLib (38.66 KB, patch)
2009-01-14 08:51 PST, Dominik Röttsches (drott)
darin: review-
Details | Formatted Diff | Diff
2/4 - Moving Text Codecs to GLib (36.73 KB, patch)
2009-01-14 08:52 PST, Dominik Röttsches (drott)
no flags Details | Formatted Diff | Diff
1/4 - Moving WTF Unicode backend to GLib (v2) (30.56 KB, patch)
2009-01-16 07:46 PST, Dominik Röttsches (drott)
no flags Details | Formatted Diff | Diff
Sunspider Results Comparing icu against GLib Unicode Backend (3.38 KB, text/plain)
2009-01-16 07:49 PST, Dominik Röttsches (drott)
no flags Details
2/4 - Moving Text Codecs to GLib (v2) (33.60 KB, patch)
2009-10-19 11:24 PDT, Dominik Röttsches (drott)
gustavo: review-
Details | Formatted Diff | Diff
2/4 - Moving Text Codecs to GLib (v3) (32.97 KB, patch)
2009-10-26 07:39 PDT, Dominik Röttsches (drott)
gustavo: review-
Details | Formatted Diff | Diff
2/4 - Moving Text Codecs to GLib (v4) (32.97 KB, patch)
2009-10-30 06:35 PDT, Dominik Röttsches (drott)
commit-queue: commit-queue-
Details | Formatted Diff | Diff
2/4 - Moving Text Codecs to GLib (v5) (32.52 KB, patch)
2009-11-04 01:33 PST, Dominik Röttsches (drott)
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Alp Toker 2007-11-08 22:29:23 PST
GLib's Unicode manipulation API:

http://library.gnome.org/devel/glib/stable/glib-Unicode-Manipulation.html

The old gtk-webcore project does this to a small extent and is a good starting place for inspiration.

This could be a fairly easy way to shave off 10M shared library memory use due to ICU. I haven't investigated how complete the coverage is though.
Comment 1 Xan Lopez 2007-11-21 12:07:18 PST
Can you quickly point where is the code that should be modified?
Comment 2 Alp Toker 2008-01-06 07:31:17 PST
JavaScriptCore/wtf/unicode/qt4/UnicodeQt4.h is a good place to start porting, though we probably don't want to inline some of the larger functions here.
Comment 3 Jürg Billeter 2008-02-01 14:00:13 PST
Created attachment 18852 [details]
Preliminary patch removing ICU dependency

I've replaced the ICU dependency by the unicode functionality of GLib, Pango, and fribidi. The attached patch is experimental. It's probably quite a bit slower as I had to add many conversions between UTF-8 and UTF-16. TextBoundariesGtk.cpp implementation is still missing, so at least text entry won't work.

Pango already contains a copy of fribidi, however, the Pango API doesn't provide access to Bidi Type information. That's why I had to add a direct dependency on fribidi. Maybe this could be eliminated by extending the API in a future version of Pango.
Comment 4 Jürg Billeter 2008-02-09 08:20:27 PST
I've filed a Pango enhancement request to avoid the fribidi dependency: http://bugzilla.gnome.org/show_bug.cgi?id=515432
Comment 5 Alp Toker 2008-03-09 18:09:57 PDT
r30913 now conditionalises ICU use in the GTK+ port.
Comment 6 Jürg Billeter 2008-04-25 14:42:23 PDT
Created attachment 20824 [details]
Updated patch

I've updated the patch to not depend on fribidi anymore using the new function available in pango 1.21.0. I've also implemented TextBoundariesGtk.cpp, so selection in text input elements basically work again.

The patch still needs work, though. It doesn't register all available encodings and the semantics of the text break iterator implementation doesn't seem to completely match the ICU iterator.
Comment 7 Naiem 2008-06-04 22:59:20 PDT
Created attachment 21500 [details]
Modified the Juergbi's patch for umemcasecmp function

Modified couple of lines in earlier patch in umemcasecmp function for sending the string length.
Comment 8 Dominik Röttsches (drott) 2008-06-09 08:13:54 PDT
In r33380 (http://trac.webkit.org/changeset/33380) the appendOmittingBOM function was removed. So this patch can't be cleanly applied to tip of trunk currently (appendOmittingBOM is used for example in TextCodecGtk::decode()). The solution I came up with so far was to explicitly specify UTF-16LE as the target encoding in g_iconv_open() of TextCodecGtk::createIConvDecoder(). Currently I only used LE, but Glib's byte-order macros could be used to determine the correct target encoding including its byte order at compile time.

This stops libiconv from generating BOMs for the result of the conversion. Consequently, there's no need for removing the BOMs when appending them to the Vector<UChar>, allowing us to just use result.append(buffer, count / sizeof(UChar)). 

What is your opinion on this approach? Do you think that's the right direction to go or do you see any risks? 
Comment 9 Mike Hommey 2008-06-09 08:51:14 PDT
> This stops libiconv from generating BOMs for the result of the conversion.

Keep in mind that while it won't *add* a BOM, it will *keep* existing BOMs (zero-width no-break space to be precise) if the original stream contains them, wherever they are located (start of the stream or not)
Comment 10 Daniel Laird 2008-06-30 12:45:36 PDT
I have downloaded r34824 and tried the patch.  I have updated it so that it can be applied except I am new to this and have no idea about the now missing appendOmittingBOM.  For now to get webkit building and hopefully running on GTK-DirectFB I have just commented it out.  However I was wonderig what progress had been made with this patch?
If I get it running on my platform I will be happy to try out updated patches.
For now it is just compiling and then when I run it crashes so ill sort that first but the idea of removing ICU support is great hope you can finish what you started.
dan
Comment 11 Manoj 2008-08-25 02:34:22 PDT
(In reply to comment #10)
> I have downloaded r34824 and tried the patch.  I have updated it so that it can
> be applied except I am new to this and have no idea about the now missing
> appendOmittingBOM.  For now to get webkit building and hopefully running on
> GTK-DirectFB I have just commented it out.  However I was wonderig what
> progress had been made with this patch?
> If I get it running on my platform I will be happy to try out updated patches.
> For now it is just compiling and then when I run it crashes so ill sort that
> first but the idea of removing ICU support is great hope you can finish what
> you started.
> dan
> 

(In reply to comment #10)
> I have downloaded r34824 and tried the patch.  I have updated it so that it can
> be applied except I am new to this and have no idea about the now missing
> appendOmittingBOM.  For now to get webkit building and hopefully running on
> GTK-DirectFB I have just commented it out.  However I was wonderig what
> progress had been made with this patch?
> If I get it running on my platform I will be happy to try out updated patches.
> For now it is just compiling and then when I run it crashes so ill sort that
> first but the idea of removing ICU support is great hope you can finish what
> you started.
> dan
> 

I amtrying to remove ICU lib from webkit dep.

I have applied patch to r35806 version.
Commented out appendOmittingBOM from TextCodecGtk::decode of TextCodecGtk.cpp.
When i try to lauch GtkLauncher, there is not text on my GtkLauncher, though i have set google.com as url.
If any one suggests where is problem. I could try to resolve it. 
Comment 12 Manoj 2008-08-26 03:29:11 PDT
Replacing ICU lib with Glib is a great idea.

Are there anyone success with Unicode support with Glib instead of ICU lib ?
I have applied the Glib patch for Unicode, But there is no text display on my browser(gtklauncher). 
Comment 13 Dominik Röttsches (drott) 2008-08-26 03:38:43 PDT
(In reply to comment #11)

> I amtrying to remove ICU lib from webkit dep.
> 
> I have applied patch to r35806 version.
> Commented out appendOmittingBOM from TextCodecGtk::decode of TextCodecGtk.cpp.
> When i try to lauch GtkLauncher, there is not text on my GtkLauncher, though i
> have set google.com as url.
> If any one suggests where is problem. I could try to resolve it. 

Yes, it does work, I can confirm that - although the patch attachment here is not up to date with trunk. Look at my comment #8. Commenting out appendOmittingBOM is not sufficient for getting it to work. The comment should give you pointers on how to resolve this. 

I contacted Jürg via Email to get more information for setting up a development environment that supports pango 1.21 (which he used to get rid of a fribidi dependency) - no reply set. I was trying to work on an updated patch that is compatible with current trunk.
Comment 14 Dominik Röttsches (drott) 2008-08-26 06:46:28 PDT
Now, using an Ubuntu 8.04 plus some *-dev packages from Ubuntu 8.10 I was able to set up an environment that has libpango-dev with at a 1.21.x version number. 
Comment 15 Dominik Röttsches (drott) 2008-08-26 08:41:34 PDT
(In reply to comment #6)

> The patch still needs work, though. It doesn't register all available encodings [...]

Regarding listing the encodings, I'm currently considering two alternatives:

1) Probe g_iconv_open with all the supported encodings listed in the libiconv's list of supported encodings (http://www.gnu.org/software/libiconv/). For each probe, if successful, register the respective encoding with webcore. This avoids a direct dependency to libiconv but isn't very elegant. 

2) Introduce a dependency to libiconv and use its iconvlist() function directly. This function takes a callback as a parameter and calls it for each of the available encodings. Introducing this dependency doesn't necessarily mean much extra trouble, as http://library.gnome.org/devel/glib/stable/glib-building.html#dependencies lists libiconv as a dependency of glib on many platforms.

Any other suggestions? Is it acceptable to introduce the libiconv dependency in the --with-unicode-backend=glib case?
Comment 16 Alexey Proskuryakov 2008-08-26 08:55:59 PDT
Another possible approach would be to redesign WebCore text conversion classes so that they wouldn't need to enumerate all supported encodings. We've discussed this briefly before, but at that point, all conversion libraries we needed did provide encoding enumeration APIs.
Comment 17 Daniel Laird 2008-08-26 08:57:08 PDT
I would think that a dependency on libiconv is ok as this is commonly a dependency of glib.  My only concern would be to make sure that libiconv cross-compiles better than icu otherwise we have substituted one bad package for another.

Comment 18 Dominik Röttsches (drott) 2008-08-26 09:03:46 PDT
Created attachment 23001 [details]
Patch updated for trunk revision 35916

Now explicitly specifying byte order of UTF-16 target encoding based on glib macros.
Comment 19 Christian Dywan 2008-08-26 10:04:23 PDT
(In reply to comment #17)
> I would think that a dependency on libiconv is ok as this is commonly a
> dependency of glib.  My only concern would be to make sure that libiconv
> cross-compiles better than icu otherwise we have substituted one bad package
> for another.

Note that Glib on Win32 is built without iconv these days, so it is only required for unix platforms. In that respect avoiding the dependency would be preferrable.
Comment 20 Dominik Röttsches (drott) 2008-08-27 08:01:57 PDT
Created attachment 23031 [details]
Overview and Comparison of Text Codecs in Safari Win/Mac and libiconv

Libiconv becomes a less favorable choice for enumerating the available encodings given the fact that glib does not rely on libiconv on all platforms but uses the standard libc iconv on a number of Unices or win32 functions on Windows.

I have created an overview of Safari Win & Mac encodings compared to the ones supported by libiconv (and most likely supported by standard libc iconv, too, as from what I understand, these implementations are often very similar). This table shows that almost all Safari encodings can be covered by using these encoding names.

The table does not yet list all the aliases for each canonical encoding. For a start, a list of typical MIME encoding names could be extracted from: http://www.iana.org/assignments/character-sets.

The canonical encoding names in the libiconv column of this table could be used for probing whether glib's g_iconv_open supports this encoding on a particular platform. 

Otherwise, the only way out would be to redesign WebCore's text codecs part, as Alexey pointed out in comment #16.
Comment 21 Manoj 2008-08-27 21:18:24 PDT
Applied patch to 35917. But no luck, still GtkLauncher is without any text.
I am trying to test it on ARM Little Endian.
Comment 22 Dominik Röttsches (drott) 2008-09-02 13:04:11 PDT
Created attachment 23124 [details]
patch updated for probing for iconv encodings, updated to revision 36025

I updated Jürg's patch for probing the list of codecs that are supported by iconv (as stated on the libiconv web page). TIS-620, WINDOWS-1251, EUC-KR, WINDOWS-1253 added manually. This patch now works without a dependency to libiconv, but relies on hard-coded codec tables. It could serve as an interim solution until it's decided to change the text codecs architecture.
Comment 23 Alexey Proskuryakov 2008-09-02 14:25:30 PDT
Comment on attachment 23124 [details]
patch updated for probing for iconv encodings, updated to revision 36025

When asking for review, please mark patches as review?, not review+ (which means that a review has been granted).
Comment 24 Dominik Röttsches (drott) 2008-09-02 14:30:04 PDT
Sorry, of course.
Comment 25 Mark Rowe (bdash) 2008-09-03 16:55:51 PDT
There appear to be some encoding issues with this patch.  Several non-ASCII characters in peoples names in ChangeLog entries and copyright lines appear to have been somewhat mangled.
Comment 26 monil 2008-09-04 05:08:59 PDT
I have used patch-36025 for Webkit version-36053 (using glib). I tried to open few  html pages and some url using GtkLauncer on fedora-8, but facing following problems:
1) html page with english language is opening fine.
2) html page with korean or japanese language(utf-16) not showing any text. 
3) tried to open url like google.com, these pages are not coming up.i put some debug statements in code so find out when launcher try to access the .js pages it hung there itself and not proceeding further.

Comment 27 Dominik Röttsches (drott) 2008-09-04 06:24:46 PDT
Created attachment 23167 [details]
replacing icu with glib, updated for 36087, ChangeLogs now UTF-8, coding style fixes

(In reply to comment #25)
The UTF-8 troubles in the ChangeLogs are now hopefully fixed.
In addition, I tried being more coding style compliant.
Comment 28 Dominik Röttsches (drott) 2008-09-04 06:57:38 PDT
(In reply to comment #26)

> 2) html page with korean or japanese language(utf-16) not showing any text. 

It would be helpful if you could post the URLs you're having problems with. In my opinion, UTF-16 source-encoded web pages are really rare as using this encoding inflates the content size. 

I tried some pages as found on the Alexa.com top 100 pages by country or language.
For Korea, for example http://www.dreamwiz.com/, http://www.gmarket.co.kr/ - These ones are EUC-KR encoded. They are okay from an encoding point of view, however I cannot verify rendering completely as I don't have Korean fonts on my system.
For Japanese I found that certain pages are in EUC-JP which might not be available. It depends on whether your system's iconv supports EUC-JP. My installation of iconv for example cannot instantiate a codec for this encoding. If you do a debug build, you will be able to see if there are problems instantiating a certain encoding by console error messages like: "EUC-JP not decodable => not available!" or "EUC-JP not encodable => not available!". 

> 3) tried to open url like google.com, these pages are not coming up.i put some
> debug statements in code so find out when launcher try to access the .js pages
> it hung there itself and not proceeding further.

google.com, google.co.kr, google.co.jp display fine here. It would be helpful if you could provide more details and specify the exact URLs or procedure that lead to the problems you are observing.

Have you tried the same in a built of the same revision but without applying the patch? - Just compare whether it's a new issue introduced by this page or one that might have existed previously.
Comment 29 Mike Hommey 2008-09-04 10:19:52 PDT
(In reply to comment #26)
> 2) html page with korean or japanese language(utf-16) not showing any text. 

This might be bug #16792
Comment 30 Dominik Röttsches (drott) 2008-09-05 06:07:51 PDT
Comment on attachment 23167 [details]
replacing icu with glib, updated for 36087, ChangeLogs now UTF-8, coding style fixes

Found an issue with this patch. It doesn't handle partially transmitted multibyte characters correctly. I'm looking into it, removing r?.
Comment 31 Alexey Proskuryakov 2008-09-05 08:51:17 PDT
A TextCodec implementation is supposed to keep partial bytes in it internal state, unless decode() is called with flush argument. In the latter case, an incomplete character is an error. This is handled by ICU internally, so TextCodecICU doesn't need any such logic (but TextCodecUTF16 has it).
Comment 32 Dominik Röttsches (drott) 2008-09-08 10:08:22 PDT
Created attachment 23263 [details]
Replacing ICU with glib

Patch updated for 36268, and:
* now supports partially transmitted multibyte characters. Thank you, Alexey, for the clarification.
* changed semantics of textBreakLast and textBreakFirst to look for characters not breaks.
Comment 33 monil 2008-09-18 07:29:01 PDT
Its strange, but i tried on diffrent versions of webkit by applying patches(glib) on that, still iam unable to open any url on gtklauncher. Following are the version which right now iam using:
webkit->r36477
glib-patch->Patch updated for 36268
pango->pango-1.21.4
glib->glib-2.17.6
iconv->libiconv-1.12

I tried to open following url :
www.google.co.in
www.google.co.kr
www.daum.net
www.ana.co.jp

none of the above url opened in GtkLuancher.
Is some modification or some other packages are required?
Also all these url's opened well with ICU.
if some more information required please tell.
Comment 34 Eric Seidel (no email) 2008-09-23 04:34:39 PDT
Comment on attachment 23263 [details]
Replacing ICU with glib

There are a bunch of webkit-style violations in this patch, which make it difficult to r+.

It concerns me that you'd need to copy all those macros from the ICU headers.  Does GLIB not provide any equivalents?

Things like where * and & go, relative to the variable name.
Single line if/else statements don't use {}
Use of c-style casts (Foo*)foo instead of static_cast<Foo*>(foo)
Each variable gets its own declaration line, no int foo, bar;
4-space indents
0 instead of NULL when referring to pointers in c++ code
true and false instead of TRUE and FALSE when dealing with c++ "bool" types
Header #define names should be identical to the header, like #define UnicodeGLib_h

Anything copied from Glib should really be in its own header file, IMO.  Separating things out into their own file makes it easier to update them if they ever change upstream, or for platforms which have the necessary headers to not use our local copies of that data.

Probably the same goes of the ICU macros. UnicodeMacrosFromICU.h would be one possible name. :)

WebKit source code files are generally UT8, I'm not sure what encoding you're editing with:
+ * Copyright (C) 2008 Jürg Billeter <j@bitron.ch>


There is a lot of code in UnicodeGLib.h, much of that could be moved into a corresponding .cpp file.

:(  Even though this code touches Gtk, it would be best if we could avoid manual-menory management.  Manual memory management is *evil* (IMO) and gets you into all sorts of trouble.  I think there was a GOwnPtr landed recently, no?  That would know how to call g_free on its contents when it went out of scope?  If so, one could use that instead of all the manual g_free calls.

Thanks for the patch, but I think we need another round of cleanup to this one.
Comment 35 Eric Seidel (no email) 2008-09-23 04:35:57 PDT
If you haven't found it already, here is the WebKit style guide:
http://webkit.org/coding/coding-style.html
it's linked from the contribution guidelines:
http://webkit.org/coding/contributing.html
Comment 36 Dominik Röttsches (drott) 2008-09-23 04:41:39 PDT
(In reply to comment #34)

Thanks for your remarks, Eric. I will have a closer look at it once I find the time, maybe Jürg can support in this if he's available. 

Regarding the GOwnPtr, I will try to incorporate it. I saw the discussions but it wasn't landed at the time. 
Comment 37 Miguel L. Turbón 2008-10-01 12:23:32 PDT
I think there is a bug in the patch in toLower, toUpper from UnicodeGlib.h (used in StringPrototype.cpp). Maybe a *error = FALSE is missing in this functions at the end.
Comment 38 Kalle Vahlman 2008-11-12 10:21:12 PST
I tried this patch, surfed through some pages with UTF-8 and didn't see any errors.

All of these worked ok for me (well, except the bits I don't have font support for.. not many of those though):

http://www.ltg.ed.ac.uk/~richard/unicode-sample.html
http://www.ltg.ed.ac.uk/~richard/unicode-sample-3-2.html
http://www.madore.org/~david/misc/unitest/
http://www.borut.com/library/s_10646t.htm

It'd be great to get this cleaned up and in, dragging extra 14 MegaBytes of unneeded (and pretty annoyingly built) libraries seems silly, specially in mobile devices context.
Comment 39 Dominik Röttsches (drott) 2008-11-20 08:11:30 PST
Created attachment 25314 [details]
Replacing ICU with glib

Another round, now updated to apply cleanly to 38619:
* lots of coding style fixes, with special focus on the ones Eric pointed out
* tried to use GOwnPtr instead of manual memory management
* externalized GLib casefolding table and macros copied from ICU to separate files
* Changes to GNUMakefile.am in JSC depending on choice of unicode backend

This is a preliminary version. Before asking for a review I need to do some more testing on pages with exotic encodings.

Kalle, if you have some time, could have a go with this one? Any feedback is welcome.
Comment 40 Dominik Röttsches (drott) 2008-11-21 04:38:24 PST
Created attachment 25348 [details]
Replacing ICU with glib

In addition to the previous changes:
* sawError argument to the decode() function now (hopefully) correctly used.
* In response to Miguel L. Turbón's comment, initializing error out-argument to false for safety.
Comment 41 Holger Freyther 2008-11-21 18:46:04 PST
Does this pass the "fast/encoding" tests? You can run them with ./WebKitTools/Scripts/run-webkit-tests --gtk (--debug|--release) fast/encoding.
Comment 42 Dominik Röttsches (drott) 2008-11-24 05:47:55 PST
(In reply to comment #41)
> Does this pass the "fast/encoding" tests? You can run them with
> ./WebKitTools/Scripts/run-webkit-tests --gtk (--debug|--release) fast/encoding.

Unfortunately not (yet). I've started looking into it. What's the expected outcome? Would this patch have to pass all of them? 

2 or 3 test cases (admittedly the easier ones) depend on a certain canonical name for the encoding which seems different in ICU and GLib (for example windows-1256 != CP1256). Would it be acceptable to modify the test cases here, e.g. doing the actual result check with a more tolerant RegExp in JS and print PASS/FAIL? 

Also, some of the tests specifically checking the decoding result for some exotic encodings will not pass because libiconv underlying to the GLib routines does not support all of them. How can we account for that?
Comment 43 Alexey Proskuryakov 2008-11-24 06:55:27 PST
(In reply to comment #42)
> 2 or 3 test cases (admittedly the easier ones) depend on a certain canonical
> name for the encoding which seems different in ICU and GLib (for example
> windows-1256 != CP1256). Would it be acceptable to modify the test cases here,
> e.g. doing the actual result check with a more tolerant RegExp in JS and print
> PASS/FAIL? 

We need to decide on a case by case basis. In particular, windows-1256 is a registered IANA name, unlike CP1256, so changing from former to latter is not an improvement. In fact, some cpXXX names do not even match windows-XXX ones with the same number (e.g. windows-949 vs. cp949).

I don't have an opinion on whether the issues you described should be dealt with now, or left for later.
Comment 44 Darin Adler 2008-11-24 09:35:08 PST
(In reply to comment #42)
> Unfortunately not (yet). I've started looking into it. What's the expected
> outcome? Would this patch have to pass all of them?

The regression tests tell you what behavior you're changing. Policy for what can be right and wrong in a given port is a different question. We have a mechanism for having port-specific results for various tests, so you can have expected failures in some ports rather than others.

> 2 or 3 test cases (admittedly the easier ones) depend on a certain canonical
> name for the encoding which seems different in ICU and GLib (for example
> windows-1256 != CP1256). Would it be acceptable to modify the test cases here,
> e.g. doing the actual result check with a more tolerant RegExp in JS and print
> PASS/FAIL?

The names actually used on the web are the ones that matter. Also, consistent availability of the same names for encoding across platforms. But this is not yet extensively tested -- the test coverage is sparse.

The text encoding libraries often don't match what's needed on the web very closely. Back early in the WebKit project, when we used the Mac-specific "Text Encoding Converter" library we had a long list of encoding names that had to be added explicitly; we didn't even try to use the names in the library itself. When we switched to the ICU library, we discovered that it had a much better list of names in the library, so the list of encoding names that need to be hardcoded in WebKit itself for ICU is a lot shorter. I imagine that GLib omits many encoding names that are needed for practical compatibility with the web, so you'll need code to deal with this.

To see how this is implemented for ICU, look at the TextCodecICU::registerExtendedEncodingNames function.

To see the remnants of the former mechanism used in older versions of WebKit, look at the TextCodecMac::registerEncodingNames function and at the character-sets.txt, mac-encodings.txt, and make-charset-table.pl files in the platform/text/mac directory.

> Also, some of the tests specifically checking the decoding result for some
> exotic encodings will not pass because libiconv underlying to the GLib routines
> does not support all of them. How can we account for that?

Those test failures are specific examples of how a version of WebKit using the GLib functions would be deficient compared to versions based on other text decoding libraries. There are at least three things we can do about this:

    1) Put expected test results in a platform-specific test result directory, reflecting the known shortcomings of the GLib-based support.

    2) Supplement the GLib decoding with some other mechanism so we can support more encodings without pulling in an entire library, such as ICU.

    3) Take the existing tests and break them into multiple tests to separate "important" from "exotic" encodings, so that the tests that cover important encodings don't have to have platform-specific results. However, this will require discussion of what encodings are important enough.
Comment 45 Kalle Vahlman 2008-11-24 12:27:12 PST
(In reply to comment #39)
> Kalle, if you have some time, could have a go with this one? Any feedback is
> welcome. 

I took it for a spin, and visually it worked for me :)

Valgrind disagreed though, there were two leaks introduced. Fortunately fixing those was trivial:

--- a/WebCore/platform/graphics/gtk/FontGtk.cpp
+++ b/WebCore/platform/graphics/gtk/FontGtk.cpp
@@ -139,6 +139,7 @@ static gchar* convertUniCharToUTF8(const UChar* characters, gint length, int fro
         pos += start;
         len -= start;
     }
+    g_free(utf8);
     return g_string_free(ret, FALSE);
 }

The contents of utf8 is appended to a gstring which copies the data, so the original needs to be freed.
 
--- a/WebCore/platform/text/gtk/TextBreakIteratorGtk.cpp
+++ b/WebCore/platform/text/gtk/TextBreakIteratorGtk.cpp
@@ -67,6 +67,8 @@ static TextBreakIterator* setUpIterator(bool& createdIterator, TextBreakIterator
 
     iterator->m_type = type;
     iterator->m_length = length;
+    if (createdIterator)
+       g_free(iterator->m_log_attrs);
     iterator->m_log_attrs = g_new0(PangoLogAttr, length + 1);
     iterator->m_index = -1;
     pango_get_log_attrs(utf8.get(), utf8len, -1, 0, iterator->m_log_attrs, length + 1);

Here the PangoLogAttr array was leaked for each setup after the first one. Since this seems to be done a lot, I wonder if using the g_slice API would be a good idea performance-wise... That would require profiling to prove it's hurting though.
Comment 46 Dominik Röttsches (drott) 2008-11-26 06:04:19 PST
(In reply to comment #45)
> (In reply to comment #39)
> I took it for a spin, and visually it worked for me :)

Thanks for your feedback and taking a look, Kalle!

> Valgrind disagreed though, there were two leaks introduced. Fortunately fixing
> those was trivial:
> 
> --- a/WebCore/platform/graphics/gtk/FontGtk.cpp
> +++ b/WebCore/platform/graphics/gtk/FontGtk.cpp
> @@ -139,6 +139,7 @@ static gchar* convertUniCharToUTF8(const UChar* characters,
> gint length, int fro
>          pos += start;
>          len -= start;
>      }
> +    g_free(utf8);
>      return g_string_free(ret, FALSE);
>  }
> 
> The contents of utf8 is appended to a gstring which copies the data, so the
> original needs to be freed.

Interesting that you found this one. The patch actually doesn't touch FontGtk.cpp AFAICS. So this leak must have been there before. I think it would make sense to file this one as a separate bug. FontGtk.cpp looks like it would profit from being reworked using GOwnPtr, so another option would be to just add this problem to bug 21594. 


> Here the PangoLogAttr array was leaked for each setup after the first one.
> Since this seems to be done a lot, I wonder if using the g_slice API would be a
> good idea performance-wise... That would require profiling to prove it's
> hurting though.

So far, I just incorporated your fix proposal. 
Comment 47 Dominik Röttsches (drott) 2008-12-01 05:54:44 PST
Created attachment 25627 [details]
Replacing ICU with glib, IDN support

New iteration:
* incorporated one of the leak fixes from Kalle for TextBreakIteratorGtk.cpp
* added IDN support using libidn (autoconf check included)
  (this is required to pass /fast/encoding/url-hostname-non-ascii.html)
* reordered some encoding names, added some aliases and changed some letter cases in order to pass the testcases that check for specific encoding names

Now passes a lot more of the test cases, the remaining failing ones are mostly GBK and Hebrew encoding cases, which I am unable to test on my machine as my libiconv doesn't have them.
Comment 48 Dominik Röttsches (drott) 2008-12-01 05:58:21 PST
Created attachment 25628 [details]
Fast/Encoding Test Results for GTK GLib-Unicode Release Build
Comment 49 Dominik Röttsches (drott) 2008-12-01 05:59:17 PST
Created attachment 25629 [details]
Fast/Encoding Test Results for GTK ICU-Unicode Release Build

For reference, the results for the ICU-backend release build.
Comment 50 Vitaliy Pronkin 2009-01-03 06:32:53 PST
I compiled this for N810/Maemo and it hangs in g_convert_with_iconv when passed length == 0. returning empty string in this case helps

Comment 51 Darin Adler 2009-01-10 15:57:52 PST
Comment on attachment 25627 [details]
Replacing ICU with glib, IDN support

Do all these changes have to go in one big patch? It seems that the codec changes are entirely separate from the other changes. By making the patch big, you make it a lot harder to review, so you'll end up waiting longer for someone to have time to review the whole thing.

Roughly speaking, it looks good, but I don't see any reason to tie the codec changes with the others.
Comment 52 Alexander Butenko 2009-01-11 15:58:49 PST
i was using this patch for a while and it works perfect with english, spanish and russian languages. I dont know any other languages, so cant test more.
Comment 53 Dominik Röttsches (drott) 2009-01-12 03:23:56 PST
(In reply to comment #51)
> Roughly speaking, it looks good, but I don't see any reason to tie the codec
> changes with the others.

Darin, first of all, thanks for the positive feedback. 
I am not sure if I understand how you see the codec changes separate from the rest of the patch. Could you explain in a little more detail? 
TextCodecGtk.cpp and TextCodecGtk.h are new in this patch, not modifying old code and essential to the overall functioning of this patch. On the one hand, on their own in a separate patch they wouldn't add anything functional and on the other hand, the changes in TextEncodingRegistry.cpp for example wouldn't work without these new files. 
Maybe my Changelog entry
+        TextCodecGtk now supports partially transmitted multibyte characters.
is misleading. TextCodecGtk was only existing in prior versions of the patch, it didn't exist in the trunk before. 

So from my point view the patch cannot be split into codec changes / other changes - but I might have misunderstood your comment.
Comment 54 Kalle Vahlman 2009-01-12 04:31:13 PST
(In reply to comment #53)
> (In reply to comment #51)
> > Roughly speaking, it looks good, but I don't see any reason to tie the codec
> > changes with the others.
> 
> Darin, first of all, thanks for the positive feedback. 
> I am not sure if I understand how you see the codec changes separate from the
> rest of the patch. Could you explain in a little more detail? 

He probably means that the changes could be split to allow individual review, not the whole 78kB of it :)

On another issue, I guess related to comment #50, there's something that introduces a seemingly infinite loop with the glib code. I haven't been able to make real sense of it, but at least it's easy to reproduce:

1. Load google.com
2. type something to the search field (I think it needs to be > somecount chars, but not sure. 1234567890 should work)
3. go back to the beginning of the line and try typing additional text there
4. consistently, glib backend goes in to a busyloop while icu does not

I've tried to print some debugs from the iterator constructor, but haven't seen anything exceptionally funny in there (but I admit that I'm not terribly well aware of what is actually happening there...).
Comment 55 Darin Adler 2009-01-12 09:08:00 PST
(In reply to comment #53)
> I am not sure if I understand how you see the codec changes separate from the
> rest of the patch. Could you explain in a little more detail? 
> TextCodecGtk.cpp and TextCodecGtk.h are new in this patch, not modifying old
> code and essential to the overall functioning of this patch. On the one hand,
> on their own in a separate patch they wouldn't add anything functional and on
> the other hand, the changes in TextEncodingRegistry.cpp for example wouldn't
> work without these new files. 

I mean that the TextCodecGtk and TextEncodingRegistry changes together can stand alone, and don't rely on the WTF changes. I see nothing where it requires the WTF changes.

Similarly, the IDN changes don't have to go with either the codec changes or the WTF changes.

Same for the text break iterator changes.

These seem like they could be done in at least 4 separate pieces, each easier to review, and each helpful in making progress.
Comment 56 Daniel Laird 2009-01-13 02:51:55 PST
Can I ask if libiconv is a definate requirement now?
I have libglib without libiconv support and when I run with an older version of this patch and webkit it crashes.
If I build glib with libiconv=gnu option then webkit runs.  However to get this to work for me requires some hacking so ideally I would not have a requirement to build libiconv.
Hope someone can give a yes/no answer 
Cheers
Dan
Comment 57 Dominik Röttsches (drott) 2009-01-13 03:26:09 PST
libiconv is not a direct requirement of this patch. The patch in its current form requires GLib and libidn. It needs at least GLib functions g_convert_with_iconv, g_iconv_open, g_iconv_close to be available and functional. So the patch indirectly depends on what your GLib's using to implement those functions. 
Comment 58 Kalle Vahlman 2009-01-13 03:34:17 PST
(In reply to comment #54)
> On another issue, I guess related to comment #50, there's something that
> introduces a seemingly infinite loop with the glib code. I haven't been able to
> make real sense of it, but at least it's easy to reproduce:
>
> 1. Load google.com
> 2. type something to the search field (I think it needs to be > somecount
> chars, but not sure. 1234567890 should work)
> 3. go back to the beginning of the line and try typing additional text there
> 4. consistently, glib backend goes in to a busyloop while icu does not

Yesterday I took time to debug this further and found the following (excerpt from mail conversation):

I did some digging with gdb and found that the busyloop
revolves inside VisiblePosition::leftVisuallyDistinctCandidate() in
WebCore/editing/VisiblePosition.cpp. There's two while(true) loops
there and the inner one never exits. After a while of
head-scratching, comparison with ICU implementation and reading ICU
reference, I came to the conclusion that the following should be the
correct fix:

@@ -197,7 +197,7 @@ int textBreakPrevious(TextBreakIterator* bi)
            return i;
        }
    }
-    return textBreakFirst(bi);
+    return TextBreakDone;
 }

The icu docs for ubrk_preceding() say:

 "The value returned is always smaller than offset, or UBRK_DONE."

but the glib implementation was never returning DONE (which is -1)
even for 0 since it calls the BreakFirst which (as the BreakLast
comments also state) always returns the first character of the text.
Also note that BreakNext also returns DONE instead of the last
character already.

---

Dominik agreed and will incorporate this into the next petch/patches.
Comment 59 Dominik Röttsches (drott) 2009-01-14 08:49:45 PST
Thanks to Darin's suggestions, for splitting the patch into smaller pieces, my plan is as follows. I will create 4 patches, the first one for moving WTF Unicode to GLib, the second one for moving Text Codecs from ICU to GLib, the third for switching TextBreakIterator to GLib and the fourth one for adding IDN support. 

For the first two modificaitons, I have created two mostly independent patches which can be reviewed separately. Unfortunately, they overlap in changes for /configure.ac and /GNUMakefile.am. This is because I introduced a compile flag for compiling a hybrid WebKit version that has a mixed ICU/GLib Unicode backend. 

Patch 2/4 can be applied after 1/4 if the changes to configure.ac and GNUMakefile.am and JavaScriptCore/wtf/unicode/Unicode.h are ignored/skipped.

I can quickly clean up the second patch in this regard once the first is landed.

My plan is that once these two would be landed, I will provide 3/4 and 4/4 containing the TextBreakIterator change and the IDN support - applying cleanly to a future SVN revision that contains 1&2. 

If anyone has a suggestion for improving or simplifying this staged process, please let me know. Also, any testing/feedback on the two separated patches is very welcome.
Comment 60 Dominik Röttsches (drott) 2009-01-14 08:51:34 PST
Created attachment 26711 [details]
1/4 - Moving WTF Unicode backend to GLib
Comment 61 Dominik Röttsches (drott) 2009-01-14 08:52:19 PST
Created attachment 26712 [details]
2/4 - Moving Text Codecs to GLib
Comment 62 Darin Adler 2009-01-15 10:26:01 PST
Comment on attachment 26711 [details]
1/4 - Moving WTF Unicode backend to GLib

> Index: JavaScriptCore/wtf/unicode/glib/CasefoldTableFromGLib.h
> ===================================================================
> --- JavaScriptCore/wtf/unicode/glib/CasefoldTableFromGLib.h	(revision 0)
> +++ JavaScriptCore/wtf/unicode/glib/CasefoldTableFromGLib.h	(revision 0)
> @@ -0,0 +1,200 @@
> +/*
> + *  Copyright (C) 2006 George Staikos <staikos@kde.org>
> + *  Copyright (C) 2006 Alexey Proskuryakov <ap@nypop.com>
> + *  Copyright (C) 2007 Apple Computer, Inc. All rights reserved.
> + *  Copyright (C) 2008 Jürg Billeter <j@bitron.ch>
> + *  Copyright (C) 2008 Dominik Röttsches <dominik.roettsches@access-company.com>

Is this file really needed?

I don't think the copyrights are right on this file.

> +} casefold_table[] = {

This is not the normal naming for identifiers in WebKit. This would typically be inside the WTF::Unicode namespace and be named something like caseFoldTable or CaseFoldTable.

> +#include "CasefoldTableFromGLib.h"

Do we really need to compile in a copy of the table into each source file? That seems like a very bad idea. Since the case folding table is declared static, it will have internal linkage so each source file that uses this header will get a separate copy of the table!

> +        inline UChar32 foldCase(UChar32 ch)

These functions are pretty long. I don't think it makes sense to inline them.

> +        inline int umemcasecmp(const UChar* a, const UChar* b, int len)
> +        {
> +            GOwnPtr<char> utf8a;
> +            GOwnPtr<char> utf8b;
> +
> +            utf8a.set(g_utf16_to_utf8(a, len, 0, 0, 0));
> +            utf8b.set(g_utf16_to_utf8(b, len, 0, 0, 0));
> +
> +            GOwnPtr<char> foldedA;
> +            GOwnPtr<char> foldedB;
> +
> +            foldedA.set(g_utf8_casefold(utf8a.get(), -1));
> +            foldedB.set(g_utf8_casefold(utf8b.get(), -1));
> +
> +            int result = g_utf8_collate(foldedA.get(), foldedB.get());
> +
> +            return result;
> +        }

Converting to UTF-8 just to do the case folding is going to be very inefficient. This is going to result in a quite-slow GTK port if it's used anywhere.

In general we had to work hard to eliminate the conversion from UTF-16 and UTF-8 and back that used to be present in WebKit and JavaScriptCore in its core algorithms, and move that conversion to the external API. I think you need to do some performance tests if you're going to introduce memory allocation and UTF-8 conversion into these algorithms, unless it's OK to have the performance of the GTK port suffer.

> +#ifndef UnicodeGLibTypes_h
> +#define UnicodeGLibTypes_h
> +
> +typedef uint16_t UChar;
> +typedef int32_t UChar32;
> +
> +#endif

Seems a little excessive to have a separate header file for these, but it might be OK.

I'm going to say review- because of some of the issues with the case folding table and the performance of the UTF-8 conversion.
Comment 63 Dominik Röttsches (drott) 2009-01-16 07:46:07 PST
Created attachment 26793 [details]
1/4 - Moving WTF Unicode backend to GLib (v2)

(In reply to comment #62)

Darin, thanks for taking the time to review the patch. Here's another iteration that tries to address your comments. 

> (From update of attachment 26711 [details] [review])
> > Index: JavaScriptCore/wtf/unicode/glib/CasefoldTableFromGLib.h
> > ===================================================================
> > [...]
> 
> Is this file really needed?

I got rid of it at the price of a single-character ucs4->utf8->casefold->ucs4 conversion. See below regarding the conversion costs.

> > +} casefold_table[] = {
> This is not the normal naming for identifiers in WebKit. This would typically
> be inside the WTF::Unicode namespace and be named something like caseFoldTable
> or CaseFoldTable.

gone.

> > +#include "CasefoldTableFromGLib.h"
> Do we really need to compile in a copy of the table into each source file? 

gone as well.

> > +        inline UChar32 foldCase(UChar32 ch)
> 
> These functions are pretty long. I don't think it makes sense to inline them.

I disabled inlining for the longer ones and moved them to a new file UnicodeGLib.cpp

> > +        inline int umemcasecmp(const UChar* a, const UChar* b, int len)

> Converting to UTF-8 just to do the case folding is going to be very
> inefficient. This is going to result in a quite-slow GTK port if it's used
> anywhere.
> 
> In general we had to work hard to eliminate the conversion from UTF-16 and
> UTF-8 and back that used to be present in WebKit and JavaScriptCore in its core
> algorithms, and move that conversion to the external API. I think you need to
> do some performance tests if you're going to introduce memory allocation and
> UTF-8 conversion into these algorithms, unless it's OK to have the performance
> of the GTK port suffer.

Looking at the public GLib Unicode API (http://library.gnome.org/devel/glib/stable/glib-Unicode-Manipulation.html) it seems difficult to call the more interesting unicode functionality with utf-16 strings directly - casefolding, collation, normalization all require utf-8. Please see below for my my view regarding these costs.

> > +#ifndef UnicodeGLibTypes_h
> > +#define UnicodeGLibTypes_h
> > +
> > +typedef uint16_t UChar;
> > +typedef int32_t UChar32;
> > +
> > +#endif
> 
> Seems a little excessive to have a separate header file for these, but it might
> be OK.

Pulled that into UnicodeGLib.h

Also added a FIXME in umemcasecmp that discusses the discrepancy to the icu implementation.

> I'm going to say review- because of some of the issues with the case folding
> table and the performance of the UTF-8 conversion.

While I am aware it's probably not a very appropriate benchmark, for a first and very rough idea, I compared the ICU build against the GLib build running sunspider. You can see a perfomance regression of 2-3% in most of the string tests. I will attach these results.

My understanding is that this patch helps to reduce ROM footprint when deploying on an embedded platform by eliminating the ICU dependency, saving  approximately 10MB (as Alp reports). Currently, the glib backend is optional, icu would remain default. So in my opinion, the benefit of this patch is that it gives a choice to integrators between sacrificing a little performance in favor of the binary reduction or package ICU onto their targets. Future changes in GLib could lead to removing the utf-16 to utf-8 conversions eventually.
Comment 64 Dominik Röttsches (drott) 2009-01-16 07:49:42 PST
Created attachment 26794 [details]
Sunspider Results Comparing icu against GLib Unicode Backend
Comment 65 Mikkel Kruse Johnsen 2009-02-08 07:34:39 PST
When will the two missing patches be published ?

I need them to cross compile WebKit for windows (no ICU).

Would love to test them.
Comment 66 Dominik Röttsches (drott) 2009-02-09 01:20:40 PST
(In reply to comment #65)
> When will the two missing patches be published ?

As I mentioned in comment #59, the plan is to publish them as soon as 1 & 2 are landed. There is not really much new in patch 3 & 4 to come, the code is already available in patch "Replacing ICU with glib, IDN support" from 1st of December. It's just split into 4 parts now. HTH.
Comment 67 Mikkel Kruse Johnsen 2009-04-08 06:54:45 PDT
How do we deal with this error, what could replace these functions ?


../WebCore/editing/visible_units.cpp: In function 'int WebCore::firstNonComplexContextLineBreak(const UChar*, int)':
../WebCore/editing/visible_units.cpp:49: error: 'hasLineBreakingPropertyComplexContext' was not declared in this scope
../WebCore/editing/visible_units.cpp: In function 'int WebCore::lastNonComplexContextLineBreak(const UChar*, int)':
../WebCore/editing/visible_units.cpp:58: error: 'hasLineBreakingPropertyComplexContext' was not declared in this scope
../WebCore/editing/visible_units.cpp: In function 'WebCore::VisiblePosition WebCore::previousBoundary(const WebCore::VisiblePosition&, unsigned int (*)(const UChar*, unsigned int, unsigned int))':
../WebCore/editing/visible_units.cpp:89: error: 'hasLineBreakingPropertyComplexContext' was not declared in this scope
../WebCore/editing/visible_units.cpp: In function 'WebCore::VisiblePosition WebCore::nextBoundary(const WebCore::VisiblePosition&, unsigned int (*)(const UChar*, unsigned int, unsigned int))':
../WebCore/editing/visible_units.cpp:175: error: 'hasLineBreakingPropertyComplexContext' was not declared in this scope
make[1]: *** [WebCore/editing/libWebCore_la-visible_units.lo] Error 1
make[1]: Leaving directory `/home/mkj/projects/XCare/build/webkit-1.1.4/WebKitBuild'
make: *** [all] Error 2
Comment 68 Mikkel Kruse Johnsen 2009-04-09 04:48:57 PDT
Just added to "JavaScriptCore/wtf/unicode/glib/UnicodeGLib.h"

        inline bool hasLineBreakingPropertyComplexContext(UChar32 c)
        {
            // FIXME
            return false;
        }
Comment 69 Eric Seidel (no email) 2009-05-21 18:45:26 PDT
Comment on attachment 26712 [details]
2/4 - Moving Text Codecs to GLib

No comments have been made in over a month.  Marking this r- to remove it from the review queue.  Please email one of the gtk reviewers directly to get comments.

Feel free to post for re-review if you have had contact with a reviewer and believe this will be reviewed in a timely manner.
Comment 70 Eric Seidel (no email) 2009-05-21 18:45:39 PDT
Comment on attachment 26793 [details]
1/4 - Moving WTF Unicode backend to GLib (v2)

No comments have been made in over a month.  Marking this r- to remove it from the review queue.  Please email one of the gtk reviewers directly to get comments.

Feel free to post for re-review if you have had contact with a reviewer and believe this will be reviewed in a timely manner.
Comment 71 Maciej Stachowiak 2009-05-21 22:31:43 PDT
Comment on attachment 26712 [details]
2/4 - Moving Text Codecs to GLib

Reflagging for review; this still needs substantive review, ideally from a Gtk reviewer.
Comment 72 Maciej Stachowiak 2009-05-21 22:31:51 PDT
Comment on attachment 26793 [details]
1/4 - Moving WTF Unicode backend to GLib (v2)

Reflagging for review; this still needs substantive review, ideally from a Gtk reviewer.
Comment 73 Gustavo Noronha (kov) 2009-05-22 10:23:50 PDT
Comment on attachment 26712 [details]
2/4 - Moving Text Codecs to GLib

This seems to be an earlier attempt, that has been superseded by https://bugs.webkit.org/attachment.cgi?id=26793, right? Clearing flag.
Comment 74 Gustavo Noronha (kov) 2009-05-22 10:56:23 PDT
Comment on attachment 26793 [details]
1/4 - Moving WTF Unicode backend to GLib (v2)

> +        https://bugs.webkit.org/show_bug.cgi?id=15914
> +        [GTK] Implement Unicode functionality using GLib

I have gone through all the comments now, and have verified that most comments have been addressed. I think the best way to move forward is to land the patch in its current state, before it gets old enough that too much has changed, now that you have made it smaller and self-contained. I have given it another pass, and found only indentation problems, which I fixed (see UnicodeGLib.h for reference, so that you can get it right for the next patches), and two missing function implementations (including the one found by Mikkel), which I added. I also fixed header inclusion on top of UnicodeGLib.h.

I have tested that the ICU backend keeps working, and passing tests. This, along with the facts that icu keeps being the default, and that your patch touches little cross platform code makes me believe that we're ready for landing it now.
Comment 75 Gustavo Noronha (kov) 2009-05-22 11:08:58 PDT
Comment on attachment 26793 [details]
1/4 - Moving WTF Unicode backend to GLib (v2)

Landed as r44050.
Comment 76 Dominik Röttsches (drott) 2009-05-25 05:12:06 PDT
(In reply to comment #73)
> (From update of attachment 26712 [details] [review])
> This seems to be an earlier attempt, that has been superseded by
> https://bugs.webkit.org/attachment.cgi?id=26793, right? Clearing flag.

26712 belonged to an earlier set (26711 and 26712). I hope I will soon find the time to provide an updated version of patch 2/4 for the Text Codecs.

Thanks for landing!

Comment 77 Gustavo Noronha (kov) 2009-10-16 05:31:17 PDT
Ping. Do you plan to keep going on this one, or should we revert the original patch?
Comment 78 Christian Dywan 2009-10-16 10:35:04 PDT
I would love to see this progress, if it weren't for the bleeding edge Glib requirement making it unusable on my N900 :-/
Comment 79 Xan Lopez 2009-10-16 15:49:23 PDT
(In reply to comment #78)
> I would love to see this progress, if it weren't for the bleeding edge Glib
> requirement making it unusable on my N900 :-/

Upgrading the glib in the system should be trivial, I'm sure at worst there's a handful of patches applied to it.
Comment 80 Christian Dywan 2009-10-16 17:02:42 PDT
Emphasis on "should". Glib can't (easily) be replaced because several system applications explicitly require the version that is ships with.
Comment 81 Xan Lopez 2009-10-16 17:05:54 PDT
(In reply to comment #80)
> Emphasis on "should". Glib can't (easily) be replaced because several system
> applications explicitly require the version that is ships with.

Upgrade your glib and then just change the version number back to the previous one. Stupid workarounds for stupid applications (and good God, file bugs).
Comment 82 Dominik Röttsches (drott) 2009-10-19 01:58:40 PDT
(In reply to comment #77)
> Ping. Do you plan to keep going on this one, or should we revert the original
> patch?

Yes, I will provide updated patches until the middle of this week.
Comment 83 Dominik Röttsches (drott) 2009-10-19 11:24:06 PDT
Created attachment 41434 [details]
2/4 - Moving Text Codecs to GLib (v2)

New version of patch 2/4 - applies cleanly against r49769

Moving TextCodec to GLib and removing respective HYBRID macro usages.
Comment 84 Eric Seidel (no email) 2009-10-19 14:08:57 PDT
Doing more than one patch per bug, is generally a recipe for slow reviews.  In this case, I'm scared to review this because I don't have time to read all 83 comments before doing my review. :(
Comment 85 Dominik Röttsches (drott) 2009-10-19 14:48:26 PDT
(In reply to comment #84)
> Doing more than one patch per bug, is generally a recipe for slow reviews.  In
> this case, I'm scared to review this because I don't have time to read all 83
> comments before doing my review. :(

Well, in comment #51 Darin asked me to split it into smaller portions, that's why there's more than one patch. Maybe Gustavo can help reviewing this as he reviewed 1/4?
Comment 86 Gustavo Noronha (kov) 2009-10-25 11:54:12 PDT
Comment on attachment 41434 [details]
2/4 - Moving Text Codecs to GLib (v2)

> @@ -114,6 +117,20 @@ CString TextEncoding::encode(const UChar
>      QString str(reinterpret_cast<const QChar*>(characters), length);
>      str = str.normalized(QString::NormalizationForm_C);
>      return newTextCodec(*this)->encode(reinterpret_cast<const UChar *>(str.utf16()), str.length(), handling);
> +#elif USE(GLIB_UNICODE)
> +    GOwnPtr<char> utf8source;
> +    utf8source.set(g_utf16_to_utf8(characters, length, 0, 0, 0));
> +
> +    GOwnPtr<char> utf8normalized;
> +    utf8normalized.set(g_utf8_normalize(utf8source.get(), -1, G_NORMALIZE_NFC));
> +
> +    long utf16length;
> +    GOwnPtr<UChar> utf16normalized;
> +    utf16normalized.set(g_utf8_to_utf16(utf8normalized.get(), -1, 0, &utf16length, 0));

Naming of the variables is a bit off our style. UTF8Normalized is more like it, for instance.

> +    CString result = newTextCodec(*this)->encode(utf16normalized.get(), utf16length, handling);
> +
> +    return result;

No need for a variable, just return the newTextCodec call directly.

> +#include "gtk/TextCodecGtk.h"
> +#endif
>  #if PLATFORM(WINCE) && !PLATFORM(QT)
>  #include "TextCodecWince.h"
>  #endif
>  
> +
>  using namespace WTF;

Why this space?

> +namespace WebCore {
> +
> +
> +

Too many blank lines to my taste =).

> +// list of text encodings and their aliases
> +// the first entry/"alias" is the canonical name, each alias list must be terminated by a 0

Comments should in general be full sentences, starting with a capital letter, and ending in a period.

> +// Unicode
> +TextCodecGtk::codecAliasList TextCodecGtk::m_codecAliases_UTF_8            = { "UTF-8", 0}; 

You're missing a space before } on all of these. I don't think we have a style rule for this, but it looks inconsistent with the fact that you have one at the start =).

> +static PassOwnPtr<TextCodec> newTextCodecGtk(const TextEncoding& encoding, const void*)
> +{
> +#ifndef NDEBUG
> +    // too noisy if each UTF-8 codec creation is logged
> +    // if (strcmp(encoding.name(), "UTF-8"))
> +    //     LOG(TextConversion, "creating new text codec for encoding name %s", encoding.name());
> +#endif

This is dead code. Either let it log stuff, or remove the code altogether, please. I think it may make sense to log in a debug build, but up to you.

> +    for (unsigned int i = 0; i < listLength; ++i) {
> +        codecAliasList *codecAliases = static_cast<codecAliasList*>(encodingList[i]);
> +        int codecCount = -1;
> +        const gchar *canonicalName, *currentAlias;
> +        bool canonicalAvailable = true;
> +        
> +        // for each of the alias lists:
> +        // bail out if canonical is not available,
> +        // then continue testing each alias, 
> +        // if available, add it to the list, mapping it to its canonical name
> +
> +        while ((currentAlias = (*codecAliases)[++codecCount]) && canonicalAvailable) {
> +            bool currentAvailable = isEncodingAvailable(currentAlias);
> +
> +            if (codecCount == 0 && currentAvailable) {
> +                canonicalName = currentAlias;
> +                canonicalAvailable = true;
> +                LOG_VERBOSE(TextConversion, "registering encoding canonical %s", canonicalName);
> +                registrar(canonicalName, canonicalName);
> +            } else if (codecCount == 0 && !currentAvailable)
> +                canonicalAvailable = false;

This feels a bit convoluted. Have you considered moving these two condition checks before the while loop? There's no need to check them on every loop run, since they are only useful in the first one.

> +TextCodecGtk::~TextCodecGtk()
> +{
> +    releaseIConv();
> +}
> +
> +void TextCodecGtk::releaseIConv() const
> +{
> +    if (m_iconvDecoder != reinterpret_cast<GIConv>(-1)) {
> +        g_iconv_close(m_iconvDecoder);
> +        m_iconvDecoder = reinterpret_cast<GIConv>(-1);
> +    }
> +    if (m_iconvEncoder != reinterpret_cast<GIConv>(-1)) {
> +        g_iconv_close(m_iconvEncoder);
> +        m_iconvEncoder = reinterpret_cast<GIConv>(-1);
> +    }
> +}

Why is this a separate function? Do you call releaseIConv anywhere else? Why not put this code inside the destructor, directly?

> +    if (err) {
> +        LOG_ERROR("GIConv conversion error");

Would it be useful to log the error message here?

> +    CString result = CString(buffer.get(), count);
> +
> +    return result;

No need for a variable here.
> +#ifndef TextCodecGTK_h
> +#define TextCodecGTK_h
> +
> +#include <glib.h>
> +
> +#include "TextCodec.h"
> +#include "TextEncoding.h"

No need for that blank line between the includes.

Another round! =)
Comment 87 Dominik Röttsches (drott) 2009-10-26 07:39:01 PDT
Created attachment 41866 [details]
2/4 - Moving Text Codecs to GLib (v3)

(In reply to comment #86)

Thanks a lot for your review, Gustavo!

> (From update of attachment 41434 [details])
> > @@ -114,6 +117,20 @@ CString TextEncoding::encode(const UChar

> > [...]
> > +    GOwnPtr<char> utf8source;
> > [...]
> > +    GOwnPtr<char> utf8normalized;
> > [...]
> > +    GOwnPtr<UChar> utf16normalized;
> 
> Naming of the variables is a bit off our style. UTF8Normalized is more like it,
> for instance.

If I understood the style guide correctly, it should be CamelCase starting with a lowercase letter for variables, also lowercasing acronyms. so I changed it to utf8Source, utf8Normalized etc. Hope that's fine now. 

> > +    CString result = newTextCodec(*this)->encode(utf16normalized.get(), utf16length, handling);
> > +
> > +    return result;
> 
> No need for a variable, just return the newTextCodec call directly.

Done.

> > +#include "gtk/TextCodecGtk.h"
> > +#endif
> >  #if PLATFORM(WINCE) && !PLATFORM(QT)
> >  #include "TextCodecWince.h"
> >  #endif
> >  
> > +
> >  using namespace WTF;
> 
> Why this space?

Removed.

> > +namespace WebCore {
> > +
> > +
> > +
> 
> Too many blank lines to my taste =).

Removed.

> > +// list of text encodings and their aliases
> > +// the first entry/"alias" is the canonical name, each alias list must be terminated by a 0
> 
> Comments should in general be full sentences, starting with a capital letter,
> and ending in a period.

Rephrased this comment as well as the Changelog a little bit.

> > +// Unicode
> > +TextCodecGtk::codecAliasList TextCodecGtk::m_codecAliases_UTF_8            = { "UTF-8", 0}; 
> 
> You're missing a space before } on all of these. I don't think we have a style
> rule for this, but it looks inconsistent with the fact that you have one at the
> start =).

Changed to match.

> > +static PassOwnPtr<TextCodec> newTextCodecGtk(const TextEncoding& encoding, const void*)
> > +{
> > +#ifndef NDEBUG
> > +    // too noisy if each UTF-8 codec creation is logged
> > +    // if (strcmp(encoding.name(), "UTF-8"))
> > +    //     LOG(TextConversion, "creating new text codec for encoding name %s", encoding.name());
> > +#endif
> 
> This is dead code. Either let it log stuff, or remove the code altogether,
> please. I think it may make sense to log in a debug build, but up to you.

Removed the log statement here and as well as in isEncodingAvailable. Replaced those by more accurate logging in registerEncodingNames.

> > +    for (unsigned int i = 0; i < listLength; ++i) {
> > +        codecAliasList *codecAliases = static_cast<codecAliasList*>(encodingList[i]);
> > +        int codecCount = -1;
> > +        const gchar *canonicalName, *currentAlias;
> > +        bool canonicalAvailable = true;
> > +        
> > +        // for each of the alias lists:
> > +        // bail out if canonical is not available,
> > +        // then continue testing each alias, 
> > +        // if available, add it to the list, mapping it to its canonical name
> > +
> > +        while ((currentAlias = (*codecAliases)[++codecCount]) && canonicalAvailable) {
> > +            bool currentAvailable = isEncodingAvailable(currentAlias);
> > +
> > +            if (codecCount == 0 && currentAvailable) {
> > +                canonicalName = currentAlias;
> > +                canonicalAvailable = true;
> > +                LOG_VERBOSE(TextConversion, "registering encoding canonical %s", canonicalName);
> > +                registrar(canonicalName, canonicalName);
> > +            } else if (codecCount == 0 && !currentAvailable)
> > +                canonicalAvailable = false;
> 
> This feels a bit convoluted. Have you considered moving these two condition
> checks before the while loop? There's no need to check them on every loop run,
> since they are only useful in the first one.

Yes, thanks for the remark. I tried to unclutter this and indeed it should be much easier to read now.

> > +TextCodecGtk::~TextCodecGtk()
> > +{
> > +    releaseIConv();
> > +}
> > +
> > +void TextCodecGtk::releaseIConv() const
> > +{
> > +    if (m_iconvDecoder != reinterpret_cast<GIConv>(-1)) {
> > +        g_iconv_close(m_iconvDecoder);
> > +        m_iconvDecoder = reinterpret_cast<GIConv>(-1);
> > +    }
> > +    if (m_iconvEncoder != reinterpret_cast<GIConv>(-1)) {
> > +        g_iconv_close(m_iconvEncoder);
> > +        m_iconvEncoder = reinterpret_cast<GIConv>(-1);
> > +    }
> > +}
> 
> Why is this a separate function? Do you call releaseIConv anywhere else? Why
> not put this code inside the destructor, directly?

"Inlined".

> > +    if (err) {
> > +        LOG_ERROR("GIConv conversion error");
> 
> Would it be useful to log the error message here?

Printing code and error message now in encode() and decode().

> > +    CString result = CString(buffer.get(), count);
> > +
> > +    return result;
> 
> No need for a variable here.

Removed.


> > +#ifndef TextCodecGTK_h
> > +#define TextCodecGTK_h
> > +
> > +#include <glib.h>
> > +
> > +#include "TextCodec.h"
> > +#include "TextEncoding.h"
> 
> No need for that blank line between the includes.

Removed.

In addition, due to the log error output I changed the canonical name for Mac Central Europe Encoding from MACCENTRALEUROPE to MAC-CENTRALEUROPE and EUC_JP and EUC_CN to EUC-JP and EUC-CN, so that they work with GLib/iconv.

> Another round! =)

Thanks again for your review.
Comment 88 Gustavo Noronha (kov) 2009-10-28 06:20:55 PDT
Comment on attachment 41866 [details]
2/4 - Moving Text Codecs to GLib (v3)

> +#elif USE(GLIB_UNICODE)
> +    GOwnPtr<char> utf8Source;
> +    utf8Source.set(g_utf16_to_utf8(characters, length, 0, 0, 0));
> +
> +    GOwnPtr<char> utf8Normalized;
> +    utf8Normalized.set(g_utf8_normalize(utf8Source.get(), -1, G_NORMALIZE_NFC));
> +
> +    long utf16Length;
> +    GOwnPtr<UChar> utf16Normalized;
> +    utf16Normalized.set(g_utf8_to_utf16(utf8Normalized.get(), -1, 0, &utf16Length, 0));
> +
> +    return newTextCodec(*this)->encode(utf16Normalized.get(), utf16Length, handling);


Regarding this, there's an exception to the camel case rule for acronyms, and indeed all usages of "utf" in this file seem to be upper-cased, so I think we should strive for consistency. This seems to be the only problem I am able to spot by now. Sorry for being nit-picky, but please upload a patch with this fixed so that we can use the commit queue =). Thanks!
Comment 89 Dominik Röttsches (drott) 2009-10-30 06:35:18 PDT
Created attachment 42210 [details]
2/4 - Moving Text Codecs to GLib (v4)

(In reply to comment #88)

> Regarding this, there's an exception to the camel case rule for acronyms, and
> indeed all usages of "utf" in this file seem to be upper-cased, so I think we
> should strive for consistency. This seems to be the only problem I am able to
> spot by now. Sorry for being nit-picky, but please upload a patch with this
> fixed so that we can use the commit queue =). Thanks!

Sure, here it is. Thanks for reviewing.
Comment 90 Gustavo Noronha (kov) 2009-11-03 14:30:14 PST
Comment on attachment 42210 [details]
2/4 - Moving Text Codecs to GLib (v4)

Thanks.
Comment 91 WebKit Commit Bot 2009-11-03 14:59:11 PST
Comment on attachment 42210 [details]
2/4 - Moving Text Codecs to GLib (v4)

Rejecting patch 42210 from commit-queue.

Failed to run "['/Users/eseidel/Projects/CommitQueue/WebKitTools/Scripts/svn-apply', '--reviewer', 'Gustavo Noronha', '--force']" exit_code: 1
Last 500 characters of output:
stry.cpp
|===================================================================
|--- wkTrunkGlib_2_4_Codec/WebCore/platform/text/TextEncodingRegistry.cpp	(revision 50327)
|+++ wkTrunkGlib_2_4_Codec/WebCore/platform/text/TextEncodingRegistry.cpp	(working copy)
--------------------------
No file to patch.  Skipping patch.
5 out of 5 hunks ignored
patching file wkTrunkGlib_2_4_Codec/WebCore/platform/text/gtk/TextCodecGtk.cpp
patching file wkTrunkGlib_2_4_Codec/WebCore/platform/text/gtk/TextCodecGtk.h
Comment 92 Eric Seidel (no email) 2009-11-03 15:01:01 PST
Looks like the patch is rooted incorrectly.  It's possible that svn-create-patch crawled up to the wrong repository root. If you believe that it's a problem with svn-create-patch, please file a bug. :)
Comment 93 Dominik Röttsches (drott) 2009-11-04 01:33:25 PST
Created attachment 42465 [details]
2/4 - Moving Text Codecs to GLib (v5)

(In reply to comment #92)

> Looks like the patch is rooted incorrectly.  It's possible that
> svn-create-patch crawled up to the wrong repository root. If you believe that
> it's a problem with svn-create-patch, please file a bug. :)

Argh, yes, there was an orphaned .svn folder one level up, looks like svn-create-patch only stopped at this one. Here's a correctly rooted version of the patch, otherwise no changes.
Comment 94 Eric Seidel (no email) 2009-11-04 01:41:16 PST
Comment on attachment 42465 [details]
2/4 - Moving Text Codecs to GLib (v5)

If you believe svn-create-patch to have been in error in its handling of this patch creation please do file a bug. :)  But yes this patch looks good.  Since all patches require both an r+ and a cq+ for the commit-queue to land, I'm adding an r+ here too, mostly based on gns's previous review.
Comment 95 WebKit Commit Bot 2009-11-04 02:28:11 PST
Comment on attachment 42465 [details]
2/4 - Moving Text Codecs to GLib (v5)

Clearing flags on attachment: 42465

Committed r50508: <http://trac.webkit.org/changeset/50508>
Comment 96 WebKit Commit Bot 2009-11-04 02:28:30 PST
All reviewed patches have been landed.  Closing bug.
Comment 97 Christian Dywan 2009-11-05 20:00:57 PST
Patches 3 and 4 are still pending, hence re-opening.
Comment 98 Eric Seidel (no email) 2009-11-06 11:30:45 PST
Comment on attachment 42210 [details]
2/4 - Moving Text Codecs to GLib (v4)

Clearing r+ on this obsolete patch.
Comment 99 Eric Seidel (no email) 2009-11-06 11:31:26 PST
(In reply to comment #97)
> Patches 3 and 4 are still pending, hence re-opening.

They're not marked for review?  Should they be?

I really think we should close this bug and start a new ones for any remaining changes.
Comment 100 Dominik Röttsches (drott) 2009-11-13 05:45:52 PST
(In reply to comment #99)
> (In reply to comment #97)
> > Patches 3 and 4 are still pending, hence re-opening.
> 
> They're not marked for review?  Should they be?
No, nothing to be marked for review here. 

> I really think we should close this bug and start a new ones for any remaining
> changes.

I agree, I'm creating those and adding them as dependencies. So, there won't be any new patches on this one.
Comment 101 Dominik Röttsches (drott) 2010-01-28 01:21:30 PST
The patch for sub-bug 31470 was landed, so this one here can be closed. I don't seem to have the rights for it, could somebody help me?

I created a new issue for streamlining the differences in behavior between ICU and GLib unicode backend: bug 34247.
Comment 102 Gustavo Noronha (kov) 2010-01-28 05:14:26 PST
there you go, thanks!