Bug 60786

Summary: [Qt] Better complex language support: Switch to ICU library (libicu)
Product: WebKit Reporter: Siddharth Mathur <s.mathur>
Component: TextAssignee: Siddharth Mathur <s.mathur>
Status: RESOLVED FIXED    
Severity: Normal CC: ap, benjamin, commit-queue, cshu, darin, hausmann, jesus, kbalazs, kenneth, kling, laszlo.gombos, ossy, webkit.review.bot, yael
Priority: P2 Keywords: LayoutTestFailure, Qt, QtTriaged
Version: 528+ (Nightly build)   
Hardware: All   
OS: All   
Bug Depends on:    
Bug Blocks: 50245, 16981, 39852, 59187, 76821    
Attachments:
Description Flags
Experimental patch for others to spin a build
none
updated patch
none
Better patch
none
Patch
none
ready for review
none
Rebaselined patch
none
the perfect patch (pun intended)
benjamin: review-
Benchmark test to time layout
none
review comments incorporated none

Description Siddharth Mathur 2011-05-13 12:25:36 PDT
As Qt WebKit based applications and browsers reach more corners of the globe (which speak "complex"/non-European scripts), 
the large number of text encoding and international domain names (IDN) related failing tests in Qt's Skipped list are painful to watch. 

I did a quick switch from QT4_UNICODE -> ICU_UNICODE and turned on CONFIG+=text_breaking_with_icu to maximize use of the ICU library, and the following list of Skipped tests ..

fast/encoding/mailto-always-utf-8.html
fast/encoding/yentest.html
fast/encoding/yentest2.html
fast/encoding/char-encoding.html
fast/encoding/idn-security.html
fast/encoding/xmacroman-encoding-test.html
fast/encoding/GBK/EUC-CN.html
fast/encoding/GBK/chinese.html
fast/encoding/GBK/cn-gb.html
fast/encoding/GBK/csgb2312.html
fast/encoding/GBK/csgb231280.html
fast/encoding/GBK/gb2312.html
fast/encoding/GBK/gb_2312-80.html
fast/encoding/GBK/gbk.html
fast/encoding/GBK/iso-ir-58.html
fast/encoding/GBK/x-euc-cn.html
fast/encoding/GBK/x-gbk.html
fast/encoding/char-encoding-mac.html
fast/encoding/hebrew/8859-8-e.html
fast/encoding/hebrew/8859-8-i.html
fast/encoding/hebrew/logical.html
fast/encoding/char-decoding.html
fast/encoding/frame-default-enc.html
fast/encoding/invalid-xml.html
fast/encoding/char-decoding-mac.html
fast/encoding/hebrew/csISO88598I.html
fast/encoding/denormalised-voiced-japanese-chars.html
fast/encoding/invalid-UTF-8.html


... shrinks to: 


fast/encoding/denormalised-voiced-japanese-chars.html	expected	actual	diff	pretty diff
fast/encoding/idn-security.html	expected	actual	diff	pretty diff
fast/encoding/invalid-xml.html	expected	actual	diff	pretty diff
fast/encoding/mailto-always-utf-8.html

I am also adding some bugs in the "Blocks" list which I think are solvable by this migration. Some may be dups of the above test failures.
Comment 1 Chang Shu 2011-05-13 12:28:41 PDT
This should also help some editing tests.
Comment 2 Siddharth Mathur 2011-05-13 12:47:10 PDT
Created attachment 93495 [details]
Experimental patch for others to spin a build
Comment 3 Benjamin Poulain 2011-05-15 07:23:52 PDT
Uh. You might want to talk with the release guys...Because last few dozen of times this was suggested,  shipping ICU with Qt was totally out of the question.
Comment 4 Siddharth Mathur 2011-05-23 13:16:20 PDT
Created attachment 94470 [details]
updated patch

This experimental patch: 
- removes the "use ICU for text breaking only" option
- consolidates all Qt WebKit use of ICU behind the existing existing USE(ICU_UNICODE) flag
- adds a "use_system_icu" qmake CONFIG parameter to override default text handling (Qt's Unicode) 
- adds a new TextBreakIteratorInternalICUQt.cpp similar to other ports which use ICU (and move out a function from TextBreakIteratorQt.cpp) 
- guards TextCodecQt.cpp to be only compiled for USE(QT4_UNICODE) configuration

Pass CONFIG+=use_system_icu to build-webkit to use this.
Comment 5 Siddharth Mathur 2011-05-25 11:40:57 PDT
Created attachment 94818 [details]
Better patch
Comment 6 Siddharth Mathur 2011-05-25 13:58:17 PDT
Created attachment 94851 [details]
Patch
Comment 7 Siddharth Mathur 2011-05-26 07:47:39 PDT
Created attachment 94971 [details]
ready for review

Please note that the default is still the "use Qt's unicode" . I am not proposing to change that in this patch.
Comment 8 Siddharth Mathur 2011-05-26 09:10:02 PDT
Created attachment 94985 [details]
Rebaselined patch
Comment 9 WebKit Review Bot 2011-05-26 09:12:29 PDT
Attachment 94985 [details] did not pass style-queue:

Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/JavaScriptCore/ChangeLog', u'Source..." exit_code: 1

Source/WebCore/platform/text/qt/TextBreakIteratorInternalICUQt.cpp:39:  Extra space between return and cachedSystemLocale  [whitespace/declaration] [3]
Total errors found: 1 in 10 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 10 Siddharth Mathur 2011-05-26 09:42:18 PDT
Created attachment 94990 [details]
the perfect patch (pun intended)
Comment 11 Siddharth Mathur 2011-05-26 13:48:49 PDT
Created attachment 95030 [details]
Benchmark test to time layout

Also attached is a speed test I wrote for timing layout of a Russian and a Hindi all-text page. When run 100 times (tst_painting complexScript -iterations 100), the benchmark shows almost no timing differences between the QT4_UNICODE and ICU_UNICODE builds of QtWebKit. 

Here is what is being timed. Is this correct? 

QBENCHMARK {
        QWebView* view = new QWebView;
        QWebFrame* frame = view->page()->mainFrame();
        frame->setContent(data);
        // Layout if one is required.
        frame->toPlainText();
        delete view;
}
Comment 12 Benjamin Poulain 2011-05-29 08:16:47 PDT
Comment on attachment 94990 [details]
the perfect patch (pun intended) 

View in context: https://bugs.webkit.org/attachment.cgi?id=94990&action=review

> Source/WebCore/platform/text/qt/TextBreakIteratorInternalICUQt.cpp:30
> +Q_GLOBAL_STATIC_WITH_INITIALIZER(QByteArray, cachedSystemLocale, {*x = QLocale::system().name().toAscii();})

toAscii() is almost never correct in library context. It is not correct here.
The function return the string encoded with the default encoding, not the ascii encoding. You should use toLatin1() like the original code.

This caching is a change of behavior since we no longer use the system local when it changes. So I become curious: how important is the gain?

> Source/WebCore/platform/text/qt/TextBreakIteratorInternalICUQt.cpp:39
> +    return cachedSystemLocale()->data(); 
> +}
> +
> +const char* currentTextBreakLocaleID()
> +{
> +    return cachedSystemLocale()->data(); 

You should use constData, otherwise the implementation have to detach() if needed.
Comment 13 Siddharth Mathur 2011-05-31 15:08:11 PDT
(In reply to comment #12)
> (From update of attachment 94990 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=94990&action=review
> 
> > Source/WebCore/platform/text/qt/TextBreakIteratorInternalICUQt.cpp:30
> > +Q_GLOBAL_STATIC_WITH_INITIALIZER(QByteArray, cachedSystemLocale, {*x = QLocale::system().name().toAscii();})
> 
> toAscii() is almost never correct in library context. It is not correct here.
> The function return the string encoded with the default encoding, not the ascii encoding. You should use toLatin1() like the original code.
> 

Thanks for the review. I went back to Latin1() in upcoming patch. 

> This caching is a change of behavior since we no longer use the system local when it changes. So I become curious: how important is the gain?

I am not shooting for any gains here at all. I noted that the old code passes a pointer managed by a transient stack object to the caller, hence causing a invalid read in my Valgrind tests, so I changed it to use a global static so that a valid memory location is always passed around. And Q_GLOBAL_STATIC_WITH_INITIALIZER() provides a clean way to do that. 

> > +
> > +const char* currentTextBreakLocaleID()
> > +{
> > +    return cachedSystemLocale()->data(); 
> 
> You should use constData, otherwise the implementation have to detach() if needed.

Fixed to use constData() in the upcoming patch. Thanks!
Comment 14 Siddharth Mathur 2011-05-31 15:09:02 PDT
Created attachment 95491 [details]
review comments incorporated
Comment 15 Simon Hausmann 2011-05-31 15:34:19 PDT
I can see the motivation of using ICU, but I'd like to raise the point that this introduces fragmentation between different platforms on what I would call rather basic behaviour.

Note that in the past we have done changes in Qt to move it closer to ICU behavior, for example commit e6ac173991223dbf3b1b6f7213550ebca4608cb6 in Qt 4.7.
I'm pretty sure that Lars would welcome similar changes in Qt to match the behaviour, given WebKit's dominant role in content rendering these days and our desire to make Qt applications and QtWebKit match other browsers where it makes sense.

I don't think we should blindly switch without knowing the exact impact on behaviour and how we can maximize the deployment/availability of ICU for people using QtWebKit.

Imagine bug reports coming in about line breaking and us having to guess if QtWebKit was built with or without ICU support. I think that can become a maintenance burden.
Comment 16 Simon Hausmann 2011-05-31 15:35:44 PDT
On the topic of locales, please also see recent improvements in QLocale. Perhaps they will have a positive impact on the layout test results (I see locale specific stuff in the patch)

http://labs.qt.nokia.com/2011/01/03/qlocale-its-about-time-and-dates-and-languages-and/
Comment 17 Simon Hausmann 2011-05-31 15:42:17 PDT
(In reply to comment #16)
> On the topic of locales, please also see recent improvements in QLocale. Perhaps they will have a positive impact on the layout test results (I see locale specific stuff in the patch)
> 
> http://labs.qt.nokia.com/2011/01/03/qlocale-its-about-time-and-dates-and-languages-and/

Sorry, that link wasn't correct. However there has been a fair amount of work in QLocale in Qt 4.8, including ICU support for collation/toLower/toUpper.

It would be good to know the exact shortcomings in Qt that explain the individual test failures we see.
Comment 18 Siddharth Mathur 2011-06-01 10:09:36 PDT
Simon, thanks for your comments :)

A minor note that the above patch does _not_ switch the default build option Qt WebKit, but provides the ability to do so with an CONFIG flag. 

I agree that as far as Qt WebKit as a portable framework is concerned, I would personally consider it a nightmare if I were to triage and fix lots of bug reports from multiple OSes using different configurations. However, my admittedly limited experience with Qt WebKit bug inflow suggests that only a small number of bugs originate from Windows or Linux QtWebKit API users who are _also_ looking for functional exactness between OSes. For example, a KDE developer may wish for quality on desktop Linux with standards support on par with Chrome, Safari etc. I don't think think he/she compares a QtWebKit Windows build with a QtWebKit Linux build? My understanding is that both desktop Linux and Windows users of Qt and Qt WebKit ship their own shared objects with custom bug fixes applied and build flags toggled, so they accept the cost of being "on their own". Nokia's mobile end-users and developers follow a different trajectory as Nokia internally absorbs the majority of costs of maturing Meego or Symbian QtWebKit binaries that are pre-loaded on devices. To simplify matters, Symbian is certainly becoming a less important platform for future trunk deliveries. 

On the other hands, the following costs _are_ significant IMHO: 
- cost to current Qt WebKit developers of maintaining a list of dozens of Skipped tests for basic functionality. 
- cost to Apple, Google and other non Qt contributors of keeping Qt bots happy
- cost to Nokia of shipping frameworks or browsers in 2011 that can't render Thai, Arabic, Hebrew or Hindi correctly. 
- cost to mankind of using a web that doesn't speak their language correctly. 

More specifically to your points: 
a) Is this the Qt tracker bug for QLocale and collation work that you were looking for? http://bugreports.qt.nokia.com/browse/QTBUG-16178
(collation isn't done yet, and lot of the other changes aren't important for the failures in our Skipped list) 

b) The IDN related failures (idn-security.html) are due to ICANN specified blacklisted glyphs and characters which cannot be in IDN domain names to prevent confusion or phishing attacks. This is already in ICU, but missing in Qurl::toAce(). Note this IDN/QUrl problems are separate from QtCore's Unicode and Locale related issues, but a (lazy man's) solution happens to be the same. 

I haven't done a root cause analysis of the dozens of failures in the current configuration, but I thought that the presence of objective data from peer-reviewed Layout Tests cases and long-open bugs would help.
Comment 19 Benjamin Poulain 2011-06-03 07:13:21 PDT
Comment on attachment 95491 [details]
review comments incorporated

Ok, let's give this a try.
Could you look on the wiki if this flag can be documented anywhere?
Comment 20 WebKit Commit Bot 2011-06-03 07:47:16 PDT
Comment on attachment 95491 [details]
review comments incorporated

Clearing flags on attachment: 95491

Committed r88016: <http://trac.webkit.org/changeset/88016>
Comment 21 WebKit Commit Bot 2011-06-03 07:47:24 PDT
All reviewed patches have been landed.  Closing bug.
Comment 22 Laszlo Gombos 2011-06-27 13:31:05 PDT
Re-purpose this bug to track the progress of switch to ICU and to track the impacted LayoutTests.

The following tests are also passing with ICU turned on:
sputnik/Unicode/Unicode_510/S15.5.4.16_A1.html
sputnik/Unicode/Unicode_510/S15.5.4.18_A1.html
sputnik/Unicode/Unicode_510/S7.6_A1.1_T1.html
sputnik/Unicode/Unicode_510/S7.6_A1.1_T2.html
sputnik/Unicode/Unicode_510/S7.6_A1.1_T4.html
sputnik/Unicode/Unicode_510/S7.6_A2.2_T1.html
sputnik/Unicode/Unicode_510/S7.6_A2.2_T2.html
sputnik/Unicode/Unicode_510/S7.6_A2.3.html

sputnik/Unicode/Unicode_510/S7.6_A5.2_T1.html
sputnik/Unicode/Unicode_510/S7.6_A5.2_T2.html
sputnik/Unicode/Unicode_510/S7.6_A5.2_T4.html
sputnik/Unicode/Unicode_510/S7.6_A5.2_T7.html
sputnik/Unicode/Unicode_510/S7.6_A5.2_T8.html
sputnik/Unicode/Unicode_510/S7.6_A5.2_T9.html
Comment 23 Jesus Sanchez-Palencia 2012-01-26 09:32:48 PST
(In reply to comment #22)
> Re-purpose this bug to track the progress of switch to ICU and to track the impacted LayoutTests.


Closing this bug after http://trac.webkit.org/changeset/105997 .