Bug 16981

Summary: Thai word break support for platforms which don't use ICU
Product: WebKit Reporter: Pattara Kiatisevi <pattara>
Component: Layout and RenderingAssignee: Nobody <webkit-unassigned>
Status: RESOLVED WONTFIX    
Severity: Normal CC: alp, ap, hausmann, lars.knoll, mitz
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: PC   
OS: Linux   
URL: http://linux.thai.net/viewvc/viewvc.cgi/software/webkit/webkit-0svn25144/webkit_wordbreak.diff?revision=1.4&view=markup
Bug Depends on: 60786    
Bug Blocks:    
Attachments:
Description Flags
Patch to enable Thai wordbreaking in WebKit eric: review-

Description Pattara Kiatisevi 2008-01-22 21:02:56 PST
In Thai, a sentence is composed of several words without space in between like in English. Therefore it is neccessary for the software to know word boundary locations so that it can wrap the line properly (i.e., wrap between words, not in the middle of the word).

This patch adds Thai word break support for WebKit using libthai library (http://sourceforge.net/projects/libthai). The algorithm/code is very similar to the one in KDE's KHTML.

As a Thai, I think it's best to link WebKit with libthai since the compile time so that webkit will always come with full Thai support, but I understand the rest of the world who don't read Thai might not want to install libthai library when they don't really use it. Therefore the patch checks to load libthai dynamically (using dlopen) if libthai is not linked.

http://linux.thai.net/viewvc/viewvc.cgi/software/webkit/webkit-0svn25144/webkit_wordbreak.diff?revision=1.4&view=markup
Comment 1 Pattara Kiatisevi 2008-01-22 21:04:32 PST
Created attachment 18616 [details]
Patch to enable Thai wordbreaking in WebKit

Tested with Ubuntu libqtwebkit0d package version 0~svn27674-1~gutsy1
Comment 2 Alexey Proskuryakov 2008-01-23 00:16:09 PST
I'm not sure about the scope of this bug. Does this work on Mac/Windows WebKit already, thanks to ICU? If not, could you please provide a test case showing the problems?
Comment 3 mitz 2008-01-23 00:26:35 PST
(In reply to comment #2)
> I'm not sure about the scope of this bug. Does this work on Mac/Windows WebKit
> already, thanks to ICU?

On Mac OS X WebKit uses UCCreateTextBreakLocator precisely because the version of ICU that ships on the system does not support Thai (see bug 12175).
Comment 4 Pattara Kiatisevi 2008-01-23 03:44:22 PST
Test page:

http://www.mm.co.th/~pattara/thai/wbrtest-short.html

In a working system, sentences should be wrapped at the boundaries of words, not in between.

But I think in MAC and Windows, if it doesn't work already, the solution should be different as those systems already have API for Thai word breaking, e.g., ATSUI, UniScribe, we should just use them, no need to use libthai)?
Comment 5 Alexey Proskuryakov 2008-01-23 04:15:31 PST
To have this patch reviewed, please mark it review? by clicking Edit link to the right of the patch.

There are a number of coding style issues that would definitely make it rejected though, so you may want to read <http://webkit.org/coding/contributing.html>. It's also OK to request review if you'd like to get feedback on the substance of the patch first, and fix style later.

Sounds like this fix will be very important for platforms that can not use existing Thai support from ICU - thank you very much for working on it!
Comment 6 Pattara Kiatisevi 2008-01-23 10:12:56 PST
I think ICU is another possible option.

I saw <a href=http://bugs.webkit.org/show_bug.cgi?id=4628>Bug 4628</a>. We could add a check -- if LINUX then use ICU for Thai word breaking instead of Carbon?

But then we need an explicit dependency between ICU and WebKit (in Ubuntu where I tested there seems none though). The above patch will look for libthai and use it dynamicially if exists, without forcing users to install it. For Thais, compile-time linking is of course better but for the rest non-Thai-speaking users, this dynamic method may save their bandwidth and disk space. 
Comment 7 Dave Hyatt 2008-01-23 12:17:37 PST
The patch mentions "libkhtml."  Was this code lifted from KHTML and moved into WebKit?  If so, please make sure to include the appropriate copyright attributions at the top of break_lines.cpp.

Comment 8 Pattara Kiatisevi 2008-01-23 19:00:46 PST
It was lifted from KHTML and it was us who submitted to KHTML. Anyway, will add attributions as you suggested.
Comment 9 Eric Seidel (no email) 2008-01-26 14:18:22 PST
Comment on attachment 18616 [details]
Patch to enable Thai wordbreaking in WebKit

First, let me thank you VERY MUCH for taking the time to contribute to WebKit.  We strongly value new contributions, and I would like to encourage you to contribute more in the future.

That said, we're not going to take this patch.  More information below:


Ok, so it looks like ICU does support Thai breaking.  Safari 3.0.4 on Mac OS X already gets this test case correct:
http://www.mm.co.th/~pattara/thai/wbrtest-short.html
(as far as I can tell)

So this bug only applies to platforms which are not able to use ICU.  Both WebKitWinApple and WebKitMac use ICU.  I'm not certain about WebKitGtk or WebKitQt.

In order for this bug to move forward, we would need an example of a platform combination under active development which did not support proper Thai word breaks.

In that case, we would need to re-write this patch to apply to the TextIterator platform abstraction model (which looks to be possible.

This patch has a raft of style issues, as previously mentioned, as well as memory leaks (the entire thai cached leaks).

I'm marking this r-, and I recommend with close the bug as "INVALID" unless we can come up with a paltform combination which doesn't already have thai word break support.
Comment 10 Eric Seidel (no email) 2008-01-26 14:22:00 PST
I meant "TextBreakIterator" when I said TextIterator.

The files in question can be found under:
WebCore/platform/text
Comment 11 mitz 2008-01-26 14:26:38 PST
(In reply to comment #9)
> Ok, so it looks like ICU does support Thai breaking.  Safari 3.0.4 on Mac OS X
> already gets this test case correct:
> http://www.mm.co.th/~pattara/thai/wbrtest-short.html
> (as far as I can tell)

The Mac OS X port of WebKit does not use ICU for word-breaking because the version of ICU that ships with Tiger does not support Thai. Instead, WebKit uses Carbon's Unicode Utilities on both Tiger and Leopard.
Comment 12 Pattara Kiatisevi 2008-01-26 22:17:56 PST
(In reply to comment #9)
> (From update of attachment 18616 [details] [edit])
> First, let me thank you VERY MUCH for taking the time to contribute to WebKit. 
> We strongly value new contributions, and I would like to encourage you to
> contribute more in the future.
> That said, we're not going to take this patch.  More information below:
> Ok, so it looks like ICU does support Thai breaking.  Safari 3.0.4 on Mac OS X
> already gets this test case correct:
> http://www.mm.co.th/~pattara/thai/wbrtest-short.html
> (as far as I can tell)
> So this bug only applies to platforms which are not able to use ICU.  Both
> WebKitWinApple and WebKitMac use ICU.  I'm not certain about WebKitGtk or
> WebKitQt.

Safari on Mac works ok (thanks to Carbon). But as far as we have tested, Thai word breaking is NOT working in WebKitQt on Linux. 

As I mentioned in #6, maybe the problem is better solved using ICU provided that WebKit and ICU is already connected. But as I see on my Linux box that the WebKitQt package doesn't depend on libicu. So we guess maybe WebKit is not using libicu and hence propose this patch which uses libthai as a possible solution for word breaking, similar to in KHTML and Firefox. The bottom line is just to have propert Thai word breaking in WebKit. Any solution is actually fine.

Please let us know if we can be of any further help regarding this (testing, etc..). 
Comment 13 Alexey Proskuryakov 2008-01-26 23:30:11 PST
Many non-Apple ports do not want to use ICU (because it's too large), so I think that this work is of good value. CC'ing some people that would know for sure.
Comment 14 Simon Hausmann 2009-03-04 04:18:50 PST
I think for non-ICU builds, such as the Qt one, the TextBreakIterator should continue to work correctly for Thai. The TextBreakIterator for the Qt build uses QTextBoundaryFinder. That's the class that may need some additional fixing for Thai. I'm working with Pattara on that :-)
Comment 15 Simon Hausmann 2012-08-02 04:17:20 PDT
For the Qt port we have chosen to use ICU, which I think means now that all ports are using ICU and this bug can be closed :)