Bug 37543 - Need more sophisticated line breaking rule for CJK about quotation marks
Summary: Need more sophisticated line breaking rule for CJK about quotation marks
Status: UNCONFIRMED
Alias: None
Product: WebKit
Classification: Unclassified
Component: Text (show other bugs)
Version: 528+ (Nightly build)
Hardware: All All
: P2 Normal
Assignee: Nobody
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-04-13 19:23 PDT by Xianzhu Wang
Modified: 2010-12-30 01:12 PST (History)
3 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Xianzhu Wang 2010-04-13 19:23:14 PDT
For now, all quotation marks in CJK text are considered prohibiting line breaking both before and after them. This is not unacceptable, but a sophisticated CJK text layout software should also consider the opening and closing natures of the quotation marks and apply different line breaking rules.

For example, the following Chinese text

一二三四五六七八九“甲乙丙丁”

is displayed in a container whose width can contain 10 Chinese characters.

In current WebKit, the above text will be displayed as:

一二三四五六七八
九“甲乙丙丁”

while all word-processing software and other browsers (IE, Firefox) will display the text as:

一二三四五六七八九
“甲乙丙丁”

which better utilizes the container space and looks better to Chinese people.

Firefox implemented an algorithm (https://wiki.mozilla.org/Gecko:Line_Breaking http://mxr.mozilla.org/mozilla1.9.2/source/intl/lwbrk/src/nsJISx4501LineBreaker.cpp) conforming to the JIS X 4051 standard (Formatting rules for Japanese documents) which also applies to Chinese and Korean documents.

I'd like to add JIS X 4501 support in WebKit. What's the rule about importing source code of other licenses?
Comment 1 Alexey Proskuryakov 2010-04-14 14:36:39 PDT
Licensing requirements are available at patch submit page. Generally, we can only take code that is BSD or LGPL-licensed.

We normally follow the Unicode line breaking algorithm <http://unicode.org/reports/tr14/>. While deviations from it are acceptable, they certainly need to be explained in detail. One question to answer is - why the Unicode algorithm doesn't implement this?
Comment 2 Xianzhu Wang 2010-04-14 23:18:34 PDT
Though UAX14 says that by default quotation marks "act like they are both opening and closing" thus prohibit line breaks both before and after it, there is also a note: "If language information is available, it can be used to determine which character is used as the opening quote and which as the closing quote. ... the quotation marks could be tailored to either OP or CL depending on their actual usage."

Mozilla bug https://bugzilla.mozilla.org/show_bug.cgi?id=450088 contains detailed discussions about the quotation mark line breaking issue.

Mozilla's rule about quotation marks are as follows (not including the quotation marks that already have OP or CL line breaking properties):

1. The following left quotation marks are treated as opening punctuations (prohibiting break after) in all language contexts:

201B;QU # SINGLE HIGH-REVERSED-9 QUOTATION MARK
201F;QU # DOUBLE HIGH-REVERSED-9 QUOTATION MARK

2. The following left quotation marks are treated as opening punctuations (prohibiting break after) in CJK contexts (where the next character is CJK a character):

00AB;QU # LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
2018;QU # LEFT SINGLE QUOTATION MARK
201C;QU # LEFT DOUBLE QUOTATION MARK

3. The following right quotation marks are treated as closing punctuations (prohibiting break before) in all language contexts:

2019;QU # RIGHT SINGLE QUOTATION MARK
201D;QU # RIGHT DOUBLE QUOTATION MARK

I think the above 1 should be combined into 2 because 201B and 201F has the same semantic as 2018 and 201C respectively.

I think the solution might be either pushing icu to add this functionality or implementing an extra layer over icu in webkit code.