Bug 3963 - Trailing space included in line if next line begins with non-Latin-1 character
Summary: Trailing space included in line if next line begins with non-Latin-1 character
Status: VERIFIED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: Layout and Rendering (show other bugs)
Version: 412
Hardware: Mac OS X 10.4
: P2 Normal
Assignee: Dave Hyatt
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-07-12 07:39 PDT by mitz
Modified: 2005-08-27 12:48 PDT (History)
1 user (show)

See Also:


Attachments
Testcase (2.34 KB, text/html)
2005-07-12 07:39 PDT, mitz
no flags Details
Extend Latin-1 behavior to the general case (778 bytes, patch)
2005-08-21 07:59 PDT, mitz
hyatt: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description mitz 2005-07-12 07:39:03 PDT
Summary: when a line break occurs at a space, that space should not be rendered on either line. 
However, when the character following the space is not in Latin-1, WebKit renders the space at the end 
of the first line.

To reproduce: open the testcase in Safari.

Expected: "Lorem" to be flush against the right edge in each of the three blue boxes, like in the green 
box.

Actual: In the first blue box, there is an underlined space between "Lorem" and the right edge of the 
box, like in the red box. The other two blue boxes render as expected.

Analysis: The problem is in khtml::isBreakable(). "A B" can be broken before the space as well as before 
the B. Contrary to what the comment in the code says, isBreakable considers it breakable only before 
the space. Which is OK. The problem is that with characters outside Latin-1, isBreakable invokes 
UCFindTextBreak, which actually says that the break can happen only before the B.

I'm not sure how to fix it. Changing break_lines.cpp:108 to return true if end == pos + 1 actually fixes 
the testcase, but I'm not sure it's a complete solution.

Anyway, the comment should be changed to reflect exactly what isBreakable() does.
Comment 1 mitz 2005-07-12 07:39:41 PDT
Created attachment 2925 [details]
Testcase
Comment 2 Oliver Hunt 2005-07-13 23:16:00 PDT
Reproducable on ToT -- good testcase
Comment 3 Justin Garcia 2005-08-05 13:58:21 PDT
Looking into this.  Assigning this to myself.
Comment 4 Justin Garcia 2005-08-15 11:32:26 PDT
Haven't had a chance to look at this yet.  Unassigning.  Mitz?
Comment 5 mitz 2005-08-15 12:01:49 PDT
(In reply to comment #4)
> Haven't had a chance to look at this yet.  Unassigning.  Mitz?

The first thing I want to do is to not call isBreakable() from findNextLineBreak(), and instead call 
UCFindTextBreak() only as necessary (i.e. call it once and not call it again before we reach the position it 
gave us). This will eliminate the O(n^2) bit, and by moving all the relevant logic into findNextLineBreak() 
will hopefully make it easier to move on with fixing the actual bug.
Comment 6 Dave Hyatt 2005-08-21 01:37:12 PDT
I think a simpler option would be to just fix isBreakable.  I think there are ways you can query for 
breakability without having to "search" for the position of the next text break.
Comment 7 mitz 2005-08-21 04:02:14 PDT
(In reply to comment #6)
> I think a simpler option would be to just fix isBreakable.  I think there are ways you can query for 
> breakability without having to "search" for the position of the next text break.

Sure. The only way I know of is calling UCFindTextBreak with textLength equal to startOffset+1. I don't 
think ICU has anything better to offer.

Anyway, I'm not sure what the expected behavior of isBreakable is. What it currently does in the Latin-1 
case works, but doesn't fit any definition of line-breakability -- in particular not UCFindTextBreak's -- 
other than "what findNextLineBreak expects". Can that definition be extended to cover the general case?
Comment 8 mitz 2005-08-21 07:59:21 PDT
Created attachment 3491 [details]
Extend Latin-1 behavior to the general case
Comment 9 mitz 2005-08-21 08:06:07 PDT
Comment on attachment 3491 [details]
Extend Latin-1 behavior to the general case

This patch fixes this bug and doesn't affect any of the layout tests. If it is
landed, then I'll open a separate bug about the O(n^2) complexity, and yet
another bug about many breaking opportunities we miss in Latin-1, e.g. breaking
before a left parenthesis.
Comment 10 Darin Adler 2005-08-21 09:09:49 PDT
Side comment (not about the patch). I think we'd like to switch to ICU for finding the breaks rather than 
using Carbon, unless there's a reason not to. Currently the only reason not to that I know of is a concern 
that Thai line breaking is not implemented yet in ICU.
Comment 11 Alexey Proskuryakov 2005-08-24 11:21:59 PDT
(In reply to comment #10)
> Side comment (not about the patch). I think we'd like to switch to ICU for finding the breaks rather 
than using Carbon, unless there's a reason not to.

  A reason that you may or may not find important... We (Mac developers outside Apple) are strongly 
advised not to use ICU in our projects for various reasons (which probably don't apply to WebKit). Still, I 
find "eating your own dog food" a good thing to do in this case - after all, WebKit has no special text 
processing requirements, and should be served by CFString well. If something is missing, chances are 
that we all would benefit from it being added to CFString.

  Moving from Unicode Utilities to an open source alternative, be it ICU or CFString, sounds great to me.
Comment 12 Darin Adler 2005-08-24 13:27:30 PDT
The CFString APIs require that your text be in a CFString. That's not the case in WebKit, so I don't think 
those APIs are going to be suitable for this work.
Comment 13 Alexey Proskuryakov 2005-08-24 21:52:35 PDT
(In reply to comment #12)

Yes, ICU works on buffers of UChars - but that's pretty easy to emulate with CFString:
CFStringRef str = CFStringCreateWithCharactersNoCopy(0, unichars, numChars, kCFAllocatorNull);
// do whatever you need; there's also CFStringCreateMutableWithExternalCharactersNoCopy()
CFRelease(str);

Given the complexity of algorithms involved, these additional function calls shouldn't be much of an 
overhead (especially since WebKit's native format isn't always buffers of UChars).

A nice thing about CFString is that it works with multiple encodings natively, without necessarily 
converting them into Unicode:
CFStringRef str = CFStringCreateWithCStringNoCopy(..., kSomeEncoding).
This transparently offers performance benefits in cases when the encoding is a simple one - try to 
match that with ICU! ;)
Comment 14 mitz 2005-08-24 22:50:25 PDT
Created bug 4628 - Use ICU instead of Carbon in khtml::isBreakable() 
Comment 15 Dave Hyatt 2005-08-26 17:04:45 PDT
Comment on attachment 3491 [details]
Extend Latin-1 behavior to the general case

r=me