Bug 3963

Summary: Trailing space included in line if next line begins with non-Latin-1 character
Product: WebKit Reporter: mitz
Component: Layout and RenderingAssignee: Dave Hyatt <hyatt>
Status: VERIFIED FIXED    
Severity: Normal CC: ap
Priority: P2    
Version: 412   
Hardware: Mac   
OS: OS X 10.4   
Attachments:
Description Flags
Testcase
none
Extend Latin-1 behavior to the general case hyatt: review+

mitz
Reported 2005-07-12 07:39:03 PDT
Summary: when a line break occurs at a space, that space should not be rendered on either line. However, when the character following the space is not in Latin-1, WebKit renders the space at the end of the first line. To reproduce: open the testcase in Safari. Expected: "Lorem" to be flush against the right edge in each of the three blue boxes, like in the green box. Actual: In the first blue box, there is an underlined space between "Lorem" and the right edge of the box, like in the red box. The other two blue boxes render as expected. Analysis: The problem is in khtml::isBreakable(). "A B" can be broken before the space as well as before the B. Contrary to what the comment in the code says, isBreakable considers it breakable only before the space. Which is OK. The problem is that with characters outside Latin-1, isBreakable invokes UCFindTextBreak, which actually says that the break can happen only before the B. I'm not sure how to fix it. Changing break_lines.cpp:108 to return true if end == pos + 1 actually fixes the testcase, but I'm not sure it's a complete solution. Anyway, the comment should be changed to reflect exactly what isBreakable() does.
Attachments
Testcase (2.34 KB, text/html)
2005-07-12 07:39 PDT, mitz
no flags
Extend Latin-1 behavior to the general case (778 bytes, patch)
2005-08-21 07:59 PDT, mitz
hyatt: review+
mitz
Comment 1 2005-07-12 07:39:41 PDT
Created attachment 2925 [details] Testcase
Oliver Hunt
Comment 2 2005-07-13 23:16:00 PDT
Reproducable on ToT -- good testcase
Justin Garcia
Comment 3 2005-08-05 13:58:21 PDT
Looking into this. Assigning this to myself.
Justin Garcia
Comment 4 2005-08-15 11:32:26 PDT
Haven't had a chance to look at this yet. Unassigning. Mitz?
mitz
Comment 5 2005-08-15 12:01:49 PDT
(In reply to comment #4) > Haven't had a chance to look at this yet. Unassigning. Mitz? The first thing I want to do is to not call isBreakable() from findNextLineBreak(), and instead call UCFindTextBreak() only as necessary (i.e. call it once and not call it again before we reach the position it gave us). This will eliminate the O(n^2) bit, and by moving all the relevant logic into findNextLineBreak() will hopefully make it easier to move on with fixing the actual bug.
Dave Hyatt
Comment 6 2005-08-21 01:37:12 PDT
I think a simpler option would be to just fix isBreakable. I think there are ways you can query for breakability without having to "search" for the position of the next text break.
mitz
Comment 7 2005-08-21 04:02:14 PDT
(In reply to comment #6) > I think a simpler option would be to just fix isBreakable. I think there are ways you can query for > breakability without having to "search" for the position of the next text break. Sure. The only way I know of is calling UCFindTextBreak with textLength equal to startOffset+1. I don't think ICU has anything better to offer. Anyway, I'm not sure what the expected behavior of isBreakable is. What it currently does in the Latin-1 case works, but doesn't fit any definition of line-breakability -- in particular not UCFindTextBreak's -- other than "what findNextLineBreak expects". Can that definition be extended to cover the general case?
mitz
Comment 8 2005-08-21 07:59:21 PDT
Created attachment 3491 [details] Extend Latin-1 behavior to the general case
mitz
Comment 9 2005-08-21 08:06:07 PDT
Comment on attachment 3491 [details] Extend Latin-1 behavior to the general case This patch fixes this bug and doesn't affect any of the layout tests. If it is landed, then I'll open a separate bug about the O(n^2) complexity, and yet another bug about many breaking opportunities we miss in Latin-1, e.g. breaking before a left parenthesis.
Darin Adler
Comment 10 2005-08-21 09:09:49 PDT
Side comment (not about the patch). I think we'd like to switch to ICU for finding the breaks rather than using Carbon, unless there's a reason not to. Currently the only reason not to that I know of is a concern that Thai line breaking is not implemented yet in ICU.
Alexey Proskuryakov
Comment 11 2005-08-24 11:21:59 PDT
(In reply to comment #10) > Side comment (not about the patch). I think we'd like to switch to ICU for finding the breaks rather than using Carbon, unless there's a reason not to. A reason that you may or may not find important... We (Mac developers outside Apple) are strongly advised not to use ICU in our projects for various reasons (which probably don't apply to WebKit). Still, I find "eating your own dog food" a good thing to do in this case - after all, WebKit has no special text processing requirements, and should be served by CFString well. If something is missing, chances are that we all would benefit from it being added to CFString. Moving from Unicode Utilities to an open source alternative, be it ICU or CFString, sounds great to me.
Darin Adler
Comment 12 2005-08-24 13:27:30 PDT
The CFString APIs require that your text be in a CFString. That's not the case in WebKit, so I don't think those APIs are going to be suitable for this work.
Alexey Proskuryakov
Comment 13 2005-08-24 21:52:35 PDT
(In reply to comment #12) Yes, ICU works on buffers of UChars - but that's pretty easy to emulate with CFString: CFStringRef str = CFStringCreateWithCharactersNoCopy(0, unichars, numChars, kCFAllocatorNull); // do whatever you need; there's also CFStringCreateMutableWithExternalCharactersNoCopy() CFRelease(str); Given the complexity of algorithms involved, these additional function calls shouldn't be much of an overhead (especially since WebKit's native format isn't always buffers of UChars). A nice thing about CFString is that it works with multiple encodings natively, without necessarily converting them into Unicode: CFStringRef str = CFStringCreateWithCStringNoCopy(..., kSomeEncoding). This transparently offers performance benefits in cases when the encoding is a simple one - try to match that with ICU! ;)
mitz
Comment 14 2005-08-24 22:50:25 PDT
Created bug 4628 - Use ICU instead of Carbon in khtml::isBreakable()
Dave Hyatt
Comment 15 2005-08-26 17:04:45 PDT
Comment on attachment 3491 [details] Extend Latin-1 behavior to the general case r=me
Note You need to log in before you can comment on or make changes to this bug.