Bug 3963

Summary:

Trailing space included in line if next line begins with non-Latin-1 character

Product:

WebKit

Reporter:

mitz

Component:

Layout and Rendering

Assignee:

Dave Hyatt <hyatt>

Status:

VERIFIED FIXED

Severity:

Normal

CC:

Priority:

Version:

412

Hardware:

Mac

OS:

OS X 10.4

Attachments:

Description	Flags
Testcase	none
Extend Latin-1 behavior to the general case	hyatt: review+

mitz

Reported 2005-07-12 07:39:03 PDT

Summary: when a line break occurs at a space, that space should not be rendered on either line. However, when the character following the space is not in Latin-1, WebKit renders the space at the end of the first line. To reproduce: open the testcase in Safari. Expected: "Lorem" to be flush against the right edge in each of the three blue boxes, like in the green box. Actual: In the first blue box, there is an underlined space between "Lorem" and the right edge of the box, like in the red box. The other two blue boxes render as expected. Analysis: The problem is in khtml::isBreakable(). "A B" can be broken before the space as well as before the B. Contrary to what the comment in the code says, isBreakable considers it breakable only before the space. Which is OK. The problem is that with characters outside Latin-1, isBreakable invokes UCFindTextBreak, which actually says that the break can happen only before the B. I'm not sure how to fix it. Changing break_lines.cpp:108 to return true if end == pos + 1 actually fixes the testcase, but I'm not sure it's a complete solution. Anyway, the comment should be changed to reflect exactly what isBreakable() does.

Attachments
Testcase (2.34 KB, text/html) 2005-07-12 07:39 PDT, mitz	no flags	Details
Extend Latin-1 behavior to the general case (778 bytes, patch) 2005-08-21 07:59 PDT, mitz	hyatt: review+	Details Formatted Diff Diff
View All Add attachment proposed patch, testcase, etc.

mitz

Comment 1 2005-07-12 07:39:41 PDT

Created attachment 2925 [details] Testcase

Oliver Hunt

Comment 2 2005-07-13 23:16:00 PDT

Reproducable on ToT -- good testcase

Justin Garcia

Comment 3 2005-08-05 13:58:21 PDT

Looking into this. Assigning this to myself.

Justin Garcia

Comment 4 2005-08-15 11:32:26 PDT

Haven't had a chance to look at this yet. Unassigning. Mitz?

mitz

Comment 5 2005-08-15 12:01:49 PDT

(In reply to comment #4) > Haven't had a chance to look at this yet. Unassigning. Mitz? The first thing I want to do is to not call isBreakable() from findNextLineBreak(), and instead call UCFindTextBreak() only as necessary (i.e. call it once and not call it again before we reach the position it gave us). This will eliminate the O(n^2) bit, and by moving all the relevant logic into findNextLineBreak() will hopefully make it easier to move on with fixing the actual bug.

Dave Hyatt

Comment 6 2005-08-21 01:37:12 PDT

I think a simpler option would be to just fix isBreakable. I think there are ways you can query for breakability without having to "search" for the position of the next text break.

mitz

Comment 7 2005-08-21 04:02:14 PDT

(In reply to comment #6) > I think a simpler option would be to just fix isBreakable. I think there are ways you can query for > breakability without having to "search" for the position of the next text break. Sure. The only way I know of is calling UCFindTextBreak with textLength equal to startOffset+1. I don't think ICU has anything better to offer. Anyway, I'm not sure what the expected behavior of isBreakable is. What it currently does in the Latin-1 case works, but doesn't fit any definition of line-breakability -- in particular not UCFindTextBreak's -- other than "what findNextLineBreak expects". Can that definition be extended to cover the general case?

mitz

Comment 8 2005-08-21 07:59:21 PDT

Created attachment 3491 [details] Extend Latin-1 behavior to the general case

mitz

Comment 9 2005-08-21 08:06:07 PDT

Comment on attachment 3491 [details] Extend Latin-1 behavior to the general case This patch fixes this bug and doesn't affect any of the layout tests. If it is landed, then I'll open a separate bug about the O(n^2) complexity, and yet another bug about many breaking opportunities we miss in Latin-1, e.g. breaking before a left parenthesis.

Darin Adler

Comment 10 2005-08-21 09:09:49 PDT

Side comment (not about the patch). I think we'd like to switch to ICU for finding the breaks rather than using Carbon, unless there's a reason not to. Currently the only reason not to that I know of is a concern that Thai line breaking is not implemented yet in ICU.

Alexey Proskuryakov

Comment 11 2005-08-24 11:21:59 PDT

(In reply to comment #10) > Side comment (not about the patch). I think we'd like to switch to ICU for finding the breaks rather than using Carbon, unless there's a reason not to. A reason that you may or may not find important... We (Mac developers outside Apple) are strongly advised not to use ICU in our projects for various reasons (which probably don't apply to WebKit). Still, I find "eating your own dog food" a good thing to do in this case - after all, WebKit has no special text processing requirements, and should be served by CFString well. If something is missing, chances are that we all would benefit from it being added to CFString. Moving from Unicode Utilities to an open source alternative, be it ICU or CFString, sounds great to me.

Darin Adler

Comment 12 2005-08-24 13:27:30 PDT

The CFString APIs require that your text be in a CFString. That's not the case in WebKit, so I don't think those APIs are going to be suitable for this work.

Alexey Proskuryakov

Comment 13 2005-08-24 21:52:35 PDT

(In reply to comment #12) Yes, ICU works on buffers of UChars - but that's pretty easy to emulate with CFString: CFStringRef str = CFStringCreateWithCharactersNoCopy(0, unichars, numChars, kCFAllocatorNull); // do whatever you need; there's also CFStringCreateMutableWithExternalCharactersNoCopy() CFRelease(str); Given the complexity of algorithms involved, these additional function calls shouldn't be much of an overhead (especially since WebKit's native format isn't always buffers of UChars). A nice thing about CFString is that it works with multiple encodings natively, without necessarily converting them into Unicode: CFStringRef str = CFStringCreateWithCStringNoCopy(..., kSomeEncoding). This transparently offers performance benefits in cases when the encoding is a simple one - try to match that with ICU! ;)

mitz

Comment 14 2005-08-24 22:50:25 PDT

Created bug 4628 - Use ICU instead of Carbon in khtml::isBreakable()

Dave Hyatt

Comment 15 2005-08-26 17:04:45 PDT

Comment on attachment 3491 [details] Extend Latin-1 behavior to the general case r=me

Note You need to log in before you can comment on or make changes to this bug.