Bug 23124 - RTL: Directionality always reset on hard line break
Summary: RTL: Directionality always reset on hard line break
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: Layout and Rendering (show other bugs)
Version: 528+ (Nightly build)
Hardware: Mac OS X 10.5
: P2 Normal
Assignee: Nobody
URL:
Keywords: GoogleBug
Depends on:
Blocks: 50910
  Show dependency treegraph
 
Reported: 2009-01-05 13:47 PST by Jeremy Moskovich
Modified: 2011-03-30 14:41 PDT (History)
14 users (show)

See Also:


Attachments
Test case (1.92 KB, text/html)
2009-01-05 13:48 PST, Jeremy Moskovich
no flags Details
Test Case (169 bytes, text/html)
2009-06-04 13:19 PDT, Jeremy Moskovich
no flags Details
TestCase rendering in IE,WebKit,FF & Opera (97.09 KB, image/png)
2009-06-04 13:24 PDT, Jeremy Moskovich
no flags Details
Only reset directionality if caused by Unicode Control Char (5.31 KB, patch)
2009-06-04 16:54 PDT, Jeremy Moskovich
no flags Details | Formatted Diff | Diff
Only reset directionality if caused by Unicode Control Char (5.27 KB, patch)
2009-06-08 08:02 PDT, Jeremy Moskovich
no flags Details | Formatted Diff | Diff
Added more thorough layout test (30.19 KB, patch)
2009-06-17 19:22 PDT, Jeremy Moskovich
mitz: review-
Details | Formatted Diff | Diff
Patch (139.10 KB, patch)
2011-03-29 07:54 PDT, Levi Weintraub
no flags Details | Formatted Diff | Diff
Patch (139.72 KB, patch)
2011-03-29 10:04 PDT, Levi Weintraub
eric: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Jeremy Moskovich 2009-01-05 13:47:18 PST
Filed in Chrome as: http://code.google.com/p/chromium/issues/detail?id=5959

Reproducible in WebKit trunk and Safari 3.1.2.

From the bug report:
"Review the attached reduced test case and notice the following issues -
- In non-paragraphic elements (I guess), the directionality is not maintained 
beyond the first element, it goes back to the paragraphic inheritance or 
something.
- The spacing between the lines of the non-paragraphic elements is not equal 
between one another (line or element). It is inconsistently changing. 
Sometimes the first line has an almost non existence bottom spacing, 
sometimes it is a 2px bottom spacing or so.
"
Comment 1 Jeremy Moskovich 2009-01-05 13:48:02 PST
Created attachment 26439 [details]
Test case
Comment 2 mitz 2009-01-08 07:38:37 PST
This seems to be about hard line breaks. A reduction with fewer elements:

data:text/html,<span dir="rtl">&#x05d0;.<br>&#x05d0.</span>
Comment 3 mitz 2009-01-11 11:20:14 PST
I suspect that this behavior is related to the special treatment of hard line breaks in BidiResolver::createBidiRunsForLine():

                // A deviation from the Unicode Bidi Algorithm in order to match
                // Mac OS X text and WinIE: a hard line break resets bidi state.

Perhaps the behavior of IE should be studied more carefully and the above should be refined. Maybe it should only affect embedding using Unicode control characters, and not embedding using CSS properties.
Comment 4 Yair Yogev 2009-05-14 03:58:53 PDT
this is not only a mac issue of course
Comment 5 Jeremy Moskovich 2009-06-04 13:19:10 PDT
Created attachment 30956 [details]
Test Case

Tests case of both dir=rtl & unicode dir override chars.
Comment 6 Jeremy Moskovich 2009-06-04 13:24:26 PDT
Created attachment 30959 [details]
TestCase rendering in IE,WebKit,FF & Opera

IE: Only resets directionality in the case of Unicode override chars.
FF,Opera: Never reset directionality.
WebKit: Resets directionality in both cases.

If we want to match IE here, looks like the right thing to do is to change the current behavior to not reset directionality in the case of css specified directionality per mitz's comment.
Comment 7 Jeremy Moskovich 2009-06-04 16:54:06 PDT
Created attachment 30972 [details]
Only reset directionality if caused by Unicode Control Char
Comment 8 Eric Seidel (no email) 2009-06-06 01:24:59 PDT
Comment on attachment 30972 [details]
Only reset directionality if caused by Unicode Control Char

WebKit uses CamelCase for variable names.

Mitz should review this for content though.  You probably want to upload a new patch in the meanwhile so he doesn't have to think about the style issues.

Also:
\ No newline at end of file
Comment 9 Jeremy Moskovich 2009-06-08 08:02:02 PDT
Created attachment 31050 [details]
Only reset directionality if caused by Unicode Control Char

Fixed var naming + Newline.
Comment 10 mitz 2009-06-15 10:50:45 PDT
Comment on attachment 31050 [details]
Only reset directionality if caused by Unicode Control Char

> +        Direction current_context_direction = context()->dir();

WebKit variable naming style is to use camelCase.

Is it correct to pop out of all explicit embedding levels, which can be a mixture of style-originated and control-character-originated, just because the topmost was started by a control character? I think the test should cover such cases.
Comment 11 Jeremy Moskovich 2009-06-17 18:42:31 PDT
When creating test cases for this bug I came across the following case:

&#x202e;
&#x05d0;.<br>
&#x05D0;.<br>
<span dir="rtl"></span>
&#x202c;

WebKit display this as:
.a
.a

While IE displays this as:
.a
a.

Removing the span causes WebKit to match IE.
Comment 12 Jeremy Moskovich 2009-06-17 19:22:34 PDT
Created attachment 31485 [details]
Added more thorough layout test

* layout test added to cover the case of top level unicode control char.
* fixed style issues.
* Added comment.
* Included layout test results.

Mitz: Are there any other tests you'd like me to add here?  If so I'd be grateful if you could mention them explicitly.

This patch changes the output of a couple of layout tests but they all still appear to pass.
Rendering of the new "complex" layout test is identical to IE except for the strange behavior I highlighted in my previous comment (i.e. it looks different for reasons unrelated to this patch).
Comment 13 Jeremy Moskovich 2009-07-04 23:36:20 PDT
mitz: Could you take another look please?  If you have any other concerns or there's anything else you'd like me to check please let me know.
Comment 14 mitz 2009-07-10 00:12:28 PDT
How should a case like this behave?

<span dir="rtl">A &#x202E; B<span dir="ltr">C <br> D </span> E &#x202C; F</span>

Namely, what should the context stack look like after the <br>, after the inner span is closed, and after the PDF?
Comment 15 mitz 2009-08-07 09:53:32 PDT
Comment on attachment 31485 [details]
Added more thorough layout test

I don’t think I can review this patch without knowing the expected behavior for nested cases like the ones I’ve mentioned before.
Comment 16 Jeremy Moskovich 2009-08-07 09:56:30 PDT
I haven't been able to work on this for a while, will get back to you with that information as soon as I can.

Thanks!
Comment 17 Jeremy Moskovich 2009-08-27 08:32:32 PDT
Adding GoogleBug keyword since this affects:
http://www.google.co.il/m/search?q=%D7%AA%D7%9C-%D7%90%D7%91%D7%99%D7%91

The issues is with the way the URL's are displayed on the page.
Comment 18 Aharon (Vladimir) Lanin 2009-08-30 02:03:03 PDT
A few thoughts about this issue.

- The most important test case for the effect of <br> on the Unicode BiDi Alogorithm is as follows:

1. Start with &#x05d0;.
2. &#x05d1; is the next step.

The period on the first line is should come out to the right of the aleph and the "2." on the second line should come out to the left of the bet.

That is where Firefox and Opera go wrong.

The second most important test case is that in <span dir=rtl>&#x05d0;.<br>&#x05d0;.</span>, the two lines should comeout the same (with the period to the left of the aleph).

This is where WebKit is currently going wrong.

The issue of the effect of <br> on embedding levels started via Unicode BiDi formatting characters like LRE is of secondary importance, since these really aren't supposed to be used in HTML documents except where mark-up can't be used.

As far as I am concerned, the only legitimate reason for these characters to appear in HTML documents otherwise is as a result of including plain text in it. One HTML-escapes the plain text, of course, and translates newlines to <br>s - but what is one supposed to do with the LREs etc.? What usually happens is that the LREs are ignored, i.e. left in the HTML document as is. It is good in such cases for the <br>s (that used to be newlines) to close them, as they had done in the original plain-text context, where a newline is a paragraph break.

- However <br> behaves in this whole matter, <hr> and embedded block elements (e.g. <div></div>) should behave the same.

- The Unicode BiDi algorithm unfortunately does not deal with a good parrallel to <br>. The obvious candidate would, of course, be paragraph breaks, except that paragraph breaks are supposed to terminate all embedding levels, and it makes no sense for <br> to terminate the effect of the dir attributes of its ancestors. It is the lack of a good parallel that caused all the browsers to do such different things.

- Given that the Unicode BiDi Algorithm gives no clearcut answer on what the effect of <br> should be, and that IE, Firefox, and WebKit currently all do different things, I do not think that WebKit needs to do exactly what IE or Firefox does (unless that's what makes sense).

- IE's approach of digging into the stack and killing the embedding levels started with LRE, RLE, LRO,and RLO, but not those started by HTML elements within the scope of these characters and their matching PDFs gives me the willies. I would like to offer an alternative: close embedding levels up to, but not including the topmost one started by mark-up. That way, we don't have to do surgery on the stack, but we still close LREs started in included plain text. Here is a test case:

<div dir=ltr>
==>
&#x202B;
=>>
<span dir=ltr>
==>
&#x202B;
=>>
&#x202A;
==>
<br>
==>
&#x202C;
==>
&#x202C;
==>
</span>
=>>
&#x202C;
==>
</div>

The result should be that all the ==> arrows point left, while all the =>> arrows point right.
Comment 19 Simon Fraser (smfr) 2011-01-12 21:43:28 PST
http://www.w3.org/TR/html4/struct/text.html#edef-BR
Comment 20 Xiaomei Ji 2011-01-13 10:27:11 PST
Aharon proposed to treat <br> as bidi paragraph separator to HTML5.

Following is his proposal:
http://www.w3.org/TR/html-bidi/#br-as-separator
(Amendament: Although IE7 continued to treat <br> as a bidi separator, IE8 and and IE9, when in standards mode, treat <br> as bidi whitespace, as per the HTML4 spec.)

And he filed the following bug to HTML5:
http://www.w3.org/Bugs/Public/show_bug.cgi?id=10828

The proposal was adopted in HTML5 which defines <br> as a bidi paragraph separator.
http://dev.w3.org/html5/spec/Overview.html#the-br-element
Comment 21 Ian 'Hixie' Hickson 2011-01-13 11:40:54 PST
Just as a heads-up, the spec might change again.
Comment 22 Eric Seidel (no email) 2011-03-27 08:13:43 PDT
(In reply to comment #3)
> I suspect that this behavior is related to the special treatment of hard line breaks in BidiResolver::createBidiRunsForLine():
> 
>                 // A deviation from the Unicode Bidi Algorithm in order to match
>                 // Mac OS X text and WinIE: a hard line break resets bidi state.

For those following along at home, this comment can be found at:
http://trac.webkit.org/browser/trunk/Source/WebCore/platform/text/BidiResolver.h#L560
Comment 23 Levi Weintraub 2011-03-29 07:54:56 PDT
Created attachment 87328 [details]
Patch
Comment 24 Eric Seidel (no email) 2011-03-29 08:19:33 PDT
Comment on attachment 87328 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=87328&action=review

I think this is good,b ut needs another round of cleanup.

> Source/WebCore/platform/text/BidiContext.cpp:62
> +static inline PassRefPtr<BidiContext> clone(BidiContext* context, BidiContext* parent)

I would separate this into two functions. One which is a copy() function on the BidiContext, and a second where you set the level by calcuating a new level in a local helper fuction.

> Source/WebCore/platform/text/BidiContext.cpp:69
> +        // Go to the least greater odd integer
> +        newLevel += 1;
> +        newLevel |= 1;

Just put these into helper functions.  nextGreaterOdd()

> Source/WebCore/platform/text/BidiContext.cpp:78
> +// The BidiContext stack must be immutable -- they're re-used for repainting -- so

They're re-used when we re-layout after dom modification/editing.

> Source/WebCore/platform/text/BidiContext.cpp:80
> +PassRefPtr<BidiContext> BidiContext::clearUnicodeEmbeddingContexts()

copyStackRemovingUnicodeEmbeddingContexts?

> Source/WebCore/platform/text/BidiContext.cpp:82
> +    Vector<BidiContext*> contexts;

Either this should have inline capacity of 64 (to avoid ever mallocing) or you shoudl do this in two passes.  One which makes a clone of the list w/o unicodeEmbeddings, and a second walk which fixes the levels.

I think I'm slightly more in favor of doing two passes.  But it's really up to you.  A linked list walk is also expensive, but probably less expensive than a malloc.  Although we can avoid the malloc with the inline capcity.  so again, up to you.

> Source/WebCore/platform/text/BidiContext.h:33
> +enum BidiEmbeddingClass {

This seems confusing to me.  Maybe BidiSource?  ContextSource?  Donno.  But currently confusing.

> Source/WebCore/platform/text/BidiContext.h:34
> +    DOMEmbedding,

could be FromDOM and FromUnicode?  BidiEmbeddingSource?  maybe?

> Source/WebCore/platform/text/BidiContext.h:65
> +    BidiEmbeddingClass m_embeddingClass : 1;

There may be some concern about makign sure this is unsigned on MSVC.  I dont' rmember the exact rules.  but enums + bitfields + msvc = confusion.

> Source/WebCore/platform/text/BidiResolver.h:80
> +    BidiEmbedding(WTF::Unicode::Direction direction, BidiEmbeddingClass embeddingClass = UnicodeEmbedding)

I'm not sure you want a default argumetn value here.

> Source/WebCore/platform/text/BidiResolver.h:82
> +    : m_direction(direction)
> +    , m_embeddingClass(embeddingClass)

indent.

> Source/WebCore/platform/text/BidiResolver.h:610
> +            embed(BidiEmbedding(dirCurrent));

This can be implicitly constructed, no?  I geuss it's better to be clear.

> Source/WebCore/rendering/InlineIterator.h:161
> +                resolver->embed(BidiEmbedding(d, DOMEmbedding));

I think it's cleaner to make embed just take two params, and ahve it construct the Embedding thing.
Comment 25 Nikolas Zimmermann 2011-03-29 08:22:05 PDT
Comment on attachment 87328 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=87328&action=review

> Source/WebCore/platform/text/BidiContext.h:35
> +    DOMEmbedding,
> +    UnicodeEmbedding

I think this is confusing. DOMEmbedding sounds like it would only come from DOM (<span dir="...) attributes, but inline styles would affect this as well (<span style="..").
Please correct me if I'm wrong.
Comment 26 Levi Weintraub 2011-03-29 08:48:47 PDT
Comment on attachment 87328 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=87328&action=review

Thanks for the review. I'll correct your points :)

>> Source/WebCore/platform/text/BidiContext.cpp:82
>> +    Vector<BidiContext*> contexts;
> 
> Either this should have inline capacity of 64 (to avoid ever mallocing) or you shoudl do this in two passes.  One which makes a clone of the list w/o unicodeEmbeddings, and a second walk which fixes the levels.
> 
> I think I'm slightly more in favor of doing two passes.  But it's really up to you.  A linked list walk is also expensive, but probably less expensive than a malloc.  Although we can avoid the malloc with the inline capcity.  so again, up to you.

I'd love to do a 2-pass solution, but I don't believe we can do this properly, since each context derives its level from those below it, and we need a parallel data structure to walk from hte bottom to the top.
Comment 27 Levi Weintraub 2011-03-29 08:58:20 PDT
Comment on attachment 87328 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=87328&action=review

>> Source/WebCore/platform/text/BidiContext.h:35
>> +    UnicodeEmbedding
> 
> I think this is confusing. DOMEmbedding sounds like it would only come from DOM (<span dir="...) attributes, but inline styles would affect this as well (<span style="..").
> Please correct me if I'm wrong.

You're right that style-based directions would also be classified here as originating from the "DOM." Perhaps FromUnicode and FromStyleOrDOM?
Comment 28 Eric Seidel (no email) 2011-03-29 09:01:29 PDT
You could jsut do FromStyle (or FromDOM).  DOM just gets mapped into Style anyway.
Comment 29 Nikolas Zimmermann 2011-03-29 09:09:45 PDT
(In reply to comment #28)
> You could jsut do FromStyle (or FromDOM).  DOM just gets mapped into Style anyway.
FromStyle / FromUnicode is fine. But I still find it confusing, I propose:
ExternalEmbedding / IntrinsicEmbedding, where external is any DOM attribute or CSS style attribute, either defined on the element or any of its parents (when inheritance applies) and intrinsic denotes that the directional information is included within the text.
Comment 30 Levi Weintraub 2011-03-29 10:04:41 PDT
Created attachment 87354 [details]
Patch
Comment 31 Eric Seidel (no email) 2011-03-29 10:47:23 PDT
Comment on attachment 87354 [details]
Patch

This looks great to me.  You might want to give mitz overnight to see if he has any thoughts.
Comment 32 Levi Weintraub 2011-03-30 05:49:00 PDT
Committed r82424: <http://trac.webkit.org/changeset/82424>
Comment 33 WebKit Review Bot 2011-03-30 14:41:02 PDT
http://trac.webkit.org/changeset/82424 might have broken GTK Linux 32-bit Debug