Bug 21312

Summary: text-transform: lowercase is not lang-dependent (Turkish languages : tr,az)
Product: WebKit Reporter: Jungshik Shin <jshin>
Component: CSSAssignee: Darin Adler <darin>
Status: RESOLVED FIXED    
Severity: Normal CC: ajlyon, ap, benjamin, cbguder, cmarcelo, commit-queue, darin, eflews.bot, eric, esprehn+autocc, glenn, gyuyoung.kim, ian, isherman, kondapallykalyan, mitz, nickshanks, philn, rniwa, webkit.bugzilla, xan.lopez
Priority: P2 Keywords: InRadar
Version: 528+ (Nightly build)   
Hardware: All   
OS: All   
URL: http://www.w3.org/International/tests/sec-text-transform-1
Bug Depends on: 67586    
Bug Blocks:    
Attachments:
Description Flags
tentative 1st step toward a fix
darin: review-
pilot patch (to work with/test the patch for bug 16801)
none
Patch
none
Patch rniwa: review+

Description Jungshik Shin 2008-10-02 10:50:50 PDT
1. Go to the URL above and scroll down the Turkish section

Expected: U+0049 should be transformed to U+0131 (lowercase letter dotless I). U+0130 (latin capital letter dotted I) should be transformed to U+0069 (lowercase I = i).

Actual: U+0049 and U+0130 are transformed to U+0069 and <U+0069, U+0307> ('i' followed by a combining mark dot above). 

This is a subset of the problem (not honoring 'lang'/'xml:lang' in CSS).
Comment 1 Alexey Proskuryakov 2008-10-02 11:05:48 PDT
<rdar://problem/3611341>
Comment 2 Jungshik Shin 2009-03-06 09:18:30 PST
Created attachment 28363 [details]
tentative 1st step toward a fix

I realized that Eric added the code language 'calcuation' for pseudo-lang selector support (so my previous comment about the lack of lang-dependent behavior is wrong). To refer to the value of lang in text transform (RenderText), I moved the code to Node. 

The second part will be to make upper/lower/capitalize lang-dependent (using ICU for ports that use it and by other means elsewhere)

I also want to use this for bug 10874 (and bug 18085) as well.  

Before going further, I'd like to know whether I'm on the right track. So, it'd be nice if any of you can take a quick look. Thanks !
Comment 3 Jungshik Shin 2009-03-09 10:41:01 PDT
Comment on attachment 28363 [details]
tentative 1st step toward a fix

I don't expect r+ for this patch (this patch even has '#if 0-enclosed code'), but I like to get some (potentially harsh) feedback to move things forward because the patch is also necessary for lang-based font selection bugs (mentioned earlier).
Comment 4 Jungshik Shin 2009-03-09 10:49:24 PDT
Comment on attachment 28363 [details]
tentative 1st step toward a fix

I realized that bug 16801 deals with this issue and uploaded this path there.
Comment 5 Darin Adler 2009-03-09 10:52:26 PDT
Comment on attachment 28363 [details]
tentative 1st step toward a fix

There are 3 things done here:

    1) Move the language-determining code for the style system from CSSStyleSelector into a member function in the Node class.
    2) Adding a cache in every Node object to store that language code.
    3) Changes to the logic of the language-determining code.

I can't easily comment on much of (3), because the patch doesn't show the change from the old to the new due to the use of #if 0.

There are some issues with (1):

    A) The style system only needs the language for elements, not other nodes, I would have the code in Element, not Node. But we won't want to dedicate 4 bytes in every Element to caching the language for that element. See below. In general, we need to keep the number of function members in these basic classes to a minimum, so it's unfortunate that the code is moving here.

    B) Breaking the language-computation out into a separate function is fine. That could be done as a first step of refactoring inside the CSSStyleSelector class. I don't think the function needs to be a member of the Node classes.

The biggest problems are with (2):

    C) There's no way we can make all nodes 4 bytes bigger just to make it faster to retrieve the value of the lang attribute, so that aspect of this patch is unacceptable. There are techniques we could use, such as putting the cached language into the rare data or using an external hash table that could keep the cost down for typical elements.

    D) There's no code to invalidate the cached language when attributes change or when the DOM tree structure changes. For example, if you store the value of an attribute such as langAttr, you then need code to deal with the case where that attribute changes. This becomes even more complex when you cache the value of the lang attribute of a parent. Then that parent's attribute changes and somehow the cached value needs to be invalidated for all the descendants of that parent. The cost of a cache is the invalidation cost, and since there's no invalidation here it's unclear we could afford it.

One issue with (3):

    E) It's generally much better to use iteration to walk through the parent list than to use recursion; we don't want an algorithm that uses machine stack proportional to the depth of the DOM tree if we can avoid it.
Comment 6 Jungshik Shin 2009-03-09 12:01:12 PDT
Thanks a lot for your very extensive review comment. In bug 16801 comment 3, Dave mentioned that lang should be in the style. Your comment on RareData led me to find that ElementRareData has access to computedStyle. So, if lang becomes available in the style, perhaps, we can just access it via ElementRareData. Then, I think there's no need to add it to ElementRareData (let alone Element or Node). 
 
Comment 7 Avram Lyon 2010-03-18 23:10:04 PDT
Before this lands, please note that there are other affected languages. The Latin scripts for Crimean Tatar (crh), Volga Tatar (tt), and Bashkir (ba) all use the Turkish/Azerbaijani-style i/İ and ı/I pairings. All three have both Cyrillic and Latin scripts, and only the latter is affected, so perhaps this would require the use of script variants (e.g., tt-Latn), but Azerbaijani also has a well-represented Cyrillic script. I would like to see the new text-transform behavior apply to tt, crh, and ba as well.
Comment 8 Jungshik Shin 2011-06-24 10:49:02 PDT
Ironically, Element.cpp has computeInheritedLanguage almost identical to my patch here with the same shortcomings :-) It's added by isherman on 2010-09-02. ( http://trac.webkit.org/changeset/66647 )


Anyway, I'll try to use it for text-transform. 

I'm also looking into the possibility of mapping 'lang' to '-webkit-locale' in HTMLElement.cpp (the same way HTML attribute dir is mapped to CSS property direction. I don't understand why we introduced '-webkit-locale' when we have 'lang' to determine what locale to use for  hyphenation, line breaking, text-transform,  font selection, etc.  

However, in retrospect, introducing '-webkit-locale' seems to be a brilliant move because that way, Darin's comment in this bug is almost automatically addressed. 

Perhaps, I'll do that in bug 16801.
Comment 9 Ilya Sherman 2011-06-24 14:28:03 PDT
(In reply to comment #8)
> Ironically, Element.cpp has computeInheritedLanguage almost identical to my patch here with the same shortcomings :-) It's added by isherman on 2010-09-02. ( http://trac.webkit.org/changeset/66647 )

Does my patch really have the same issues (I assume you're referring to comment #5)?  It does not add any caching, so it looks to me like it satisfies Darin's concerns.  If not, though, we should probably fix it :-)
Comment 10 Jungshik Shin 2011-06-24 14:57:34 PDT
(In reply to comment #9)
> (In reply to comment #8)
> > Ironically, Element.cpp has computeInheritedLanguage almost identical to my patch here with the same shortcomings :-) It's added by isherman on 2010-09-02. ( http://trac.webkit.org/changeset/66647 )
> 
> Does my patch really have the same issues (I assume you're referring to comment #5)?  It does not add any caching, so it looks to me like it satisfies Darin's concerns.  If not, though, we should probably fix it :-)

Sorry I missed two differences (recursion vs iteration and caching). I'll try to map 'lang' (HTML attribute) to '-webkit-locale' (CSS property), which may speed things up for font-fallback handing patch (in bug 10874).
Comment 11 Jungshik Shin 2011-07-19 15:58:11 PDT
Created attachment 101403 [details]
pilot patch (to work with/test the patch for bug 16801) 

This is not meant to be for review or anything. It's just to test the patch for bug 16801.  

When bug 16801 is fixed, perhaps I have to add a version of String::makeUpper() and String::makeLower() that accept a locale as an input.   

With the patch for bug 16801 and this patch, it works correctly for Turkish tests at http://www.w3.org/International/tests/tests-html-css/list-text-transform
Comment 12 Darin Adler 2013-10-04 22:05:47 PDT
Created attachment 213441 [details]
Patch
Comment 13 Ryosuke Niwa 2013-10-04 22:20:41 PDT
Comment on attachment 213441 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=213441&action=review

> Source/WTF/wtf/text/StringImpl.cpp:570
> +#if !USE(ICU_UNICODE)
> +    UNUSED_PARAM(localeIdentifier);
> +    return lower();
> +#else

Can we move this stuff into wtf/unicode?
We can just leave the non-ICU version fallback to the regular lower().

> Source/WTF/wtf/text/StringImpl.cpp:572
> +    if (!(localeIdentifier == "tr" || localeIdentifier == "az"))

It seems somewhat fragile to hard-code these two languages.

> Source/WTF/wtf/text/StringImpl.cpp:587
> +    u_strToLower(data16, realLength, source16, length, "tr", &status);

The documentation says &status cannot indicate a failure. So maybe we need to reset the value here?
But we're probably not getting the right real length if there was a failure anyway so I'm not sure if that's what we're supposed to do.
Comment 14 Ryosuke Niwa 2013-10-04 22:25:27 PDT
Comment on attachment 213441 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=213441&action=review

> Source/WTF/wtf/text/StringImpl.cpp:584
> +    if (U_SUCCESS(status) && realLength == length)

Is it correct to hard-code "tr" here?  Perhaps this deserves a comment.
Comment 15 EFL EWS Bot 2013-10-04 22:29:21 PDT
Comment on attachment 213441 [details]
Patch

Attachment 213441 [details] did not pass efl-ews (efl):
Output: http://webkit-queues.appspot.com/results/3334026
Comment 16 mitz 2013-10-04 22:38:46 PDT
Comment on attachment 213441 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=213441&action=review

>> Source/WTF/wtf/text/StringImpl.cpp:572
>> +    if (!(localeIdentifier == "tr" || localeIdentifier == "az"))
> 
> It seems somewhat fragile to hard-code these two languages.

Is it sufficient to check for these two strings? Shouldn’t localeIdentifier values such as "tr-TR" or even “tR-Tr” be handled too?
Comment 17 Darin Adler 2013-10-04 22:54:13 PDT
Created attachment 213442 [details]
Patch
Comment 18 Darin Adler 2013-10-04 22:56:05 PDT
(In reply to comment #16)
> Shouldn’t localeIdentifier values such as "tr-TR" or even “tR-Tr” be handled too?

Possibly.

If so, then we need test coverage for that.

And if true, then it seems wrong for us to use the type AtomicString for localeIdentifier values in RenderStyle, since equality comparison is an inappropriate way to interpret them.
Comment 19 Ryosuke Niwa 2013-10-04 23:02:26 PDT
Comment on attachment 213442 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=213442&action=review

> Source/WTF/wtf/text/StringImpl.cpp:577
> +    if (!(localeIdentifier == "tr" || localeIdentifier == "az"))
> +        return lower();

We should do case-insensitive comparison according to BCP47 format as specified in
http://www.whatwg.org/specs/web-apps/current-work/multipage/elements.html#the-lang-and-xml:lang-attributes

We can add then add test cases for TR-TR, etc...
Comment 20 Darin Adler 2013-10-04 23:37:10 PDT
Committed r156948: <http://trac.webkit.org/changeset/156948>