Bug 21312

Summary: text-transform: lowercase is not lang-dependent (Turkish languages : tr,az)
Product: WebKit Reporter: Jungshik Shin <jshin>
Component: CSSAssignee: Darin Adler <darin>
Status: RESOLVED FIXED    
Severity: Normal CC: ajlyon, ap, benjamin, cbguder, cmarcelo, commit-queue, darin, eflews.bot, eric, esprehn+autocc, glenn, gyuyoung.kim, ian, isherman, kondapallykalyan, mitz, nickshanks, philn, rniwa, webkit.bugzilla, xan.lopez
Priority: P2 Keywords: InRadar
Version: 528+ (Nightly build)   
Hardware: All   
OS: All   
URL: http://www.w3.org/International/tests/sec-text-transform-1
Bug Depends on: 67586    
Bug Blocks:    
Attachments:
Description Flags
tentative 1st step toward a fix
darin: review-
pilot patch (to work with/test the patch for bug 16801)
none
Patch
none
Patch rniwa: review+

Jungshik Shin
Reported 2008-10-02 10:50:50 PDT
1. Go to the URL above and scroll down the Turkish section Expected: U+0049 should be transformed to U+0131 (lowercase letter dotless I). U+0130 (latin capital letter dotted I) should be transformed to U+0069 (lowercase I = i). Actual: U+0049 and U+0130 are transformed to U+0069 and <U+0069, U+0307> ('i' followed by a combining mark dot above). This is a subset of the problem (not honoring 'lang'/'xml:lang' in CSS).
Attachments
tentative 1st step toward a fix (2.60 KB, patch)
2009-03-06 09:18 PST, Jungshik Shin
darin: review-
pilot patch (to work with/test the patch for bug 16801) (2.74 KB, patch)
2011-07-19 15:58 PDT, Jungshik Shin
no flags
Patch (14.15 KB, patch)
2013-10-04 22:05 PDT, Darin Adler
no flags
Patch (26.49 KB, patch)
2013-10-04 22:54 PDT, Darin Adler
rniwa: review+
Alexey Proskuryakov
Comment 1 2008-10-02 11:05:48 PDT
Jungshik Shin
Comment 2 2009-03-06 09:18:30 PST
Created attachment 28363 [details] tentative 1st step toward a fix I realized that Eric added the code language 'calcuation' for pseudo-lang selector support (so my previous comment about the lack of lang-dependent behavior is wrong). To refer to the value of lang in text transform (RenderText), I moved the code to Node. The second part will be to make upper/lower/capitalize lang-dependent (using ICU for ports that use it and by other means elsewhere) I also want to use this for bug 10874 (and bug 18085) as well. Before going further, I'd like to know whether I'm on the right track. So, it'd be nice if any of you can take a quick look. Thanks !
Jungshik Shin
Comment 3 2009-03-09 10:41:01 PDT
Comment on attachment 28363 [details] tentative 1st step toward a fix I don't expect r+ for this patch (this patch even has '#if 0-enclosed code'), but I like to get some (potentially harsh) feedback to move things forward because the patch is also necessary for lang-based font selection bugs (mentioned earlier).
Jungshik Shin
Comment 4 2009-03-09 10:49:24 PDT
Comment on attachment 28363 [details] tentative 1st step toward a fix I realized that bug 16801 deals with this issue and uploaded this path there.
Darin Adler
Comment 5 2009-03-09 10:52:26 PDT
Comment on attachment 28363 [details] tentative 1st step toward a fix There are 3 things done here: 1) Move the language-determining code for the style system from CSSStyleSelector into a member function in the Node class. 2) Adding a cache in every Node object to store that language code. 3) Changes to the logic of the language-determining code. I can't easily comment on much of (3), because the patch doesn't show the change from the old to the new due to the use of #if 0. There are some issues with (1): A) The style system only needs the language for elements, not other nodes, I would have the code in Element, not Node. But we won't want to dedicate 4 bytes in every Element to caching the language for that element. See below. In general, we need to keep the number of function members in these basic classes to a minimum, so it's unfortunate that the code is moving here. B) Breaking the language-computation out into a separate function is fine. That could be done as a first step of refactoring inside the CSSStyleSelector class. I don't think the function needs to be a member of the Node classes. The biggest problems are with (2): C) There's no way we can make all nodes 4 bytes bigger just to make it faster to retrieve the value of the lang attribute, so that aspect of this patch is unacceptable. There are techniques we could use, such as putting the cached language into the rare data or using an external hash table that could keep the cost down for typical elements. D) There's no code to invalidate the cached language when attributes change or when the DOM tree structure changes. For example, if you store the value of an attribute such as langAttr, you then need code to deal with the case where that attribute changes. This becomes even more complex when you cache the value of the lang attribute of a parent. Then that parent's attribute changes and somehow the cached value needs to be invalidated for all the descendants of that parent. The cost of a cache is the invalidation cost, and since there's no invalidation here it's unclear we could afford it. One issue with (3): E) It's generally much better to use iteration to walk through the parent list than to use recursion; we don't want an algorithm that uses machine stack proportional to the depth of the DOM tree if we can avoid it.
Jungshik Shin
Comment 6 2009-03-09 12:01:12 PDT
Thanks a lot for your very extensive review comment. In bug 16801 comment 3, Dave mentioned that lang should be in the style. Your comment on RareData led me to find that ElementRareData has access to computedStyle. So, if lang becomes available in the style, perhaps, we can just access it via ElementRareData. Then, I think there's no need to add it to ElementRareData (let alone Element or Node).
Avram Lyon
Comment 7 2010-03-18 23:10:04 PDT
Before this lands, please note that there are other affected languages. The Latin scripts for Crimean Tatar (crh), Volga Tatar (tt), and Bashkir (ba) all use the Turkish/Azerbaijani-style i/İ and ı/I pairings. All three have both Cyrillic and Latin scripts, and only the latter is affected, so perhaps this would require the use of script variants (e.g., tt-Latn), but Azerbaijani also has a well-represented Cyrillic script. I would like to see the new text-transform behavior apply to tt, crh, and ba as well.
Jungshik Shin
Comment 8 2011-06-24 10:49:02 PDT
Ironically, Element.cpp has computeInheritedLanguage almost identical to my patch here with the same shortcomings :-) It's added by isherman on 2010-09-02. ( http://trac.webkit.org/changeset/66647 ) Anyway, I'll try to use it for text-transform. I'm also looking into the possibility of mapping 'lang' to '-webkit-locale' in HTMLElement.cpp (the same way HTML attribute dir is mapped to CSS property direction. I don't understand why we introduced '-webkit-locale' when we have 'lang' to determine what locale to use for hyphenation, line breaking, text-transform, font selection, etc. However, in retrospect, introducing '-webkit-locale' seems to be a brilliant move because that way, Darin's comment in this bug is almost automatically addressed. Perhaps, I'll do that in bug 16801.
Ilya Sherman
Comment 9 2011-06-24 14:28:03 PDT
(In reply to comment #8) > Ironically, Element.cpp has computeInheritedLanguage almost identical to my patch here with the same shortcomings :-) It's added by isherman on 2010-09-02. ( http://trac.webkit.org/changeset/66647 ) Does my patch really have the same issues (I assume you're referring to comment #5)? It does not add any caching, so it looks to me like it satisfies Darin's concerns. If not, though, we should probably fix it :-)
Jungshik Shin
Comment 10 2011-06-24 14:57:34 PDT
(In reply to comment #9) > (In reply to comment #8) > > Ironically, Element.cpp has computeInheritedLanguage almost identical to my patch here with the same shortcomings :-) It's added by isherman on 2010-09-02. ( http://trac.webkit.org/changeset/66647 ) > > Does my patch really have the same issues (I assume you're referring to comment #5)? It does not add any caching, so it looks to me like it satisfies Darin's concerns. If not, though, we should probably fix it :-) Sorry I missed two differences (recursion vs iteration and caching). I'll try to map 'lang' (HTML attribute) to '-webkit-locale' (CSS property), which may speed things up for font-fallback handing patch (in bug 10874).
Jungshik Shin
Comment 11 2011-07-19 15:58:11 PDT
Created attachment 101403 [details] pilot patch (to work with/test the patch for bug 16801) This is not meant to be for review or anything. It's just to test the patch for bug 16801. When bug 16801 is fixed, perhaps I have to add a version of String::makeUpper() and String::makeLower() that accept a locale as an input. With the patch for bug 16801 and this patch, it works correctly for Turkish tests at http://www.w3.org/International/tests/tests-html-css/list-text-transform
Darin Adler
Comment 12 2013-10-04 22:05:47 PDT
Ryosuke Niwa
Comment 13 2013-10-04 22:20:41 PDT
Comment on attachment 213441 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=213441&action=review > Source/WTF/wtf/text/StringImpl.cpp:570 > +#if !USE(ICU_UNICODE) > + UNUSED_PARAM(localeIdentifier); > + return lower(); > +#else Can we move this stuff into wtf/unicode? We can just leave the non-ICU version fallback to the regular lower(). > Source/WTF/wtf/text/StringImpl.cpp:572 > + if (!(localeIdentifier == "tr" || localeIdentifier == "az")) It seems somewhat fragile to hard-code these two languages. > Source/WTF/wtf/text/StringImpl.cpp:587 > + u_strToLower(data16, realLength, source16, length, "tr", &status); The documentation says &status cannot indicate a failure. So maybe we need to reset the value here? But we're probably not getting the right real length if there was a failure anyway so I'm not sure if that's what we're supposed to do.
Ryosuke Niwa
Comment 14 2013-10-04 22:25:27 PDT
Comment on attachment 213441 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=213441&action=review > Source/WTF/wtf/text/StringImpl.cpp:584 > + if (U_SUCCESS(status) && realLength == length) Is it correct to hard-code "tr" here? Perhaps this deserves a comment.
EFL EWS Bot
Comment 15 2013-10-04 22:29:21 PDT
mitz
Comment 16 2013-10-04 22:38:46 PDT
Comment on attachment 213441 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=213441&action=review >> Source/WTF/wtf/text/StringImpl.cpp:572 >> + if (!(localeIdentifier == "tr" || localeIdentifier == "az")) > > It seems somewhat fragile to hard-code these two languages. Is it sufficient to check for these two strings? Shouldn’t localeIdentifier values such as "tr-TR" or even “tR-Tr” be handled too?
Darin Adler
Comment 17 2013-10-04 22:54:13 PDT
Darin Adler
Comment 18 2013-10-04 22:56:05 PDT
(In reply to comment #16) > Shouldn’t localeIdentifier values such as "tr-TR" or even “tR-Tr” be handled too? Possibly. If so, then we need test coverage for that. And if true, then it seems wrong for us to use the type AtomicString for localeIdentifier values in RenderStyle, since equality comparison is an inappropriate way to interpret them.
Ryosuke Niwa
Comment 19 2013-10-04 23:02:26 PDT
Comment on attachment 213442 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=213442&action=review > Source/WTF/wtf/text/StringImpl.cpp:577 > + if (!(localeIdentifier == "tr" || localeIdentifier == "az")) > + return lower(); We should do case-insensitive comparison according to BCP47 format as specified in http://www.whatwg.org/specs/web-apps/current-work/multipage/elements.html#the-lang-and-xml:lang-attributes We can add then add test cases for TR-TR, etc...
Darin Adler
Comment 20 2013-10-04 23:37:10 PDT
Note You need to log in before you can comment on or make changes to this bug.