Bug 147310 - [GTK] [EFL] Hyphenation can never work in practice due to requirements on lang tags
Summary: [GTK] [EFL] Hyphenation can never work in practice due to requirements on lan...
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: WebCore Misc. (show other bugs)
Version: 528+ (Nightly build)
Hardware: PC Linux
: P2 Normal
Assignee: Martin Robinson
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-07-26 13:14 PDT by Michael Catanzaro
Modified: 2016-01-14 09:32 PST (History)
4 users (show)

See Also:


Attachments
Patch (77.90 KB, patch)
2016-01-13 16:24 PST, Martin Robinson
no flags Details | Formatted Diff | Diff
Patch (77.86 KB, patch)
2016-01-13 16:26 PST, Martin Robinson
no flags Details | Formatted Diff | Diff
Patch (77.61 KB, patch)
2016-01-14 08:16 PST, Martin Robinson
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Michael Catanzaro 2015-07-26 13:14:27 PDT
There is another reason besides the vendor prefix that our hyphenation support does not work on any site. WebKit requires an appropriate language declaration in order to hyphenate text. For example, this works in WebKit:

<html lang="en_US">

But that doesn't work in Firefox. Firefox requires either of:

<html lang="en-US">
<html lang="en">

Neither of which work in WebKit!

Note that en_US matches the dictionary name hyph_en_US.dic, but it is not a valid lang tag according to any reference I could find (e.g. [1] [2]), so I don't see why it should work. en and en-US are both valid lang tags. en-US should definitely work; it would be nice to make en work as well, picking some default country code for each language if possible, since that is what Firefox does [3].

But all of our layout tests explicitly test to make sure en_US works and that en does not work (but does not check en-US), e.g. fast/text/hyphenate-locale.html... (what does fast stand for?). They are using the -webkit-locale tag rather than the lang tag, though; presumably those are not equivalent.

I will pile on to this bug: the lang tag should be case-insensitive according to [1], but I believe we treat it as case-sensitive.

[1] http://tools.ietf.org/html/bcp47
[2] http://www.w3.org/International/articles/language-tags/
[3] https://developer.mozilla.org/en-US/docs/Web/CSS/hyphens
Comment 1 Michael Catanzaro 2015-08-27 10:54:44 PDT
Martin, did you get a chance to look at this? We were going to mention hyphenation as a new feature in 2.10, but can't really, due to this....
Comment 2 Martin Robinson 2016-01-13 16:24:07 PST
Created attachment 268911 [details]
Patch
Comment 3 Martin Robinson 2016-01-13 16:26:05 PST
Created attachment 268912 [details]
Patch
Comment 4 Martin Robinson 2016-01-13 16:27:36 PST
mcatanzaro or cgarcia, would one of you be willing to review this?
Comment 5 Michael Catanzaro 2016-01-13 18:22:51 PST
Comment on attachment 268912 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=268912&action=review

Thanks for working on this! Maybe if we can also unprefix this, then we will start seeing some nice hyphenation on the web....

> Source/WebCore/platform/text/hyphen/HyphenationLibHyphen.cpp:65
> +    auto addToLocaleMap = [&](AtomicString locale, const String& filePath)

Why a local lambda instead of a normal function? Is it just a convenience so you don't have to pass availableLocales each time? It's fine I suppose.

> Source/WebCore/platform/text/hyphen/HyphenationLibHyphen.cpp:67
> +        HashMap<AtomicString, Vector<String>>::iterator i = availableLocales.find(locale);

I'd use auto here.

> Source/WebCore/platform/text/hyphen/HyphenationLibHyphen.cpp:79
> +        addToLocaleMap(AtomicString(locale), filePath);

Why are we supporting the underscores in locale names? I could not find any evidence that we are supposed to accept these, or that any other engine is. The advantage of not supporting them is to avoid tricking a developer into thinking it might work in other browsers just because it works in WebKitGTK+. So I would support hyphens only, unless you've found a reference that suggests we should be accepting underscores.

> Source/WebCore/platform/text/hyphen/HyphenationLibHyphen.cpp:87
> +        if (dividerPosition != notFound) {

Seems that en is going to be mapped to some random English dictionary, rather than en-US, as I might expect. But I think this is fine, because all the other dictionaries are symlinks anyway, and in the off chance that some language ever has region-specific hyphenation dictionaries, it's probably not going to be a very big deal....

> LayoutTests/ChangeLog:8
> +        Update some baselines and add a GTK+ specific test for locale variations.

I don't think this should be a GTK-specific test. Of course the expected results will need to be generated for each platform, but the general idea that these locales should be treated identically should be tested on all ports. So I would move this from platform/gtk/fast/text up to fast/text, upload the patch again, and get Mac results from the bot that will complain when you do so. I'd mark it as an expected failure for platforms that don't behave properly. I think the test would pass on platforms that lack hyphenation support, but if not, it can be skipped.
Comment 6 Martin Robinson 2016-01-13 20:20:51 PST
(In reply to comment #5)
> Comment on attachment 268912 [details]
> Patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=268912&action=review
> 
> Thanks for working on this! Maybe if we can also unprefix this, then we will
> start seeing some nice hyphenation on the web....
> 
> > Source/WebCore/platform/text/hyphen/HyphenationLibHyphen.cpp:65
> > +    auto addToLocaleMap = [&](AtomicString locale, const String& filePath)
> 
> Why a local lambda instead of a normal function? Is it just a convenience so
> you don't have to pass availableLocales each time? It's fine I suppose.

Yes, and additionally I don't really want to make it available for callers outside this function.
 
> > Source/WebCore/platform/text/hyphen/HyphenationLibHyphen.cpp:67
> > +        HashMap<AtomicString, Vector<String>>::iterator i = availableLocales.find(locale);
> 
> I'd use auto here.

Sure.
 
> > Source/WebCore/platform/text/hyphen/HyphenationLibHyphen.cpp:79
> > +        addToLocaleMap(AtomicString(locale), filePath);
> 
> Why are we supporting the underscores in locale names? I could not find any
> evidence that we are supposed to accept these, or that any other engine is.
> The advantage of not supporting them is to avoid tricking a developer into
> thinking it might work in other browsers just because it works in
> WebKitGTK+. So I would support hyphens only, unless you've found a reference
> that suggests we should be accepting underscores.

Underscores are used in fast/text/hyphenate-locale.html originally introduced for the Mac port. The test results indicate that hyphenation is supported for this locale as well. I think we should continue to support underscores for this reason.

> > Source/WebCore/platform/text/hyphen/HyphenationLibHyphen.cpp:87
> > +        if (dividerPosition != notFound) {
> 
> Seems that en is going to be mapped to some random English dictionary,
> rather than en-US, as I might expect. But I think this is fine, because all
> the other dictionaries are symlinks anyway, and in the off chance that some
> language ever has region-specific hyphenation dictionaries, it's probably
> not going to be a very big deal....

What happens is that we now check every dictionary with the "en" prefix. So "en" will check "en_US" and also "en_GB." There is probably a better approach, but this was the one that I felt had the greatest chance of finding a proper hyphenation. For instance, the GB dictionary that we use for testing does not properly hyphenate "throughout."
 
> > LayoutTests/ChangeLog:8
> > +        Update some baselines and add a GTK+ specific test for locale variations.
> 
> I don't think this should be a GTK-specific test. Of course the expected
> results will need to be generated for each platform, but the general idea
> that these locales should be treated identically should be tested on all
> ports. So I would move this from platform/gtk/fast/text up to fast/text,
> upload the patch again, and get Mac results from the bot that will complain
> when you do so. I'd mark it as an expected failure for platforms that don't
> behave properly. I think the test would pass on platforms that lack
> hyphenation support, but if not, it can be skipped.

The reason that I added a platform specific test is that I'm not sure if the spec requires that these locales are supported or indeed if the Mac results will be the same. Since I want to use a reference test I would have to mark the test as failing on Mac, which may not be appropriate since I'm not sure if this is a behavior that they want to ensure. What I can do is contact the author of the Mac hyphenation support and see if they are interested in sharing this test. If so we can move it later to a platform-independent directory. Does that seem like a reasonable compromise?
Comment 7 Michael Catanzaro 2016-01-13 20:33:34 PST
(In reply to comment #6) 
> Underscores are used in fast/text/hyphenate-locale.html originally
> introduced for the Mac port. The test results indicate that hyphenation is
> supported for this locale as well. I think we should continue to support
> underscores for this reason.

OK then.

> What happens is that we now check every dictionary with the "en" prefix. So
> "en" will check "en_US" and also "en_GB." There is probably a better
> approach, but this was the one that I felt had the greatest chance of
> finding a proper hyphenation. For instance, the GB dictionary that we use
> for testing does not properly hyphenate "throughout."

OK then. Though I'm a bit surprised that you have an en_GB hyphenation dictionary that is separate from en_US. On Fedora hyph_en_GB.dic is a symlink to hyph_en_US.dic.

> The reason that I added a platform specific test is that I'm not sure if the
> spec requires that these locales are supported or indeed if the Mac results
> will be the same. Since I want to use a reference test I would have to mark
> the test as failing on Mac, which may not be appropriate since I'm not sure
> if this is a behavior that they want to ensure. What I can do is contact the
> author of the Mac hyphenation support and see if they are interested in
> sharing this test. If so we can move it later to a platform-independent
> directory. Does that seem like a reasonable compromise?

Yes!
Comment 8 Carlos Garcia Campos 2016-01-13 23:25:17 PST
Comment on attachment 268912 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=268912&action=review

> Source/WebCore/platform/text/hyphen/HyphenationLibHyphen.cpp:74
> +        HashMap<AtomicString, Vector<String>>::iterator i = availableLocales.find(locale);
> +        if (i != availableLocales.end()) {
> +            i->value.append(filePath);
> +            return;
> +        }
> +
> +        Vector<String> filePaths = { filePath };
> +        availableLocales.set(locale, WTFMove(filePaths));

I think all this could be simplified as:

availableLocales.add(locale, Vector<String>()).iterator->value.append(filePath);

> Source/WebCore/platform/text/hyphen/HyphenationLibHyphen.cpp:238
> +    ASSERT(availableLocales().contains(localeIdentifier));
> +    String locale = AtomicString(localeIdentifier.string().convertToASCIILowercase());

I'm confused here, locales are now always converted to ascii lower case before being added to the map, could this assert fail then? since you are converting to ascii lower case after the assert.
Comment 9 Martin Robinson 2016-01-14 07:55:53 PST
(In reply to comment #8)
> Comment on attachment 268912 [details]
> Patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=268912&action=review
> 
> > Source/WebCore/platform/text/hyphen/HyphenationLibHyphen.cpp:74
> > +        HashMap<AtomicString, Vector<String>>::iterator i = availableLocales.find(locale);
> > +        if (i != availableLocales.end()) {
> > +            i->value.append(filePath);
> > +            return;
> > +        }
> > +
> > +        Vector<String> filePaths = { filePath };
> > +        availableLocales.set(locale, WTFMove(filePaths));
> 
> I think all this could be simplified as:
> 
> availableLocales.add(locale,
> Vector<String>()).iterator->value.append(filePath);

Cool! I wasn't aware of the add() method. 

> > Source/WebCore/platform/text/hyphen/HyphenationLibHyphen.cpp:238
> > +    ASSERT(availableLocales().contains(localeIdentifier));
> > +    String locale = AtomicString(localeIdentifier.string().convertToASCIILowercase());
> 
> I'm confused here, locales are now always converted to ascii lower case
> before being added to the map, could this assert fail then? since you are
> converting to ascii lower case after the assert.

I think the assertion is simply wrong here. I'll fix when I land. Nice catch.
Comment 10 Martin Robinson 2016-01-14 08:16:49 PST
Created attachment 268968 [details]
Patch
Comment 11 WebKit Commit Bot 2016-01-14 09:32:38 PST
Comment on attachment 268968 [details]
Patch

Clearing flags on attachment: 268968

Committed r195058: <http://trac.webkit.org/changeset/195058>
Comment 12 WebKit Commit Bot 2016-01-14 09:32:41 PST
All reviewed patches have been landed.  Closing bug.