Bug 122261 - AX: WebKit should avoid exposing lang=en defined on root <html> because it's frequently incorrect
Summary: AX: WebKit should avoid exposing lang=en defined on root <html> because it's ...
Status: NEW
Alias: None
Product: WebKit
Classification: Unclassified
Component: Accessibility (show other bugs)
Version: 528+ (Nightly build)
Hardware: All All
: P2 Normal
Assignee: Nobody
URL:
Keywords: InRadar
Depends on:
Blocks:
 
Reported: 2013-10-02 23:31 PDT by James Craig
Modified: 2013-10-23 19:17 PDT (History)
6 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description James Craig 2013-10-02 23:31:00 PDT
AX: WebKit should avoid exposing lang="en" defined on root <html> because it's frequently incorrect, potentially an artifact of a buggy authoring tools that assume everyone speaks English. WebKit could use some heuristics to determine accuracy of these cases, but I'd propose:

1. Expose lang or xml:lang on <html> or <body> *only* when it does not contain the "en" language code.
2. Expose all values of lang or xml:lang on descendant elements of <body>.

Examples:
http://cnnespanol.cnn.com (lists lang="en" on <html> element; should be "es")
http://www.amazon.de (lists lang="en" on <html> element; should be "de")

Other sites do get it right, for example, the localized version of facebook.com lists the correct language code.
Comment 1 Radar WebKit Bug Importer 2013-10-02 23:31:32 PDT
<rdar://problem/15139499>
Comment 2 James Craig 2013-10-02 23:38:34 PDT
Clarifying: "WebKit could use some heuristics to determine accuracy of these cases (where lang=en is accurate)." I'm suggesting some heuristics to determine whether the content language really is English, and if so, only then expose the "en" value for AXLanguage.

A variety of anecdotal evidence indicates this isn't a problem for pages with a non-English value defined. e.g. http://www.softbank.jp/ gets it right (lang="ja").
Comment 3 Alexey Proskuryakov 2013-10-03 10:42:17 PDT
I don't think that we should limit this to AX - if we can decide that the lang attribute is incorrect, then it probably shouldn't be used for style calculations either.
Comment 4 chris fleizach 2013-10-03 10:52:40 PDT
(In reply to comment #3)
> I don't think that we should limit this to AX - if we can decide that the lang attribute is incorrect, then it probably shouldn't be used for style calculations either.

I don't know if this is a good idea. We'd essentially be saying a portion of defined HTML should be ignored because authors are doing it wrong. That would could be true of so many different things.
Comment 5 Darin Adler 2013-10-03 12:38:13 PDT
We have had to do things like this in the past. When an HTML feature exists, but is ignored, for a long time, we often have to continue ignoring it when we implement features that rely on it. It was harmless to have the language set wrong for such a long time that we can’t be pedantic about honoring it. This has been true, for example, for encoding. But there are many other examples.

I think that if we do this we should propose making the improved heuristic a future part of HTML standards, not just a WebKit-specific rule.

I agree with Alexey that this is not an accessibility-specific issue, so I would not want the heuristic in the accessibility code.
Comment 6 James Craig 2013-10-04 00:37:26 PDT
(In reply to comment #4)

> I don't know if this is a good idea. We'd essentially be saying a portion of defined HTML should be ignored because authors are doing it wrong. That would could be true of so many different things.

We do this to differentiate layout tables from data tables and it works very well. We have heuristics to weed out spacer images, too. I’d argue we need to do this in more places. Lists for example: bug 122320.
Comment 7 chris fleizach 2013-10-04 00:40:36 PDT
(In reply to comment #6)
> (In reply to comment #4)
> 
> > I don't know if this is a good idea. We'd essentially be saying a portion of defined HTML should be ignored because authors are doing it wrong. That would could be true of so many different things.
> 
> We do this to differentiate layout tables from data tables and it works very well. We have heuristics to weed out spacer images, too. I’d argue we need to do this in more places. Lists for example: bug 122320.

I don't think these examples are equivalent. What you're proposing with this change would make it impossible for non-english speakers to go to a webpage with lang="en" and hear English spoken automatically.
Comment 8 James Craig 2013-10-04 00:41:37 PDT
(In reply to comment #5)

> I agree with Alexey that this is not an accessibility-specific issue, so I would not want the heuristic in the accessibility code.

Okay, let’s clone this one for the heuristics portion of it, and just do the simple solution above for accessibility, where we ignore EN defined on <html> due to the authoring tool errors.
Comment 9 James Craig 2013-10-04 00:46:39 PDT
(In reply to comment #7)
> (In reply to comment #6)
> > (In reply to comment #4)
> > 
> > > I don't know if this is a good idea. We'd essentially be saying a portion of defined HTML should be ignored because authors are doing it wrong. That would could be true of so many different things.
> > 
> > We do this to differentiate layout tables from data tables and it works very well. We have heuristics to weed out spacer images, too. I’d argue we need to do this in more places. Lists for example: bug 122320.
> 
> I don't think these examples are equivalent. What you're proposing with this change would make it impossible for non-english speakers to go to a webpage with lang="en" and hear English spoken automatically.

Currently, non-English speaking screen reader users go to sites like the ones above and hear the *wrong* language voice, then have to override it it manually. I’m suggesting we fix that bug, based on the evidence that many of these sites do the wrong thing on the <html> element. We could leave the <body> alone as a workaround, since we haven’t seen this in error in as many places.
Comment 10 James Craig 2013-10-04 00:48:33 PDT
Non-English speakers users can still switch their language rotor and hear the current site in English, if they determine it actually is in English.