Bug 48225

Summary: Language attribute validation
Product: WebKit Reporter: Leandro Graciá Gil <leandrogracia>
Component: DOMAssignee: Leandro Graciá Gil <leandrogracia>
Status: RESOLVED WONTFIX    
Severity: Normal CC: ap, jorlow, satish
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: All   
OS: All   
Attachments:
Description Flags
Patch
none
Patch
none
Patch
none
Patch
none
Patch
none
Patch ap: review-

Description Leandro Graciá Gil 2010-10-25 02:32:15 PDT
The contents of the language attribute are expected to follow the BCP47 specification (http://www.rfc-editor.org/rfc/bcp/bcp47.txt) however no validation at all is performed on it. This patch introduces a brief validation to filter the language attribute contents to [A-Za-z0-9-]* with the purpose of making it safe to use it, for example, as a parameter in URLs. Please note that this doesn't guarantee that the attribute contains a completely valid language tag according to BCP47 as this is not the purpose of this patch.
Comment 1 Leandro Graciá Gil 2010-10-25 02:56:30 PDT
Created attachment 71724 [details]
Patch
Comment 2 Satish Sampath 2010-10-25 03:03:02 PDT
Comment on attachment 71724 [details]
Patch

Since this is a standalone patch, can you add layout tests to check for valid and invalid language codes?
Comment 3 Alexey Proskuryakov 2010-10-25 08:26:41 PDT
Anyone is welcome to review patches, but only reviewers should actually mark them r+ or r-.

I agree that this needs a test.
Comment 4 Leandro Graciá Gil 2010-10-26 07:02:54 PDT
Created attachment 71873 [details]
Patch
Comment 5 Leandro Graciá Gil 2010-10-26 07:17:32 PDT
Created attachment 71878 [details]
Patch
Comment 6 Jeremy Orlow 2010-10-26 07:28:56 PDT
Comment on attachment 71873 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=71873&action=review

> LayoutTests/fast/dom/Element/language-valid.html:13
> +    <div id="console"></div>

put this under the <p>'s.  Otherwise it's kind of confusing to read the output.

> WebCore/ChangeLog:12
> +        https://bugs.webkit.org/show_bug.cgi?id=48225

move this under the first line (before the blank new line)

> WebCore/dom/Element.cpp:1508
> +    if (!value.isNull() && !value.isEmpty()) {

IsEmpty is a superset of isNull.  When in doubt, please look these things up so that code isn't more complex than it needs to be.

> WebCore/dom/Element.cpp:1510
> +        CString asciiOnly = value.string().ascii();

It seems as though we should be able to do this check without converting the whole string to ascii first.  Maybe this isn't a performance concern and thus keeping the code simple is best, but it seems as though you should be able to inspect the data directly.
Comment 7 Satish Sampath 2010-10-26 07:30:57 PDT
Comment on attachment 71878 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=71878&action=review

> LayoutTests/fast/dom/Element/language-valid.html:26
> +        jsTestIsAsync = true;

Since this test is not really async (no handlers, everything done sequentially when the page loads), I think you can remove this line and the call to 'finishJSTest()' below.
Comment 8 Leandro Graciá Gil 2010-10-26 07:36:25 PDT
Created attachment 71881 [details]
Patch
Comment 9 Leandro Graciá Gil 2010-10-26 07:42:40 PDT
(In reply to comment #7)
> (From update of attachment 71878 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=71878&action=review
> 
> > LayoutTests/fast/dom/Element/language-valid.html:26
> > +        jsTestIsAsync = true;
> 
> Since this test is not really async (no handlers, everything done sequentially when the page loads), I think you can remove this line and the call to 'finishJSTest()' below.

Done. Going to apply Jeremy's updates now.
Comment 10 Leandro Graciá Gil 2010-10-26 07:45:48 PDT
(In reply to comment #6)
> (From update of attachment 71873 [details])
> > WebCore/dom/Element.cpp:1510
> > +        CString asciiOnly = value.string().ascii();
> 
> It seems as though we should be able to do this check without converting the whole string to ascii first.  Maybe this isn't a performance concern and thus keeping the code simple is best, but it seems as though you should be able to inspect the data directly.

There's another reason of doing this, and it's that ascii deals with the UTF16 encoding and converts all non-ascii characters to the question mark symbol. That way it avoids any nasty problems coming from the representation of non-ascii characters.
Comment 11 Leandro Graciá Gil 2010-10-26 09:20:47 PDT
Created attachment 71900 [details]
Patch
Comment 12 Jeremy Orlow 2010-10-26 09:27:40 PDT
Comment on attachment 71900 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=71900&action=review

Does anyone else have any thoughts?

> LayoutTests/fast/dom/Element/language-valid.html:15
> +    <p id="description"></p>

I think description is still best on top

> WebCore/dom/Element.cpp:1509
> +        // Only A-Za-z0-9 and '-' are allowed in the language tag (see BCP 47)

period at end

> WebCore/dom/Element.cpp:1510
> +        size_t length = value.length();

not sure if there's any value to this being broken out.  Maybe just replace length with it?
Comment 13 Leandro Graciá Gil 2010-10-26 09:40:51 PDT
Created attachment 71901 [details]
Patch
Comment 14 Jeremy Orlow 2010-10-26 09:42:40 PDT
Comment on attachment 71901 [details]
Patch

LGTM.  Ask me to r+ it tomorrow if no one else raises an issue in the mean time.
Comment 15 Leandro Graciá Gil 2010-10-26 09:43:54 PDT
(In reply to comment #14)
> (From update of attachment 71901 [details])
> LGTM.  Ask me to r+ it tomorrow if no one else raises an issue in the mean time.

Ok. Thank you.
Comment 16 Alexey Proskuryakov 2010-10-26 10:07:57 PDT
Comment on attachment 71901 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=71901&action=review

> LayoutTests/ChangeLog:9
> +        The contents of the language attribute are expected to follow the BCP 47
> +        specification. However no validation at all is performed on it. This patch

What are the specs governing this? In HTML5, I could find that in two places:
- authoring requirements, which are irrelevant to browsers;
- non-normative list of attributes, saying that value of hreflang, lang, and srclang attributes is a "valid BCP 47 language tag".

So, I don't see that requirement specified anywhere. And Firefox 3.6 fails the test case included in this patch. Does IE pass it?

> LayoutTests/fast/dom/Element/language-valid.html:9
> +        p:lang(en) { text-transform: uppercase; }
> +        p:lang(iην41iδ_ã»4g) { text-transform: uppercase; }

We'd need tests for more forbidden characters - ideally, for every forbidden ASCII character, like en/GB.

Please use red/green color in results for easier visual evaluation <http://www.w3.org/Style/CSS/Test/guidelines.html#color>.

> WebCore/dom/Element.cpp:1513
> +            if (!isASCIIAlphanumeric(c) && c != '-')
> +                return AtomicString();

Is an invalid value equivalent to lang="", or should it be ignored?
Comment 17 Leandro Graciá Gil 2010-10-26 10:20:20 PDT
(In reply to comment #16)
> (From update of attachment 71901 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=71901&action=review
> 
> > LayoutTests/ChangeLog:9
> > +        The contents of the language attribute are expected to follow the BCP 47
> > +        specification. However no validation at all is performed on it. This patch
> 
> What are the specs governing this? In HTML5, I could find that in two places:
> - authoring requirements, which are irrelevant to browsers;
> - non-normative list of attributes, saying that value of hreflang, lang, and srclang attributes is a "valid BCP 47 language tag".
> 
> So, I don't see that requirement specified anywhere. And Firefox 3.6 fails the test case included in this patch. Does IE pass it?
> 

I was referring to the non-normative attributes. This is not required to be implemented, but thought that it should be good to have some kind of validation. If you think it's not necessary at all we can drop this issue.
Comment 18 Alexey Proskuryakov 2010-10-26 10:24:17 PDT
We certainly should be very suspicious of requirements not mandated by specs, especially if no other browser implements such.
Comment 19 Leandro Graciá Gil 2010-10-26 10:44:53 PDT
(In reply to comment #18)
> We certainly should be very suspicious of requirements not mandated by specs, especially if no other browser implements such.

Agreed. Just tested in IE and it doesn't seem to implement it either. Let's drop this. Thank you for your time and sorry.
Comment 20 Leandro Graciá Gil 2010-10-26 10:45:41 PDT
Please mark it as a WONTFIX as I can't do it by myself.
Comment 21 Jeremy Orlow 2010-10-26 10:48:44 PDT
Sounds like a plan.

P.S. Sorry for not checking whether this was normative text or what other browsers will do.  Will try to do that in such reviews in the future!
Comment 22 Alexey Proskuryakov 2010-10-26 10:56:53 PDT
You may want to report this as a bug in HTML5 draft, as the non-normative description is misleading.
Comment 23 Alexey Proskuryakov 2010-10-26 10:58:26 PDT
We could also land a regression test verifying that non-ASCII lang values actually work. I think that would be fairly valuable.
Comment 24 Leandro Graciá Gil 2010-10-27 02:12:23 PDT
(In reply to comment #23)
> We could also land a regression test verifying that non-ASCII lang values actually work. I think that would be fairly valuable.

Actually, I found while developing the layout test that if the language tag in the p:lang() contained an @ then it was not properly assigned. I'm not sure if this is a CSS syntax problem, though. I got no complains in the test results. It did work however with many invalid characters like Greek and Japanese letters.

An example would be:

 p:lang(doesnt@work) { text-transform: uppercase; }

 ...

  <p id="invalid" lang="doesn@work">This line won't be in uppercase at all. Not because of the patch but because the CSS style doesn't seem to be properly associated by the lang attribute.</p>