If a "Content-Language" header is included in the HTTP response, the value is not recorded in the Document. Only http-equiv values from the meta tags are recognized. Also, this information is not exported through WebDocument for outsiders to read. Patch forthcoming... -- Brian
Created attachment 166305 [details] Patch
Please wait for approval from abarth@webkit.org, dglazkov@chromium.org, fishd@chromium.org, jamesr@chromium.org or tkent@chromium.org before submitting, as this patch contains changes to the Chromium public API. See also https://trac.webkit.org/wiki/ChromiumWebKitAPI.
Attachment 166305 [details] did not pass style-queue: Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/WebCore/ChangeLog', u'Source/WebCor..." exit_code: 1 Source/WebCore/loader/DocumentWriter.cpp:124: One line control clauses should not use braces. [whitespace/braces] [4] Source/WebCore/ChangeLog:9: Line contains tab character. [whitespace/tab] [5] Source/WebCore/ChangeLog:10: Line contains tab character. [whitespace/tab] [5] Source/WebCore/ChangeLog:11: Line contains tab character. [whitespace/tab] [5] Source/WebCore/ChangeLog:12: Line contains tab character. [whitespace/tab] [5] Total errors found: 5 in 7 files If any of these errors are false positives, please file a bug against check-webkit-style.
Created attachment 166307 [details] Patch
Attachment 166307 [details] did not pass style-queue: Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/WebCore/ChangeLog', u'Source/WebCor..." exit_code: 1 Source/WebCore/loader/DocumentWriter.cpp:124: One line control clauses should not use braces. [whitespace/braces] [4] Total errors found: 1 in 7 files If any of these errors are false positives, please file a bug against check-webkit-style.
Comment on attachment 166307 [details] Patch Teaching DocumentWriter about ResourceResponse seems undesirable to me, as it currently doesn't have any real concept of how the data is delivered, just that there are bits that need to go into a Document. Perhaps we could just send this data directly from DocumentLoader to Document, similarly to the setBaseURLOveride() call in DocumentLoader::commitData()?
There's one "style violation" but I'd like to keep it as it indicates that this is where more such extractions can be done in the future.
> There's one "style violation" but I'd like to keep it as it indicates that this is where more such extractions can be done in the future. If people need to add more statements to the body of the if statement, they're smart enough to add the braces themselves. :)
Comment on attachment 166307 [details] Patch This is not the correct way to plumb this information into Document. There are a bunch of other examples of Document learning information from HTTP header fields. They might provide useful templates for this sort of change.
Please consider separating cross-platform and Chromium changes into two patches.
Nate, I wasn't looking at it as "how it was delivered" but rather "what was delivered". Header contents are (sort of) the "beginning" of a web page and so it seemed reasonable to pass it to "begin" in order to include that information. I suppose that the DocumentWriter may be able to extract the embedded Document pointer (since they're both part of WebCore) from the WebDocument reference it has and inject the ContentLanguage value directly. Would you prefer that? -- Brian
DocumentWriter should not be involved in this operation. Please see Comment #9 for advice about how to implement this patch. This is not the first HTTP header that Documents are interested in learning about.
Adam, if you would provide a direct pointer to where it is being done, I would appreciate it. I looked for quite some time but could not find such examples. Thanks. -- Brian
> Adam, if you would provide a direct pointer to where it is being done, I would appreciate it. I looked for quite some time but could not find such examples. I grepped Source/WebCore for "httpHeaderField" and I found a bunch of calls in FrameLoader::didBeginDocument. That's what I'd try first.
Created attachment 166538 [details] Patch
Attachment 166538 [details] did not pass style-queue: Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/WebCore/ChangeLog', u'Source/WebCor..." exit_code: 1 Source/WebCore/ChangeLog:9: Line contains tab character. [whitespace/tab] [5] Total errors found: 1 in 2 files If any of these errors are false positives, please file a bug against check-webkit-style.
Thanks for the pointer, Adam. I think this is more what you're looking for. Oops... Forgot to add a test... Will do that tomorrow.
Comment on attachment 166538 [details] Patch Yeah, that looks much better. You've still got a style issue (a tab snuck unto your ChangeLog entry). Also, we'll need a test.
I'm looking at how to write a test for this... I don't see any existing tests for this area of the code. Am I just missing it?
The easiest way to make a test is to take any existing test for Content-Language in META tag, copy it to http/tests/misc, and convert that to a PHP file that prints this HTTP header.
Ah, an end-to-end test rather than a unit-test on that specific block of code. Okay, I'll try that.
Once you get this running, please add a number of tests that check parsing correctness (e.g. what if there are multiple values? What if there are disallowed characters?)
Created attachment 166881 [details] Patch
Alexey, this is injecting the value just as the http-equiv meta tag would do. Tests for that seem easier and faster as it doesn't need php or an http server to execute. I haven't yet found how .WebkitLocale gets set so don't know if there's any validation at all. Meanwhile... Is this test okay? -- Brian
Comment on attachment 166881 [details] Patch Does this test fail without this patch?
> Alexey, this is injecting the value just as the http-equiv meta tag would do. Tests for that seem easier and faster as it doesn't need php or an http server to execute. I do not understand what you are saying. The test included in the latest patch is a PHP script. Also, please do add tests for invalid and multiple values, as suggested previously.
Alexey, I'm saying that the tests you're requesting (invalid and multiple values) apply equally to an "http-equiv" for "content-language" as both that and this header copy the value to the same place in the Document data structure. However, I can write tests for http-equiv=content-language under fast/css entirely as HTML without need for PHP and a webserver to process it. That will make the tests faster and still yield the same amount of testing. Of course, I'm not yet even sure if there _is_ any parsing to test. Still looking...
> Alexey, I'm saying that the tests you're requesting (invalid and multiple values) apply equally to an "http-equiv" for "content-language" as both that and this header copy the value to the same place in the Document data structure. I don't know why this should be the case. These are governed by entirely separate specifications. Code paths for handling these are also not necessarily the same in browsers.
Adam, yes, the test fails without the patch. The value returned is "auto" if no HTTP header or http-equiv meta tag is present.
Alexey, I don't follow. "http-equiv" is a way of simulated the presence of a proper HTTP header. Why would the code paths (or tests) be different?
> "http-equiv" is a way of simulated the presence of a proper HTTP header. This is the original idea, but in reality, there is no formal connection between http-equiv and HTTP headers syntax. Specifically, rules for parsing HTTP headers are defined in RFC 2616, while HTML has its own rules for Content-Language: <http://www.whatwg.org/specs/web-apps/current-work/#attr-meta-http-equiv-content-language>. Oh, we also need a test for what happens when there are both an http-equiv and an HTTP header field, and they disagree. HTML spec currently says that http-equiv wins.
Comment on attachment 166881 [details] Patch r- to get this out of review queue. We clearly need better test coverage.
Alexey, good point about the header/equiv. The http-equiv wins (I knew that) but I'll create a test for that as well. As I read the docs at that link, it sounds like processing of the actual header should only set the document()->contentLanguage() if the value contains only one "intended language". Do you read that as well? Or would you prefer that I just drop this so that there is zero overlap between the HTTP header and http-equiv for Content-Language? -- Brian
Alexey, what do you see as the correct way to go forward with this? Only set the document()->contentLanguage() if the value contains only one "intended language" or just drop this so that there is zero overlap between the HTTP header and http-equiv for Content-Language?
This primarily depends on what you need this for. As far as heuristics for glyph selection and charset guessing are concerned, using just the first language is fine. Also, I don't know usage patterns on actual web sites.
Currently the http-equiv for Content Language code in WebKit just uses the verbatim content string to set the "pragma-set default language". So even a comma-separated list like "a, b" is taken as the language "a, b". I had some plans to improve this parsing but haven't changed anything yet, see bug 77724. See also bug 76701, comment 30 for some explanation as to why the spec is the way it is.
*** Bug 76892 has been marked as a duplicate of this bug. ***
_My_ interest in this is Chrome's auto-translation feature. It can try to guess the language but it's (presumably) more accurate to use what is explicitly declared. For that purpose, both HTTP header and an http-equiv tag, despite their somewhat different definitions, accomplish the same thing.
This patch does fix https://bugs.webkit.org/show_bug.cgi?id=76892. The w3.org test then passes "green". http://www.w3.org/International/tests/html-css/generate?format=h5&test=language-declarations-003
Created attachment 167799 [details] Patch
Latest patch extracts only the first language from the header to substitute for an http-equiv tag. Added comments with details and more tests.
Comment on attachment 167799 [details] Patch Attachment 167799 [details] did not pass mac-ews (mac): Output: http://queues.webkit.org/results/14228714
Comment on attachment 167799 [details] Patch Attachment 167799 [details] did not pass qt-ews (qt): Output: http://queues.webkit.org/results/14209996
Comment on attachment 167799 [details] Patch Attachment 167799 [details] did not pass qt-wk2-ews (qt): Output: http://queues.webkit.org/results/14217838
Created attachment 167809 [details] Patch
Comment on attachment 167809 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=167809&action=review Thank you for updating the patch. I think that it's quite close now. I have a number of nitpicks below, so another iteration would be beneficial. > Source/WebCore/loader/FrameLoader.cpp:678 > + // HTTP-header and http-equiv meta tag do not have exactly the same > + // meaning. The former is the "intended audience" and the latter > + // is the language of the document and must contain only a single > + // entry. If we detect the former with multiple entries separated > + // by commas, we take only the first one. Alternatively we could > + // ignore an HTTP Content-Language header with multiple entries. > + // http://tools.ietf.org/html/rfc2616#page-118 > + // http://www.whatwg.org/specs/web-apps/current-work/#attr-meta-http-equiv-content-language I would drop this whole comment. It is fairly obvious what's being done here, and if details were really important, one could always do an svn blame and find this bug. > Source/WebCore/loader/FrameLoader.cpp:679 > + size_t comma = headerContentLanguage.find(','); The variable holds comma position, not the comma itself, so it should be named accordingly. > Source/WebCore/loader/FrameLoader.cpp:683 > + m_frame->document()->setContentLanguage(headerContentLanguage); So we'll be setting content language to empty string if header field value were ",foo". This seems incorrect, please fix and add a regression test. > LayoutTests/http/tests/misc/extract-http-content-language-expected.txt:4 > +zh-CN > +ar It's not great that these tests don't have explicit pass/fail output. If one opens the test in browser, how would they know if the test passed? > LayoutTests/http/tests/misc/extract-http-content-language-multiple.php:2 > + header("Content-Language: fr, fi "); Could you try something a little more complicated, like maybe "fr \t , fi "?
Comment on attachment 167809 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=167809&action=review >> Source/WebCore/loader/FrameLoader.cpp:679 >> + size_t comma = headerContentLanguage.find(','); > > The variable holds comma position, not the comma itself, so it should be named accordingly. I had it named "comma_index" but got a style violation. Would "commaindex" be okay? >> Source/WebCore/loader/FrameLoader.cpp:683 >> + m_frame->document()->setContentLanguage(headerContentLanguage); > > So we'll be setting content language to empty string if header field value were ",foo". This seems incorrect, please fix and add a regression test. My thinking was that if the header was malformed then it's better not to put any additional effort into trying to guess what was intended because it's just as likely to guess wrong. >> LayoutTests/http/tests/misc/extract-http-content-language-expected.txt:4 >> +ar > > It's not great that these tests don't have explicit pass/fail output. If one opens the test in browser, how would they know if the test passed? And then just have "pass" in the -expected.txt file?
> I had it named "comma_index" but got a style violation. Would "commaindex" be okay? commaIndex or commaPosition would be fine. > > So we'll be setting content language to empty string if header field value were ",foo". This seems incorrect, please fix and add a regression test. > > My thinking was that if the header was malformed then it's better not to put any additional effort into trying to guess what was intended because it's just as likely to guess wrong. A Web browser typically deals with malformed input, and it's important to deal with it gracefully. This is less about guessing than about maintaining internal invariants - and expecting that a content language member variable holds either a non-empty string or a null string is reasonable. > And then just have "pass" in the -expected.txt file? That would work failure case, it's desirable to have more information about what went wrong.
That would work. In failure case, it's desirable to have more information about what went wrong.
Created attachment 168006 [details] Patch
It turns out that some processing has already been done on the header value so ",foo" will be just "foo" by the time it gets to this function. It's probably not possible to get an empty string (the header gets removed completely) but I've put the test in anyways so that it won't set an empty value.
Comment on attachment 168006 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=168006&action=review > Source/WebCore/loader/FrameLoader.cpp:673 > + if (commaIndex != notFound) I don't think that this check is necessary - truncate will do nothing if the index is outside bounds. > LayoutTests/http/tests/misc/extract-http-content-language-malformed.php:20 > + if (language == expect) > + element.innerText = 'Pass. Got "' + expect + '" ' + msg + '.'; > + else > + element.innerText = 'Fail! Expected "' + expect + '" ' + msg + '; got "' + language + '".'; This would be much better if existing testing support was used (shouldBe coming from js-test-pre.js).
Created attachment 168050 [details] Patch
Comment on attachment 168050 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=168050&action=review Thank you for switching to js-test-pre.js. > LayoutTests/http/tests/misc/extract-http-content-language.php:17 > + if (window.testRunner) > + testRunner.dumpAsText(); These can be removed now - js-test-pre.js does it automatically. > LayoutTests/http/tests/misc/extract-http-content-language.php:22 > + debug('==> All done...'); This is not needed either - there is a TEST COMPLETE message.
>> LayoutTests/http/tests/misc/extract-http-content-language.php:22 >> + debug('==> All done...'); > >This is not needed either - there is a TEST COMPLETE message. Yes, but there is a "PASS successfullyParsed is true" message (output from the "js-test-post.js" script) which appears to be part of the previous test if there is no separator message.
Comment on attachment 168050 [details] Patch Attachment 168050 [details] did not pass mac-ews (mac): Output: http://queues.webkit.org/results/14254413 New failing tests: http/tests/misc/extract-http-content-language-malformed.php
So... Is it acceptable to leave the redundant debug("all done") call in order to separate the extraneous "PASS successfullyParsed is true" message from the actual tests?
You still have r+, so yes, it's acceptable. I don't think there's value in that, but it's not important.
Okay. I've got a new CL that remove the unnecessary calls to testRunner.dumpAsText() and then I think we're all good!
Created attachment 168478 [details] Patch
Did you find out why extract-http-content-language-malformed.php failed on mac-ews?
No, I haven't. When I looked at the link provided, there's no mention of that test. It lists 9 tests that failed but none of them are "extract-http-content-language-malformed.php". I wanted to see if it failed again with this updated patch to see if it was just a flaky test. If not, perhaps the "mac" version is not cleaning the header value as other builds do; perhaps it's not stripping the initial comma which would mean a different result in the test code. If it persists, I'll have to get someone here with a Mac to try my patch and see what's going on.
> If not, perhaps the "mac" version is not cleaning the header value as other builds do I think that only chromium and mac run tests, other EWS bots just build.
Interesting. I didn't know that. I'm building on Windows and running the new tests locally there with no problems. But the guy next to me has a Mac so I'll just ask him to build and try it. I'll also verify it on my Linux machine.
Comment on attachment 168478 [details] Patch Attachment 168478 [details] did not pass mac-ews (mac): Output: http://queues.webkit.org/results/14298167 New failing tests: http/tests/misc/extract-http-content-language-malformed.php
Ah, so the issue is that the "mac" test is a Safari test and apparently Safari sanitizes the HTTP headers differently than Chrome. Specifically, Chrome turns a value of ",foo" into just "foo" while Safari either removes the header completely or (more likely) leaves it as-is. The end result is a language of "foo" known to Chrome but "auto" known to Safari. Should I update the test to detect the browser type and look for different results for the two?
No, there shouldn't be browser sniffing in the test. Expected results should match what's correct behavior in our best judgement, and there should be platform specific results from platforms that get it wrong.
I was thinking the same. Given that there's no absolute defined behavior in the case of malformed input, I'll just modify the test to accept both logical outcomes: it makes a reasonable guess or refuses to make any guess at all.
Created attachment 168973 [details] Patch
Comment on attachment 168973 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=168973&action=review > Source/WebCore/loader/FrameLoader.cpp:673 > + headerContentLanguage.truncate(commaIndex); // notFound == -1 == don't truncate This comment is somewhat misleading - notFound is static_cast<size_t>(-1), not -1. I'd either omit the comment, or say something like "won't truncate if comma was not found". > Source/WebCore/loader/FrameLoader.cpp:674 > + headerContentLanguage = headerContentLanguage.stripWhiteSpace(isHTMLSpace); Haven't noticed this before. Why HTML space? We're not dealing with HTML here. > LayoutTests/http/tests/misc/extract-http-content-language.php:7 > +<script src="../../js-test-resources/js-test-pre.js"></script> A trivial nit - no need for relative path in HTTP tests. I think that "/js-test-resources/js-test-pre.js" would work just as well.
Comment on attachment 168973 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=168973&action=review >> Source/WebCore/loader/FrameLoader.cpp:674 >> + headerContentLanguage = headerContentLanguage.stripWhiteSpace(isHTMLSpace); > > Haven't noticed this before. Why HTML space? We're not dealing with HTML here. The default match does not include \t as whitespace. HTMLspace does, as well as \r and \n which don't really apply here since header processing would have taken those. >> LayoutTests/http/tests/misc/extract-http-content-language.php:7 >> +<script src="../../js-test-resources/js-test-pre.js"></script> > > A trivial nit - no need for relative path in HTTP tests. I think that "/js-test-resources/js-test-pre.js" would work just as well. All the other tests in this directory use the ../.. prefix; I just copied them.
So... Is this good, then? All tests pass and I've given reasoning for the minor comments.
Yes, the patch has an r+. I think that my comments still stand, but addressing them is not necessary to have this landed. If you need someone to approve this for commit queue, please set the cq? flag.
Comment on attachment 168973 [details] Patch Clearing flags on attachment: 168973 Committed r131794: <http://trac.webkit.org/changeset/131794>
All reviewed patches have been landed. Closing bug.
I want to thank everyone, especially Alexey for helping me with this, my first ever real WebKit change. It has been one of the most frustrating, aggravating, discouraging, and painful experiences of my life. :-) I also learned a lot. Thanks for your patience! -- Brian
We aim to please? Hopefully the next one will go smoother. :)
Why did this change set a bunch of files as executable?!
It did?!? I didn't notice that in the diff. Damn. There was a conflict between CgyWin and Xemacs that was causing edited files to receive permissions of 000 which was then converted to 755 whenever something was accessing the file. I thought I had corrected that before the last patch was submitted for review. I can submit a fix tomorrow.
> Why did this change set a bunch of files as executable?! I overlooked HTMLParserIdioms.h in the patch, but are there any others?
Looks like someone else has already corrected this. Sorry 'bout that.