32937 – LayoutTests/fast/encoding/invalid-UTF-8.html

Bug 32937 - LayoutTests/fast/encoding/invalid-UTF-8.html

Summary: LayoutTests/fast/encoding/invalid-UTF-8.html

Status:	RESOLVED FIXED

Alias:	None

Product:	WebKit
Classification:	Unclassified
Component:	WebKit Qt (show other bugs)
Version:	528+ (Nightly build)
Hardware:	All All

Importance:	P2 Normal
Assignee:	Nobody

URL:
Keywords:

Depends on:
Blocks:

Reported:	2009-12-25 11:42 PST by Robert Hogan
Modified:	2010-02-22 14:18 PST (History)
CC List:	2 users (show)

See Also:

Attachments
Patch (23.32 KB, patch) 2010-02-14 04:47 PST, Robert Hogan	ap: review-	Details \| Formatted Diff \| Diff
Updated Patch (22.99 KB, patch) 2010-02-21 13:28 PST, Robert Hogan	no flags	Details \| Formatted Diff \| Diff
Updated Patch (22.91 KB, patch) 2010-02-22 11:03 PST, Robert Hogan	ap: review+	Details \| Formatted Diff \| Diff
Show Obsolete (2) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Robert Hogan 2009-12-25 11:42:32 PST

Mac, Firefox and Chromium (and presumably other decoders) represent the unicode string in this test as:

|d1 82 | b5 | d1 | d1 82 | 20   | f0 90 80 | f0 80 f0 90 | 90 |
|  T   | ?  | ?  |  T    | [\s] |     ?    |       ?     | ?  |

While Qt's QTextCodec decoder (which sits in Qt not WebKit) represents it as:

|d1 82 | b5 | d1 | d1 82 | 20   |f0|90|80|f0|80|f0|90|90|
|  T   | ?  | ?  |  T    | [\s] |? |? |? |? |? |? |? |? |

In other words the Mac and other decoders represent a failed encoding sequence with a single replacement character, but Qt displays each byte in the failed encoding sequence as a single replacement character.

I think Qt might be in the wrong here.

Comment 1 Robert Hogan 2009-12-25 11:43:39 PST

Opened Qt bug at http://bugreports.qt.nokia.com/browse/QTBUG-7032

Comment 2 Robert Hogan 2010-02-14 04:47:52 PST

Created attachment 48718 [details]
Patch

Response from Qt at http://bugreports.qt.nokia.com/browse/QTBUG-7032 is:

"The decoding is not wrong. It's perfectly valid to write anything when the input is invalid."

So change test to a textdump (no need for font metrics in this test) and add Qt-specific results.

Comment 3 Alexey Proskuryakov 2010-02-14 12:28:27 PST

> "The decoding is not wrong. It's perfectly valid to write anything when the
> input is invalid."

I don't have a reference handy, but I'm fairly sure that's incorrect. Will try to look it up later.

Comment 4 Darin Adler 2010-02-15 09:32:38 PST

On thing the Unicode Standard 5.2 says about this is that these ill-formed sequences must not be interpreted as characters. The examples given of legitimate handling are: Signaling an error, filtering the code unit out, or representing the code unit with a marker such as U+FFFD. It's not clear if representing the code unit with a "?" character is acceptable, nor is it clear if representing ill-formed sequences with a run of "?" characters that happens to match the sequence length is acceptable.

Comment 5 Alexey Proskuryakov 2010-02-15 10:17:05 PST

Comment on attachment 48718 [details]
Patch

Turns out that per the Unicode spec, both results are allowed, although the current results match the recommended ones. See "Constraints on Conversion Processes" paragraph in <http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf>.

> Mac, Firefox and Chromium (and presumably other decoders) represent the unicode
> string in this test as:
> |d1 82 | b5 | d1 | d1 82 | 20   | f0 90 80 | f0 80 f0 90 | 90 |
> |  T   | ?  | ?  |  T    | [\s] |     ?    |       ?     | ?  |

I don't think this is quite accurate - it seems that the invalid subsequences are b5, d1, f09080, f080, and f09090. At least, that's what they should be per the spec recommendation.

> test itself to a dumpAsText test since font metrics are not required to determine
> if it has passed or not.

This test was originally added for bug 8972, "REGRESSION: invalid UTF-8 sequences are not displayed". One needs pixel results to tell whether characters are displayed!

We actually want to test two things here. First, it's that U+FFFD substitution characters are actually displayed. Second, compliance with Unicode spec recommended handling shouldn't regress on platforms that are compliant now.

Only the former subtest needs pixel results. One way to make it text-only would be to have a separate container with text "тт ", and compare its rendered width to actual one.

-<p>The output should be: "т??т ???" (with black diamonds in place of question marks).</p>
+<p>The output on Mac should be: "т??т ???" (with black diamonds in place of question marks).<br>
+The output on Qt should be: "т??т ????????" (with black diamonds in place of question marks).<br>
+(See https://bugs.webkit.org/show_bug.cgi?id=32937 for the reason Qt is different.)</p>

There is no need to single out Qt here. It would be better to just make test results more self-explanatory, separating the "must" and "ideally should" subtests.

I think it's fine to have Qt-specific results for the test, and it would be nice to make the test text-only. But some clarification would be useful to avoid future confusion, so please consider my comments above. Marking r- to get this out of review queue.

Comment 6 Alexey Proskuryakov 2010-02-15 10:29:53 PST

> It's not clear if representing the code unit with a "?" character is acceptable

I'd say it is definitely not acceptable for HTML/JavaScript/CSS, since the '?' character does not signal an error there. Even setting aside the possibility for circumventing security checks, having plain question marks in content would be confusing to human readers.

Comment 7 Robert Hogan 2010-02-17 11:01:40 PST

(In reply to comment #5)
> We actually want to test two things here. First, it's that U+FFFD substitution
> characters are actually displayed. Second, compliance with Unicode spec
> recommended handling shouldn't regress on platforms that are compliant now.
> 
> Only the former subtest needs pixel results. One way to make it text-only would
> be to have a separate container with text "тт ", and compare its rendered width
> to actual one.
> 

Do you mean something like:

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<style>
/* Force font fallback on Windows to match Mac OS X */
@font-face {
    font-family: 'times';
    src: local('Lucida Grande');
    unicode-range: U+FFFD;
}
</style>
</head>
<body>
<p>This tests the rendering of invalid UTF-8 sequences.</p>
The following should be: "т??т ???" or "т??т ????????" (with black diamonds in place of question marks):<br>
<div id="target" style="font-size: 12px;">т��т ��������</div>
<br>It should be wider than this:<br>
<div id="reference" style="font-size: 12px;">тт </div>
<p id="result">Test did not run</p>
<script>
    if (window.layoutTestController)
        layoutTestController.dumpAsText();

    var target = document.getElementById("target");
    var textNode = target.firstChild;
    var range = document.createRange();
    range.setStart(textNode, 0);
    range.setEnd(textNode, 8); // Minimum valid length of decoded output is 8 characters.
    var width = range.getClientRects()[0].width;

    var reference = document.getElementById("reference");
    textNode = reference.firstChild;
    range.setStart(textNode, 0);
    range.setEnd(textNode, 3); 
    var referenceWidth = range.getClientRects()[0].width;

    document.getElementById("result").innerText = width > referenceWidth ? "PASS " : "FAIL: width (" + width + ") was less than or equal to " + referenceWidth;

</script>
</body>
</html>

Which will have expected results (in Qt) of:

This tests the rendering of invalid UTF-8 sequences.

The following should be: "Ñ‚??Ñ‚ ???" or "Ñ‚??Ñ‚ ????????" (with black diamonds in place of question marks):
Ñ‚ï¿½ï¿½Ñ‚ ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½

It should be wider than this:
Ñ‚Ñ‚
PASS

and in other platforms of:

This tests the rendering of invalid UTF-8 sequences.

The following should be: "Ñ‚??Ñ‚ ???" or "Ñ‚??Ñ‚ ????????" (with black diamonds in place of question marks):
Ñ‚ï¿½ï¿½Ñ‚ ï¿½ï¿½ï

It should be wider than this:
Ñ‚Ñ‚
PASS

Comment 8 Alexey Proskuryakov 2010-02-17 11:10:45 PST

I'd have done it a little differently - making two separate tests, one with a single ill-formed sequence, and another with what we have now. The former needs to be a pixel test (but it will be independent of decoder differences), while the latter can be text only.

But this seems fine, too.

Comment 9 Robert Hogan 2010-02-21 13:28:14 PST

Created attachment 49167 [details]
Updated Patch

I've split it into two as you suggested. I've also removed the checksum and png for the original invalid-UTF-8.html test as I don't have a mac to run it here. I'm not confident that the pixel results for the first test will be the same on all platforms - would it be better for you to create mac specific results before committing (and remove the platform-independent results for invalid-UTF-8.html)?

Comment 10 Robert Hogan 2010-02-22 11:03:39 PST

Created attachment 49228 [details]
Updated Patch

There was an error in the platform-independent expected results.

Comment 11 Alexey Proskuryakov 2010-02-22 14:02:49 PST

Comment on attachment 49228 [details]
Updated Patch

r=me. I'd have added a link to this bug in invalid-UTF-8-2.html, but one can always get it from svn history, if really interested.

Comment 12 Alexey Proskuryakov 2010-02-22 14:18:47 PST

Grrr, this patch has no ChangeLog!

Committed <http://trac.webkit.org/changeset/55103>.