Bug 54444 - Correct handling of end of buffer partial sequence in UTF-8 and UTF-16 decoders when flushing with zero length
Summary: Correct handling of end of buffer partial sequence in UTF-8 and UTF-16 decode...
Status: ASSIGNED
Alias: None
Product: WebKit
Classification: Unclassified
Component: Text (show other bugs)
Version: 528+ (Nightly build)
Hardware: All All
: P2 Normal
Assignee: Darin Adler
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-02-15 03:08 PST by Antti Koivisto
Modified: 2012-06-12 08:05 PDT (History)
9 users (show)

See Also:


Attachments
Patch (34.43 KB, patch)
2012-06-11 19:23 PDT, Darin Adler
ap: review+
buildbot: commit-queue-
Details | Formatted Diff | Diff
Archive of layout-test-results from ec2-cr-linux-01 (540.81 KB, application/zip)
2012-06-11 23:29 PDT, WebKit Review Bot
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Antti Koivisto 2011-02-15 03:08:48 PST
http://trac.webkit.org/changeset/78541 added a zero length check to fix crashing bug 54418. This is not strictly correct as far as I understand as it should also set the error flag and possibly generate replacement characters.
Comment 1 Darin Adler 2011-02-15 08:51:34 PST
Thanks for adding the check Antti.

I’ll tackle this soon.
Comment 2 Alexey Proskuryakov 2011-02-15 08:52:21 PST
As mentioned in bug 54418, TextCodecUTF16 has the same problem.
Comment 3 Darin Adler 2011-02-15 08:56:29 PST
TextCodecUTF16 has a couple other mistakes as well. The main thing I need here is a good way to test codecs, maybe by using XMLHttpRequest or something. A way to feed in a specific number of bytes at a time. And also a way to test both the “stop on error” and “use replacement character” code paths. Then we can construct a great regression test and then fix these problems.

I spotted another problem in the UTF8-codec that I would like to fix at the same time; if a partial sequence is invalid, then the remainder of that is left as a partial sequence. But the remainder could contain plain ASCII characters, and the code does not handle that path correctly.
Comment 4 Darin Adler 2011-02-22 17:26:06 PST
I have this fixed for the UTF-8 codec now, but my attempt to make a test case for this was unsuccessful.
Comment 5 Darin Adler 2012-06-11 19:23:16 PDT
Created attachment 146994 [details]
Patch
Comment 6 Build Bot 2012-06-11 20:02:50 PDT
Comment on attachment 146994 [details]
Patch

Attachment 146994 [details] did not pass win-ews (win):
Output: http://queues.webkit.org/results/12949153
Comment 7 Alexey Proskuryakov 2012-06-11 22:49:08 PDT
Comment on attachment 146994 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=146994&action=review

Please fix build issues found by EWS.

> Source/WebCore/platform/text/TextCodecUTF16.cpp:77
> +        if (m_haveBufferedByte && flush) {
> +            sawError = true;
> +            if (!stopOnError)
> +                return String(&replacementCharacter, 1);
> +        }
> +        return String();        

I only had a very brief look at other decoders, but it seems like we may need to reset to clean state when flush == true.

> Source/WebCore/platform/text/TextCodecUTF16.cpp:128
> +        if (flush) {
> +            sawError = true;
> +            if (!stopOnError) {
> +                size_t numCharsInBuffer = q - buffer.characters();
> +                if (numCharsInBuffer == buffer.length()) {
> +                    buffer.resize(numCharsInBuffer + 1);
> +                    q = buffer.characters() + numCharsInBuffer;
> +                }
> +                *q++ = replacementCharacter;
> +            }

Ditto.

> Source/WebCore/testing/Internals.h:51
> -class Internals : public RefCounted<Internals>,
> -                  public FrameDestructionObserver {
> +class Internals : public RefCounted<Internals>, private FrameDestructionObserver {

Someone else is working on removing inheritance from FrameDestructionObserver now.

> LayoutTests/fast/encoding/incremental-decoding.html:220
> +                debug("Test requires WebKit internals object");

This is helpful for other browser vendors trying to open the test, but necessarily for people opening it in a WebKit based browser.

Also, I'm surprised that the main branch is an else.
Comment 8 Alexey Proskuryakov 2012-06-11 22:49:40 PDT
*not* necessarily.
Comment 9 WebKit Review Bot 2012-06-11 23:29:50 PDT
Comment on attachment 146994 [details]
Patch

Attachment 146994 [details] did not pass chromium-ews (chromium-xvfb):
Output: http://queues.webkit.org/results/12937813

New failing tests:
fast/parser/test-unicode-characters-in-attribute-name.html
Comment 10 WebKit Review Bot 2012-06-11 23:29:54 PDT
Created attachment 147017 [details]
Archive of layout-test-results from ec2-cr-linux-01

The attached test failures were seen while running run-webkit-tests on the chromium-ews.
Bot: ec2-cr-linux-01  Port: <class 'webkitpy.common.config.ports.ChromiumXVFBPort'>  Platform: Linux-2.6.35-28-virtual-x86_64-with-Ubuntu-10.10-maverick
Comment 11 Darin Adler 2012-06-12 08:05:31 PDT
(In reply to comment #7)
> fast/parser/test-unicode-characters-in-attribute-name.html

I had overlooked this test failure, and it’s interesting.

The test file has an extra byte at the end; it’s a newline character, but it’s a single byte. The old codec would ignore and drop that last byte. The new codec changes it into U+FFFD. I’m curious what other web browsers do with this stray byte.

We should probably fix that test case and add coverage for this case to char-decoding.html by passing in a single byte.