Summary: | Correct handling of end of buffer partial sequence in UTF-8 and UTF-16 decoders when flushing with zero length | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | WebKit | Reporter: | Antti Koivisto <koivisto> | ||||||
Component: | Text | Assignee: | Darin Adler <darin> | ||||||
Status: | ASSIGNED --- | ||||||||
Severity: | Normal | CC: | annevk, ap, darin, dglazkov, gustavo, jnd, mario, philn, webkit.review.bot, xan.lopez | ||||||
Priority: | P2 | ||||||||
Version: | 528+ (Nightly build) | ||||||||
Hardware: | All | ||||||||
OS: | All | ||||||||
Bug Depends on: | |||||||||
Bug Blocks: | 179303 | ||||||||
Attachments: |
|
Description
Antti Koivisto
2011-02-15 03:08:48 PST
Thanks for adding the check Antti. I’ll tackle this soon. As mentioned in bug 54418, TextCodecUTF16 has the same problem. TextCodecUTF16 has a couple other mistakes as well. The main thing I need here is a good way to test codecs, maybe by using XMLHttpRequest or something. A way to feed in a specific number of bytes at a time. And also a way to test both the “stop on error” and “use replacement character” code paths. Then we can construct a great regression test and then fix these problems. I spotted another problem in the UTF8-codec that I would like to fix at the same time; if a partial sequence is invalid, then the remainder of that is left as a partial sequence. But the remainder could contain plain ASCII characters, and the code does not handle that path correctly. I have this fixed for the UTF-8 codec now, but my attempt to make a test case for this was unsuccessful. Created attachment 146994 [details]
Patch
Comment on attachment 146994 [details] Patch Attachment 146994 [details] did not pass win-ews (win): Output: http://queues.webkit.org/results/12949153 Comment on attachment 146994 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=146994&action=review Please fix build issues found by EWS. > Source/WebCore/platform/text/TextCodecUTF16.cpp:77 > + if (m_haveBufferedByte && flush) { > + sawError = true; > + if (!stopOnError) > + return String(&replacementCharacter, 1); > + } > + return String(); I only had a very brief look at other decoders, but it seems like we may need to reset to clean state when flush == true. > Source/WebCore/platform/text/TextCodecUTF16.cpp:128 > + if (flush) { > + sawError = true; > + if (!stopOnError) { > + size_t numCharsInBuffer = q - buffer.characters(); > + if (numCharsInBuffer == buffer.length()) { > + buffer.resize(numCharsInBuffer + 1); > + q = buffer.characters() + numCharsInBuffer; > + } > + *q++ = replacementCharacter; > + } Ditto. > Source/WebCore/testing/Internals.h:51 > -class Internals : public RefCounted<Internals>, > - public FrameDestructionObserver { > +class Internals : public RefCounted<Internals>, private FrameDestructionObserver { Someone else is working on removing inheritance from FrameDestructionObserver now. > LayoutTests/fast/encoding/incremental-decoding.html:220 > + debug("Test requires WebKit internals object"); This is helpful for other browser vendors trying to open the test, but necessarily for people opening it in a WebKit based browser. Also, I'm surprised that the main branch is an else. *not* necessarily. Comment on attachment 146994 [details] Patch Attachment 146994 [details] did not pass chromium-ews (chromium-xvfb): Output: http://queues.webkit.org/results/12937813 New failing tests: fast/parser/test-unicode-characters-in-attribute-name.html Created attachment 147017 [details]
Archive of layout-test-results from ec2-cr-linux-01
The attached test failures were seen while running run-webkit-tests on the chromium-ews.
Bot: ec2-cr-linux-01 Port: <class 'webkitpy.common.config.ports.ChromiumXVFBPort'> Platform: Linux-2.6.35-28-virtual-x86_64-with-Ubuntu-10.10-maverick
(In reply to comment #7) > fast/parser/test-unicode-characters-in-attribute-name.html I had overlooked this test failure, and it’s interesting. The test file has an extra byte at the end; it’s a newline character, but it’s a single byte. The old codec would ignore and drop that last byte. The new codec changes it into U+FFFD. I’m curious what other web browsers do with this stray byte. We should probably fix that test case and add coverage for this case to char-decoding.html by passing in a single byte. |