Bug 52236

Summary: Strip NUL character when copying text on Windows
Product: WebKit Reporter: Erik Arvidsson <arv>
Component: HTML EditingAssignee: Tony Chang <tony>
Status: RESOLVED FIXED    
Severity: Normal CC: abarth, ap, commit-queue, enrica, eric, mjs, rniwa, tony
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: PC   
OS: Windows 7   
Attachments:
Description Flags
Test case
none
Shows that NUL is stripped through the parser
none
Patch
none
Patch for landing
none
Patch
none
Patch for landing none

Description Erik Arvidsson 2011-01-11 12:21:16 PST
Windows does not allow NUL characters on the clipboard since they are used to terminate the clipboard data.

We should strip NUL characters from the text before sending the text to the windows clipboard.
Comment 1 Erik Arvidsson 2011-01-11 12:32:43 PST
Created attachment 78579 [details]
Test case
Comment 2 Tony Chang 2011-01-11 14:00:18 PST
Some testing:

On Chromium Linux, Chromium Win, Safari Win, we copy up to the NUL character (i.e., the copied text is truncated).

On Safari Mac and Chromium Mac, we ignore the NUL (when you paste, you get text without the NUL character).

On IE9, Firefox Win and Firefox Linux, the DOM has the unicode replacement character.

On Firefox Mac, the NUL is converted into a unicode replacement character (U+FFFD) in the pasted text.  Actually, it may already be replaced in the DOM but perhaps there's no glyph for it.
Comment 3 Ryosuke Niwa 2011-01-11 14:02:29 PST
(In reply to comment #2)
> On IE9, Firefox Win and Firefox Linux, the DOM has the unicode replacement character.

We should most certainly replace with the unicode replace character.
Comment 4 Tony Chang 2011-01-11 14:18:38 PST
(In reply to comment #3)
> (In reply to comment #2)
> > On IE9, Firefox Win and Firefox Linux, the DOM has the unicode replacement character.
> 
> We should most certainly replace with the unicode replace character.

Yup, that's what the HTML5 spec says:
http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#plaintext-state
Comment 5 Tony Chang 2011-01-11 14:25:18 PST
Actually, arv says that if he changes the test case to use innerHTML, the NUL character is stripped (he'll upload a test case).

Seems like we need to have setting textContent go through the HTML parser and the HTML parser needs to insert the replacement character.
Comment 6 Adam Barth 2011-01-11 14:35:18 PST
> Seems like we need to have setting textContent go through the HTML parser and the HTML parser needs to insert the replacement character.

That would be surprising.  textContent is a DOM-level concept.  It should ignore anything HTML specific.  That said, stranger things have happened.
Comment 7 Erik Arvidsson 2011-01-11 14:37:35 PST
Created attachment 78601 [details]
Shows that NUL is stripped through the parser
Comment 8 Alexey Proskuryakov 2011-01-11 14:43:02 PST
HTML5 goes to great lengths trying to avoid having null characters in Unicode strings, but I don't think that it generally makes sense. Replacing null with U+FFFD during parsing is particularly weird, because one can always use DOM manipulation to insert a null (as it's done in attached test case), so there is no guarantee of "no null in DOM" anyway.

When copying, we should just make sure that the user gets expected results. As WebKit doesn't display nulls in content, they should be removed in copied text, not replaced with U+FFFD.

> Seems like we need to have setting textContent go through the HTML parser and the HTML parser needs to insert the replacement character.

I don't think that it's the right direction. There are many ways to insert a null via DOM manipulation, and even HTML5 doesn't prohibit those.
Comment 9 Erik Arvidsson 2011-01-11 15:03:23 PST
(In reply to comment #8)
> When copying, we should just make sure that the user gets expected results. As WebKit doesn't display nulls in content, they should be removed in copied text, not replaced with U+FFFD.

Agreed. That was how this started before it got side tracked...
Comment 10 Tony Chang 2011-01-11 15:05:53 PST
(In reply to comment #9)
> (In reply to comment #8)
> > When copying, we should just make sure that the user gets expected results. As WebKit doesn't display nulls in content, they should be removed in copied text, not replaced with U+FFFD.
> 
> Agreed. That was how this started before it got side tracked...

OK, patch coming for Chromium.  I don't have a windows box handy, so Safari Win will come later.
Comment 11 Ryosuke Niwa 2011-01-11 15:09:47 PST
(In reply to comment #10)
> OK, patch coming for Chromium.  I don't have a windows box handy, so Safari Win will come later.

Why don't we do this when we're serializing DOM instead of doing it in each platform?  It's a simple modification to MarkupAccumulator.
Comment 12 Tony Chang 2011-01-13 14:30:34 PST
Created attachment 78856 [details]
Patch
Comment 13 Tony Chang 2011-01-13 14:32:14 PST
(In reply to comment #12)
> Created an attachment (id=78856) [details]
> Patch

I talked to Ryosuke offline and the code for putting plain text on the clipboard doesn't go through MarkupAccumulator, it goes thorough Editor::selectedText().  Adding the code there fixes Chromium Linux, but I need to do some more testing on a Windows machine to see if we need to make a similar fix in MarkupAccumulator (for the rich text case).
Comment 14 Ryosuke Niwa 2011-01-13 15:07:55 PST
(In reply to comment #13)
> (In reply to comment #12)
> > Created an attachment (id=78856) [details] [details]
> > Patch
> 
> I talked to Ryosuke offline and the code for putting plain text on the clipboard doesn't go through MarkupAccumulator, it goes thorough Editor::selectedText().  Adding the code there fixes Chromium Linux, but I need to do some more testing on a Windows machine to see if we need to make a similar fix in MarkupAccumulator (for the rich text case).

Per discussion in person, Tony doesn't have access to a Windows machine, so I'll build & test his patch on my Windows machine.
Comment 15 Ryosuke Niwa 2011-01-13 19:27:51 PST
(In reply to comment #14)
> Per discussion in person, Tony doesn't have access to a Windows machine, so I'll build & test his patch on my Windows machine.

I confirmed that the bug does not produce on Windows with Tony's patch.
Comment 16 Ryosuke Niwa 2011-01-13 19:39:03 PST
Comment on attachment 78856 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=78856&action=review

r=me provided the answers to the following two questions are both yes.

> LayoutTests/editing/pasteboard/copy-null-characters.html:27
> +    var source = document.getElementById("source");
> +    var textWithNull = "Copy\0 paste me";
> +    source.textContent = textWithNull;
> +    sel.setPosition(source, 0);
> +    document.execCommand("SelectAll");
> +    document.execCommand("Copy");
> +    // Remove the source text so we don't end up with a null character in the
> +    // expected output file.
> +    source.parentNode.removeChild(source);
> +
> +    var destinationRichText = document.getElementById("destination-rich-text");
> +    sel.setPosition(destinationRichText, 0);
> +    document.execCommand("Paste");
> +
> +    var destinationPlainText = document.getElementById("destination-plain-text");
> +    destinationPlainText.focus();
> +    document.execCommand("Paste");

Could you also add a test case where the copied contents are richly formatted?

> Source/WebCore/platform/mac/PasteboardMac.mm:194
> -        String text = selectedRange->text();
> +        String text = frame->editor()->selectedText();

If I'm understanding this right, you'd expect expect Range::text() to include null char because it's a DOM API. Am I right?
Comment 17 Tony Chang 2011-01-14 14:35:09 PST
Created attachment 79001 [details]
Patch for landing
Comment 18 Tony Chang 2011-01-14 14:36:36 PST
(In reply to comment #16)
> Could you also add a test case where the copied contents are richly formatted?

Done.

> > Source/WebCore/platform/mac/PasteboardMac.mm:194
> > -        String text = selectedRange->text();
> > +        String text = frame->editor()->selectedText();
> 
> If I'm understanding this right, you'd expect expect Range::text() to include null char because it's a DOM API. Am I right?

Yes, that's right.  This has the side benefit that now all platforms use the same code path to get the plain text that is placed on the clipboard/pasteboard.
Comment 19 Alexey Proskuryakov 2011-01-14 14:56:20 PST
Comment on attachment 79001 [details]
Patch for landing

+static String removeNULLCharacters(const String& str)
+{
+    String copy = str;
+    return copy.removeCharacters(&isNULLCharacter);
+}

Wow, that's a pretty slow way to remove nulls. Are we sure that we want to land it?
Comment 20 Alexey Proskuryakov 2011-01-14 15:02:56 PST
User-initiated copying may not be the hottest code, but one function call per character is still a lot.
Comment 21 WebKit Commit Bot 2011-01-14 15:06:09 PST
The commit-queue encountered the following flaky tests while processing attachment 79001 [details]:

http/tests/appcache/update-cache.html bug 52483 (author: ap@webkit.org)
The commit-queue is continuing to process your patch.
Comment 22 Tony Chang 2011-01-14 15:11:26 PST
(In reply to comment #20)
> User-initiated copying may not be the hottest code, but one function call per character is still a lot.

I suspect the compiler would inline the function, but if you prefer, I can update the patch to use StringImpl::replace(UChar, StringImpl*).
Comment 23 Tony Chang 2011-01-14 15:13:54 PST
(In reply to comment #22)
> (In reply to comment #20)
> > User-initiated copying may not be the hottest code, but one function call per character is still a lot.
> 
> I suspect the compiler would inline the function, but if you prefer, I can update the patch to use StringImpl::replace(UChar, StringImpl*).

I could also use StringImpl::createStrippingNullCharactersSlowCase.
Comment 24 Alexey Proskuryakov 2011-01-14 15:35:20 PST
> I suspect the compiler would inline the function, but if you prefer, I can update the patch to use StringImpl::replace(UChar, StringImpl*).

The compiler cannot inline the function, because String::removeCharacters() is not inline itself. The function's definition is not visible in StringImpl.cpp (which is in JavaScriptCore, so even cross-unit inlining wouldn't work).
Comment 25 Maciej Stachowiak 2011-01-14 15:48:30 PST
Better way to do it (untested):

static String removeNULLCharacters(const String& str)
{
    return str.replace(0, "");
}
Comment 26 Ryosuke Niwa 2011-01-14 15:51:47 PST
(In reply to comment #19)
> (From update of attachment 79001 [details])
> +static String removeNULLCharacters(const String& str)
> +{
> +    String copy = str;
> +    return copy.removeCharacters(&isNULLCharacter);
> +}
> 
> Wow, that's a pretty slow way to remove nulls. Are we sure that we want to land it?

Oops, I missed that.  Let's fix this before landing.
Comment 27 Tony Chang 2011-01-14 16:11:14 PST
Created attachment 79021 [details]
Patch
Comment 28 Alexey Proskuryakov 2011-01-14 16:34:31 PST
Comment on attachment 79021 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=79021&action=review

> Source/WebCore/editing/Editor.cpp:133
> +static void removeNULLCharacters(String& str)
> +{
> +    str.replace(0, "");
> +}

I didn't comment about the function name, because I assumed that you'd just inline str.replace(0, "") at call site. But NULL isn't great, s it usually refers to C null pointer macro, which is not a single byte, and which we don't use anyway.
Comment 29 Tony Chang 2011-01-14 17:26:06 PST
Created attachment 79039 [details]
Patch for landing
Comment 30 WebKit Commit Bot 2011-01-14 18:06:23 PST
Comment on attachment 79039 [details]
Patch for landing

Clearing flags on attachment: 79039

Committed r75861: <http://trac.webkit.org/changeset/75861>
Comment 31 WebKit Commit Bot 2011-01-14 18:06:31 PST
All reviewed patches have been landed.  Closing bug.