Bug 211498

Summary: Cut and paste from Google Doc to Notes in several (non-Latin) languages doesn't work
Product: WebKit Reporter: Wenson Hsieh <wenson_hsieh>
Component: HTML EditingAssignee: Wenson Hsieh <wenson_hsieh>
Status: RESOLVED FIXED    
Severity: Normal CC: ap, bdakin, cdumez, darin, esprehn+autocc, ews-watchlist, kangil.han, megan_gardner, mifenton, thorton, webkit-bug-importer, wenson_hsieh
Priority: P2 Keywords: InRadar
Version: WebKit Nightly Build   
Hardware: Unspecified   
OS: Unspecified   
Attachments:
Description Flags
Patch
none
Patch
none
Address review feedback
none
Address review feedback (2) none

Description Wenson Hsieh 2020-05-05 23:37:24 PDT
<rdar://problem/56675345>
Comment 1 Wenson Hsieh 2020-05-06 00:09:34 PDT
Created attachment 398598 [details]
Patch
Comment 2 Wenson Hsieh 2020-05-06 00:20:15 PDT
Comment on attachment 398598 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=398598&action=review

> Source/WebCore/ChangeLog:22
> +        Later, when pasting in Notes or TextEdit, both apps use `-[NSAttributedString initWithData:...:]` to convert the
> +        HTML data on the pasteboard into an NSAttributedString. This takes the NSPasteboard's HTML data (a blob of
> +        `NSData`) and synchronously loads it in a new legacy WebKit view by calling `-[WebFrame
> +        loadData:MIMEType:textEncodingName:baseURL:]`, passing in `nil` as the text encoding name. Since WebKit is only
> +        given a blob of data and no particular encoding, we fall back to default Latin-1 encoding, which produces
> +        gibberish for CJK text.

An aside: I explored a solution that involves adjusting UIFoundation’s NSHTMLReader to use a default encoding of UTF-8 in its WebView, but ended up not going down this route because (1) defaulting to UTF-8 seemed largely arbitrary when all we’re given is a chunk of NSData, and (2) a fix there wouldn’t help fix the bug in macOS 10.14 and 10.15.
Comment 3 Darin Adler 2020-05-06 08:02:23 PDT
Comment on attachment 398598 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=398598&action=review

> Source/WebCore/ChangeLog:26
> +        To fix this, we preserve any meta tag with a valid charset that the page attempts to write to the clipboard when
> +        sanitizing markup. This matches behavior in both Chrome and Firefox; this fix is also effective on shipping and
> +        older versions of macOS.

I am not happy with this solution.

Preserving the character set in the fragment doesn’t make sense when putting markup in a WTF::String, which contains UTF-16 characters. The character set would be determined when the WTF::String is converted back to bytes. The original character set that was used to decode the fragment won’t be correct. This only will do the right thing when the character set happens to be UTF-8.

I would like to see where this WTF::String is turned into UTF-8 bytes. That is where we should be adding the character set.

Or if we add a character set in markup code long before converting the WTF::String to UTF-8, we should do it whether or not the incoming markup has a meta charset in it, based on the presence of non-ASCII characters, and always write the character set "UTF-8" since that’s what we will be serializing the WTF::String to.
Comment 4 Darin Adler 2020-05-06 08:04:59 PDT
Comment on attachment 398598 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=398598&action=review

> Tools/TestWebKitAPI/Tests/WebKitCocoa/CopyHTML.mm:112
> +        "    e.clipboardData.setData('text/html', `<head><meta charset='utf-8'></head><body><span style='color: red;'>æå«è¬ææ</span></body>`);"

A non-UTF-8 test case would likely prove that this code does the wrong thing.

> Tools/TestWebKitAPI/Tests/WebKitCocoa/CopyHTML.mm:125
> +    EXPECT_TRUE([copiedMarkup containsString:@"<meta charset=\"UTF-8\">"]);

This is unnecessary behavior.

> Tools/TestWebKitAPI/Tests/WebKitCocoa/CopyHTML.mm:134
> +    EXPECT_WK_STREQ("æå«è¬ææ", [attributedString string]);

This is a good test of the bug, but we should cover a non-UTF-8 encoding.
Comment 5 Darin Adler 2020-05-06 08:07:08 PDT
Comment on attachment 398598 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=398598&action=review

>> Tools/TestWebKitAPI/Tests/WebKitCocoa/CopyHTML.mm:125
>> +    EXPECT_TRUE([copiedMarkup containsString:@"<meta charset=\"UTF-8\">"]);
> 
> This is unnecessary behavior.

Let me clarify. This is strange, *possibly* unnecessary, behavior, but it’s possibly helpful for compatibility if there are mistakes in how apps decode pasteboard strings where they convert an NSString to UTF-8.
Comment 6 Wenson Hsieh 2020-05-06 08:09:27 PDT
(In reply to Darin Adler from comment #3)
> Comment on attachment 398598 [details]
> Patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=398598&action=review
> 
> > Source/WebCore/ChangeLog:26
> > +        To fix this, we preserve any meta tag with a valid charset that the page attempts to write to the clipboard when
> > +        sanitizing markup. This matches behavior in both Chrome and Firefox; this fix is also effective on shipping and
> > +        older versions of macOS.
> 
> I am not happy with this solution.
> 
> Preserving the character set in the fragment doesn’t make sense when putting
> markup in a WTF::String, which contains UTF-16 characters. The character set
> would be determined when the WTF::String is converted back to bytes. The
> original character set that was used to decode the fragment won’t be
> correct. This only will do the right thing when the character set happens to
> be UTF-8.
> 
> I would like to see where this WTF::String is turned into UTF-8 bytes. That
> is where we should be adding the character set.
> 
> Or if we add a character set in markup code long before converting the
> WTF::String to UTF-8, we should do it whether or not the incoming markup has
> a meta charset in it, based on the presence of non-ASCII characters, and
> always write the character set "UTF-8" since that’s what we will be
> serializing the WTF::String to.

The pasteboard string is being converted into UTF-8 bytes in AppKit, when it calls into CoreFoundation’s CFPasteboardCreateDataForString while serializing the string to the pasteboard.

At our layer, we take the sanitized markup string and call -[NSPasteboard setString:forType:] to write it to the pasteboard.
Comment 7 Darin Adler 2020-05-06 08:12:46 PDT
(In reply to Wenson Hsieh from comment #6)
> The pasteboard string is being converted into UTF-8 bytes in AppKit, when it
> calls into CoreFoundation’s CFPasteboardCreateDataForString while
> serializing the string to the pasteboard.

OK, then this is a workaround for that. Since AppKit always converts to UTF-8, then we need to add a meta tag that sets the charset to UTF-8 if there are non-ASCII characters present. And make sure it overrides any other misleading conflicting character set indicators, such as other meta tags inserted explicitly by websites, by either removing them or considering "priority order". For economy we can leave one out if the characters are all ASCII.
Comment 8 Darin Adler 2020-05-06 08:34:00 PDT
Two further thoughts now that I know the conversion to UTF-8 is done by CFPasteboardCreateDataForString:

I looked at the Apple internal source code for that function and it has a comment saying the format is “currently” UTF-8. Someone should let the Foundation team know that it’s not OK for them to change to something else since we are all hard-coding UTF-8 into our HTML markup!

CFPasteboardCreateDataForString uses CFStringCreateExternalRepresentation, so it seems like it would be inserting a BOM. The presence of a UTF-8 BOM character at the start of the data buffer *should* cause us to correctly decode as UTF-8 rather than Windows Latin-1 and *ignore* the presence or absence of a <meta> tag, overriding it with the stronger evidence of the BOM. Not sure why that doesn’t kick in and prevent this bug.
Comment 9 Alexey Proskuryakov 2020-05-06 09:28:46 PDT
Inspecting pasteboard contents after running the test manually in Safari, I see three flavors:

- public.html
- Apple HTML Pasteboard Type
- com.apple.WebKit.custom-pasteboard-data

The latter is a binary format that's presumably not used by any other code, so it's not relevant.

The two HTML flavors are identical, and contain UTF-8 fragment without a BOM. It may or may not be an accurate assumption that consumers handle these identically though! I don't think that stepping through what happens when pasting into TextEdit is necessarily sufficient.

I guess one ugly workaround could be to encode all text using HTML entities, that way it's all ASCII. 

But also, since AppKit always converts to UTF-8, shouldn't it just pass the encoding to -[WebFrame loadData:MIMEType:textEncodingName:baseURL:]?
Comment 10 Darin Adler 2020-05-06 09:47:22 PDT
(In reply to Alexey Proskuryakov from comment #9)
> But also, since AppKit always converts to UTF-8, shouldn't it just pass the
> encoding to -[WebFrame loadData:MIMEType:textEncodingName:baseURL:]?

That might work if AppKit is always used to decode the HTML. I was assuming that some apps might not use AppKit to decode and parse.
Comment 11 Wenson Hsieh 2020-05-06 09:49:52 PDT
(In reply to Darin Adler from comment #10)
> (In reply to Alexey Proskuryakov from comment #9)
> > But also, since AppKit always converts to UTF-8, shouldn't it just pass the
> > encoding to -[WebFrame loadData:MIMEType:textEncodingName:baseURL:]?
> 
> That might work if AppKit is always used to decode the HTML. I was assuming
> that some apps might not use AppKit to decode and parse.

And also, this impacts any client or app that tries to load markup as NSData from the pasteboard using -loadData:…: (or similar).
Comment 12 Wenson Hsieh 2020-05-06 09:54:01 PDT
Created attachment 398630 [details]
Patch
Comment 13 Darin Adler 2020-05-06 10:07:23 PDT
Comment on attachment 398630 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=398630&action=review

Looks good to me.

> Tools/TestWebKitAPI/Tests/WebKitCocoa/CopyHTML.mm:144
> +#if PLATFORM(MAC)
> +            NSColor *color = (NSColor *)value;
> +#else
> +            UIColor *color = (UIColor *)value;
> +#endif

We need a CocoaColor typedef so we can write code like this without #if. I think David Kilzer added CocoaImage recently.
Comment 14 Darin Adler 2020-05-06 10:09:43 PDT
Comment on attachment 398630 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=398630&action=review

> Source/WebCore/editing/markup.cpp:970
> +    if (!result.isAllASCII())
> +        builder.appendLiteral("<meta charset=\"UTF-8\">");

This is expedient, but I want to reiterate that this is working around specific problems in code in the Cocoa family platforms, and this doesn’t seem like the perfect place to do it, especially unconditionally.
Comment 15 Wenson Hsieh 2020-05-06 10:26:53 PDT
Comment on attachment 398630 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=398630&action=review

>> Source/WebCore/editing/markup.cpp:970
>> +        builder.appendLiteral("<meta charset=\"UTF-8\">");
> 
> This is expedient, but I want to reiterate that this is working around specific problems in code in the Cocoa family platforms, and this doesn’t seem like the perfect place to do it, especially unconditionally.

Good point; I’ll limit this to Cocoa, then.

What do you think about adding something like:

static bool shouldAppendMetaCharsetWhenSanitizingMarkup()
{
#If PLATFORM(COCOA)
    return true;
#else
    return false;
#endif
}

…above sanitizedMarkupForFragmentInDocument, and consulting it here? I could otherwise call out to the client layer via an EditorClient hook, but it seems like that might be overkill.

>> Tools/TestWebKitAPI/Tests/WebKitCocoa/CopyHTML.mm:144
>> +#endif
> 
> We need a CocoaColor typedef so we can write code like this without #if. I think David Kilzer added CocoaImage recently.

I see! Yes, I think that would be a nice refactor. I noticed that CocoaImage.h is in WebKit; we could add a CocoaColor.h alongside it, but since this is in TestWebKitAPI, I’ll need to either mark it as an SPI header, or tweak HEADER_SEARCH_PATHS in the TestWebKitAPI project to include this directory using a relative path. We do a similar thing in the WebKitTestRunner project, for header files in Source/WebKit/Platform/spi/ios.

I prefer the latter, since the only “client” of this SPI header (if we were to expose it) would be our own testing infrastructure.
Comment 16 Alexey Proskuryakov 2020-05-06 10:30:17 PDT
What are the apps that don't use AppKit, and are broken? I tested a couple Adobe and Microsoft applications, and those didn't accept these pasteboard flavors at all. Neither did BBEdit.

Do we have any examples of apps using -loadData? That doesn't seem like an obvious idea for fragments.
Comment 17 Wenson Hsieh 2020-05-06 11:10:22 PDT
Created attachment 398637 [details]
Address review feedback
Comment 18 Darin Adler 2020-05-06 11:22:10 PDT
(In reply to Wenson Hsieh from comment #15)
> Good point; I’ll limit this to Cocoa, then.

That’s not what I was driving at. We predict that Cocoa will convert to UTF-8 so we protect ourselves by adding a meta element. This is about the *prediction* that some code *later* will convert to UTF-8.

> What do you think about adding something like:
> 
> static bool shouldAppendMetaCharsetWhenSanitizingMarkup()
> {
> #If PLATFORM(COCOA)
>     return true;
> #else
>     return false;
> #endif
> }
> 
> …above sanitizedMarkupForFragmentInDocument, and consulting it here? I could
> otherwise call out to the client layer via an EditorClient hook, but it
> seems like that might be overkill.

I don’t think this helps much. Seems like it would likely not do any harm elsewhere.

What seems “opportunistic” and possibly wrong here is the use of the “sanitize” function to do this. It’s not really logical. It seems too early to predict how the markup would be used. Also doesn’t seem guaranteed that all content that ends up on the pasteboard passes through this function.

At the very least I think there should be a function explaining why this is done here. I would want the comment to be brief, but there are 3 elements I think are non-obvious:

1) We predict someone might encode this as UTF-8, and add this meta element does no harm when this is encoded as UTF-16.

2) We know this is needed on Cocoa platforms; the platforms convert UTF-16 strings to UTF-8 automatically as part of pasteboard handling.

3) This is a good place to do this for the cut and copy to pasteboard operation because ... (compelling reason here).

> I see! Yes, I think that would be a nice refactor. I noticed that
> CocoaImage.h is in WebKit; we could add a CocoaColor.h alongside it, but
> since this is in TestWebKitAPI, I’ll need to either mark it as an SPI
> header, or tweak HEADER_SEARCH_PATHS in the TestWebKitAPI project to include
> this directory using a relative path. We do a similar thing in the
> WebKitTestRunner project, for header files in Source/WebKit/Platform/spi/ios.
> 
> I prefer the latter, since the only “client” of this SPI header (if we were
> to expose it) would be our own testing infrastructure.

I believe that right now we have a lot of Private headers just for testing. I didn’t know about this "testing-only" option. Would be nice to eliminate "using Private but really only for testing" headers and have them use this mechanism instead.
Comment 19 Darin Adler 2020-05-06 11:22:36 PDT
I think a comment is helpful. Not sure the function is needed.
Comment 20 Wenson Hsieh 2020-05-06 12:00:10 PDT
(In reply to Darin Adler from comment #18)
> (In reply to Wenson Hsieh from comment #15)
> > Good point; I’ll limit this to Cocoa, then.
> 
> That’s not what I was driving at. We predict that Cocoa will convert to
> UTF-8 so we protect ourselves by adding a meta element. This is about the
> *prediction* that some code *later* will convert to UTF-8.

Okay, I think I see…

It seems we’ll only want to apply this hack when writing markup strings then, and not when we’re sanitizing natively written data that we want to expose to bindings.
> 
> > What do you think about adding something like:
> > 
> > static bool shouldAppendMetaCharsetWhenSanitizingMarkup()
> > {
> > #If PLATFORM(COCOA)
> >     return true;
> > #else
> >     return false;
> > #endif
> > }
> > 
> > …above sanitizedMarkupForFragmentInDocument, and consulting it here? I could
> > otherwise call out to the client layer via an EditorClient hook, but it
> > seems like that might be overkill.
> 
> I don’t think this helps much. Seems like it would likely not do any harm
> elsewhere.
> 
> What seems “opportunistic” and possibly wrong here is the use of the
> “sanitize” function to do this. It’s not really logical. It seems too early
> to predict how the markup would be used. Also doesn’t seem guaranteed that
> all content that ends up on the pasteboard passes through this function.
> 
> At the very least I think there should be a function explaining why this is
> done here. I would want the comment to be brief, but there are 3 elements I
> think are non-obvious:
> 
> 1) We predict someone might encode this as UTF-8, and add this meta element
> does no harm when this is encoded as UTF-16.
> 
> 2) We know this is needed on Cocoa platforms; the platforms convert UTF-16
> strings to UTF-8 automatically as part of pasteboard handling.
> 
> 3) This is a good place to do this for the cut and copy to pasteboard
> operation because ... (compelling reason here).
> 
> > I see! Yes, I think that would be a nice refactor. I noticed that
> > CocoaImage.h is in WebKit; we could add a CocoaColor.h alongside it, but
> > since this is in TestWebKitAPI, I’ll need to either mark it as an SPI
> > header, or tweak HEADER_SEARCH_PATHS in the TestWebKitAPI project to include
> > this directory using a relative path. We do a similar thing in the
> > WebKitTestRunner project, for header files in Source/WebKit/Platform/spi/ios.
> > 
> > I prefer the latter, since the only “client” of this SPI header (if we were
> > to expose it) would be our own testing infrastructure.
> 
> I believe that right now we have a lot of Private headers just for testing.
> I didn’t know about this "testing-only" option. Would be nice to eliminate
> "using Private but really only for testing" headers and have them use this
> mechanism instead.

(In reply to Darin Adler from comment #18)
> (In reply to Wenson Hsieh from comment #15)
> > Good point; I’ll limit this to Cocoa, then.
> 
> That’s not what I was driving at. We predict that Cocoa will convert to
> UTF-8 so we protect ourselves by adding a meta element. This is about the
> *prediction* that some code *later* will convert to UTF-8.
> 
> > What do you think about adding something like:
> > 
> > static bool shouldAppendMetaCharsetWhenSanitizingMarkup()
> > {
> > #If PLATFORM(COCOA)
> >     return true;
> > #else
> >     return false;
> > #endif
> > }
> > 
> > …above sanitizedMarkupForFragmentInDocument, and consulting it here? I could
> > otherwise call out to the client layer via an EditorClient hook, but it
> > seems like that might be overkill.
> 
> I don’t think this helps much. Seems like it would likely not do any harm
> elsewhere.
> 
> What seems “opportunistic” and possibly wrong here is the use of the
> “sanitize” function to do this. It’s not really logical. It seems too early
> to predict how the markup would be used. Also doesn’t seem guaranteed that
> all content that ends up on the pasteboard passes through this function.
> 
> At the very least I think there should be a function explaining why this is
> done here. I would want the comment to be brief, but there are 3 elements I
> think are non-obvious:
> 
> 1) We predict someone might encode this as UTF-8, and add this meta element
> does no harm when this is encoded as UTF-16.
> 
> 2) We know this is needed on Cocoa platforms; the platforms convert UTF-16
> strings to UTF-8 automatically as part of pasteboard handling.
> 
> 3) This is a good place to do this for the cut and copy to pasteboard
> operation because ... (compelling reason here).
> 
> > I see! Yes, I think that would be a nice refactor. I noticed that
> > CocoaImage.h is in WebKit; we could add a CocoaColor.h alongside it, but
> > since this is in TestWebKitAPI, I’ll need to either mark it as an SPI
> > header, or tweak HEADER_SEARCH_PATHS in the TestWebKitAPI project to include
> > this directory using a relative path. We do a similar thing in the
> > WebKitTestRunner project, for header files in Source/WebKit/Platform/spi/ios.
> > 
> > I prefer the latter, since the only “client” of this SPI header (if we were
> > to expose it) would be our own testing infrastructure.
> 
> I believe that right now we have a lot of Private headers just for testing.
> I didn’t know about this "testing-only" option. Would be nice to eliminate
> "using Private but really only for testing" headers and have them use this
> mechanism instead.
Comment 21 Wenson Hsieh 2020-05-06 12:10:12 PDT
Created attachment 398643 [details]
Address review feedback (2)
Comment 22 Darin Adler 2020-05-06 12:25:33 PDT
Comment on attachment 398643 [details]
Address review feedback (2)

Not thrilled that this makes things more complicated. Kind of liked just doing it unconditionally. Not sure how we know we found all the places that AddMetaCharsetIfNeeded is needed. Something about the AddMetaCharsetIfNeeded name doesn’t seem perfect. It seems to concentrate too much on "which tag we are adding" as opposed to "make string that will work if converted to UTF-8". r=me as is, though
Comment 23 Wenson Hsieh 2020-05-06 12:36:58 PDT
(In reply to Darin Adler from comment #22)
> Comment on attachment 398643 [details]
> Address review feedback (2)
> 
> Not thrilled that this makes things more complicated. Kind of liked just
> doing it unconditionally. Not sure how we know we found all the places that
> AddMetaCharsetIfNeeded is needed. Something about the AddMetaCharsetIfNeeded
> name doesn’t seem perfect. It seems to concentrate too much on "which tag we
> are adding" as opposed to "make string that will work if converted to
> UTF-8". r=me as is, though

From code inspection, these are the only two places in WebKit where we take markup strings provided directly to us from the page through DOM API, and sanitize it with the purpose of writing to the system clipboard.

I will use <https://bugs.webkit.org/show_bug.cgi?id=211524> to track finding a better name for this enum (or possibly replacing it altogether).
Comment 24 Darin Adler 2020-05-06 13:02:33 PDT
(In reply to Wenson Hsieh from comment #23)
> From code inspection, these are the only two places in WebKit where we take
> markup strings provided directly to us from the page through DOM API, and
> sanitize it with the purpose of writing to the system clipboard.

What limits this to markup strings provided by the page? Seems like we could have the same kind of problem when serializing markup from the DOM as part of a copy operation.
Comment 25 EWS 2020-05-06 13:02:43 PDT
Committed r261247: <https://trac.webkit.org/changeset/261247>

All reviewed patches have been landed. Closing bug and clearing flags on attachment 398643 [details].
Comment 26 Wenson Hsieh 2020-05-06 13:26:24 PDT
(In reply to Darin Adler from comment #24)
> (In reply to Wenson Hsieh from comment #23)
> > From code inspection, these are the only two places in WebKit where we take
> > markup strings provided directly to us from the page through DOM API, and
> > sanitize it with the purpose of writing to the system clipboard.
> 
> What limits this to markup strings provided by the page? Seems like we could
> have the same kind of problem when serializing markup from the DOM as part
> of a copy operation.

When copying from the DOM (e.g. selected text), we also write RTF and web archive data to the pasteboard, which don’t have the same problem. Both Notes and TextEdit prefer these representations over markup. It is true, however, that if a client were to still grab the HTML data and try to load it in a web page, it would be broken due to lack of encoding information.

This sounds like a good reason to push the `meta charset` hack down to `serializePreservingVisualAppearanceInternal`, and add a test for this case (where the user is copying selected text).
Comment 27 Darin Adler 2020-05-06 13:49:32 PDT
(In reply to Wenson Hsieh from comment #26)
> This sounds like a good reason to push the `meta charset` hack down to
> `serializePreservingVisualAppearanceInternal`, and add a test for this case
> (where the user is copying selected text).

Yes, that’s the kind of follow through I was thinking about!
Comment 28 Wenson Hsieh 2020-05-06 14:02:23 PDT
(In reply to Darin Adler from comment #27)
> (In reply to Wenson Hsieh from comment #26)
> > This sounds like a good reason to push the `meta charset` hack down to
> > `serializePreservingVisualAppearanceInternal`, and add a test for this case
> > (where the user is copying selected text).
> 
> Yes, that’s the kind of follow through I was thinking about!

I will follow up on this soon, in https://bugs.webkit.org/show_bug.cgi?id=211524!