Bug 26791

Summary: [Gtk] Paste of rich text from firefox results garbled markup
Product: WebKit Reporter: Jiahua Huang <jhuangjiahua>
Component: WebKitGTKAssignee: Nobody <webkit-unassigned>
Status: CLOSED FIXED    
Severity: Normal CC: darin, eric, jhuangjiahua, jmalonzo
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: All   
OS: All   
Attachments:
Description Flags
Fix problem with UTF-16 clipboard pasted
none
update patch to r45340
none
Fix the length size of GtkSelectionData from Pidgin and FireFox
eric: review-
use TextResourceDecoder instead of simple detectTextEncoding
none
change mimeType to decrease the cost
none
add a manual test in WebCore/manual-tests/gtk jmalonzo: review+

Jiahua Huang
Reported 2009-06-29 00:57:44 PDT
WebkitGtk HTML Editing not accustomed to UTF-16 clipboard. When paste rich text to editablearea, (like gmail Rich formatting editor, or any application use WebkitGtk as html editor) WebkitGtk assumed that SelectionData from gtk clipboard is always UTF-8 encoding. But ,in fact, some applications like FireFox put SelectionData for clipboard use UTF-16 encoding, so rich text markup garbled. To reproduce: 0. Launch FireFox to browse http://www.apple.com 1. Copy that text or picture from FireFox page 2. Launch a WebkitGtk browser (Midori or GtkLauncher) and open http://www.mozilla.org/editor/midasdemo/ , it can see a rich formatting editor 3. Paste in the rich formatting editor Results: pasted contents of editor is garbled markup
Attachments
Fix problem with UTF-16 clipboard pasted (1.51 KB, patch)
2009-06-29 01:04 PDT, Jiahua Huang
no flags
update patch to r45340 (1.58 KB, patch)
2009-06-29 10:35 PDT, Jiahua Huang
no flags
Fix the length size of GtkSelectionData from Pidgin and FireFox (1.56 KB, patch)
2009-06-30 09:42 PDT, Jiahua Huang
eric: review-
use TextResourceDecoder instead of simple detectTextEncoding (1.61 KB, patch)
2009-07-03 08:33 PDT, Jiahua Huang
no flags
change mimeType to decrease the cost (1.65 KB, patch)
2009-07-03 09:59 PDT, Jiahua Huang
no flags
add a manual test in WebCore/manual-tests/gtk (3.31 KB, patch)
2009-07-04 16:20 PDT, Jiahua Huang
jmalonzo: review+
Jiahua Huang
Comment 1 2009-06-29 01:04:06 PDT
Created attachment 31999 [details] Fix problem with UTF-16 clipboard pasted
Jiahua Huang
Comment 2 2009-06-29 10:35:00 PDT
Created attachment 32013 [details] update patch to r45340
Eric Seidel (no email)
Comment 3 2009-06-30 00:40:38 PDT
Comment on attachment 32013 [details] update patch to r45340 Can there be other encodings on the gtk pasteboard? Should we be using TextDecoder instead here? This is also really lame that this isn't standardized in some way...
Jiahua Huang
Comment 4 2009-06-30 01:14:28 PDT
(In reply to comment #3) > (From update of attachment 32013 [details] [review]) > Can there be other encodings on the gtk pasteboard? No, only UTF-8 and "Unicode" (UTF-16LE, UTF-16BE) now. General gtk apps use UTF-8, and others use Unicode. > Should we be using > TextDecoder instead here? This is also really lame that this isn't > standardized in some way... Thanks for your comments. They were£¬It was enough for UTF-8 and Unicode.
Eric Seidel (no email)
Comment 5 2009-06-30 02:53:10 PDT
Comment on attachment 32013 [details] update patch to r45340 Won't this still end up with the UTF16 BOM char pasted into your web page? WebKit is smart enough to ignore BOMs in the middle of pages, but it seems silly to paste them in.
Jiahua Huang
Comment 6 2009-06-30 03:34:52 PDT
(In reply to comment #5) > (From update of attachment 32013 [details] [review]) > Won't this still end up with the UTF16 BOM char pasted into your web page? > WebKit is smart enough to ignore BOMs in the middle of pages, but it seems > silly to paste them in. > Hi, Eric, The problem of garbled is not on the BOMs, it is that String::fromUTF8() the code in WebCore/platform/text/String.cpp: > String String::fromUTF8(const char* string, size_t size) > > String String::fromUTF8(const char* string, size_t size) > { > if (!string) > return String(); > return UTF8Encoding().decode(string, size); > } the string must be legal UTF-8 encoding chars, and the size must be right length. When gtk.Clipbord put a UTF-16 data, the string is not legal UTF-8 encoding chars but UChar, and the length muse be Halved. So, the old code will wrong for encoding (multiple decoding) and size, the wrong of size (double the right length) will make markups confused. So, I maked this changes: > - String html = String::fromUTF8(reinterpret_cast<gchar*>(data->data), data->length * data->format / 8); > + String html; > + if (reinterpret_cast<UChar*>(data->data)[0] == 0xFEFF > + || reinterpret_cast<UChar*>(data->data)[0] == 0xFFFE) > + html = String(reinterpret_cast<UChar*>(data->data), data->length * data->format / 16); > + else > + html = String::fromUTF8(reinterpret_cast<gchar*>(data->data), data->length * data->format / 8); Let it use String::String() instead of String::fromUTF8 when gtk.Clipbord put a UTF-16 data (BOMs used to discern)
Jiahua Huang
Comment 7 2009-06-30 03:38:06 PDT
(In reply to comment #6) > Let it use String::String() instead of String::fromUTF8 > when gtk.Clipbord put a UTF-16 data (BOMs used to discern) > and Halved the length size ("data->length * data->format / 16" instead of "data->length * data->format / 8")
Jiahua Huang
Comment 8 2009-06-30 09:42:13 PDT
Created attachment 32072 [details] Fix the length size of GtkSelectionData from Pidgin and FireFox UTF-16 (with BOMs) GtkSelectionData must a halfed length, but some apps wrong for format size, in which case, I use "data->length / 2" to fix it.
Jiahua Huang
Comment 9 2009-06-30 10:24:47 PDT
Comment on attachment 32072 [details] Fix the length size of GtkSelectionData from Pidgin and FireFox UTF-16 (with BOMs) GtkSelectionData must a halfed length, but some apps wrong for format size, in which case, I use "data->length / 2" to fix it.
Eric Seidel (no email)
Comment 10 2009-06-30 10:59:59 PDT
Comment on attachment 32072 [details] Fix the length size of GtkSelectionData from Pidgin and FireFox It seems like this would do a better job: TextEncoding clipboardEncoding; if (detectTextEncoding(data->data, data->len, "utf8", &clipboardEncoding)) html = clipboardEncoding.decode(data->data, data->len); This also needs a manual test in WebCore/manual-tests/gtk
Jiahua Huang
Comment 11 2009-06-30 11:19:39 PDT
(In reply to comment #10) > (From update of attachment 32072 [details] [review]) > It seems like this would do a better job: > Thanks, I appreciate your comments. I'm trying to add a manual-tests.
Jiahua Huang
Comment 12 2009-06-30 11:30:32 PDT
(In reply to comment #10) > (From update of attachment 32072 [details] [review]) > It seems like this would do a better job: > Sorry, I tried this code, but it didn't work. > String html; > TextEncoding clipboardEncoding; > if (detectTextEncoding(reinterpret_cast<char*>(data->data), data->length, "utf8", &clipboardEncoding)) > html = clipboardEncoding.decode(reinterpret_cast<char*>(data->data), data->length); I'm trying to find way
Jiahua Huang
Comment 13 2009-06-30 11:34:06 PDT
(In reply to comment #12) > Sorry, > I tried this code, but it didn't work. > It can paste text from FireFox, other WebkitGtk Page, Pidgin MessageView, but formats have been lost&#12290; It seems detectTextEncoding always return False, so it fallback to PlainText.
Eric Seidel (no email)
Comment 14 2009-06-30 11:34:58 PDT
(In reply to comment #12) > Sorry, > I tried this code, but it didn't work. > > > String html; > > TextEncoding clipboardEncoding; > > if (detectTextEncoding(reinterpret_cast<char*>(data->data), data->length, "utf8", &clipboardEncoding)) > > html = clipboardEncoding.decode(reinterpret_cast<char*>(data->data), data->length); Does "UTF-8" work? I admit, I haven't used detectTextEncoding much.
Eric Seidel (no email)
Comment 15 2009-06-30 11:36:08 PDT
Oh, it's also possible that detectTextEncoding isn't wired up for the gtk port. TextEncodingDetectorICU.cpp is the file I would expect gtk to be using.
Jiahua Huang
Comment 16 2009-06-30 11:42:01 PDT
(In reply to comment #14) > Does "UTF-8" work? I admit, I haven't used detectTextEncoding much. > no :(
Jiahua Huang
Comment 17 2009-06-30 11:46:39 PDT
(In reply to comment #15) > Oh, it's also possible that detectTextEncoding isn't wired up for the gtk port. > TextEncodingDetectorICU.cpp is the file I would expect gtk to be using. > configure of WebkitGtk has this Optional: > --with-unicode-backend=[icu/glib] > Select Unicode backend (WARNING: the glib-based > backend is slow, and incomplete) default=icu I selected the default icu, and I had add > #include "TextEncodingDetector.h" in WebCore/platform/gtk/PasteboardGtk.cpp
Eric Seidel (no email)
Comment 18 2009-06-30 12:14:13 PDT
I would encourage you to set a breakpoint in detectTextEncoding, because I believe your use-case should work fine if the gtk build has encoding detection wired up correctly.
Jiahua Huang
Comment 19 2009-07-03 08:21:46 PDT
(In reply to comment #18) > I would encourage you to set a breakpoint in detectTextEncoding, because I > believe your use-case should work fine if the gtk build has encoding detection > wired up correctly. Hi, I use TextResourceDecoder instead of simple detectTextEncoding. this's the code: > String html; > html = TextResourceDecoder::create("text/html", "UTF-8", true)->decode(reinterpret_cast<char*>(data->data), data->length); it works fine.
Jiahua Huang
Comment 20 2009-07-03 08:33:33 PDT
Created attachment 32238 [details] use TextResourceDecoder instead of simple detectTextEncoding change to this code: > String html; > RefPtr<TextResourceDecoder> decoder = TextResourceDecoder::create("text/html", "UTF-8", true); > html = decoder->decode(reinterpret_cast<char*>(data->data), data->length); it works. GtkSelectionData (html) from GTK clipboard can be some forms: 1. From other WebkitGtk page "text/html", 8bits, utf-8, no BOMs 2. From FireFox page "text/html", 8bits (wrong for firefox), utf-16, BOMs 3. From Pidgin messageview "text/html", 16bits, BOMs 4. From other Gtk App who use "text/html" when copy-paste "text/html", 8bits, utf-8, no BOMs ... it use TextResourceDecoder to auto detect encoding.
Jiahua Huang
Comment 21 2009-07-03 09:59:46 PDT
Created attachment 32239 [details] change mimeType to decrease the cost Use "text/plain" mimeType instead of "text/html" in TextResourceDecoder, it will decrease the checking cost.
Jan Alonzo
Comment 22 2009-07-04 14:55:43 PDT
(In reply to comment #21) > Created an attachment (id=32239) [details] > change mimeType to decrease the cost > > Use "text/plain" mimeType instead of "text/html" in TextResourceDecoder, > it will decrease the checking cost. Hi Jiahua. Are you able to add manual-test for this? That seems to be a missing piece in your patch. Thanks!
Jiahua Huang
Comment 23 2009-07-04 16:20:41 PDT
Created attachment 32264 [details] add a manual test in WebCore/manual-tests/gtk (In reply to comment #22) > Hi Jiahua. Are you able to add manual-test for this? That seems to be a missing > piece in your patch. > > Thanks! Hi Jan, WebCore/manual-tests/gtk/paste-richtext-from-firefox.html has been added.
Jan Alonzo
Comment 24 2009-07-07 05:17:10 PDT
Comment on attachment 32264 [details] add a manual test in WebCore/manual-tests/gtk > - String html = String::fromUTF8(reinterpret_cast<gchar*>(data->data), data->length * data->format / 8); > + String html; > + RefPtr<TextResourceDecoder> decoder = TextResourceDecoder::create("text/plain", "UTF-8", true); > + html = decoder->decode(reinterpret_cast<char*>(data->data), data->length); > + html += decoder->flush(); > gtk_selection_data_free(data); Declaration and the first assignment can be in the same line here. Looks fine and the patch works great! r=me.
Jiahua Huang
Comment 25 2009-07-07 05:44:29 PDT
(In reply to comment #24) > (From update of attachment 32264 [details]) > > - String html = String::fromUTF8(reinterpret_cast<gchar*>(data->data), data->length * data->format / 8); > > + String html; > > + RefPtr<TextResourceDecoder> decoder = TextResourceDecoder::create("text/plain", "UTF-8", true); > > + html = decoder->decode(reinterpret_cast<char*>(data->data), data->length); > > + html += decoder->flush(); > > gtk_selection_data_free(data); > > Declaration and the first assignment can be in the same line here. > > Looks fine and the patch works great! r=me. Thank you, Jan, I appreciate your comments. I observ'd also, when I select some html content, and Copy them use Ctrl-C, the Pasteboard/GtkClipboard can't handle the "text/html" target. I found that the "text/html" target only in X clipboard (GDK_SELECTION_CLIPBOARD), but not Desktop clipboard (GDK_SELECTION_PRIMARY). So we can't use Ctrl-C to Copy rich text from WebkitGtk. I use this Change: > Index: WebCore/platform/gtk/PasteboardGtk.cpp > =================================================================== > --- WebCore/platform/gtk/PasteboardGtk.cpp (revision 45546) > +++ WebCore/platform/gtk/PasteboardGtk.cpp (working copy) > @@ -37,8 +38,8 @@ namespace WebCore { > /* FIXME: we must get rid of this and use the enum in webkitwebview.h someway */ > typedef enum > { > - WEBKIT_WEB_VIEW_TARGET_INFO_HTML = - 1, > - WEBKIT_WEB_VIEW_TARGET_INFO_TEXT = - 2 > + WEBKIT_WEB_VIEW_TARGET_INFO_HTML, > + WEBKIT_WEB_VIEW_TARGET_INFO_TEXT > } WebKitWebViewTargetInfo; > it works. it can use Ctrl-C to Copy rich text from WebkitGtk page now. but I'm not sure if it will be a new patch. Jan, should fix it when landing, or should I create a new patch with ChangeLog, or open a new Bug? Regards
Jan Alonzo
Comment 26 2009-07-07 05:59:05 PDT
> > Looks fine and the patch works great! r=me. Landed in r45592. > > Index: WebCore/platform/gtk/PasteboardGtk.cpp > > =================================================================== > > --- WebCore/platform/gtk/PasteboardGtk.cpp (revision 45546) > > +++ WebCore/platform/gtk/PasteboardGtk.cpp (working copy) > > @@ -37,8 +38,8 @@ namespace WebCore { > > /* FIXME: we must get rid of this and use the enum in webkitwebview.h someway */ > > typedef enum > > { > > - WEBKIT_WEB_VIEW_TARGET_INFO_HTML = - 1, > > - WEBKIT_WEB_VIEW_TARGET_INFO_TEXT = - 2 > > + WEBKIT_WEB_VIEW_TARGET_INFO_HTML, > > + WEBKIT_WEB_VIEW_TARGET_INFO_TEXT > > } WebKitWebViewTargetInfo; > > > > it works. > it can use Ctrl-C to Copy rich text from WebkitGtk page now. > > > but I'm not sure if it will be a new patch. Please file a new bug with the problem that you just encountered (including your patch too). Please CC me on the bug as well. Thanks
Jiahua Huang
Comment 27 2009-07-07 06:45:21 PDT
(In reply to comment #26) > Please file a new bug with the problem that you just encountered (including > your patch too). Please CC me on the bug as well. > > Thanks Well, I see, wait for a moment。 Thanks. Should I mark this bug as CLOSED.
Jiahua Huang
Comment 28 2009-07-07 07:24:48 PDT
(In reply to comment #26) Hi, Jan, I had file a new bug on https://bugs.webkit.org/show_bug.cgi?id=27028 and CC you on the bug. but when I submitted for bug 27028, it say: > Excluding: > <your email>
Note You need to log in before you can comment on or make changes to this bug.