Summary: | REGRESSION (r223440): Copying & pasting a list from Microsoft Word to TinyMCE fails | ||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | WebKit | Reporter: | Ryosuke Niwa <rniwa> | ||||||||||||||||||||||||||||||||||
Component: | HTML Editing | Assignee: | Ryosuke Niwa <rniwa> | ||||||||||||||||||||||||||||||||||
Status: | RESOLVED FIXED | ||||||||||||||||||||||||||||||||||||
Severity: | Normal | CC: | bfulgham, commit-queue, ews-watchlist, m.lewandowski, m.samsel, thespyder, wenson_hsieh | ||||||||||||||||||||||||||||||||||
Priority: | P2 | Keywords: | InRadar | ||||||||||||||||||||||||||||||||||
Version: | WebKit Nightly Build | ||||||||||||||||||||||||||||||||||||
Hardware: | Unspecified | ||||||||||||||||||||||||||||||||||||
OS: | Unspecified | ||||||||||||||||||||||||||||||||||||
Bug Depends on: | 124391 | ||||||||||||||||||||||||||||||||||||
Bug Blocks: | 182938, 182954 | ||||||||||||||||||||||||||||||||||||
Attachments: |
|
Description
Ryosuke Niwa
2018-02-06 23:53:15 PST
Created attachment 333269 [details]
WIP
Created attachment 333361 [details]
WIP2
The code change is done. Will write tests next. Created attachment 333362 [details]
WIP3
Oops, there was a tiny bug there.
My initial impressions are that this gives us most (if not all) of what we need. It would be really nice if we could get a valid HTML document, though, instead of "<html><style></style>content" :) I'll adjust my code next week to pick up the style tag this way and see if there are any problems. Created attachment 333448 [details]
Fixes the bug
Created attachment 333449 [details]
Updated for ToT
(In reply to Andrew Herron from comment #6) > My initial impressions are that this gives us most (if not all) of what we > need. It would be really nice if we could get a valid HTML document, though, > instead of "<html><style></style>content" :) > > I'll adjust my code next week to pick up the style tag this way and see if > there are any problems. Oh, what kind of markup do you want to see instead? <html><body><style>? I can technically handle anything, but a proper HTML document would be easiest to work with :) <html><head><style></style></head><body>content</body></html> Comment on attachment 333449 [details] Updated for ToT Attachment 333449 [details] did not pass win-ews (win): Output: http://webkit-queues.webkit.org/results/6429691 New failing tests: http/tests/security/clipboard/copy-paste-html-across-origin-strips-mso-list.html Created attachment 333475 [details]
Archive of layout-test-results from ews205 for win-future
The attached test failures were seen while running run-webkit-tests on the win-ews.
Bot: ews205 Port: win-future Platform: CYGWIN_NT-6.1-2.9.0-0.318-5-3-x86_64-64bit
(In reply to Andrew Herron from comment #10) > I can technically handle anything, but a proper HTML document would be > easiest to work with :) > > <html><head><style></style></head><body>content</body></html> <html><style>content is a perfectly a valid HTML5 document. Having said that, we can make that change. (In reply to Ryosuke Niwa from comment #13) > (In reply to Andrew Herron from comment #10) > > I can technically handle anything, but a proper HTML document would be > > easiest to work with :) > > > > <html><head><style></style></head><body>content</body></html> > > <html><style>content is a perfectly a valid HTML5 document. Having said > that, we can make that change. Never mind. This is actually tricky because we're using the fragment parsing algorithm which strips away head. Comment on attachment 333449 [details] Updated for ToT View in context: https://bugs.webkit.org/attachment.cgi?id=333449&action=review Just some preliminary comments — I'll continue looking. > Source/WebCore/ChangeLog:25 > + that don't enable custom pasteboard data (detected by the content origin being null), and the HTML that don't => that doesn't > Source/WebCore/ChangeLog:30 > + PasteHTML.SripsMSOListWhenMissingMSOHTMLElement Srips => Strips > Source/WebCore/ChangeLog:58 > + set to Enable whenever the content COULD be the one generated by Microsoft Word. Not very intuitive, but sensible I suppose :/ Maybe MSOListQuirks should be renamed to something more like `MayUseMSOListQuirks { No, Yes }` to make this more obvious? > Source/WebCore/ChangeLog:65 > + attributes, one containing computed styles, and another containing mso-list. Luckily, HTML parsering parsering => parsing > Source/WebCore/ChangeLog:66 > + rule dictats that the first attribute always wins when more than attributes of the same name appears, dictats => dictates more than attributes of the same name appears => multiple attributes of the same name appear > Source/WebCore/editing/markup.cpp:596 > +bool StyledMarkupAccumulator::appendNodeToPreserveMSOList(Node& node, bool& inMSOList) Maybe set `inMSOList = false` at the beginning of this function so it always has a correct value after this call? (In reply to Wenson Hsieh from comment #15) > Comment on attachment 333449 [details] > Updated for ToT > > View in context: > https://bugs.webkit.org/attachment.cgi?id=333449&action=review > > Just some preliminary comments — I'll continue looking. > > > Source/WebCore/ChangeLog:25 > > + that don't enable custom pasteboard data (detected by the content origin being null), and the HTML > > that don't => that doesn't Fixed. > > Source/WebCore/ChangeLog:30 > > + PasteHTML.SripsMSOListWhenMissingMSOHTMLElement > > Srips => Strips Fixed. > > Source/WebCore/ChangeLog:58 > > + set to Enable whenever the content COULD be the one generated by Microsoft Word. > > Not very intuitive, but sensible I suppose :/ > > Maybe MSOListQuirks should be renamed to something more like > `MayUseMSOListQuirks { No, Yes }` to make this more obvious? So the idea here is that MSOListQuirks is like a runtime feature. When the feature is enabled, we'd check for html tag, and do the quirks if needed. > > Source/WebCore/ChangeLog:65 > > + attributes, one containing computed styles, and another containing mso-list. Luckily, HTML parsering > > parsering => parsing Fixed. > > > Source/WebCore/ChangeLog:66 > > + rule dictats that the first attribute always wins when more than attributes of the same name appears, > > dictats => dictates Fixed. > more than attributes of the same name appears => multiple attributes of the > same name appear Fixed. > > > Source/WebCore/editing/markup.cpp:596 > > +bool StyledMarkupAccumulator::appendNodeToPreserveMSOList(Node& node, bool& inMSOList) > > Maybe set `inMSOList = false` at the beginning of this function so it always > has a correct value after this call? Oh, the whole point of that boolean is it changes over time during the traversal. It gets updated in this function. Made it a member variable instead to make this clear. Created attachment 333520 [details]
Skip the test on Windows
> > Not very intuitive, but sensible I suppose :/ > > > > Maybe MSOListQuirks should be renamed to something more like > > `MayUseMSOListQuirks { No, Yes }` to make this more obvious? > > So the idea here is that MSOListQuirks is like a runtime feature. When the > feature is enabled, we'd check for html tag, and do the quirks if needed. Right, I understand that — it's just that the purpose wasn't obvious to me from the name alone (MSOListQuirks) and I was initially confused why there were 2 flags. Though, I suppose the enum values, ::Enable and ::Disable should be enough to make this clear, so it's fine. > > Maybe set `inMSOList = false` at the beginning of this function so it always > > has a correct value after this call? > > Oh, the whole point of that boolean is it changes over time during the > traversal. It gets updated in this function. Made it a member variable > instead to make this clear. Oh ok, now I see! Your new approach makes this a lot clearer. 👍🏻 Comment on attachment 333520 [details] Skip the test on Windows View in context: https://bugs.webkit.org/attachment.cgi?id=333520&action=review > Source/WebCore/editing/MarkupAccumulator.cpp:204 > + ASSERT(length <= text.data().length()); Perhaps `start + length <= text.data().length()`? > Source/WebCore/editing/markup.cpp:487 > + // Unfortunately, TinyMCE doesn't recognize mso-list inlie style if the style attribute contains properties. inlie => inline > Source/WebCore/editing/markup.cpp:600 > + Comment& commentNode = downcast<Comment>(node); Nit - auto&? (In reply to Wenson Hsieh from comment #19) > Comment on attachment 333520 [details] > Skip the test on Windows > > View in context: > https://bugs.webkit.org/attachment.cgi?id=333520&action=review > > > Source/WebCore/editing/MarkupAccumulator.cpp:204 > > + ASSERT(length <= text.data().length()); > > Perhaps `start + length <= text.data().length()`? Oh oops, nice catch. Fixed. > > Source/WebCore/editing/markup.cpp:487 > > + // Unfortunately, TinyMCE doesn't recognize mso-list inlie style if the style attribute contains properties. > > inlie => inline Fixed. > > Source/WebCore/editing/markup.cpp:600 > > + Comment& commentNode = downcast<Comment>(node); > > Nit - auto&? Fixed. (In reply to Wenson Hsieh from comment #18) > > > Not very intuitive, but sensible I suppose :/ > > > > > > Maybe MSOListQuirks should be renamed to something more like > > > `MayUseMSOListQuirks { No, Yes }` to make this more obvious? > > > > So the idea here is that MSOListQuirks is like a runtime feature. When the > > feature is enabled, we'd check for html tag, and do the quirks if needed. > > Right, I understand that — it's just that the purpose wasn't obvious to me > from the name alone (MSOListQuirks) and I was initially confused why there > were 2 flags. Though, I suppose the enum values, ::Enable and ::Disable > should be enough to make this clear, so it's fine. Okay, I'd rename MSOListQuirks::Enabled to MSOListQuirks::CheckIfNeeded as we discussed in person. Created attachment 333550 [details]
Patch for landing
Comment on attachment 333550 [details]
Patch for landing
Wait for EWS first.
Committed r228352: <https://trac.webkit.org/changeset/228352> (In reply to Ryosuke Niwa from comment #14) > (In reply to Ryosuke Niwa from comment #13) > > > > <html><style>content is a perfectly a valid HTML5 document. Having said > > that, we can make that change. > > Never mind. This is actually tricky because we're using the fragment parsing > algorithm which strips away head. I realise it's valid, however previously we could rely on the raw clipboard being a "proper" document. But as I said it's not a huge issue it just means some adjustment to how I handle the structure before I can validate if this includes everything we need. Reopening to attach new patch. Created attachment 333610 [details]
Fixes the off-by-one error
Attachment 333610 [details] did not pass style-queue:
ERROR: Tools/ChangeLog:8: Line contains tab character. [whitespace/tab] [5]
ERROR: Tools/ChangeLog:9: Line contains tab character. [whitespace/tab] [5]
Total errors found: 2 in 5 files
If any of these errors are false positives, please file a bug against check-webkit-style.
Comment on attachment 333610 [details] Fixes the off-by-one error View in context: https://bugs.webkit.org/attachment.cgi?id=333610&action=review r=me > Tools/ChangeLog:8 > + Added assertions to make sure the trailing } is includd in the style. Nit - it looks like there's a tab character here. (In reply to Wenson Hsieh from comment #29) > Comment on attachment 333610 [details] > Fixes the off-by-one error > > View in context: > https://bugs.webkit.org/attachment.cgi?id=333610&action=review > > r=me > > > Tools/ChangeLog:8 > > + Added assertions to make sure the trailing } is includd in the style. > > Nit - it looks like there's a tab character here. Oops, yeah, will fix before landing. Created attachment 333614 [details]
Fixed off by one error
Comment on attachment 333614 [details]
Fixed off by one error
This doesn't quite cut it. The original patch broke Evernote & Wordpress.com :(
(In reply to Wenson Hsieh from comment #29) > Comment on attachment 333610 [details] > Fixes the off-by-one error > > View in context: > https://bugs.webkit.org/attachment.cgi?id=333610&action=review > > r=me > > > Tools/ChangeLog:8 > > + Added assertions to make sure the trailing } is includd in the style. > > Nit - it looks like there's a tab character here. You've also missed the e in "included" :) Created attachment 333653 [details]
Fixes Wordpress, evernote, and Gmail
I like where this patch is going with extra style definitions included. Those are the main thing I was concerned we might need to fully process lists :)
I wonder whether the variations in behaviour across TinyMCE versions you mention in your changelog are due to plugins; only our PowerPaste plugin uses detailed list processing (it's enabled on tinymce.com), the default paste implementation has very basic word import.
Treating the style tag as text may be a bug that I can address rather than needing a webkit workaround (although putting it inside a head tag definitely helps). It's most likely because the content is not inside a body tag - in the current release if a body tag isn't found it assumes the entire clipboard is content. This is related to why I asked for the structure change earlier. I'm going to need to release a plugin update to support this change anyway so if the changes here break paste on the current release that's not the end of the world.
I have some feedback on the current patch:
> if (m_shouldPreserveMSOList && element.hasTagName(pTag) && element.getAttribute(classAttr).startsWith("MsoListParagraph"))
This won't catch all list items in the content. Just from a quick search I'm seeing paragraphs with class "MsoHeading7" that have mso-list attributes, as well as "MsoNormal". And I'm sure there are others. Having a whitelist for this seems like the sort of thing that could easily break between versions of word.
In addition, this "MsoListParagraph" filter isn't applied to the comment nodes in appendNodeToPreserveMSOList so the results are inconsistent. I'm seeing paragraphs without style mso-list but with the "supportsLists" comments inside.
I made a custom patch on my machine that relaxes the override checks a bit and I'm getting _really_ good results.
(In reply to Andrew Herron from comment #35) > > > if (m_shouldPreserveMSOList && element.hasTagName(pTag) && element.getAttribute(classAttr).startsWith("MsoListParagraph")) > > This won't catch all list items in the content. Just from a quick search I'm > seeing paragraphs with class "MsoHeading7" that have mso-list attributes, as > well as "MsoNormal". And I'm sure there are others. Having a whitelist for > this seems like the sort of thing that could easily break between versions > of word. That doesn't seem to affect the end result. > In addition, this "MsoListParagraph" filter isn't applied to the comment > nodes in appendNodeToPreserveMSOList so the results are inconsistent. I'm > seeing paragraphs without style mso-list but with the "supportsLists" > comments inside. Those nodes are taken care of by m_inMSOList check. > I made a custom patch on my machine that relaxes the override checks a bit > and I'm getting _really_ good results. What are the changes you made? Comment on attachment 333653 [details] Fixes Wordpress, evernote, and Gmail View in context: https://bugs.webkit.org/attachment.cgi?id=333653&action=review r=me > Tools/ChangeLog:33 > + Did you mean to have these two separate ChangeLog entries? Created attachment 333727 [details]
Fixes older versions of TinyMCE
(In reply to Ryosuke Niwa from comment #36) > (In reply to Andrew Herron from comment #35) > > > > > if (m_shouldPreserveMSOList && element.hasTagName(pTag) && element.getAttribute(classAttr).startsWith("MsoListParagraph")) > > > > This won't catch all list items in the content. Just from a quick search I'm > > seeing paragraphs with class "MsoHeading7" that have mso-list attributes, as > > well as "MsoNormal". And I'm sure there are others. Having a whitelist for > > this seems like the sort of thing that could easily break between versions > > of word. > > That doesn't seem to affect the end result. This issue means mso-list styles are stripped from actual list items, and every time that happens we can't process the item. If the documents you're testing with all appear fine I will try to find some more complex documents that I can share - my test data mostly comes from customers so I need to be careful. > > In addition, this "MsoListParagraph" filter isn't applied to the comment > > nodes in appendNodeToPreserveMSOList so the results are inconsistent. I'm > > seeing paragraphs without style mso-list but with the "supportsLists" > > comments inside. > > Those nodes are taken care of by m_inMSOList check. But that isn't used for the style attribute output. I think the two should be linked somehow. As I said, in some documents the HTML I get on paste has paragraphs with the supportsLists comments and related data but no mso-list style on the P tag. That's the problem. If you're ever going to strip mso-list from a paragraph (which per above is a bad idea imo) it seems like the comments should be removed as well. My code deals with these comments, but only when processing a list paragraph which is detected by the mso-list style. Without mso-list, these comments make it into customer content (as a part of fake list-like paragraphs that are just confusing). > > I made a custom patch on my machine that relaxes the override checks a bit > > and I'm getting _really_ good results. > > What are the changes you made? I targeted two checks: - Style attribute preservation relaxed to element.getAttribute(classAttr).startsWith("Mso") - The msoListMode check relaxed to originalMarkup.startsWith("<html xmlns") The second change relates to an email I sent you, the check you use to detect MS Word documents is too specific and doesn't catch everything. I realise these changes are unlikely to be suitable for release, but I wanted to verify if more complete data would work with our list import routines. At first glance the answer appears to be yes. (In reply to Andrew Herron from comment #39) > > - Style attribute preservation relaxed to > element.getAttribute(classAttr).startsWith("Mso") > - The msoListMode check relaxed to originalMarkup.startsWith("<html xmlns") > > The second change relates to an email I sent you, the check you use to > detect MS Word documents is too specific and doesn't catch everything. These changes are probably too generic. That would match more documents than we'd like. It would be great if you can share specific documents from which copying & pasting would fail with the latest two patches. Created attachment 333761 [details]
Challenging document
I found a really old Ephox press release in our test data that is an extreme example of what I'm talking about, it's not handled properly even with the changes I made :)
There are newer documents with similar issues but they all have private data in them. I can try to generate my own or sanitise one of them if this document isn't suitable.
When pasting from Word v16.9:
- the <html> tag doesn't have xmlns:o as the first attribute, failing the overall list check
- none of the list paragraphs (inside the table) have a class at all, completely failing the "preserve style attribute" check
As a result, even with my relaxed checks this produces paragraphs with no "mso-list" style but retaining the "if supportsLists" comments.
Created attachment 333768 [details]
Fixed Evernote & Wordpress.com & MS Word's compat mode
(In reply to Andrew Herron from comment #41) > Created attachment 333761 [details] > Challenging document Thanks for the document. This was super helpful! > There are newer documents with similar issues but they all have private data > in them. I can try to generate my own or sanitise one of them if this > document isn't suitable. > > When pasting from Word v16.9: > > - the <html> tag doesn't have xmlns:o as the first attribute, failing the > overall list check Indeed. This seems to only occur with the compatibility mode too. I've adjusted the check for look for xmlns:o and xmlns:w within <html ~> instead. > - none of the list paragraphs (inside the table) have a class at all, > completely failing the "preserve style attribute" check Indeed. I've fixed it to look for mso-list CSS property instead. Sounds good! I'm building with the patch now. I'll try to confirm whether this appears to fix our issues before I finish work today - once this is merged I can get other developers to help verify the changes across our entire test document set. We will still need a new plugin release to be more flexible about the document structure it accepts, but other than that I think the changes on our end will be minor. I pasted a few of our test documents with the patch and it looks good. Comment on attachment 333768 [details] Fixed Evernote & Wordpress.com & MS Word's compat mode View in context: https://bugs.webkit.org/attachment.cgi?id=333768&action=review > Source/WebCore/editing/markup.cpp:433 > + if (m_shouldPreserveMSOList && element.hasTagName(pTag) && element.getAttribute(styleAttr).contains(";mso-list:")) Nit - maybe add a new helper method for this called something like `shouldPreserveMSOListStyleForElement`, and turn the above into a one-liner? Created attachment 333810 [details]
Fixed Evernote & Wordpress.com & MS Word's compat mode
Comment on attachment 333810 [details]
Fixed Evernote & Wordpress.com & MS Word's compat mode
Already been reviewed. Waiting for EWS.
Committed r228482: <https://trac.webkit.org/changeset/228482> I've come across a document that when copied to the clipboard has a much simpler style attribute than others we've been testing with. The entire opening paragraph tag is:
> <p class=Ctrl4Opsomming style='mso-list:l0 level1 lfo1'>
As such it fails the ";mso-list:" check this patch uses and the style attribute is overridden. Could it be made to check if the style starts with "mso-list:" as well? :)
Created attachment 334136 [details] Document with heading lists I've found a new failing document that I can attach publicly. I remember thinking when I looked at the "element tag name is p" check that there would probably be cases of heading tags with mso-list attributes, but I don't think I ever mentioned it here. One of our test documents does indeed have that :) I've made some adjustments to the shouldPreserve check and verified this retains the correct style attributes both with this document and the private one mentioned in my previous comment: > bool StyledMarkupAccumulator::shouldPreserveMSOListStyleForElement(const Element& element) > { > return m_inMSOList || (m_shouldPreserveMSOList && (element.getAttribute(styleAttr).contains(";mso-list:") || element.getAttribute(styleAttr).startsWith("mso-list:"))); > } Although that's again probably too relaxed - it's really only the primary content block tags (P, H1-6) that can be lists. Not _all_ tags. And to be fair TinyMCE doesn't import this document correctly even when given the right styles. But that's a bug I plan to fix so it'd be nice to have heading list styles retained :) Let's track this bug in https://bugs.webkit.org/show_bug.cgi?id=182938. Should I log a new bug for the heading lists as well? The document I attached was for that, not the "starts with" issue. (In reply to Andrew Herron from comment #53) > Should I log a new bug for the heading lists as well? The document I > attached was for that, not the "starts with" issue. Oh, okay. Please do file one. |