Bug 182564 - REGRESSION (r223440): Copying & pasting a list from Microsoft Word to TinyMCE fails
Summary: REGRESSION (r223440): Copying & pasting a list from Microsoft Word to TinyMCE...
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: HTML Editing (show other bugs)
Version: WebKit Nightly Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Ryosuke Niwa
URL:
Keywords: InRadar
Depends on: 124391
Blocks: 182938 182954
  Show dependency treegraph
 
Reported: 2018-02-06 23:53 PST by Ryosuke Niwa
Modified: 2018-02-19 21:44 PST (History)
7 users (show)

See Also:


Attachments
WIP (9.65 KB, patch)
2018-02-06 23:55 PST, Ryosuke Niwa
no flags Details | Formatted Diff | Diff
WIP2 (19.41 KB, patch)
2018-02-08 00:32 PST, Ryosuke Niwa
no flags Details | Formatted Diff | Diff
WIP3 (19.41 KB, patch)
2018-02-08 00:42 PST, Ryosuke Niwa
no flags Details | Formatted Diff | Diff
Fixes the bug (94.83 KB, patch)
2018-02-08 23:38 PST, Ryosuke Niwa
no flags Details | Formatted Diff | Diff
Updated for ToT (94.82 KB, patch)
2018-02-08 23:41 PST, Ryosuke Niwa
no flags Details | Formatted Diff | Diff
Archive of layout-test-results from ews205 for win-future (11.50 MB, application/zip)
2018-02-09 03:40 PST, Build Bot
no flags Details
Skip the test on Windows (99.31 KB, patch)
2018-02-09 14:43 PST, Ryosuke Niwa
no flags Details | Formatted Diff | Diff
Patch for landing (99.33 KB, patch)
2018-02-09 19:45 PST, Ryosuke Niwa
no flags Details | Formatted Diff | Diff
Fixes the off-by-one error (9.08 KB, patch)
2018-02-12 11:48 PST, Ryosuke Niwa
wenson_hsieh: review+
Details | Formatted Diff | Diff
Fixed off by one error (9.08 KB, patch)
2018-02-12 12:09 PST, Ryosuke Niwa
no flags Details | Formatted Diff | Diff
Fixes Wordpress, evernote, and Gmail (27.43 KB, patch)
2018-02-12 17:32 PST, Ryosuke Niwa
bfulgham: review+
Details | Formatted Diff | Diff
Fixes older versions of TinyMCE (26.97 KB, patch)
2018-02-13 14:44 PST, Ryosuke Niwa
no flags Details | Formatted Diff | Diff
Challenging document (40.50 KB, application/msword)
2018-02-13 20:13 PST, Andrew Herron
no flags Details
Fixed Evernote & Wordpress.com & MS Word's compat mode (81.75 KB, patch)
2018-02-13 22:16 PST, Ryosuke Niwa
wenson_hsieh: review+
Details | Formatted Diff | Diff
Fixed Evernote & Wordpress.com & MS Word's compat mode (82.63 KB, patch)
2018-02-14 09:58 PST, Ryosuke Niwa
no flags Details | Formatted Diff | Diff
Document with heading lists (32.31 KB, application/vnd.openxmlformats-officedocument.wordprocessingml.document)
2018-02-18 21:36 PST, Andrew Herron
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Ryosuke Niwa 2018-02-06 23:53:15 PST
As reported by the developers of TinyMCE, my work to sanitize the pasted content introduced a new regression that copying & pasting a list item from Microsoft Word into TinyMCE would cause the list items to be plain text.
Comment 1 Ryosuke Niwa 2018-02-06 23:55:05 PST
Created attachment 333269 [details]
WIP
Comment 2 Ryosuke Niwa 2018-02-06 23:55:30 PST
<rdar://problem/36035103>
Comment 3 Ryosuke Niwa 2018-02-08 00:32:55 PST
Created attachment 333361 [details]
WIP2
Comment 4 Ryosuke Niwa 2018-02-08 00:33:21 PST
The code change is done. Will write tests next.
Comment 5 Ryosuke Niwa 2018-02-08 00:42:06 PST
Created attachment 333362 [details]
WIP3

Oops, there was a tiny bug there.
Comment 6 Andrew Herron 2018-02-08 22:45:19 PST
My initial impressions are that this gives us most (if not all) of what we need. It would be really nice if we could get a valid HTML document, though, instead of "<html><style></style>content" :)

I'll adjust my code next week to pick up the style tag this way and see if there are any problems.
Comment 7 Ryosuke Niwa 2018-02-08 23:38:00 PST
Created attachment 333448 [details]
Fixes the bug
Comment 8 Ryosuke Niwa 2018-02-08 23:41:29 PST
Created attachment 333449 [details]
Updated for ToT
Comment 9 Ryosuke Niwa 2018-02-08 23:42:02 PST
(In reply to Andrew Herron from comment #6)
> My initial impressions are that this gives us most (if not all) of what we
> need. It would be really nice if we could get a valid HTML document, though,
> instead of "<html><style></style>content" :)
> 
> I'll adjust my code next week to pick up the style tag this way and see if
> there are any problems.

Oh, what kind of markup do you want to see instead? <html><body><style>?
Comment 10 Andrew Herron 2018-02-09 01:51:45 PST
I can technically handle anything, but a proper HTML document would be easiest to work with :)

<html><head><style></style></head><body>content</body></html>
Comment 11 Build Bot 2018-02-09 03:40:25 PST
Comment on attachment 333449 [details]
Updated for ToT

Attachment 333449 [details] did not pass win-ews (win):
Output: http://webkit-queues.webkit.org/results/6429691

New failing tests:
http/tests/security/clipboard/copy-paste-html-across-origin-strips-mso-list.html
Comment 12 Build Bot 2018-02-09 03:40:35 PST
Created attachment 333475 [details]
Archive of layout-test-results from ews205 for win-future

The attached test failures were seen while running run-webkit-tests on the win-ews.
Bot: ews205  Port: win-future  Platform: CYGWIN_NT-6.1-2.9.0-0.318-5-3-x86_64-64bit
Comment 13 Ryosuke Niwa 2018-02-09 12:17:48 PST
(In reply to Andrew Herron from comment #10)
> I can technically handle anything, but a proper HTML document would be
> easiest to work with :)
> 
> <html><head><style></style></head><body>content</body></html>

<html><style>content is a perfectly a valid HTML5 document. Having said that, we can make that change.
Comment 14 Ryosuke Niwa 2018-02-09 12:56:53 PST
(In reply to Ryosuke Niwa from comment #13)
> (In reply to Andrew Herron from comment #10)
> > I can technically handle anything, but a proper HTML document would be
> > easiest to work with :)
> > 
> > <html><head><style></style></head><body>content</body></html>
> 
> <html><style>content is a perfectly a valid HTML5 document. Having said
> that, we can make that change.

Never mind. This is actually tricky because we're using the fragment parsing algorithm which strips away head.
Comment 15 Wenson Hsieh 2018-02-09 13:54:47 PST
Comment on attachment 333449 [details]
Updated for ToT

View in context: https://bugs.webkit.org/attachment.cgi?id=333449&action=review

Just some preliminary comments — I'll continue looking.

> Source/WebCore/ChangeLog:25
> +        that don't enable custom pasteboard data (detected by the content origin being null), and the HTML

that don't => that doesn't

> Source/WebCore/ChangeLog:30
> +              PasteHTML.SripsMSOListWhenMissingMSOHTMLElement

Srips => Strips

> Source/WebCore/ChangeLog:58
> +        set to Enable whenever the content COULD be the one generated by Microsoft Word.

Not very intuitive, but sensible I suppose :/

Maybe MSOListQuirks should be renamed to something more like `MayUseMSOListQuirks { No, Yes }` to make this more obvious?

> Source/WebCore/ChangeLog:65
> +        attributes, one containing computed styles, and another containing mso-list. Luckily, HTML parsering

parsering => parsing

> Source/WebCore/ChangeLog:66
> +        rule dictats that the first attribute always wins when more than attributes of the same name appears,

dictats => dictates

more than attributes of the same name appears => multiple attributes of the same name appear

> Source/WebCore/editing/markup.cpp:596
> +bool StyledMarkupAccumulator::appendNodeToPreserveMSOList(Node& node, bool& inMSOList)

Maybe set `inMSOList = false` at the beginning of this function so it always has a correct value after this call?
Comment 16 Ryosuke Niwa 2018-02-09 14:04:26 PST
(In reply to Wenson Hsieh from comment #15)
> Comment on attachment 333449 [details]
> Updated for ToT
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=333449&action=review
> 
> Just some preliminary comments — I'll continue looking.
> 
> > Source/WebCore/ChangeLog:25
> > +        that don't enable custom pasteboard data (detected by the content origin being null), and the HTML
> 
> that don't => that doesn't

Fixed.

> > Source/WebCore/ChangeLog:30
> > +              PasteHTML.SripsMSOListWhenMissingMSOHTMLElement
> 
> Srips => Strips

Fixed.

> > Source/WebCore/ChangeLog:58
> > +        set to Enable whenever the content COULD be the one generated by Microsoft Word.
> 
> Not very intuitive, but sensible I suppose :/
> 
> Maybe MSOListQuirks should be renamed to something more like
> `MayUseMSOListQuirks { No, Yes }` to make this more obvious?

So the idea here is that MSOListQuirks is like a runtime feature. When the feature is enabled, we'd check for html tag, and do the quirks if needed.

> > Source/WebCore/ChangeLog:65
> > +        attributes, one containing computed styles, and another containing mso-list. Luckily, HTML parsering
> 
> parsering => parsing

Fixed.

> 
> > Source/WebCore/ChangeLog:66
> > +        rule dictats that the first attribute always wins when more than attributes of the same name appears,
> 
> dictats => dictates

Fixed.

> more than attributes of the same name appears => multiple attributes of the
> same name appear

Fixed.

> 
> > Source/WebCore/editing/markup.cpp:596
> > +bool StyledMarkupAccumulator::appendNodeToPreserveMSOList(Node& node, bool& inMSOList)
> 
> Maybe set `inMSOList = false` at the beginning of this function so it always
> has a correct value after this call?

Oh, the whole point of that boolean is it changes over time during the traversal. It gets updated in this function. Made it a member variable instead to make this clear.
Comment 17 Ryosuke Niwa 2018-02-09 14:43:43 PST
Created attachment 333520 [details]
Skip the test on Windows
Comment 18 Wenson Hsieh 2018-02-09 16:05:27 PST
> > Not very intuitive, but sensible I suppose :/
> > 
> > Maybe MSOListQuirks should be renamed to something more like
> > `MayUseMSOListQuirks { No, Yes }` to make this more obvious?
> 
> So the idea here is that MSOListQuirks is like a runtime feature. When the
> feature is enabled, we'd check for html tag, and do the quirks if needed.

Right, I understand that — it's just that the purpose wasn't obvious to me from the name alone (MSOListQuirks) and I was initially confused why there were 2 flags. Though, I suppose the enum values, ::Enable and ::Disable should be enough to make this clear, so it's fine.

> > Maybe set `inMSOList = false` at the beginning of this function so it always
> > has a correct value after this call?
> 
> Oh, the whole point of that boolean is it changes over time during the
> traversal. It gets updated in this function. Made it a member variable
> instead to make this clear.

Oh ok, now I see! Your new approach makes this a lot clearer. 👍🏻
Comment 19 Wenson Hsieh 2018-02-09 17:00:47 PST
Comment on attachment 333520 [details]
Skip the test on Windows

View in context: https://bugs.webkit.org/attachment.cgi?id=333520&action=review

> Source/WebCore/editing/MarkupAccumulator.cpp:204
> +    ASSERT(length <= text.data().length());

Perhaps `start + length <= text.data().length()`?

> Source/WebCore/editing/markup.cpp:487
> +            // Unfortunately, TinyMCE doesn't recognize mso-list inlie style if the style attribute contains properties.

inlie => inline

> Source/WebCore/editing/markup.cpp:600
> +        Comment& commentNode = downcast<Comment>(node);

Nit - auto&?
Comment 20 Ryosuke Niwa 2018-02-09 17:22:33 PST
(In reply to Wenson Hsieh from comment #19)
> Comment on attachment 333520 [details]
> Skip the test on Windows
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=333520&action=review
> 
> > Source/WebCore/editing/MarkupAccumulator.cpp:204
> > +    ASSERT(length <= text.data().length());
> 
> Perhaps `start + length <= text.data().length()`?

Oh oops, nice catch. Fixed.

> > Source/WebCore/editing/markup.cpp:487
> > +            // Unfortunately, TinyMCE doesn't recognize mso-list inlie style if the style attribute contains properties.
> 
> inlie => inline

Fixed.

> > Source/WebCore/editing/markup.cpp:600
> > +        Comment& commentNode = downcast<Comment>(node);
> 
> Nit - auto&?

Fixed.
Comment 21 Ryosuke Niwa 2018-02-09 19:17:31 PST
(In reply to Wenson Hsieh from comment #18)
> > > Not very intuitive, but sensible I suppose :/
> > > 
> > > Maybe MSOListQuirks should be renamed to something more like
> > > `MayUseMSOListQuirks { No, Yes }` to make this more obvious?
> > 
> > So the idea here is that MSOListQuirks is like a runtime feature. When the
> > feature is enabled, we'd check for html tag, and do the quirks if needed.
> 
> Right, I understand that — it's just that the purpose wasn't obvious to me
> from the name alone (MSOListQuirks) and I was initially confused why there
> were 2 flags. Though, I suppose the enum values, ::Enable and ::Disable
> should be enough to make this clear, so it's fine.

Okay, I'd rename MSOListQuirks::Enabled to MSOListQuirks::CheckIfNeeded as we discussed in person.
Comment 22 Ryosuke Niwa 2018-02-09 19:45:39 PST
Created attachment 333550 [details]
Patch for landing
Comment 23 Ryosuke Niwa 2018-02-09 19:56:40 PST
Comment on attachment 333550 [details]
Patch for landing

Wait for EWS first.
Comment 24 Ryosuke Niwa 2018-02-09 21:07:58 PST
Committed r228352: <https://trac.webkit.org/changeset/228352>
Comment 25 Andrew Herron 2018-02-11 20:48:11 PST
(In reply to Ryosuke Niwa from comment #14)
> (In reply to Ryosuke Niwa from comment #13)
> > 
> > <html><style>content is a perfectly a valid HTML5 document. Having said
> > that, we can make that change.
> 
> Never mind. This is actually tricky because we're using the fragment parsing
> algorithm which strips away head.

I realise it's valid, however previously we could rely on the raw clipboard being a "proper" document. But as I said it's not a huge issue it just means some adjustment to how I handle the structure before I can validate if this includes everything we need.
Comment 26 Ryosuke Niwa 2018-02-12 11:48:22 PST
Reopening to attach new patch.
Comment 27 Ryosuke Niwa 2018-02-12 11:48:23 PST
Created attachment 333610 [details]
Fixes the off-by-one error
Comment 28 Build Bot 2018-02-12 11:49:32 PST
Attachment 333610 [details] did not pass style-queue:


ERROR: Tools/ChangeLog:8:  Line contains tab character.  [whitespace/tab] [5]
ERROR: Tools/ChangeLog:9:  Line contains tab character.  [whitespace/tab] [5]
Total errors found: 2 in 5 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 29 Wenson Hsieh 2018-02-12 12:02:17 PST
Comment on attachment 333610 [details]
Fixes the off-by-one error

View in context: https://bugs.webkit.org/attachment.cgi?id=333610&action=review

r=me

> Tools/ChangeLog:8
> +	Added assertions to make sure the trailing } is includd in the style.

Nit - it looks like there's a tab character here.
Comment 30 Ryosuke Niwa 2018-02-12 12:07:51 PST
(In reply to Wenson Hsieh from comment #29)
> Comment on attachment 333610 [details]
> Fixes the off-by-one error
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=333610&action=review
> 
> r=me
> 
> > Tools/ChangeLog:8
> > +	Added assertions to make sure the trailing } is includd in the style.
> 
> Nit - it looks like there's a tab character here.

Oops, yeah, will fix before landing.
Comment 31 Ryosuke Niwa 2018-02-12 12:09:47 PST
Created attachment 333614 [details]
Fixed off by one error
Comment 32 Ryosuke Niwa 2018-02-12 14:07:45 PST
Comment on attachment 333614 [details]
Fixed off by one error

This doesn't quite cut it. The original patch broke Evernote & Wordpress.com :(
Comment 33 Andrew Herron 2018-02-12 16:37:46 PST
(In reply to Wenson Hsieh from comment #29)
> Comment on attachment 333610 [details]
> Fixes the off-by-one error
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=333610&action=review
> 
> r=me
> 
> > Tools/ChangeLog:8
> > +	Added assertions to make sure the trailing } is includd in the style.
> 
> Nit - it looks like there's a tab character here.

You've also missed the e in "included" :)
Comment 34 Ryosuke Niwa 2018-02-12 17:32:01 PST
Created attachment 333653 [details]
Fixes Wordpress, evernote, and Gmail
Comment 35 Andrew Herron 2018-02-12 23:32:59 PST
I like where this patch is going with extra style definitions included. Those are the main thing I was concerned we might need to fully process lists :)

I wonder whether the variations in behaviour across TinyMCE versions you mention in your changelog are due to plugins; only our PowerPaste plugin uses detailed list processing (it's enabled on tinymce.com), the default paste implementation has very basic word import.

Treating the style tag as text may be a bug that I can address rather than needing a webkit workaround (although putting it inside a head tag definitely helps). It's most likely because the content is not inside a body tag - in the current release if a body tag isn't found it assumes the entire clipboard is content. This is related to why I asked for the structure change earlier. I'm going to need to release a plugin update to support this change anyway so if the changes here break paste on the current release that's not the end of the world.

I have some feedback on the current patch:

> if (m_shouldPreserveMSOList && element.hasTagName(pTag) && element.getAttribute(classAttr).startsWith("MsoListParagraph"))

This won't catch all list items in the content. Just from a quick search I'm seeing paragraphs with class "MsoHeading7" that have mso-list attributes, as well as "MsoNormal". And I'm sure there are others. Having a whitelist for this seems like the sort of thing that could easily break between versions of word.

In addition, this "MsoListParagraph" filter isn't applied to the comment nodes in appendNodeToPreserveMSOList so the results are inconsistent. I'm seeing paragraphs without style mso-list but with the "supportsLists" comments inside.

I made a custom patch on my machine that relaxes the override checks a bit and I'm getting _really_ good results.
Comment 36 Ryosuke Niwa 2018-02-13 13:03:40 PST
(In reply to Andrew Herron from comment #35)
>
> > if (m_shouldPreserveMSOList && element.hasTagName(pTag) && element.getAttribute(classAttr).startsWith("MsoListParagraph"))
> 
> This won't catch all list items in the content. Just from a quick search I'm
> seeing paragraphs with class "MsoHeading7" that have mso-list attributes, as
> well as "MsoNormal". And I'm sure there are others. Having a whitelist for
> this seems like the sort of thing that could easily break between versions
> of word.

That doesn't seem to affect the end result.

> In addition, this "MsoListParagraph" filter isn't applied to the comment
> nodes in appendNodeToPreserveMSOList so the results are inconsistent. I'm
> seeing paragraphs without style mso-list but with the "supportsLists"
> comments inside.

Those nodes are taken care of by m_inMSOList check.

> I made a custom patch on my machine that relaxes the override checks a bit
> and I'm getting _really_ good results.

What are the changes you made?
Comment 37 Brent Fulgham 2018-02-13 13:43:48 PST
Comment on attachment 333653 [details]
Fixes Wordpress, evernote, and Gmail

View in context: https://bugs.webkit.org/attachment.cgi?id=333653&action=review

r=me

> Tools/ChangeLog:33
> +

Did you mean to have these two separate ChangeLog entries?
Comment 38 Ryosuke Niwa 2018-02-13 14:44:05 PST
Created attachment 333727 [details]
Fixes older versions of TinyMCE
Comment 39 Andrew Herron 2018-02-13 16:08:40 PST
(In reply to Ryosuke Niwa from comment #36)
> (In reply to Andrew Herron from comment #35)
> >
> > > if (m_shouldPreserveMSOList && element.hasTagName(pTag) && element.getAttribute(classAttr).startsWith("MsoListParagraph"))
> > 
> > This won't catch all list items in the content. Just from a quick search I'm
> > seeing paragraphs with class "MsoHeading7" that have mso-list attributes, as
> > well as "MsoNormal". And I'm sure there are others. Having a whitelist for
> > this seems like the sort of thing that could easily break between versions
> > of word.
> 
> That doesn't seem to affect the end result.

This issue means mso-list styles are stripped from actual list items, and every time that happens we can't process the item. If the documents you're testing with all appear fine I will try to find some more complex documents that I can share - my test data mostly comes from customers so I need to be careful.

 
> > In addition, this "MsoListParagraph" filter isn't applied to the comment
> > nodes in appendNodeToPreserveMSOList so the results are inconsistent. I'm
> > seeing paragraphs without style mso-list but with the "supportsLists"
> > comments inside.
> 
> Those nodes are taken care of by m_inMSOList check.

But that isn't used for the style attribute output. I think the two should be linked somehow.

As I said, in some documents the HTML I get on paste has paragraphs with the supportsLists comments and related data but no mso-list style on the P tag. That's the problem. If you're ever going to strip mso-list from a paragraph (which per above is a bad idea imo) it seems like the comments should be removed as well.

My code deals with these comments, but only when processing a list paragraph which is detected by the mso-list style. Without mso-list, these comments make it into customer content (as a part of fake list-like paragraphs that are just confusing).

 
> > I made a custom patch on my machine that relaxes the override checks a bit
> > and I'm getting _really_ good results.
> 
> What are the changes you made?

I targeted two checks:

- Style attribute preservation relaxed to element.getAttribute(classAttr).startsWith("Mso")
- The msoListMode check relaxed to originalMarkup.startsWith("<html xmlns")

The second change relates to an email I sent you, the check you use to detect MS Word documents is too specific and doesn't catch everything.

I realise these changes are unlikely to be suitable for release, but I wanted to verify if more complete data would work with our list import routines. At first glance the answer appears to be yes.
Comment 40 Ryosuke Niwa 2018-02-13 16:38:30 PST
(In reply to Andrew Herron from comment #39)
> 
> - Style attribute preservation relaxed to
> element.getAttribute(classAttr).startsWith("Mso")
> - The msoListMode check relaxed to originalMarkup.startsWith("<html xmlns")
> 
> The second change relates to an email I sent you, the check you use to
> detect MS Word documents is too specific and doesn't catch everything.

These changes are probably too generic. That would match more documents than we'd like. It would be great if you can share specific documents from which copying & pasting would fail with the latest two patches.
Comment 41 Andrew Herron 2018-02-13 20:13:10 PST
Created attachment 333761 [details]
Challenging document

I found a really old Ephox press release in our test data that is an extreme example of what I'm talking about, it's not handled properly even with the changes I made :)

There are newer documents with similar issues but they all have private data in them. I can try to generate my own or sanitise one of them if this document isn't suitable.

When pasting from Word v16.9:

- the <html> tag doesn't have xmlns:o as the first attribute, failing the overall list check
- none of the list paragraphs (inside the table) have a class at all, completely failing the "preserve style attribute" check

As a result, even with my relaxed checks this produces paragraphs with no "mso-list" style but retaining the "if supportsLists" comments.
Comment 42 Ryosuke Niwa 2018-02-13 22:16:26 PST
Created attachment 333768 [details]
Fixed Evernote & Wordpress.com & MS Word's compat mode
Comment 43 Ryosuke Niwa 2018-02-13 22:17:54 PST
(In reply to Andrew Herron from comment #41)
> Created attachment 333761 [details]
> Challenging document

Thanks for the document. This was super helpful!

> There are newer documents with similar issues but they all have private data
> in them. I can try to generate my own or sanitise one of them if this
> document isn't suitable.
> 
> When pasting from Word v16.9:
> 
> - the <html> tag doesn't have xmlns:o as the first attribute, failing the
> overall list check

Indeed. This seems to only occur with the compatibility mode too. I've adjusted the check for look for xmlns:o and xmlns:w within <html ~> instead.

> - none of the list paragraphs (inside the table) have a class at all,
> completely failing the "preserve style attribute" check

Indeed. I've fixed it to look for mso-list CSS property instead.
Comment 44 Andrew Herron 2018-02-13 22:32:43 PST
Sounds good! I'm building with the patch now.

I'll try to confirm whether this appears to fix our issues before I finish work today - once this is merged I can get other developers to help verify the changes across our entire test document set.

We will still need a new plugin release to be more flexible about the document structure it accepts, but other than that I think the changes on our end will be minor.
Comment 45 Andrew Herron 2018-02-13 23:21:32 PST
I pasted a few of our test documents with the patch and it looks good.
Comment 46 Wenson Hsieh 2018-02-14 07:42:16 PST
Comment on attachment 333768 [details]
Fixed Evernote & Wordpress.com & MS Word's compat mode

View in context: https://bugs.webkit.org/attachment.cgi?id=333768&action=review

> Source/WebCore/editing/markup.cpp:433
> +    if (m_shouldPreserveMSOList && element.hasTagName(pTag) && element.getAttribute(styleAttr).contains(";mso-list:"))

Nit - maybe add a new helper method for this called something like `shouldPreserveMSOListStyleForElement`, and turn the above into a one-liner?
Comment 47 Ryosuke Niwa 2018-02-14 09:58:35 PST
Created attachment 333810 [details]
Fixed Evernote & Wordpress.com & MS Word's compat mode
Comment 48 Ryosuke Niwa 2018-02-14 09:59:08 PST
Comment on attachment 333810 [details]
Fixed Evernote & Wordpress.com & MS Word's compat mode

Already been reviewed. Waiting for EWS.
Comment 49 Ryosuke Niwa 2018-02-14 13:07:44 PST
Committed r228482: <https://trac.webkit.org/changeset/228482>
Comment 50 Andrew Herron 2018-02-15 21:47:32 PST
I've come across a document that when copied to the clipboard has a much simpler style attribute than others we've been testing with. The entire opening paragraph tag is:

> <p class=Ctrl4Opsomming style='mso-list:l0 level1 lfo1'>

As such it fails the ";mso-list:" check this patch uses and the style attribute is overridden. Could it be made to check if the style starts with "mso-list:" as well? :)
Comment 51 Andrew Herron 2018-02-18 21:36:52 PST
Created attachment 334136 [details]
Document with heading lists

I've found a new failing document that I can attach publicly. I remember thinking when I looked at the "element tag name is p" check that there would probably be cases of heading tags with mso-list attributes, but I don't think I ever mentioned it here. One of our test documents does indeed have that :)

I've made some adjustments to the shouldPreserve check and verified this retains the correct style attributes both with this document and the private one mentioned in my previous comment:

> bool StyledMarkupAccumulator::shouldPreserveMSOListStyleForElement(const Element& element)
> {
>     return m_inMSOList || (m_shouldPreserveMSOList && (element.getAttribute(styleAttr).contains(";mso-list:") || element.getAttribute(styleAttr).startsWith("mso-list:")));
> }

Although that's again probably too relaxed - it's really only the primary content block tags (P, H1-6) that can be lists. Not _all_ tags.

And to be fair TinyMCE doesn't import this document correctly even when given the right styles. But that's a bug I plan to fix so it'd be nice to have heading list styles retained :)
Comment 52 Ryosuke Niwa 2018-02-19 13:23:53 PST
Let's track this bug in https://bugs.webkit.org/show_bug.cgi?id=182938.
Comment 53 Andrew Herron 2018-02-19 15:53:58 PST
Should I log a new bug for the heading lists as well? The document I attached was for that, not the "starts with" issue.
Comment 54 Ryosuke Niwa 2018-02-19 17:17:29 PST
(In reply to Andrew Herron from comment #53)
> Should I log a new bug for the heading lists as well? The document I
> attached was for that, not the "starts with" issue.

Oh, okay. Please do file one.