Bug 213907 - Text manipulation should ignore white spaces between nodes
Summary: Text manipulation should ignore white spaces between nodes
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: HTML Editing (show other bugs)
Version: WebKit Nightly Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Sihui Liu
URL:
Keywords: InRadar
Depends on:
Blocks:
 
Reported: 2020-07-02 16:22 PDT by Sihui Liu
Modified: 2020-07-08 11:30 PDT (History)
7 users (show)

See Also:


Attachments
Patch (7.50 KB, patch)
2020-07-02 16:54 PDT, Sihui Liu
no flags Details | Formatted Diff | Diff
Patch (9.75 KB, patch)
2020-07-06 12:40 PDT, Sihui Liu
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Sihui Liu 2020-07-02 16:22:36 PDT
...
Comment 1 Sihui Liu 2020-07-02 16:54:47 PDT
Created attachment 403417 [details]
Patch
Comment 2 Sihui Liu 2020-07-02 19:50:36 PDT
<rdar://problems/65055131>
Comment 3 Darin Adler 2020-07-03 14:11:37 PDT
Comment on attachment 403417 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=403417&action=review

> Source/WebCore/editing/TextManipulationController.cpp:226
> +                if (iteratorText == "\n") {

This seems close, but not quite right.

There can be newlines even in non-collapsed ranges in text that uses appropriate styling, such as the default style inside <pre>, so it seems not exactly right to only handle newlines that don’t correspond to an actual node.

There seem to be two different issues here: One issue is that we want to handle newlines appropriately, breaking separate lines into separate strings in the vector. That really needs to be done more generally, without assuming newlines are always generated. The second issue is that we want to strip all non-newline whitespace that is generated by the iterator. That should probably be done with an iterator flag telling it not to generate that whitespace, rather than letting it be generated and then filtering it out because the DOM range is collapsed.
Comment 4 Sihui Liu 2020-07-05 22:53:10 PDT
(In reply to Darin Adler from comment #3)
> Comment on attachment 403417 [details]
> Patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=403417&action=review
> 
> > Source/WebCore/editing/TextManipulationController.cpp:226
> > +                if (iteratorText == "\n") {
> 
> This seems close, but not quite right.
> 
> There can be newlines even in non-collapsed ranges in text that uses
> appropriate styling, such as the default style inside <pre>, so it seems not
> exactly right to only handle newlines that don’t correspond to an actual
> node.
> 
In the <pre> case, the line breaks is part of text content, so we want to keep them as special tokens and add them back in manipulated node. In the collapsed range case, we don't want to add those spaces when creating new text nodes.

> There seem to be two different issues here: One issue is that we want to
> handle newlines appropriately, breaking separate lines into separate strings
> in the vector. That really needs to be done more generally, without assuming
> newlines are always generated. The second issue is that we want to strip all
> non-newline whitespace that is generated by the iterator. That should
> probably be done with an iterator flag telling it not to generate that
> whitespace, rather than letting it be generated and then filtering it out
> because the DOM range is collapsed.

I am not sure I understand the first issue. Reading the TextIterator code, it looks like line breaks are emitted either there is line break in text content of node, or the style of node requires to generate line breaks. As long as the text content is not changed, we can expect line breaks of the first type to be generated, right?

For the second issue, I think we don't want to strip all non-newline whitespaces... e.g. when space is emitted to replace line break in text content, we want to return "hello world" instead of "helloworld" for "hello \n world". The white spaces we want to filter out are the ones not generated from the text content of the current position node, and those spaces happen to have collapsed range.
Comment 5 Darin Adler 2020-07-06 10:45:37 PDT
Just add test cases. We don’t need to convince each other if we have test coverage.
Comment 6 Sihui Liu 2020-07-06 12:40:44 PDT
Created attachment 403606 [details]
Patch
Comment 7 Sihui Liu 2020-07-06 12:45:43 PDT
(In reply to Darin Adler from comment #5)
> Just add test cases. We don’t need to convince each other if we have test
> coverage.

I only added one new test as the line break case was covered by existing tests. I updated the ChangeLog with those tests. 
Is there any other test case you think should be added?
Comment 8 Darin Adler 2020-07-06 13:34:19 PDT
(In reply to Sihui Liu from comment #4)
> In the <pre> case, the line breaks is part of text content, so we want to
> keep them as special tokens and add them back in manipulated node.

I guess I’m out of my depth. I don’t understand what this means, but I trust it makes sense.

> I am not sure I understand the first issue. Reading the TextIterator code,
> it looks like line breaks are emitted either there is line break in text
> content of node, or the style of node requires to generate line breaks. As
> long as the text content is not changed, we can expect line breaks of the
> first type to be generated, right?

Yes, they simply won’t necessarily be generated as part of a text iterator step with a collapsed range.

> The white spaces we want to filter out are the ones not generated
> from the text content of the current position node, and those spaces happen
> to have collapsed range.

Honestly really confused about this. If text has a newline in it, then space has to be generated because the newline is transformed into a space. If text has a space in it, then the space is part of the text content. I don’t understand how we’d handle these two. But maybe I don’t need to as long as we have enough test coverage.
Comment 9 Sihui Liu 2020-07-06 15:58:48 PDT
(In reply to Darin Adler from comment #8)
> (In reply to Sihui Liu from comment #4)
> > In the <pre> case, the line breaks is part of text content, so we want to
> > keep them as special tokens and add them back in manipulated node.
> 
> I guess I’m out of my depth. I don’t understand what this means, but I trust
> it makes sense.
> 
> > I am not sure I understand the first issue. Reading the TextIterator code,
> > it looks like line breaks are emitted either there is line break in text
> > content of node, or the style of node requires to generate line breaks. As
> > long as the text content is not changed, we can expect line breaks of the
> > first type to be generated, right?
> 
> Yes, they simply won’t necessarily be generated as part of a text iterator
> step with a collapsed range.
> 
> > The white spaces we want to filter out are the ones not generated
> > from the text content of the current position node, and those spaces happen
> > to have collapsed range.
> 
> Honestly really confused about this. If text has a newline in it, then space
> has to be generated because the newline is transformed into a space. If text
> has a space in it, then the space is part of the text content. I don’t
> understand how we’d handle these two. But maybe I don’t need to as long as
> we have enough test coverage.

Both cases you mentioned are space generated from text content, so it's Okay to keep them and view them as normal space. 

Here is more background information: 
We use TextIterator to iterate the whole document and extract text from nodes. We  advance by node by concatenating the text of the same node and parse tokens from the concatenated text.

Client can reply with new tokens to replace the old tokens. To do that, we just create some new text node with new tokens and remove the old text node. Before replacement, we iterate the node with TextIterator again to make sure its content is unchanged after extraction. 

The problem is the sequence of text of one node returned by TextIterator is affected by the range of TextIterator and style of node. 
For example, iterating on "<p>hello<p>webkit" returns {"hello", "\n", "webkit"}. TextIterator's node path is {Text "hello", nullptr, Text "webkit"}. In our previous implementation, TextManipulationController would think the "content" of Text "webkit" is "\nwebkit" (with the concatenation). Then the content check would fail because iterating on Text "webkit" only returns "webkit".

In r262778, we tried to solve this by not concatenating text "\n" with node nullptr, and that didn't work for " " or "\t". This patch just makes TextManipulationController to ignore all those characters with collapsed range, meaning they are not part of the content.
Comment 10 EWS 2020-07-08 11:30:49 PDT
Committed r264120: <https://trac.webkit.org/changeset/264120>

All reviewed patches have been landed. Closing bug and clearing flags on attachment 403606 [details].