211476 – TextManipulationController should merge pieces from the same token

Bug 211476 - TextManipulationController should merge pieces from the same token

Summary: TextManipulationController should merge pieces from the same token

Status:	NEW

Alias:	None

Product:	WebKit
Classification:	Unclassified
Component:	HTML Editing (show other bugs)
Version:	WebKit Nightly Build
Hardware:	Unspecified Unspecified

Importance:	P2 Normal
Assignee:	Sihui Liu

URL:
Keywords:

Depends on:
Blocks:

Reported:	2020-05-05 16:15 PDT by Sihui Liu
Modified:	2020-09-04 10:27 PDT (History)
CC List:	6 users (show)

See Also:

Attachments
Patch (12.76 KB, patch) 2020-05-05 16:29 PDT, Sihui Liu	no flags	Details \| Formatted Diff \| Diff
Patch (12.71 KB, patch) 2020-05-05 17:55 PDT, Sihui Liu	no flags	Details \| Formatted Diff \| Diff
Show Obsolete (2) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Sihui Liu 2020-05-05 16:15:39 PDT

There are cases where one token is split into multiple tokens with the same token identifier when completing text manipulation.
In https://bugs.webkit.org/show_bug.cgi?id=210750, Wenson fixed that for title and option element, by merging adjacent tokens with same identifier into one token.
We are seeing other elements having the same issue, so extend that to all elements.

Comment 1 Sihui Liu 2020-05-05 16:29:40 PDT

Created attachment 398565 [details]
Patch

Comment 2 Sihui Liu 2020-05-05 17:55:19 PDT

Created attachment 398574 [details]
Patch

Comment 3 Ryosuke Niwa 2020-05-06 15:27:00 PDT

Comment on attachment 398574 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=398574&action=review

> Source/WebCore/editing/TextManipulationController.cpp:564
> +    if (replacementTokens.isEmpty())
> +        return ManipulationFailureType::InvalidToken;

This seems like untested code. Also, it's incorrect to say we had an invalid token just because the replacement is empty.

> Source/WebCore/editing/TextManipulationController.cpp:572
> +            if (!token.content.isEmpty() && !currentToken.content.isEmpty())
> +                currentToken.content.append(' ');

This doesn't seem right. Why should we always assume to insert a space between two tokens?
That may be correct in Latin languages but definitely not in Chinese or Japanese.

Comment 4 Sihui Liu 2020-05-06 15:57:30 PDT

Comment on attachment 398574 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=398574&action=review

>> Source/WebCore/editing/TextManipulationController.cpp:564
>> +        return ManipulationFailureType::InvalidToken;
> 
> This seems like untested code. Also, it's incorrect to say we had an invalid token just because the replacement is empty.

We use InvalidToken when replacementTokens.size() > 1  so I think the InvalidToken means invalid replacement tokens too. Or should I use ManipulationFailureType::InvalidItem?

>> Source/WebCore/editing/TextManipulationController.cpp:572
>> +                currentToken.content.append(' ');
> 
> This doesn't seem right. Why should we always assume to insert a space between two tokens?
> That may be correct in Latin languages but definitely not in Chinese or Japanese.

I am following the pattern of http://trac.webkit.org/changeset/260393/webkit. I can remove it.

Comment 5 Ryosuke Niwa 2020-05-06 16:32:25 PDT

Comment on attachment 398574 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=398574&action=review

>>> Source/WebCore/editing/TextManipulationController.cpp:564
>>> +        return ManipulationFailureType::InvalidToken;
>> 
>> This seems like untested code. Also, it's incorrect to say we had an invalid token just because the replacement is empty.
> 
> We use InvalidToken when replacementTokens.size() > 1  so I think the InvalidToken means invalid replacement tokens too. Or should I use ManipulationFailureType::InvalidItem?

Well, we use that because in the case of a single element replacement,
we can't possibly have multiple tokens of different identifiers.

>>> Source/WebCore/editing/TextManipulationController.cpp:572
>>> +                currentToken.content.append(' ');
>> 
>> This doesn't seem right. Why should we always assume to insert a space between two tokens?
>> That may be correct in Latin languages but definitely not in Chinese or Japanese.
> 
> I am following the pattern of http://trac.webkit.org/changeset/260393/webkit. I can remove it.

I don't know what the story of that change log is (perhaps you should talk to Wenson)
but I don't think always inserting a whitespace is right even if it worked for some languages.
This is something whatever application / framework using WebKit should take care of.

Comment 6 Wenson Hsieh 2020-05-06 16:44:38 PDT

Comment on attachment 398574 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=398574&action=review

>>>> Source/WebCore/editing/TextManipulationController.cpp:572
>>>> +                currentToken.content.append(' ');
>>> 
>>> This doesn't seem right. Why should we always assume to insert a space between two tokens?
>>> That may be correct in Latin languages but definitely not in Chinese or Japanese.
>> 
>> I am following the pattern of http://trac.webkit.org/changeset/260393/webkit. I can remove it.
> 
> I don't know what the story of that change log is (perhaps you should talk to Wenson)
> but I don't think always inserting a whitespace is right even if it worked for some languages.
> This is something whatever application / framework using WebKit should take care of.

This was admittedly a hack to deal with how the relevant system framework /sometimes/ splits results into multiple tokens around spaces, and then removes the space from each of the tokens that it split into. For instance, a single token “foo bar” might be returned as two tokens “foo” and “bar”.

I hadn’t tested with CJK, and if the relevant system framework splits a single word (“foo") into multiple tokens (“fo” “o”), then the logic I added to try and stitch the tokens back together is indeed wrong :/

Comment 7 Ryosuke Niwa 2020-05-06 16:55:46 PDT

(In reply to Wenson Hsieh from comment #6)
> Comment on attachment 398574 [details]
> Patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=398574&action=review
> 
> >>>> Source/WebCore/editing/TextManipulationController.cpp:572
> >>>> +                currentToken.content.append(' ');
> >>> 
> >>> This doesn't seem right. Why should we always assume to insert a space between two tokens?
> >>> That may be correct in Latin languages but definitely not in Chinese or Japanese.
> >> 
> >> I am following the pattern of http://trac.webkit.org/changeset/260393/webkit. I can remove it.
> > 
> > I don't know what the story of that change log is (perhaps you should talk to Wenson)
> > but I don't think always inserting a whitespace is right even if it worked for some languages.
> > This is something whatever application / framework using WebKit should take care of.
> 
> This was admittedly a hack to deal with how the relevant system framework
> /sometimes/ splits results into multiple tokens around spaces, and then
> removes the space from each of the tokens that it split into. For instance,
> a single token “foo bar” might be returned as two tokens “foo” and “bar”.
> 
> I hadn’t tested with CJK, and if the relevant system framework splits a
> single word (“foo") into multiple tokens (“fo” “o”), then the logic I added
> to try and stitch the tokens back together is indeed wrong :/

I mean... we really don't want WebKit to be doing these kinds of text manipulations. We don't want to be in the business of detecting what kind of language the text is in in insert or not insert a whitespace based on that...

Comment 8 Sihui Liu 2020-05-07 09:08:10 PDT

(In reply to Ryosuke Niwa from comment #7)
> (In reply to Wenson Hsieh from comment #6)
> > Comment on attachment 398574 [details]
> > Patch
> > 
> > View in context:
> > https://bugs.webkit.org/attachment.cgi?id=398574&action=review
> > 
> > >>>> Source/WebCore/editing/TextManipulationController.cpp:572
> > >>>> +                currentToken.content.append(' ');
> > >>> 
> > >>> This doesn't seem right. Why should we always assume to insert a space between two tokens?
> > >>> That may be correct in Latin languages but definitely not in Chinese or Japanese.
> > >> 
> > >> I am following the pattern of http://trac.webkit.org/changeset/260393/webkit. I can remove it.
> > > 
> > > I don't know what the story of that change log is (perhaps you should talk to Wenson)
> > > but I don't think always inserting a whitespace is right even if it worked for some languages.
> > > This is something whatever application / framework using WebKit should take care of.
> > 
> > This was admittedly a hack to deal with how the relevant system framework
> > /sometimes/ splits results into multiple tokens around spaces, and then
> > removes the space from each of the tokens that it split into. For instance,
> > a single token “foo bar” might be returned as two tokens “foo” and “bar”.
> > 
> > I hadn’t tested with CJK, and if the relevant system framework splits a
> > single word (“foo") into multiple tokens (“fo” “o”), then the logic I added
> > to try and stitch the tokens back together is indeed wrong :/
> 
> I mean... we really don't want WebKit to be doing these kinds of text
> manipulations. We don't want to be in the business of detecting what kind of
> language the text is in in insert or not insert a whitespace based on that...

I agree with this. I am confused by this behavior too.
But currently we don't have clear rules for replacement token. Once we decide that (e.g. replacement token must have a 1-to-n mapping to tokens in original item.), we can communicate with API clients and ask them to change. Before that, we need some hacks to make things work.

Comment 9 Ryosuke Niwa 2020-05-07 10:24:19 PDT

(In reply to Sihui Liu from comment #8) 
> > I mean... we really don't want WebKit to be doing these kinds of text
> > manipulations. We don't want to be in the business of detecting what kind of
> > language the text is in in insert or not insert a whitespace based on that...
> 
> I agree with this. I am confused by this behavior too.
> But currently we don't have clear rules for replacement token. Once we
> decide that (e.g. replacement token must have a 1-to-n mapping to tokens in
> original item.), we can communicate with API clients and ask them to change.
> Before that, we need some hacks to make things work.

That’s fine but whatever clients relying on this behavior should fix their code before this stuff ships. We don’t want to be maintaining half broken code for years to come.

Comment 10 Wenson Hsieh 2020-05-07 10:28:39 PDT

(In reply to Ryosuke Niwa from comment #9)
> (In reply to Sihui Liu from comment #8) 
> > > I mean... we really don't want WebKit to be doing these kinds of text
> > > manipulations. We don't want to be in the business of detecting what kind of
> > > language the text is in in insert or not insert a whitespace based on that...
> > 
> > I agree with this. I am confused by this behavior too.
> > But currently we don't have clear rules for replacement token. Once we
> > decide that (e.g. replacement token must have a 1-to-n mapping to tokens in
> > original item.), we can communicate with API clients and ask them to change.
> > Before that, we need some hacks to make things work.
> 
> That’s fine but whatever clients relying on this behavior should fix their
> code before this stuff ships. We don’t want to be maintaining half broken
> code for years to come.

Agreed. Maybe we should add a FIXME here explaining that this “stitching” code is a temporary workaround until the relevant client fixes their behavior (for instance, by sending back at most 1 result token, if it is given an item with a single token).

Comment 11 Ryosuke Niwa 2020-05-07 13:39:33 PDT

(In reply to Wenson Hsieh from comment #10)
> (In reply to Ryosuke Niwa from comment #9)
> > (In reply to Sihui Liu from comment #8) 
> > > > I mean... we really don't want WebKit to be doing these kinds of text
> > > > manipulations. We don't want to be in the business of detecting what kind of
> > > > language the text is in in insert or not insert a whitespace based on that...
> > > 
> > > I agree with this. I am confused by this behavior too.
> > > But currently we don't have clear rules for replacement token. Once we
> > > decide that (e.g. replacement token must have a 1-to-n mapping to tokens in
> > > original item.), we can communicate with API clients and ask them to change.
> > > Before that, we need some hacks to make things work.
> > 
> > That’s fine but whatever clients relying on this behavior should fix their
> > code before this stuff ships. We don’t want to be maintaining half broken
> > code for years to come.
> 
> Agreed. Maybe we should add a FIXME here explaining that this “stitching”
> code is a temporary workaround until the relevant client fixes their
> behavior (for instance, by sending back at most 1 result token, if it is
> given an item with a single token).

I don’t see why clients can’t give multiple consecutive tokens with the same identifier. That’s okay. What’s not okay is for those clients to rely on WebKit to insert whitespaces. That’s simply not how this API was ever designed to work. White space and other text manipulations have to happen in the clients, not WebKit.

Comment 12 Ryosuke Niwa 2020-05-07 13:43:27 PDT

By the way, there is a related case where we’d currently strip away whitespaces. Namely, whitespaces between elements’s open/start tags are supposed to be rendered. With the last algorithm, unfortunately, some of those whitespaces can be stripped away. If you’re investigating bugs in this area, that’s something else to pay attention to.

P.S. I’m not trying to review this patch so if anyone can review it, please go ahead. I was just leaving a drive-by comment.

Comment 13 Ryosuke Niwa 2020-05-07 13:45:55 PDT

(In reply to Ryosuke Niwa from comment #12)
> By the way, there is a related case where we’d currently strip away
> whitespaces. Namely, whitespaces between elements’s open/start tags

And between two end/close tags. See CSS2.1’s appendix where they talk about whitespaces collapsing. Or you can ask Antti / Zalan / Simon about that too :)