There are cases where one token is split into multiple tokens with the same token identifier when completing text manipulation. In https://bugs.webkit.org/show_bug.cgi?id=210750, Wenson fixed that for title and option element, by merging adjacent tokens with same identifier into one token. We are seeing other elements having the same issue, so extend that to all elements.
Created attachment 398565 [details] Patch
Created attachment 398574 [details] Patch
Comment on attachment 398574 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=398574&action=review > Source/WebCore/editing/TextManipulationController.cpp:564 > + if (replacementTokens.isEmpty()) > + return ManipulationFailureType::InvalidToken; This seems like untested code. Also, it's incorrect to say we had an invalid token just because the replacement is empty. > Source/WebCore/editing/TextManipulationController.cpp:572 > + if (!token.content.isEmpty() && !currentToken.content.isEmpty()) > + currentToken.content.append(' '); This doesn't seem right. Why should we always assume to insert a space between two tokens? That may be correct in Latin languages but definitely not in Chinese or Japanese.
Comment on attachment 398574 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=398574&action=review >> Source/WebCore/editing/TextManipulationController.cpp:564 >> + return ManipulationFailureType::InvalidToken; > > This seems like untested code. Also, it's incorrect to say we had an invalid token just because the replacement is empty. We use InvalidToken when replacementTokens.size() > 1 so I think the InvalidToken means invalid replacement tokens too. Or should I use ManipulationFailureType::InvalidItem? >> Source/WebCore/editing/TextManipulationController.cpp:572 >> + currentToken.content.append(' '); > > This doesn't seem right. Why should we always assume to insert a space between two tokens? > That may be correct in Latin languages but definitely not in Chinese or Japanese. I am following the pattern of http://trac.webkit.org/changeset/260393/webkit. I can remove it.
Comment on attachment 398574 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=398574&action=review >>> Source/WebCore/editing/TextManipulationController.cpp:564 >>> + return ManipulationFailureType::InvalidToken; >> >> This seems like untested code. Also, it's incorrect to say we had an invalid token just because the replacement is empty. > > We use InvalidToken when replacementTokens.size() > 1 so I think the InvalidToken means invalid replacement tokens too. Or should I use ManipulationFailureType::InvalidItem? Well, we use that because in the case of a single element replacement, we can't possibly have multiple tokens of different identifiers. >>> Source/WebCore/editing/TextManipulationController.cpp:572 >>> + currentToken.content.append(' '); >> >> This doesn't seem right. Why should we always assume to insert a space between two tokens? >> That may be correct in Latin languages but definitely not in Chinese or Japanese. > > I am following the pattern of http://trac.webkit.org/changeset/260393/webkit. I can remove it. I don't know what the story of that change log is (perhaps you should talk to Wenson) but I don't think always inserting a whitespace is right even if it worked for some languages. This is something whatever application / framework using WebKit should take care of.
Comment on attachment 398574 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=398574&action=review >>>> Source/WebCore/editing/TextManipulationController.cpp:572 >>>> + currentToken.content.append(' '); >>> >>> This doesn't seem right. Why should we always assume to insert a space between two tokens? >>> That may be correct in Latin languages but definitely not in Chinese or Japanese. >> >> I am following the pattern of http://trac.webkit.org/changeset/260393/webkit. I can remove it. > > I don't know what the story of that change log is (perhaps you should talk to Wenson) > but I don't think always inserting a whitespace is right even if it worked for some languages. > This is something whatever application / framework using WebKit should take care of. This was admittedly a hack to deal with how the relevant system framework /sometimes/ splits results into multiple tokens around spaces, and then removes the space from each of the tokens that it split into. For instance, a single token “foo bar” might be returned as two tokens “foo” and “bar”. I hadn’t tested with CJK, and if the relevant system framework splits a single word (“foo") into multiple tokens (“fo” “o”), then the logic I added to try and stitch the tokens back together is indeed wrong :/
(In reply to Wenson Hsieh from comment #6) > Comment on attachment 398574 [details] > Patch > > View in context: > https://bugs.webkit.org/attachment.cgi?id=398574&action=review > > >>>> Source/WebCore/editing/TextManipulationController.cpp:572 > >>>> + currentToken.content.append(' '); > >>> > >>> This doesn't seem right. Why should we always assume to insert a space between two tokens? > >>> That may be correct in Latin languages but definitely not in Chinese or Japanese. > >> > >> I am following the pattern of http://trac.webkit.org/changeset/260393/webkit. I can remove it. > > > > I don't know what the story of that change log is (perhaps you should talk to Wenson) > > but I don't think always inserting a whitespace is right even if it worked for some languages. > > This is something whatever application / framework using WebKit should take care of. > > This was admittedly a hack to deal with how the relevant system framework > /sometimes/ splits results into multiple tokens around spaces, and then > removes the space from each of the tokens that it split into. For instance, > a single token “foo bar” might be returned as two tokens “foo” and “bar”. > > I hadn’t tested with CJK, and if the relevant system framework splits a > single word (“foo") into multiple tokens (“fo” “o”), then the logic I added > to try and stitch the tokens back together is indeed wrong :/ I mean... we really don't want WebKit to be doing these kinds of text manipulations. We don't want to be in the business of detecting what kind of language the text is in in insert or not insert a whitespace based on that...
(In reply to Ryosuke Niwa from comment #7) > (In reply to Wenson Hsieh from comment #6) > > Comment on attachment 398574 [details] > > Patch > > > > View in context: > > https://bugs.webkit.org/attachment.cgi?id=398574&action=review > > > > >>>> Source/WebCore/editing/TextManipulationController.cpp:572 > > >>>> + currentToken.content.append(' '); > > >>> > > >>> This doesn't seem right. Why should we always assume to insert a space between two tokens? > > >>> That may be correct in Latin languages but definitely not in Chinese or Japanese. > > >> > > >> I am following the pattern of http://trac.webkit.org/changeset/260393/webkit. I can remove it. > > > > > > I don't know what the story of that change log is (perhaps you should talk to Wenson) > > > but I don't think always inserting a whitespace is right even if it worked for some languages. > > > This is something whatever application / framework using WebKit should take care of. > > > > This was admittedly a hack to deal with how the relevant system framework > > /sometimes/ splits results into multiple tokens around spaces, and then > > removes the space from each of the tokens that it split into. For instance, > > a single token “foo bar” might be returned as two tokens “foo” and “bar”. > > > > I hadn’t tested with CJK, and if the relevant system framework splits a > > single word (“foo") into multiple tokens (“fo” “o”), then the logic I added > > to try and stitch the tokens back together is indeed wrong :/ > > I mean... we really don't want WebKit to be doing these kinds of text > manipulations. We don't want to be in the business of detecting what kind of > language the text is in in insert or not insert a whitespace based on that... I agree with this. I am confused by this behavior too. But currently we don't have clear rules for replacement token. Once we decide that (e.g. replacement token must have a 1-to-n mapping to tokens in original item.), we can communicate with API clients and ask them to change. Before that, we need some hacks to make things work.
(In reply to Sihui Liu from comment #8) > > I mean... we really don't want WebKit to be doing these kinds of text > > manipulations. We don't want to be in the business of detecting what kind of > > language the text is in in insert or not insert a whitespace based on that... > > I agree with this. I am confused by this behavior too. > But currently we don't have clear rules for replacement token. Once we > decide that (e.g. replacement token must have a 1-to-n mapping to tokens in > original item.), we can communicate with API clients and ask them to change. > Before that, we need some hacks to make things work. That’s fine but whatever clients relying on this behavior should fix their code before this stuff ships. We don’t want to be maintaining half broken code for years to come.
(In reply to Ryosuke Niwa from comment #9) > (In reply to Sihui Liu from comment #8) > > > I mean... we really don't want WebKit to be doing these kinds of text > > > manipulations. We don't want to be in the business of detecting what kind of > > > language the text is in in insert or not insert a whitespace based on that... > > > > I agree with this. I am confused by this behavior too. > > But currently we don't have clear rules for replacement token. Once we > > decide that (e.g. replacement token must have a 1-to-n mapping to tokens in > > original item.), we can communicate with API clients and ask them to change. > > Before that, we need some hacks to make things work. > > That’s fine but whatever clients relying on this behavior should fix their > code before this stuff ships. We don’t want to be maintaining half broken > code for years to come. Agreed. Maybe we should add a FIXME here explaining that this “stitching” code is a temporary workaround until the relevant client fixes their behavior (for instance, by sending back at most 1 result token, if it is given an item with a single token).
(In reply to Wenson Hsieh from comment #10) > (In reply to Ryosuke Niwa from comment #9) > > (In reply to Sihui Liu from comment #8) > > > > I mean... we really don't want WebKit to be doing these kinds of text > > > > manipulations. We don't want to be in the business of detecting what kind of > > > > language the text is in in insert or not insert a whitespace based on that... > > > > > > I agree with this. I am confused by this behavior too. > > > But currently we don't have clear rules for replacement token. Once we > > > decide that (e.g. replacement token must have a 1-to-n mapping to tokens in > > > original item.), we can communicate with API clients and ask them to change. > > > Before that, we need some hacks to make things work. > > > > That’s fine but whatever clients relying on this behavior should fix their > > code before this stuff ships. We don’t want to be maintaining half broken > > code for years to come. > > Agreed. Maybe we should add a FIXME here explaining that this “stitching” > code is a temporary workaround until the relevant client fixes their > behavior (for instance, by sending back at most 1 result token, if it is > given an item with a single token). I don’t see why clients can’t give multiple consecutive tokens with the same identifier. That’s okay. What’s not okay is for those clients to rely on WebKit to insert whitespaces. That’s simply not how this API was ever designed to work. White space and other text manipulations have to happen in the clients, not WebKit.
By the way, there is a related case where we’d currently strip away whitespaces. Namely, whitespaces between elements’s open/start tags are supposed to be rendered. With the last algorithm, unfortunately, some of those whitespaces can be stripped away. If you’re investigating bugs in this area, that’s something else to pay attention to. P.S. I’m not trying to review this patch so if anyone can review it, please go ahead. I was just leaving a drive-by comment.
(In reply to Ryosuke Niwa from comment #12) > By the way, there is a related case where we’d currently strip away > whitespaces. Namely, whitespaces between elements’s open/start tags And between two end/close tags. See CSS2.1’s appendix where they talk about whitespaces collapsing. Or you can ask Antti / Zalan / Simon about that too :)