Bug 231606

Summary: ASSERT hit in surrogatePairAwareIndex and surrogatePairAwareStart lambdas for text with unpaired surrogates
Product: WebKit Reporter: Gabriel Nava Marino <gnavamarino>
Component: TextAssignee: Nobody <webkit-unassigned>
Status: RESOLVED FIXED    
Severity: Normal CC: darin, mmaxfield, webkit-bug-importer, zalan
Priority: P2 Keywords: InRadar
Version: WebKit Nightly Build   
Hardware: Unspecified   
OS: Unspecified   
Attachments:
Description Flags
Patch
none
Patch
none
Patch
none
Patch none

Description Gabriel Nava Marino 2021-10-12 12:19:11 PDT
surrogatePairAwareIndex and surrogatePairAwareStart lambdas in TextUtil::MidWordBreak TextUtil::midWordBreak are assuming there are no unpaired surrogates but that is not guaranteed as they don't always come in pairs.
Comment 1 Gabriel Nava Marino 2021-10-12 14:23:35 PDT
Created attachment 440982 [details]
Patch
Comment 2 Myles C. Maxfield 2021-10-12 14:31:55 PDT
Comment on attachment 440982 [details]
Patch

Awesome 💯
Comment 3 Darin Adler 2021-10-12 15:36:11 PDT
Comment on attachment 440982 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=440982&action=review

> Source/WebCore/layout/formattingContexts/inline/text/TextUtil.cpp:150
> +        bool isLead = U16_IS_LEAD(text[index]) && (index + 1) < text.length() && U16_IS_TRAIL(text[index + 1]);

Given that this is WTF::String, we don’t need the rang checks on index. The subscript operator does enough range checking.

> Source/WebCore/layout/formattingContexts/inline/text/TextUtil.cpp:-152
> -        RELEASE_ASSERT(index + 1 < text.length());

Why remove this?

> Source/WebCore/layout/formattingContexts/inline/text/TextUtil.cpp:153
>          return ++index;

This is bizarre. It should return index + 1; why use ++?

> Source/WebCore/layout/formattingContexts/inline/text/TextUtil.cpp:172
> +                bool isTrail = index && U16_IS_LEAD(text[index - 1]) && index < text.length() && U16_IS_TRAIL(text[index]);

Given that this is WTF::String, we don’t need the additional range checks on index. The subscript operator does enough range checking.

> Source/WebCore/layout/formattingContexts/inline/text/TextUtil.cpp:176
>                  return --index;

This is bizarre. It should return index - 1; why use --?
Comment 4 EWS 2021-10-12 15:47:43 PDT
Committed r284050 (242879@main): <https://commits.webkit.org/242879@main>

All reviewed patches have been landed. Closing bug and clearing flags on attachment 440982 [details].
Comment 5 Radar WebKit Bug Importer 2021-10-12 15:48:39 PDT
<rdar://problem/84170185>
Comment 6 Gabriel Nava Marino 2021-10-12 15:53:31 PDT
View in context: https://bugs.webkit.org/attachment.cgi?id=440982&action=review

>> Source/WebCore/layout/formattingContexts/inline/text/TextUtil.cpp:150
>> +        bool isLead = U16_IS_LEAD(text[index]) && (index + 1) < text.length() && U16_IS_TRAIL(text[index + 1]);
> 
> Given that this is WTF::String, we don’t need the rang checks on index. The subscript operator does enough range checking.

Thank you for the clarification. I will remove the range checks.

>> Source/WebCore/layout/formattingContexts/inline/text/TextUtil.cpp:-152
>> -        RELEASE_ASSERT(index + 1 < text.length());
> 
> Why remove this?

This is enforced via early return in the if statement range check above. However, since I will remove the range checks, I will re-add this back here.

>> Source/WebCore/layout/formattingContexts/inline/text/TextUtil.cpp:153
>>          return ++index;
> 
> This is bizarre. It should return index + 1; why use ++?

I will update to use index + 1.

>> Source/WebCore/layout/formattingContexts/inline/text/TextUtil.cpp:172
>> +                bool isTrail = index && U16_IS_LEAD(text[index - 1]) && index < text.length() && U16_IS_TRAIL(text[index]);
> 
> Given that this is WTF::String, we don’t need the additional range checks on index. The subscript operator does enough range checking.

Thank you for the clarification. I will remove the range checks from here as well.

>> Source/WebCore/layout/formattingContexts/inline/text/TextUtil.cpp:176
>>                  return --index;
> 
> This is bizarre. It should return index - 1; why use --?

I will update to use index - 1.
Comment 7 Gabriel Nava Marino 2021-10-12 15:58:52 PDT
Reopened to address latest review comments.
Comment 8 Darin Adler 2021-10-12 16:06:20 PDT
Comment on attachment 440982 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=440982&action=review

> Source/WebCore/layout/formattingContexts/inline/text/TextUtil.cpp:178
>              // When the substring does not fit, the right side is supposed to be the start of the surrogate pair if applicable. 

One other thought, not something we need to do, but maybe worth considering.

The ICU library includes a macro for this specific operation, U16_SET_CP_START. We could just write:

    right = middle;
    U16_SET_CP_START(text, 0, right);

Then we would not need to write the surrogatePairAwareStart function at all. Perhaps there is a suitable macro for the surrogatePairAwareIndex operation too; I didn’t find it.
Comment 9 Gabriel Nava Marino 2021-10-12 16:50:47 PDT
(In reply to Darin Adler from comment #8)
> Comment on attachment 440982 [details]
> Patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=440982&action=review
> 
> > Source/WebCore/layout/formattingContexts/inline/text/TextUtil.cpp:178
> >              // When the substring does not fit, the right side is supposed to be the start of the surrogate pair if applicable. 
> 
> One other thought, not something we need to do, but maybe worth considering.
> 
> The ICU library includes a macro for this specific operation,
> U16_SET_CP_START. We could just write:
> 
>     right = middle;
>     U16_SET_CP_START(text, 0, right);
> 
> Then we would not need to write the surrogatePairAwareStart function at all.

Thank you for pointing this out. I will try it!

> Perhaps there is a suitable macro for the surrogatePairAwareIndex operation
> too; I didn’t find it.

It looks like this could have been the closest one, but not quite:
U16_SET_CP_LIMIT(s, start, i, length)
Comment 10 Darin Adler 2021-10-12 17:22:27 PDT
(In reply to Gabriel Nava Marino from comment #9)
> It looks like this could have been the closest one, but not quite:
> U16_SET_CP_LIMIT(s, start, i, length)

Yes, I do think clients of surrogatePairAwareIndex can be rewritten to use U16_SET_CP_LIMIT.
Comment 11 Darin Adler 2021-10-12 17:22:39 PDT
(In reply to Darin Adler from comment #10)
> (In reply to Gabriel Nava Marino from comment #9)
> > It looks like this could have been the closest one, but not quite:
> > U16_SET_CP_LIMIT(s, start, i, length)
> 
> Yes, I do think clients of surrogatePairAwareIndex can be rewritten to use
> U16_SET_CP_LIMIT.

Simply and fairly elegantly.
Comment 12 Darin Adler 2021-10-12 17:35:52 PDT
Comment on attachment 440982 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=440982&action=review

> Source/WebCore/layout/formattingContexts/inline/text/TextUtil.cpp:155
>  

It would be:

    auto right = std::min<unsigned>(left + offset, (startPosition + length - 1));
    U16_SET_CP_LIMIT(text, 0, right, text.length());

    auto middle = surrogatePairAwareIndex((left + right) / 2);
    U16_SET_CP_LIMIT(text, 0, middle, text.length());
Comment 13 Gabriel Nava Marino 2021-10-12 17:46:42 PDT
(In reply to Darin Adler from comment #12)
> Comment on attachment 440982 [details]
> Patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=440982&action=review
> 
> > Source/WebCore/layout/formattingContexts/inline/text/TextUtil.cpp:155
> >  
> 
> It would be:
> 
>     auto right = std::min<unsigned>(left + offset, (startPosition + length -
> 1));
>     U16_SET_CP_LIMIT(text, 0, right, text.length());
> 
>     auto middle = surrogatePairAwareIndex((left + right) / 2);
>     U16_SET_CP_LIMIT(text, 0, middle, text.length());

Thank you, I will try this! Although I just notice that currently, surrogatePairAwareIndex returns the index at trail whereas U16_SET_CP_LIMIT returns the index past it:

#define U16_SET_CP_LIMIT(s, start, i, length) UPRV_BLOCK_MACRO_BEGIN { \
    if((start)<(i) && ((i)<(length) || (length)<0) && U16_IS_LEAD((s)[(i)-1]) && U16_IS_TRAIL((s)[i])) { \
        ++(i); \
    } \
} UPRV_BLOCK_MACRO_END
Comment 14 Darin Adler 2021-10-12 17:50:54 PDT
Comment on attachment 440982 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=440982&action=review

>>> Source/WebCore/layout/formattingContexts/inline/text/TextUtil.cpp:155
>>>  
>> 
>> It would be:
>> 
>>     auto right = std::min<unsigned>(left + offset, (startPosition + length - 1));
>>     U16_SET_CP_LIMIT(text, 0, right, text.length());
>> 
>>     auto middle = surrogatePairAwareIndex((left + right) / 2);
>>     U16_SET_CP_LIMIT(text, 0, middle, text.length());
> 
> Thank you, I will try this! Although I just notice that currently, surrogatePairAwareIndex returns the index at trail whereas U16_SET_CP_LIMIT returns the index past it:
> 
> #define U16_SET_CP_LIMIT(s, start, i, length) UPRV_BLOCK_MACRO_BEGIN { \
>     if((start)<(i) && ((i)<(length) || (length)<0) && U16_IS_LEAD((s)[(i)-1]) && U16_IS_TRAIL((s)[i])) { \
>         ++(i); \
>     } \
> } UPRV_BLOCK_MACRO_END

Oh, OK, good point. So my code is wrong. It would be more like this:

    auto right = std::min<unsigned>(left + offset + 1, startPosition + length);
    U16_SET_CP_LIMIT(text, 0, right, text.length());
    --right;
Comment 15 Darin Adler 2021-10-12 17:53:08 PDT
Comment on attachment 440982 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=440982&action=review

>>>> Source/WebCore/layout/formattingContexts/inline/text/TextUtil.cpp:155
>>>>  
>>> 
>>> It would be:
>>> 
>>>     auto right = std::min<unsigned>(left + offset, (startPosition + length - 1));
>>>     U16_SET_CP_LIMIT(text, 0, right, text.length());
>>> 
>>>     auto middle = surrogatePairAwareIndex((left + right) / 2);
>>>     U16_SET_CP_LIMIT(text, 0, middle, text.length());
>> 
>> Thank you, I will try this! Although I just notice that currently, surrogatePairAwareIndex returns the index at trail whereas U16_SET_CP_LIMIT returns the index past it:
>> 
>> #define U16_SET_CP_LIMIT(s, start, i, length) UPRV_BLOCK_MACRO_BEGIN { \
>>     if((start)<(i) && ((i)<(length) || (length)<0) && U16_IS_LEAD((s)[(i)-1]) && U16_IS_TRAIL((s)[i])) { \
>>         ++(i); \
>>     } \
>> } UPRV_BLOCK_MACRO_END
> 
> Oh, OK, good point. So my code is wrong. It would be more like this:
> 
>     auto right = std::min<unsigned>(left + offset + 1, startPosition + length);
>     U16_SET_CP_LIMIT(text, 0, right, text.length());
>     --right;

And perhaps then could change the logic to get rid of the need for the "--right".

Maybe a waste of time, but I do think that using the U16 macros is still a little easier to get correct than writing our own code.
Comment 16 Gabriel Nava Marino 2021-10-12 18:07:21 PDT
(In reply to Darin Adler from comment #15)
> Comment on attachment 440982 [details]
> Patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=440982&action=review
> 
> >>>> Source/WebCore/layout/formattingContexts/inline/text/TextUtil.cpp:155
> >>>>  
> >>> 
> >>> It would be:
> >>> 
> >>>     auto right = std::min<unsigned>(left + offset, (startPosition + length - 1));
> >>>     U16_SET_CP_LIMIT(text, 0, right, text.length());
> >>> 
> >>>     auto middle = surrogatePairAwareIndex((left + right) / 2);
> >>>     U16_SET_CP_LIMIT(text, 0, middle, text.length());
> >> 
> >> Thank you, I will try this! Although I just notice that currently, surrogatePairAwareIndex returns the index at trail whereas U16_SET_CP_LIMIT returns the index past it:
> >> 
> >> #define U16_SET_CP_LIMIT(s, start, i, length) UPRV_BLOCK_MACRO_BEGIN { \
> >>     if((start)<(i) && ((i)<(length) || (length)<0) && U16_IS_LEAD((s)[(i)-1]) && U16_IS_TRAIL((s)[i])) { \
> >>         ++(i); \
> >>     } \
> >> } UPRV_BLOCK_MACRO_END
> > 
> > Oh, OK, good point. So my code is wrong. It would be more like this:
> > 
> >     auto right = std::min<unsigned>(left + offset + 1, startPosition + length);
> >     U16_SET_CP_LIMIT(text, 0, right, text.length());
> >     --right;
> 
> And perhaps then could change the logic to get rid of the need for the
> "--right".

I will work to update the logic to avoid the extra decrement. Thank you for the suggestion Darin.

> 
> Maybe a waste of time, but I do think that using the U16 macros is still a
> little easier to get correct than writing our own code.

Yes, agreed! I will work to get this macro working. The upside being that the "safe" macro offers handling of unpaired surrogates and checks for string boundaries.
Comment 17 Gabriel Nava Marino 2021-10-13 15:41:46 PDT
(In reply to Darin Adler from comment #15)
> Comment on attachment 440982 [details]
> Patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=440982&action=review
> 
> >>>> Source/WebCore/layout/formattingContexts/inline/text/TextUtil.cpp:155
> >>>>  
> >>> 
> >>> It would be:
> >>> 
> >>>     auto right = std::min<unsigned>(left + offset, (startPosition + length - 1));
> >>>     U16_SET_CP_LIMIT(text, 0, right, text.length());
> >>> 
> >>>     auto middle = surrogatePairAwareIndex((left + right) / 2);
> >>>     U16_SET_CP_LIMIT(text, 0, middle, text.length());
> >> 
> >> Thank you, I will try this! Although I just notice that currently, surrogatePairAwareIndex returns the index at trail whereas U16_SET_CP_LIMIT returns the index past it:
> >> 
> >> #define U16_SET_CP_LIMIT(s, start, i, length) UPRV_BLOCK_MACRO_BEGIN { \
> >>     if((start)<(i) && ((i)<(length) || (length)<0) && U16_IS_LEAD((s)[(i)-1]) && U16_IS_TRAIL((s)[i])) { \
> >>         ++(i); \
> >>     } \
> >> } UPRV_BLOCK_MACRO_END
> > 
> > Oh, OK, good point. So my code is wrong. It would be more like this:
> > 
> >     auto right = std::min<unsigned>(left + offset + 1, startPosition + length);
> >     U16_SET_CP_LIMIT(text, 0, right, text.length());
> >     --right;
> 
> And perhaps then could change the logic to get rid of the need for the
> "--right".

I tried adjusting the logic so that we could get rid of the need for the "--right". However, because of the loop condition "while (left < right)" and need for two U16_SET_CP_LIMIT calls and then the subsequent U16_SET_CP_START, we would need at least one "--right" after U16_SET_CP_START.

To avoid this, I have opted to keep surrogatePairAwareIndex and return index - 1 instead. Although we are still using this lambda, it is only using the U16 U16_SET_CP_LIMIT macro now.

> 
> Maybe a waste of time, but I do think that using the U16 macros is still a
> little easier to get correct than writing our own code.
Comment 18 Darin Adler 2021-10-13 15:42:33 PDT
(In reply to Gabriel Nava Marino from comment #17)
> To avoid this, I have opted to keep surrogatePairAwareIndex and return index
> - 1 instead. Although we are still using this lambda, it is only using the
> U16 U16_SET_CP_LIMIT macro now.

OK, sounds pretty good.
Comment 19 Gabriel Nava Marino 2021-10-13 15:46:03 PDT
Created attachment 441150 [details]
Patch
Comment 20 Darin Adler 2021-10-13 16:05:13 PDT
Comment on attachment 441150 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=441150&action=review

Thanks for taking advantage of my suggestion to use higher level U16 macros instead of writing the code ourselves. I think it’s looking good.

> Source/WebCore/layout/formattingContexts/inline/text/TextUtil.cpp:149
> +        U16_SET_CP_LIMIT(text, 0, index, text.length());

I think it’s a little bit less clear to modify an argument, index, than to modify a copy of it. It’s an informal coding style rule in WebKit that we typically don’t use arguments that way, even when they are passed by value and can be used that way.

> Source/WebCore/layout/formattingContexts/inline/text/TextUtil.cpp:159
> +    auto right = surrogatePairAwareIndex(std::min<unsigned>(left + offset + 1, (startPosition + length)));

Why add the + 1 for the the argument to surrogatePairAwareIndex at each place it’s called? Seems like the + 1 could be done inside that function.

Maybe you think the function makes more sense with these semantics? If so, that’s OK, but if not, I’d do it there.
Comment 21 Gabriel Nava Marino 2021-10-13 16:12:41 PDT
(In reply to Darin Adler from comment #20)
> Comment on attachment 441150 [details]
> Patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=441150&action=review
> 
> Thanks for taking advantage of my suggestion to use higher level U16 macros
> instead of writing the code ourselves. I think it’s looking good.
> 
> > Source/WebCore/layout/formattingContexts/inline/text/TextUtil.cpp:149
> > +        U16_SET_CP_LIMIT(text, 0, index, text.length());
> 
> I think it’s a little bit less clear to modify an argument, index, than to
> modify a copy of it. It’s an informal coding style rule in WebKit that we
> typically don’t use arguments that way, even when they are passed by value
> and can be used that way.
> 

Thank you for the suggestion. I will make a copy of the argument.

> > Source/WebCore/layout/formattingContexts/inline/text/TextUtil.cpp:159
> > +    auto right = surrogatePairAwareIndex(std::min<unsigned>(left + offset + 1, (startPosition + length)));
> 
> Why add the + 1 for the the argument to surrogatePairAwareIndex at each
> place it’s called? Seems like the + 1 could be done inside that function.
> 
> Maybe you think the function makes more sense with these semantics? If so,
> that’s OK, but if not, I’d do it there.

It was done primarily to avoid doing the + 1 inside the function, but if we are also making a copy of the argument, then I will do the + 1 inside the function at the same time. Thank you for the suggestion.
Comment 22 Gabriel Nava Marino 2021-10-13 16:17:18 PDT
Created attachment 441156 [details]
Patch
Comment 23 Darin Adler 2021-10-13 16:32:08 PDT
Comment on attachment 441156 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=441156&action=review

> Source/WebCore/layout/formattingContexts/inline/text/TextUtil.cpp:151
> +        RELEASE_ASSERT(offset - 1 < text.length());

Not 100% sure we need this RELEASE_ASSERT.
Comment 24 Gabriel Nava Marino 2021-10-13 16:42:17 PDT
(In reply to Darin Adler from comment #23)
> Comment on attachment 441156 [details]
> Patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=441156&action=review
> 
> > Source/WebCore/layout/formattingContexts/inline/text/TextUtil.cpp:151
> > +        RELEASE_ASSERT(offset - 1 < text.length());
> 
> Not 100% sure we need this RELEASE_ASSERT.

We don't need this. I will update this to remove it. Thank you.
Comment 25 Gabriel Nava Marino 2021-10-13 16:43:21 PDT
Created attachment 441158 [details]
Patch
Comment 26 EWS 2021-10-13 17:57:22 PDT
Committed r284141 (242961@main): <https://commits.webkit.org/242961@main>

All reviewed patches have been landed. Closing bug and clearing flags on attachment 441158 [details].