Bug 192945 - Use unorm2_normalize instead of precomposedStringWithCanonicalMapping in userVisibleString
Summary: Use unorm2_normalize instead of precomposedStringWithCanonicalMapping in user...
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: Web Template Framework (show other bugs)
Version: WebKit Nightly Build
Hardware: PC Linux
: P2 Normal
Assignee: Michael Catanzaro
URL:
Keywords: InRadar
Depends on:
Blocks: 174816
  Show dependency treegraph
 
Reported: 2018-12-20 12:07 PST by Michael Catanzaro
Modified: 2019-01-15 19:27 PST (History)
12 users (show)

See Also:


Attachments
Patch (2.64 KB, patch)
2018-12-20 13:02 PST, Michael Catanzaro
no flags Details | Formatted Diff | Diff
Patch (2.65 KB, patch)
2018-12-20 13:18 PST, Michael Catanzaro
no flags Details | Formatted Diff | Diff
Patch (2.86 KB, patch)
2018-12-20 13:30 PST, Michael Catanzaro
no flags Details | Formatted Diff | Diff
Patch (2.87 KB, patch)
2018-12-20 13:46 PST, Michael Catanzaro
no flags Details | Formatted Diff | Diff
Patch (2.87 KB, patch)
2018-12-20 13:55 PST, Michael Catanzaro
no flags Details | Formatted Diff | Diff
Patch (2.91 KB, patch)
2018-12-22 16:48 PST, Michael Catanzaro
no flags Details | Formatted Diff | Diff
Patch (3.39 KB, patch)
2019-01-14 17:41 PST, Michael Catanzaro
achristensen: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Michael Catanzaro 2018-12-20 12:07:51 PST
Replace use of the nice NSString function precomposedStringWithCanonicalMapping with the ICU API unorm2_normalize. This is to prep the code for translation to cross-platform C++.

Please congratulate the ICU developers for creating an API that is impossible to safely use except when calling it twice in succession.

Note this code was all written by Ms2ger; I'm just splitting it into its own patch because I fear we've lost him to end of year holidays.
Comment 1 Michael Catanzaro 2018-12-20 12:58:35 PST
(In reply to Michael Catanzaro from comment #0)
> Note this code was all written by Ms2ger; I'm just splitting it into its own
> patch because I fear we've lost him to end of year holidays.

OK, I've mucked with it quite a bit to attempt to integrate it into NSURLExtras.mm. Let's say credit to Ms2ger if it works and my fault if EWS comes back red red red.....
Comment 2 Michael Catanzaro 2018-12-20 13:02:14 PST
Created attachment 357853 [details]
Patch
Comment 3 Michael Catanzaro 2018-12-20 13:18:39 PST
Created attachment 357856 [details]
Patch
Comment 4 Michael Catanzaro 2018-12-20 13:30:36 PST
Created attachment 357858 [details]
Patch
Comment 5 Michael Catanzaro 2018-12-20 13:46:10 PST
Created attachment 357862 [details]
Patch
Comment 6 Michael Catanzaro 2018-12-20 13:55:40 PST
Created attachment 357865 [details]
Patch
Comment 7 Michael Catanzaro 2018-12-20 14:34:00 PST
Comment on attachment 357865 [details]
Patch

Unsetting r? until the Mac EWS are working again:

Exception: Error auto-installing the keyring package to:
 "/Volumes/Data/EWS/WebKit/Tools/Scripts/webkitpy/thirdparty/autoinstalled/keyring"
 --> Inner message: Could not download Python modules from URL "None".
 Make sure you are connected to the internet.
 You must be connected to the internet when downloading needed modules for the first time.
 --> Inner message: <urlopen error [SSL: TLSV1_ALERT_PROTOCOL_VERSION] tlsv1 alert protocol version (_ssl.c:590)>
Comment 8 Michael Catanzaro 2018-12-20 17:00:02 PST
Comment on attachment 357865 [details]
Patch

OK, all the bots are happy except for the network error....

This is my first Objective C[++] ever and I am very proud of it! Even if it is mostly copied from Ms2ger.

Note that after this lands, I will immediately reland the patch in bug #174816, deleting all this code. This commit is just to make sure that if the switch from the NSString normalization to the ICU normalization introduces an unexpected regression, it will bisect down to this small revision instead of the huge patch in that bug.
Comment 9 Alex Christensen 2018-12-20 17:05:09 PST
Comment on attachment 357865 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=357865&action=review

> Source/WTF/wtf/cocoa/NSURLExtras.mm:1170
> +    ASSERT(sourceBuffer.last() == '\0');

This assertion seems redundant.

> Source/WTF/wtf/cocoa/NSURLExtras.mm:1175
> +    const UNormalizer2 *normalizer = unorm2_getNFCInstance(&uerror);

Can unorm2_getNFCInstance never fail?

> Source/WTF/wtf/cocoa/NSURLExtras.mm:1177
> +    if (uerror == U_BUFFER_OVERFLOW_ERROR) {

Is this case covered by tests?  Are there tests with emoji?
Comment 10 Michael Catanzaro 2018-12-20 19:47:11 PST
Comment on attachment 357865 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=357865&action=review

>> Source/WTF/wtf/cocoa/NSURLExtras.mm:1170
>> +    ASSERT(sourceBuffer.last() == '\0');
> 
> This assertion seems redundant.

It is, but it makes the next line sourceBuffer.removeLast() easier to understand, so seems worth keeping.

>> Source/WTF/wtf/cocoa/NSURLExtras.mm:1175
>> +    const UNormalizer2 *normalizer = unorm2_getNFCInstance(&uerror);
> 
> Can unorm2_getNFCInstance never fail?

Good catch. It can fail if there is a memory allocation error. So not in practice, but we should check it regardless, especially since we create an error, pass it in, and don't check it... that's no good. And maybe in the future the function could change to fail in different ways.

>> Source/WTF/wtf/cocoa/NSURLExtras.mm:1177
>> +    if (uerror == U_BUFFER_OVERFLOW_ERROR) {
> 
> Is this case covered by tests?  Are there tests with emoji?

There is a test with a snowman emoji.

None of the tests intentionally triggers this "error condition". (Of course it's not a real error, just bad API design to require the caller to allocate a bigger buffer instead of ICU allocating the right size and returning it.) I can check the patch in bug #174816 tomorrow to see if any of those tests hit this codepath in the big patch.
Comment 11 Michael Catanzaro 2018-12-22 16:38:59 PST
(In reply to Michael Catanzaro from comment #10)
> None of the tests intentionally triggers this "error condition". (Of course
> it's not a real error, just bad API design to require the caller to allocate
> a bigger buffer instead of ICU allocating the right size and returning it.)
> I can check the patch in bug #174816 tomorrow to see if any of those tests
> hit this codepath in the big patch.

None of the tests hit this path. In fact, input size is always equal to output size. I doubt that's guaranteed to be the case, but I don't know enough about Unicode to pursue it further.

I'll update the patch to add the missing error check, though.
Comment 12 Michael Catanzaro 2018-12-22 16:48:15 PST
Created attachment 358027 [details]
Patch
Comment 13 Alexey Proskuryakov 2018-12-23 10:46:29 PST
> None of the tests hit this path. In fact, input size is always equal to output size. I doubt that's guaranteed to be the case, but I don't know enough about Unicode to pursue it further.

Is this actually about Unicode, or just strings longer than 2048 characters?

Unicode can certainly be tricky because conversions can make strings longer, but I don't see how that comes into play in this code.
Comment 14 Michael Catanzaro 2018-12-24 10:34:02 PST
Comment on attachment 358027 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=358027&action=review

> Source/WTF/wtf/cocoa/NSURLExtras.mm:1180
> +        normalizedLength = unorm2_normalize(normalizer, sourceBuffer.data(), sourceBuffer.size(), normalizedCharacters.data(), normalizedCharacters.size(), &uerror);
> +        if (uerror == U_BUFFER_OVERFLOW_ERROR) {

This "error" occurs if the output size is greater than input size (sourceBuffer.size()). I don't think it has anything to do with the 2048-bit inline capacity of sourceBuffer.

In all of the URLHelpers.cpp tests from bug #174816, the output size is always equal to the input size, so it is not currently tested (probably; there are a couple more tests in URLExtras.mm that I have not run).
Comment 15 Michael Catanzaro 2018-12-24 10:35:33 PST
(In reply to Michael Catanzaro from comment #14)
> This "error" occurs if the output size is greater than input size
> (sourceBuffer.size()). I don't think it has anything to do with the 2048-bit
> inline capacity of sourceBuffer.

More specifically: the error occurs if the output size is greater than the size of the buffer provided for the output, normalizedCharacters.size(). And normalizedCharacters.size() is initially the same as sourceBuffer.size().
Comment 16 Myles C. Maxfield 2019-01-02 14:52:56 PST
Alex Christensen should review this.
Comment 17 Alex Christensen 2019-01-03 08:49:44 PST
Comment on attachment 358027 [details]
Patch

I think this code should be put in a subroutine with a name that takes a const String& parameter and returns a String.
Comment 18 Michael Catanzaro 2019-01-14 17:39:13 PST
Adding a new subroutine might be harder for me than you'd think, but here goes.

A sacrifice to the EWS gods: 🔥🐑🔥
Comment 19 Michael Catanzaro 2019-01-14 17:41:25 PST
Created attachment 359107 [details]
Patch
Comment 20 Michael Catanzaro 2019-01-14 19:24:40 PST
Thanks!

(In reply to Michael Catanzaro from comment #18)
> A sacrifice to the EWS gods: 🔥🐑🔥

This must have worked. Can't believe it.
Comment 21 Michael Catanzaro 2019-01-14 19:26:08 PST
Committed r239970: <https://trac.webkit.org/changeset/239970>
Comment 22 Radar WebKit Bug Importer 2019-01-14 19:29:29 PST
<rdar://problem/47274320>
Comment 23 Darin Adler 2019-01-15 08:56:13 PST
Why does this use charactersWithNullTermination? That’s the worst part of this new code, I think.

Do we need a fast path for any simple cases, like all ASCII?
Comment 24 Darin Adler 2019-01-15 08:57:22 PST
I think this can and should easily be improved so that it works without allocating memory just to copy the characters into it.
Comment 25 Darin Adler 2019-01-15 08:57:47 PST
I’m referring to charactersWithNullTermination, not the Vector on the stack.
Comment 26 Michael Catanzaro 2019-01-15 09:35:35 PST
(In reply to Darin Adler from comment #23)
> Why does this use charactersWithNullTermination? That’s the worst part of
> this new code, I think.

We need a UChar* buffer. It could also be done using String::characters16 and copying the resultant buffer, but that seems worse.

> Do we need a fast path for any simple cases, like all ASCII?

That's a question for the performance tests, but I'd guess probably not, because displaying URLs should never be performance-sensitive. (And, without looking into implementation, it's possible that ICU has a fast path already internally, and also possible that the NSString function we've replaced did not. We'd have to check the implementations to decide whether to expect the new one to be slower.)

Regardless, I'd like to handle suggested improvements in bug #174816, please, so we can do them in C++ instead of prayers. (I've never written Objective C++ before, and have just been throwing patches at EWS hoping they'd build and maybe also work.)
Comment 27 Darin Adler 2019-01-15 12:54:28 PST
(In reply to Michael Catanzaro from comment #26)
> (In reply to Darin Adler from comment #23)
> > Why does this use charactersWithNullTermination? That’s the worst part of
> > this new code, I think.
> 
> We need a UChar* buffer. It could also be done using String::characters16
> and copying the resultant buffer, but that seems worse.

The idiom to efficiently get a UChar* buffer would be:

    auto characters = StringView { string }.upconvertedCharacters();

Then "characters" can be passed anywhere that needs a "const UChar*". We should switch to that I think.

I don’t know why you say "copying the resultant buffer". I don’t see any reason we’d need to copy it.

Feel free to handle this in bug 174816. Nothing super urgent, just unnecessarily underperforming idiom. Nothing should use charactersWithNullTermination unless it needs null termination.
Comment 28 Darin Adler 2019-01-15 12:55:53 PST
(In reply to Michael Catanzaro from comment #26)
> > Do we need a fast path for any simple cases, like all ASCII?
> 
> That's a question for the performance tests, but I'd guess probably not,
> because displaying URLs should never be performance-sensitive. (And, without
> looking into implementation, it's possible that ICU has a fast path already
> internally, and also possible that the NSString function we've replaced did
> not. We'd have to check the implementations to decide whether to expect the
> new one to be slower.)

I think we should consider a special case for when the string is already normalized. That will not only help speed but might even help memory use because we’d share the string instead of allocating a new one with all the same characters.

But maybe I’m missing the big picture. This might be such a specialized corner of the code that these kinds of optimization aren’t important.
Comment 29 Michael Catanzaro 2019-01-15 13:23:13 PST
(In reply to Darin Adler from comment #27)
> The idiom to efficiently get a UChar* buffer would be:
> 
>     auto characters = StringView { string }.upconvertedCharacters();
> 
> Then "characters" can be passed anywhere that needs a "const UChar*". We
> should switch to that I think.
> 
> I don’t know why you say "copying the resultant buffer". I don’t see any
> reason we’d need to copy it.
> 
> Feel free to handle this in bug 174816. Nothing super urgent, just
> unnecessarily underperforming idiom. Nothing should use
> charactersWithNullTermination unless it needs null termination.

OK, I'll copy this into bug #174816.

(In reply to Darin Adler from comment #28)
> I think we should consider a special case for when the string is already
> normalized. That will not only help speed but might even help memory use
> because we’d share the string instead of allocating a new one with all the
> same characters.

Problem is: how to detect that the string is already normalized? We could pass it through this API and verify that the result is unchanged. :D

> But maybe I’m missing the big picture. This might be such a specialized
> corner of the code that these kinds of optimization aren’t important.

Hopefully, since this code only runs when formatting a URL for display (e.g. in the address bar).
Comment 30 Darin Adler 2019-01-15 17:14:42 PST
(In reply to Michael Catanzaro from comment #29)
> Problem is: how to detect that the string is already normalized?

unorm2_quickCheck

> We could
> pass it through this API and verify that the result is unchanged. :D

Or that.
Comment 31 Michael Catanzaro 2019-01-15 19:27:52 PST
(In reply to Darin Adler from comment #30)
> (In reply to Michael Catanzaro from comment #29)
> > Problem is: how to detect that the string is already normalized?
> 
> unorm2_quickCheck

OK, cool: piled that onto bug #174816.