192945 – Use unorm2_normalize instead of precomposedStringWithCanonicalMapping in userVisibleString

RESOLVED FIXED 192945

Use unorm2_normalize instead of precomposedStringWithCanonicalMapping in userVisibleString

https://bugs.webkit.org/show_bug.cgi?id=192945

Summary Use unorm2_normalize instead of precomposedStringWithCanonicalMapping in user...

Michael Catanzaro

Reported 2018-12-20 12:07:51 PST

Replace use of the nice NSString function precomposedStringWithCanonicalMapping with the ICU API unorm2_normalize. This is to prep the code for translation to cross-platform C++. Please congratulate the ICU developers for creating an API that is impossible to safely use except when calling it twice in succession. Note this code was all written by Ms2ger; I'm just splitting it into its own patch because I fear we've lost him to end of year holidays.

Attachments
Patch (2.64 KB, patch) 2018-12-20 13:02 PST, Michael Catanzaro	no flags	Details Formatted Diff Diff
Patch (2.65 KB, patch) 2018-12-20 13:18 PST, Michael Catanzaro	no flags	Details Formatted Diff Diff
Patch (2.86 KB, patch) 2018-12-20 13:30 PST, Michael Catanzaro	no flags	Details Formatted Diff Diff
Patch (2.87 KB, patch) 2018-12-20 13:46 PST, Michael Catanzaro	no flags	Details Formatted Diff Diff
Patch (2.87 KB, patch) 2018-12-20 13:55 PST, Michael Catanzaro	no flags	Details Formatted Diff Diff
Patch (2.91 KB, patch) 2018-12-22 16:48 PST, Michael Catanzaro	no flags	Details Formatted Diff Diff
Patch (3.39 KB, patch) 2019-01-14 17:41 PST, Michael Catanzaro	achristensen: review+	Details Formatted Diff Diff
Show Obsolete (6) View All Add attachment proposed patch, testcase, etc.

Michael Catanzaro

Comment 1 2018-12-20 12:58:35 PST

(In reply to Michael Catanzaro from comment #0) > Note this code was all written by Ms2ger; I'm just splitting it into its own > patch because I fear we've lost him to end of year holidays. OK, I've mucked with it quite a bit to attempt to integrate it into NSURLExtras.mm. Let's say credit to Ms2ger if it works and my fault if EWS comes back red red red.....

Michael Catanzaro

Comment 2 2018-12-20 13:02:14 PST

Created attachment 357853 [details] Patch

Michael Catanzaro

Comment 3 2018-12-20 13:18:39 PST

Created attachment 357856 [details] Patch

Michael Catanzaro

Comment 4 2018-12-20 13:30:36 PST

Created attachment 357858 [details] Patch

Michael Catanzaro

Comment 5 2018-12-20 13:46:10 PST

Created attachment 357862 [details] Patch

Michael Catanzaro

Comment 6 2018-12-20 13:55:40 PST

Created attachment 357865 [details] Patch

Michael Catanzaro

Comment 7 2018-12-20 14:34:00 PST

Comment on attachment 357865 [details] Patch Unsetting r? until the Mac EWS are working again: Exception: Error auto-installing the keyring package to: "/Volumes/Data/EWS/WebKit/Tools/Scripts/webkitpy/thirdparty/autoinstalled/keyring" --> Inner message: Could not download Python modules from URL "None". Make sure you are connected to the internet. You must be connected to the internet when downloading needed modules for the first time. --> Inner message: <urlopen error [SSL: TLSV1_ALERT_PROTOCOL_VERSION] tlsv1 alert protocol version (_ssl.c:590)>

Michael Catanzaro

Comment 8 2018-12-20 17:00:02 PST

Comment on attachment 357865 [details] Patch OK, all the bots are happy except for the network error.... This is my first Objective C[++] ever and I am very proud of it! Even if it is mostly copied from Ms2ger. Note that after this lands, I will immediately reland the patch in bug #174816, deleting all this code. This commit is just to make sure that if the switch from the NSString normalization to the ICU normalization introduces an unexpected regression, it will bisect down to this small revision instead of the huge patch in that bug.

Alex Christensen

Comment 9 2018-12-20 17:05:09 PST

Comment on attachment 357865 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=357865&action=review > Source/WTF/wtf/cocoa/NSURLExtras.mm:1170 > + ASSERT(sourceBuffer.last() == '\0'); This assertion seems redundant. > Source/WTF/wtf/cocoa/NSURLExtras.mm:1175 > + const UNormalizer2 *normalizer = unorm2_getNFCInstance(&uerror); Can unorm2_getNFCInstance never fail? > Source/WTF/wtf/cocoa/NSURLExtras.mm:1177 > + if (uerror == U_BUFFER_OVERFLOW_ERROR) { Is this case covered by tests? Are there tests with emoji?

Michael Catanzaro

Comment 10 2018-12-20 19:47:11 PST

Comment on attachment 357865 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=357865&action=review >> Source/WTF/wtf/cocoa/NSURLExtras.mm:1170 >> + ASSERT(sourceBuffer.last() == '\0'); > > This assertion seems redundant. It is, but it makes the next line sourceBuffer.removeLast() easier to understand, so seems worth keeping. >> Source/WTF/wtf/cocoa/NSURLExtras.mm:1175 >> + const UNormalizer2 *normalizer = unorm2_getNFCInstance(&uerror); > > Can unorm2_getNFCInstance never fail? Good catch. It can fail if there is a memory allocation error. So not in practice, but we should check it regardless, especially since we create an error, pass it in, and don't check it... that's no good. And maybe in the future the function could change to fail in different ways. >> Source/WTF/wtf/cocoa/NSURLExtras.mm:1177 >> + if (uerror == U_BUFFER_OVERFLOW_ERROR) { > > Is this case covered by tests? Are there tests with emoji? There is a test with a snowman emoji. None of the tests intentionally triggers this "error condition". (Of course it's not a real error, just bad API design to require the caller to allocate a bigger buffer instead of ICU allocating the right size and returning it.) I can check the patch in bug #174816 tomorrow to see if any of those tests hit this codepath in the big patch.

Michael Catanzaro

Comment 11 2018-12-22 16:38:59 PST

(In reply to Michael Catanzaro from comment #10) > None of the tests intentionally triggers this "error condition". (Of course > it's not a real error, just bad API design to require the caller to allocate > a bigger buffer instead of ICU allocating the right size and returning it.) > I can check the patch in bug #174816 tomorrow to see if any of those tests > hit this codepath in the big patch. None of the tests hit this path. In fact, input size is always equal to output size. I doubt that's guaranteed to be the case, but I don't know enough about Unicode to pursue it further. I'll update the patch to add the missing error check, though.

Michael Catanzaro

Comment 12 2018-12-22 16:48:15 PST

Created attachment 358027 [details] Patch

Alexey Proskuryakov

Comment 13 2018-12-23 10:46:29 PST

> None of the tests hit this path. In fact, input size is always equal to output size. I doubt that's guaranteed to be the case, but I don't know enough about Unicode to pursue it further. Is this actually about Unicode, or just strings longer than 2048 characters? Unicode can certainly be tricky because conversions can make strings longer, but I don't see how that comes into play in this code.

Michael Catanzaro

Comment 14 2018-12-24 10:34:02 PST

Comment on attachment 358027 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=358027&action=review > Source/WTF/wtf/cocoa/NSURLExtras.mm:1180 > + normalizedLength = unorm2_normalize(normalizer, sourceBuffer.data(), sourceBuffer.size(), normalizedCharacters.data(), normalizedCharacters.size(), &uerror); > + if (uerror == U_BUFFER_OVERFLOW_ERROR) { This "error" occurs if the output size is greater than input size (sourceBuffer.size()). I don't think it has anything to do with the 2048-bit inline capacity of sourceBuffer. In all of the URLHelpers.cpp tests from bug #174816, the output size is always equal to the input size, so it is not currently tested (probably; there are a couple more tests in URLExtras.mm that I have not run).

Michael Catanzaro

Comment 15 2018-12-24 10:35:33 PST

(In reply to Michael Catanzaro from comment #14) > This "error" occurs if the output size is greater than input size > (sourceBuffer.size()). I don't think it has anything to do with the 2048-bit > inline capacity of sourceBuffer. More specifically: the error occurs if the output size is greater than the size of the buffer provided for the output, normalizedCharacters.size(). And normalizedCharacters.size() is initially the same as sourceBuffer.size().

Myles C. Maxfield

Comment 16 2019-01-02 14:52:56 PST

Alex Christensen should review this.

Alex Christensen

Comment 17 2019-01-03 08:49:44 PST

Comment on attachment 358027 [details] Patch I think this code should be put in a subroutine with a name that takes a const String& parameter and returns a String.

Michael Catanzaro

Comment 18 2019-01-14 17:39:13 PST

Adding a new subroutine might be harder for me than you'd think, but here goes. A sacrifice to the EWS gods: 🔥🐑🔥

Michael Catanzaro

Comment 19 2019-01-14 17:41:25 PST

Created attachment 359107 [details] Patch

Michael Catanzaro

Comment 20 2019-01-14 19:24:40 PST

Thanks! (In reply to Michael Catanzaro from comment #18) > A sacrifice to the EWS gods: 🔥🐑🔥 This must have worked. Can't believe it.

Michael Catanzaro

Comment 21 2019-01-14 19:26:08 PST

Committed r239970: <https://trac.webkit.org/changeset/239970>

Radar WebKit Bug Importer

Comment 22 2019-01-14 19:29:29 PST

<rdar://problem/47274320>

Darin Adler

Comment 23 2019-01-15 08:56:13 PST

Why does this use charactersWithNullTermination? That’s the worst part of this new code, I think. Do we need a fast path for any simple cases, like all ASCII?

Darin Adler

Comment 24 2019-01-15 08:57:22 PST

I think this can and should easily be improved so that it works without allocating memory just to copy the characters into it.

Darin Adler

Comment 25 2019-01-15 08:57:47 PST

I’m referring to charactersWithNullTermination, not the Vector on the stack.

Michael Catanzaro

Comment 26 2019-01-15 09:35:35 PST

(In reply to Darin Adler from comment #23) > Why does this use charactersWithNullTermination? That’s the worst part of > this new code, I think. We need a UChar* buffer. It could also be done using String::characters16 and copying the resultant buffer, but that seems worse. > Do we need a fast path for any simple cases, like all ASCII? That's a question for the performance tests, but I'd guess probably not, because displaying URLs should never be performance-sensitive. (And, without looking into implementation, it's possible that ICU has a fast path already internally, and also possible that the NSString function we've replaced did not. We'd have to check the implementations to decide whether to expect the new one to be slower.) Regardless, I'd like to handle suggested improvements in bug #174816, please, so we can do them in C++ instead of prayers. (I've never written Objective C++ before, and have just been throwing patches at EWS hoping they'd build and maybe also work.)

Darin Adler

Comment 27 2019-01-15 12:54:28 PST

(In reply to Michael Catanzaro from comment #26) > (In reply to Darin Adler from comment #23) > > Why does this use charactersWithNullTermination? That’s the worst part of > > this new code, I think. > > We need a UChar* buffer. It could also be done using String::characters16 > and copying the resultant buffer, but that seems worse. The idiom to efficiently get a UChar* buffer would be: auto characters = StringView { string }.upconvertedCharacters(); Then "characters" can be passed anywhere that needs a "const UChar*". We should switch to that I think. I don’t know why you say "copying the resultant buffer". I don’t see any reason we’d need to copy it. Feel free to handle this in bug 174816. Nothing super urgent, just unnecessarily underperforming idiom. Nothing should use charactersWithNullTermination unless it needs null termination.

Darin Adler

Comment 28 2019-01-15 12:55:53 PST

(In reply to Michael Catanzaro from comment #26) > > Do we need a fast path for any simple cases, like all ASCII? > > That's a question for the performance tests, but I'd guess probably not, > because displaying URLs should never be performance-sensitive. (And, without > looking into implementation, it's possible that ICU has a fast path already > internally, and also possible that the NSString function we've replaced did > not. We'd have to check the implementations to decide whether to expect the > new one to be slower.) I think we should consider a special case for when the string is already normalized. That will not only help speed but might even help memory use because we’d share the string instead of allocating a new one with all the same characters. But maybe I’m missing the big picture. This might be such a specialized corner of the code that these kinds of optimization aren’t important.

Michael Catanzaro

Comment 29 2019-01-15 13:23:13 PST

(In reply to Darin Adler from comment #27) > The idiom to efficiently get a UChar* buffer would be: > > auto characters = StringView { string }.upconvertedCharacters(); > > Then "characters" can be passed anywhere that needs a "const UChar*". We > should switch to that I think. > > I don’t know why you say "copying the resultant buffer". I don’t see any > reason we’d need to copy it. > > Feel free to handle this in bug 174816. Nothing super urgent, just > unnecessarily underperforming idiom. Nothing should use > charactersWithNullTermination unless it needs null termination. OK, I'll copy this into bug #174816. (In reply to Darin Adler from comment #28) > I think we should consider a special case for when the string is already > normalized. That will not only help speed but might even help memory use > because we’d share the string instead of allocating a new one with all the > same characters. Problem is: how to detect that the string is already normalized? We could pass it through this API and verify that the result is unchanged. :D > But maybe I’m missing the big picture. This might be such a specialized > corner of the code that these kinds of optimization aren’t important. Hopefully, since this code only runs when formatting a URL for display (e.g. in the address bar).

Darin Adler

Comment 30 2019-01-15 17:14:42 PST

(In reply to Michael Catanzaro from comment #29) > Problem is: how to detect that the string is already normalized? unorm2_quickCheck > We could > pass it through this API and verify that the result is unchanged. :D Or that.

Michael Catanzaro

Comment 31 2019-01-15 19:27:52 PST

(In reply to Darin Adler from comment #30) > (In reply to Michael Catanzaro from comment #29) > > Problem is: how to detect that the string is already normalized? > > unorm2_quickCheck OK, cool: piled that onto bug #174816.

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Normal

Classification Unclassified

Version WebKit Nightly Build

Hardware PC

OS Linux

Product WebKit

Component Web Template Framework

Assignee

Michael Catanzaro

Reported

2018-12-20 12:07 PST

Modified

2019-01-15 19:27 PST History

CC List

12 users Show

URL

Keywords InRadar

Depends on

Blocks

174816

Dependencies

tree graph