RESOLVED FIXED 192945
Use unorm2_normalize instead of precomposedStringWithCanonicalMapping in userVisibleString
https://bugs.webkit.org/show_bug.cgi?id=192945
Summary Use unorm2_normalize instead of precomposedStringWithCanonicalMapping in user...
Michael Catanzaro
Reported 2018-12-20 12:07:51 PST
Replace use of the nice NSString function precomposedStringWithCanonicalMapping with the ICU API unorm2_normalize. This is to prep the code for translation to cross-platform C++. Please congratulate the ICU developers for creating an API that is impossible to safely use except when calling it twice in succession. Note this code was all written by Ms2ger; I'm just splitting it into its own patch because I fear we've lost him to end of year holidays.
Attachments
Patch (2.64 KB, patch)
2018-12-20 13:02 PST, Michael Catanzaro
no flags
Patch (2.65 KB, patch)
2018-12-20 13:18 PST, Michael Catanzaro
no flags
Patch (2.86 KB, patch)
2018-12-20 13:30 PST, Michael Catanzaro
no flags
Patch (2.87 KB, patch)
2018-12-20 13:46 PST, Michael Catanzaro
no flags
Patch (2.87 KB, patch)
2018-12-20 13:55 PST, Michael Catanzaro
no flags
Patch (2.91 KB, patch)
2018-12-22 16:48 PST, Michael Catanzaro
no flags
Patch (3.39 KB, patch)
2019-01-14 17:41 PST, Michael Catanzaro
achristensen: review+
Michael Catanzaro
Comment 1 2018-12-20 12:58:35 PST
(In reply to Michael Catanzaro from comment #0) > Note this code was all written by Ms2ger; I'm just splitting it into its own > patch because I fear we've lost him to end of year holidays. OK, I've mucked with it quite a bit to attempt to integrate it into NSURLExtras.mm. Let's say credit to Ms2ger if it works and my fault if EWS comes back red red red.....
Michael Catanzaro
Comment 2 2018-12-20 13:02:14 PST
Michael Catanzaro
Comment 3 2018-12-20 13:18:39 PST
Michael Catanzaro
Comment 4 2018-12-20 13:30:36 PST
Michael Catanzaro
Comment 5 2018-12-20 13:46:10 PST
Michael Catanzaro
Comment 6 2018-12-20 13:55:40 PST
Michael Catanzaro
Comment 7 2018-12-20 14:34:00 PST
Comment on attachment 357865 [details] Patch Unsetting r? until the Mac EWS are working again: Exception: Error auto-installing the keyring package to: "/Volumes/Data/EWS/WebKit/Tools/Scripts/webkitpy/thirdparty/autoinstalled/keyring" --> Inner message: Could not download Python modules from URL "None". Make sure you are connected to the internet. You must be connected to the internet when downloading needed modules for the first time. --> Inner message: <urlopen error [SSL: TLSV1_ALERT_PROTOCOL_VERSION] tlsv1 alert protocol version (_ssl.c:590)>
Michael Catanzaro
Comment 8 2018-12-20 17:00:02 PST
Comment on attachment 357865 [details] Patch OK, all the bots are happy except for the network error.... This is my first Objective C[++] ever and I am very proud of it! Even if it is mostly copied from Ms2ger. Note that after this lands, I will immediately reland the patch in bug #174816, deleting all this code. This commit is just to make sure that if the switch from the NSString normalization to the ICU normalization introduces an unexpected regression, it will bisect down to this small revision instead of the huge patch in that bug.
Alex Christensen
Comment 9 2018-12-20 17:05:09 PST
Comment on attachment 357865 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=357865&action=review > Source/WTF/wtf/cocoa/NSURLExtras.mm:1170 > + ASSERT(sourceBuffer.last() == '\0'); This assertion seems redundant. > Source/WTF/wtf/cocoa/NSURLExtras.mm:1175 > + const UNormalizer2 *normalizer = unorm2_getNFCInstance(&uerror); Can unorm2_getNFCInstance never fail? > Source/WTF/wtf/cocoa/NSURLExtras.mm:1177 > + if (uerror == U_BUFFER_OVERFLOW_ERROR) { Is this case covered by tests? Are there tests with emoji?
Michael Catanzaro
Comment 10 2018-12-20 19:47:11 PST
Comment on attachment 357865 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=357865&action=review >> Source/WTF/wtf/cocoa/NSURLExtras.mm:1170 >> + ASSERT(sourceBuffer.last() == '\0'); > > This assertion seems redundant. It is, but it makes the next line sourceBuffer.removeLast() easier to understand, so seems worth keeping. >> Source/WTF/wtf/cocoa/NSURLExtras.mm:1175 >> + const UNormalizer2 *normalizer = unorm2_getNFCInstance(&uerror); > > Can unorm2_getNFCInstance never fail? Good catch. It can fail if there is a memory allocation error. So not in practice, but we should check it regardless, especially since we create an error, pass it in, and don't check it... that's no good. And maybe in the future the function could change to fail in different ways. >> Source/WTF/wtf/cocoa/NSURLExtras.mm:1177 >> + if (uerror == U_BUFFER_OVERFLOW_ERROR) { > > Is this case covered by tests? Are there tests with emoji? There is a test with a snowman emoji. None of the tests intentionally triggers this "error condition". (Of course it's not a real error, just bad API design to require the caller to allocate a bigger buffer instead of ICU allocating the right size and returning it.) I can check the patch in bug #174816 tomorrow to see if any of those tests hit this codepath in the big patch.
Michael Catanzaro
Comment 11 2018-12-22 16:38:59 PST
(In reply to Michael Catanzaro from comment #10) > None of the tests intentionally triggers this "error condition". (Of course > it's not a real error, just bad API design to require the caller to allocate > a bigger buffer instead of ICU allocating the right size and returning it.) > I can check the patch in bug #174816 tomorrow to see if any of those tests > hit this codepath in the big patch. None of the tests hit this path. In fact, input size is always equal to output size. I doubt that's guaranteed to be the case, but I don't know enough about Unicode to pursue it further. I'll update the patch to add the missing error check, though.
Michael Catanzaro
Comment 12 2018-12-22 16:48:15 PST
Alexey Proskuryakov
Comment 13 2018-12-23 10:46:29 PST
> None of the tests hit this path. In fact, input size is always equal to output size. I doubt that's guaranteed to be the case, but I don't know enough about Unicode to pursue it further. Is this actually about Unicode, or just strings longer than 2048 characters? Unicode can certainly be tricky because conversions can make strings longer, but I don't see how that comes into play in this code.
Michael Catanzaro
Comment 14 2018-12-24 10:34:02 PST
Comment on attachment 358027 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=358027&action=review > Source/WTF/wtf/cocoa/NSURLExtras.mm:1180 > + normalizedLength = unorm2_normalize(normalizer, sourceBuffer.data(), sourceBuffer.size(), normalizedCharacters.data(), normalizedCharacters.size(), &uerror); > + if (uerror == U_BUFFER_OVERFLOW_ERROR) { This "error" occurs if the output size is greater than input size (sourceBuffer.size()). I don't think it has anything to do with the 2048-bit inline capacity of sourceBuffer. In all of the URLHelpers.cpp tests from bug #174816, the output size is always equal to the input size, so it is not currently tested (probably; there are a couple more tests in URLExtras.mm that I have not run).
Michael Catanzaro
Comment 15 2018-12-24 10:35:33 PST
(In reply to Michael Catanzaro from comment #14) > This "error" occurs if the output size is greater than input size > (sourceBuffer.size()). I don't think it has anything to do with the 2048-bit > inline capacity of sourceBuffer. More specifically: the error occurs if the output size is greater than the size of the buffer provided for the output, normalizedCharacters.size(). And normalizedCharacters.size() is initially the same as sourceBuffer.size().
Myles C. Maxfield
Comment 16 2019-01-02 14:52:56 PST
Alex Christensen should review this.
Alex Christensen
Comment 17 2019-01-03 08:49:44 PST
Comment on attachment 358027 [details] Patch I think this code should be put in a subroutine with a name that takes a const String& parameter and returns a String.
Michael Catanzaro
Comment 18 2019-01-14 17:39:13 PST
Adding a new subroutine might be harder for me than you'd think, but here goes. A sacrifice to the EWS gods: 🔥🐑🔥
Michael Catanzaro
Comment 19 2019-01-14 17:41:25 PST
Michael Catanzaro
Comment 20 2019-01-14 19:24:40 PST
Thanks! (In reply to Michael Catanzaro from comment #18) > A sacrifice to the EWS gods: 🔥🐑🔥 This must have worked. Can't believe it.
Michael Catanzaro
Comment 21 2019-01-14 19:26:08 PST
Radar WebKit Bug Importer
Comment 22 2019-01-14 19:29:29 PST
Darin Adler
Comment 23 2019-01-15 08:56:13 PST
Why does this use charactersWithNullTermination? That’s the worst part of this new code, I think. Do we need a fast path for any simple cases, like all ASCII?
Darin Adler
Comment 24 2019-01-15 08:57:22 PST
I think this can and should easily be improved so that it works without allocating memory just to copy the characters into it.
Darin Adler
Comment 25 2019-01-15 08:57:47 PST
I’m referring to charactersWithNullTermination, not the Vector on the stack.
Michael Catanzaro
Comment 26 2019-01-15 09:35:35 PST
(In reply to Darin Adler from comment #23) > Why does this use charactersWithNullTermination? That’s the worst part of > this new code, I think. We need a UChar* buffer. It could also be done using String::characters16 and copying the resultant buffer, but that seems worse. > Do we need a fast path for any simple cases, like all ASCII? That's a question for the performance tests, but I'd guess probably not, because displaying URLs should never be performance-sensitive. (And, without looking into implementation, it's possible that ICU has a fast path already internally, and also possible that the NSString function we've replaced did not. We'd have to check the implementations to decide whether to expect the new one to be slower.) Regardless, I'd like to handle suggested improvements in bug #174816, please, so we can do them in C++ instead of prayers. (I've never written Objective C++ before, and have just been throwing patches at EWS hoping they'd build and maybe also work.)
Darin Adler
Comment 27 2019-01-15 12:54:28 PST
(In reply to Michael Catanzaro from comment #26) > (In reply to Darin Adler from comment #23) > > Why does this use charactersWithNullTermination? That’s the worst part of > > this new code, I think. > > We need a UChar* buffer. It could also be done using String::characters16 > and copying the resultant buffer, but that seems worse. The idiom to efficiently get a UChar* buffer would be: auto characters = StringView { string }.upconvertedCharacters(); Then "characters" can be passed anywhere that needs a "const UChar*". We should switch to that I think. I don’t know why you say "copying the resultant buffer". I don’t see any reason we’d need to copy it. Feel free to handle this in bug 174816. Nothing super urgent, just unnecessarily underperforming idiom. Nothing should use charactersWithNullTermination unless it needs null termination.
Darin Adler
Comment 28 2019-01-15 12:55:53 PST
(In reply to Michael Catanzaro from comment #26) > > Do we need a fast path for any simple cases, like all ASCII? > > That's a question for the performance tests, but I'd guess probably not, > because displaying URLs should never be performance-sensitive. (And, without > looking into implementation, it's possible that ICU has a fast path already > internally, and also possible that the NSString function we've replaced did > not. We'd have to check the implementations to decide whether to expect the > new one to be slower.) I think we should consider a special case for when the string is already normalized. That will not only help speed but might even help memory use because we’d share the string instead of allocating a new one with all the same characters. But maybe I’m missing the big picture. This might be such a specialized corner of the code that these kinds of optimization aren’t important.
Michael Catanzaro
Comment 29 2019-01-15 13:23:13 PST
(In reply to Darin Adler from comment #27) > The idiom to efficiently get a UChar* buffer would be: > > auto characters = StringView { string }.upconvertedCharacters(); > > Then "characters" can be passed anywhere that needs a "const UChar*". We > should switch to that I think. > > I don’t know why you say "copying the resultant buffer". I don’t see any > reason we’d need to copy it. > > Feel free to handle this in bug 174816. Nothing super urgent, just > unnecessarily underperforming idiom. Nothing should use > charactersWithNullTermination unless it needs null termination. OK, I'll copy this into bug #174816. (In reply to Darin Adler from comment #28) > I think we should consider a special case for when the string is already > normalized. That will not only help speed but might even help memory use > because we’d share the string instead of allocating a new one with all the > same characters. Problem is: how to detect that the string is already normalized? We could pass it through this API and verify that the result is unchanged. :D > But maybe I’m missing the big picture. This might be such a specialized > corner of the code that these kinds of optimization aren’t important. Hopefully, since this code only runs when formatting a URL for display (e.g. in the address bar).
Darin Adler
Comment 30 2019-01-15 17:14:42 PST
(In reply to Michael Catanzaro from comment #29) > Problem is: how to detect that the string is already normalized? unorm2_quickCheck > We could > pass it through this API and verify that the result is unchanged. :D Or that.
Michael Catanzaro
Comment 31 2019-01-15 19:27:52 PST
(In reply to Darin Adler from comment #30) > (In reply to Michael Catanzaro from comment #29) > > Problem is: how to detect that the string is already normalized? > > unorm2_quickCheck OK, cool: piled that onto bug #174816.
Note You need to log in before you can comment on or make changes to this bug.