Replace use of the nice NSString function precomposedStringWithCanonicalMapping with the ICU API unorm2_normalize. This is to prep the code for translation to cross-platform C++. Please congratulate the ICU developers for creating an API that is impossible to safely use except when calling it twice in succession. Note this code was all written by Ms2ger; I'm just splitting it into its own patch because I fear we've lost him to end of year holidays.
(In reply to Michael Catanzaro from comment #0) > Note this code was all written by Ms2ger; I'm just splitting it into its own > patch because I fear we've lost him to end of year holidays. OK, I've mucked with it quite a bit to attempt to integrate it into NSURLExtras.mm. Let's say credit to Ms2ger if it works and my fault if EWS comes back red red red.....
Created attachment 357853 [details] Patch
Created attachment 357856 [details] Patch
Created attachment 357858 [details] Patch
Created attachment 357862 [details] Patch
Created attachment 357865 [details] Patch
Comment on attachment 357865 [details] Patch Unsetting r? until the Mac EWS are working again: Exception: Error auto-installing the keyring package to: "/Volumes/Data/EWS/WebKit/Tools/Scripts/webkitpy/thirdparty/autoinstalled/keyring" --> Inner message: Could not download Python modules from URL "None". Make sure you are connected to the internet. You must be connected to the internet when downloading needed modules for the first time. --> Inner message: <urlopen error [SSL: TLSV1_ALERT_PROTOCOL_VERSION] tlsv1 alert protocol version (_ssl.c:590)>
Comment on attachment 357865 [details] Patch OK, all the bots are happy except for the network error.... This is my first Objective C[++] ever and I am very proud of it! Even if it is mostly copied from Ms2ger. Note that after this lands, I will immediately reland the patch in bug #174816, deleting all this code. This commit is just to make sure that if the switch from the NSString normalization to the ICU normalization introduces an unexpected regression, it will bisect down to this small revision instead of the huge patch in that bug.
Comment on attachment 357865 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=357865&action=review > Source/WTF/wtf/cocoa/NSURLExtras.mm:1170 > + ASSERT(sourceBuffer.last() == '\0'); This assertion seems redundant. > Source/WTF/wtf/cocoa/NSURLExtras.mm:1175 > + const UNormalizer2 *normalizer = unorm2_getNFCInstance(&uerror); Can unorm2_getNFCInstance never fail? > Source/WTF/wtf/cocoa/NSURLExtras.mm:1177 > + if (uerror == U_BUFFER_OVERFLOW_ERROR) { Is this case covered by tests? Are there tests with emoji?
Comment on attachment 357865 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=357865&action=review >> Source/WTF/wtf/cocoa/NSURLExtras.mm:1170 >> + ASSERT(sourceBuffer.last() == '\0'); > > This assertion seems redundant. It is, but it makes the next line sourceBuffer.removeLast() easier to understand, so seems worth keeping. >> Source/WTF/wtf/cocoa/NSURLExtras.mm:1175 >> + const UNormalizer2 *normalizer = unorm2_getNFCInstance(&uerror); > > Can unorm2_getNFCInstance never fail? Good catch. It can fail if there is a memory allocation error. So not in practice, but we should check it regardless, especially since we create an error, pass it in, and don't check it... that's no good. And maybe in the future the function could change to fail in different ways. >> Source/WTF/wtf/cocoa/NSURLExtras.mm:1177 >> + if (uerror == U_BUFFER_OVERFLOW_ERROR) { > > Is this case covered by tests? Are there tests with emoji? There is a test with a snowman emoji. None of the tests intentionally triggers this "error condition". (Of course it's not a real error, just bad API design to require the caller to allocate a bigger buffer instead of ICU allocating the right size and returning it.) I can check the patch in bug #174816 tomorrow to see if any of those tests hit this codepath in the big patch.
(In reply to Michael Catanzaro from comment #10) > None of the tests intentionally triggers this "error condition". (Of course > it's not a real error, just bad API design to require the caller to allocate > a bigger buffer instead of ICU allocating the right size and returning it.) > I can check the patch in bug #174816 tomorrow to see if any of those tests > hit this codepath in the big patch. None of the tests hit this path. In fact, input size is always equal to output size. I doubt that's guaranteed to be the case, but I don't know enough about Unicode to pursue it further. I'll update the patch to add the missing error check, though.
Created attachment 358027 [details] Patch
> None of the tests hit this path. In fact, input size is always equal to output size. I doubt that's guaranteed to be the case, but I don't know enough about Unicode to pursue it further. Is this actually about Unicode, or just strings longer than 2048 characters? Unicode can certainly be tricky because conversions can make strings longer, but I don't see how that comes into play in this code.
Comment on attachment 358027 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=358027&action=review > Source/WTF/wtf/cocoa/NSURLExtras.mm:1180 > + normalizedLength = unorm2_normalize(normalizer, sourceBuffer.data(), sourceBuffer.size(), normalizedCharacters.data(), normalizedCharacters.size(), &uerror); > + if (uerror == U_BUFFER_OVERFLOW_ERROR) { This "error" occurs if the output size is greater than input size (sourceBuffer.size()). I don't think it has anything to do with the 2048-bit inline capacity of sourceBuffer. In all of the URLHelpers.cpp tests from bug #174816, the output size is always equal to the input size, so it is not currently tested (probably; there are a couple more tests in URLExtras.mm that I have not run).
(In reply to Michael Catanzaro from comment #14) > This "error" occurs if the output size is greater than input size > (sourceBuffer.size()). I don't think it has anything to do with the 2048-bit > inline capacity of sourceBuffer. More specifically: the error occurs if the output size is greater than the size of the buffer provided for the output, normalizedCharacters.size(). And normalizedCharacters.size() is initially the same as sourceBuffer.size().
Alex Christensen should review this.
Comment on attachment 358027 [details] Patch I think this code should be put in a subroutine with a name that takes a const String& parameter and returns a String.
Adding a new subroutine might be harder for me than you'd think, but here goes. A sacrifice to the EWS gods: 🔥🐑🔥
Created attachment 359107 [details] Patch
Thanks! (In reply to Michael Catanzaro from comment #18) > A sacrifice to the EWS gods: 🔥🐑🔥 This must have worked. Can't believe it.
Committed r239970: <https://trac.webkit.org/changeset/239970>
<rdar://problem/47274320>
Why does this use charactersWithNullTermination? That’s the worst part of this new code, I think. Do we need a fast path for any simple cases, like all ASCII?
I think this can and should easily be improved so that it works without allocating memory just to copy the characters into it.
I’m referring to charactersWithNullTermination, not the Vector on the stack.
(In reply to Darin Adler from comment #23) > Why does this use charactersWithNullTermination? That’s the worst part of > this new code, I think. We need a UChar* buffer. It could also be done using String::characters16 and copying the resultant buffer, but that seems worse. > Do we need a fast path for any simple cases, like all ASCII? That's a question for the performance tests, but I'd guess probably not, because displaying URLs should never be performance-sensitive. (And, without looking into implementation, it's possible that ICU has a fast path already internally, and also possible that the NSString function we've replaced did not. We'd have to check the implementations to decide whether to expect the new one to be slower.) Regardless, I'd like to handle suggested improvements in bug #174816, please, so we can do them in C++ instead of prayers. (I've never written Objective C++ before, and have just been throwing patches at EWS hoping they'd build and maybe also work.)
(In reply to Michael Catanzaro from comment #26) > (In reply to Darin Adler from comment #23) > > Why does this use charactersWithNullTermination? That’s the worst part of > > this new code, I think. > > We need a UChar* buffer. It could also be done using String::characters16 > and copying the resultant buffer, but that seems worse. The idiom to efficiently get a UChar* buffer would be: auto characters = StringView { string }.upconvertedCharacters(); Then "characters" can be passed anywhere that needs a "const UChar*". We should switch to that I think. I don’t know why you say "copying the resultant buffer". I don’t see any reason we’d need to copy it. Feel free to handle this in bug 174816. Nothing super urgent, just unnecessarily underperforming idiom. Nothing should use charactersWithNullTermination unless it needs null termination.
(In reply to Michael Catanzaro from comment #26) > > Do we need a fast path for any simple cases, like all ASCII? > > That's a question for the performance tests, but I'd guess probably not, > because displaying URLs should never be performance-sensitive. (And, without > looking into implementation, it's possible that ICU has a fast path already > internally, and also possible that the NSString function we've replaced did > not. We'd have to check the implementations to decide whether to expect the > new one to be slower.) I think we should consider a special case for when the string is already normalized. That will not only help speed but might even help memory use because we’d share the string instead of allocating a new one with all the same characters. But maybe I’m missing the big picture. This might be such a specialized corner of the code that these kinds of optimization aren’t important.
(In reply to Darin Adler from comment #27) > The idiom to efficiently get a UChar* buffer would be: > > auto characters = StringView { string }.upconvertedCharacters(); > > Then "characters" can be passed anywhere that needs a "const UChar*". We > should switch to that I think. > > I don’t know why you say "copying the resultant buffer". I don’t see any > reason we’d need to copy it. > > Feel free to handle this in bug 174816. Nothing super urgent, just > unnecessarily underperforming idiom. Nothing should use > charactersWithNullTermination unless it needs null termination. OK, I'll copy this into bug #174816. (In reply to Darin Adler from comment #28) > I think we should consider a special case for when the string is already > normalized. That will not only help speed but might even help memory use > because we’d share the string instead of allocating a new one with all the > same characters. Problem is: how to detect that the string is already normalized? We could pass it through this API and verify that the result is unchanged. :D > But maybe I’m missing the big picture. This might be such a specialized > corner of the code that these kinds of optimization aren’t important. Hopefully, since this code only runs when formatting a URL for display (e.g. in the address bar).
(In reply to Michael Catanzaro from comment #29) > Problem is: how to detect that the string is already normalized? unorm2_quickCheck > We could > pass it through this API and verify that the result is unchanged. :D Or that.
(In reply to Darin Adler from comment #30) > (In reply to Michael Catanzaro from comment #29) > > Problem is: how to detect that the string is already normalized? > > unorm2_quickCheck OK, cool: piled that onto bug #174816.