WebKit Bugzilla
New
Browse
Log In
×
Sign in with GitHub
or
Remember my login
Create Account
·
Forgot Password
Forgotten password account recovery
RESOLVED FIXED
192945
Use unorm2_normalize instead of precomposedStringWithCanonicalMapping in userVisibleString
https://bugs.webkit.org/show_bug.cgi?id=192945
Summary
Use unorm2_normalize instead of precomposedStringWithCanonicalMapping in user...
Michael Catanzaro
Reported
2018-12-20 12:07:51 PST
Replace use of the nice NSString function precomposedStringWithCanonicalMapping with the ICU API unorm2_normalize. This is to prep the code for translation to cross-platform C++. Please congratulate the ICU developers for creating an API that is impossible to safely use except when calling it twice in succession. Note this code was all written by Ms2ger; I'm just splitting it into its own patch because I fear we've lost him to end of year holidays.
Attachments
Patch
(2.64 KB, patch)
2018-12-20 13:02 PST
,
Michael Catanzaro
no flags
Details
Formatted Diff
Diff
Patch
(2.65 KB, patch)
2018-12-20 13:18 PST
,
Michael Catanzaro
no flags
Details
Formatted Diff
Diff
Patch
(2.86 KB, patch)
2018-12-20 13:30 PST
,
Michael Catanzaro
no flags
Details
Formatted Diff
Diff
Patch
(2.87 KB, patch)
2018-12-20 13:46 PST
,
Michael Catanzaro
no flags
Details
Formatted Diff
Diff
Patch
(2.87 KB, patch)
2018-12-20 13:55 PST
,
Michael Catanzaro
no flags
Details
Formatted Diff
Diff
Patch
(2.91 KB, patch)
2018-12-22 16:48 PST
,
Michael Catanzaro
no flags
Details
Formatted Diff
Diff
Patch
(3.39 KB, patch)
2019-01-14 17:41 PST
,
Michael Catanzaro
achristensen
: review+
Details
Formatted Diff
Diff
Show Obsolete
(6)
View All
Add attachment
proposed patch, testcase, etc.
Michael Catanzaro
Comment 1
2018-12-20 12:58:35 PST
(In reply to Michael Catanzaro from
comment #0
)
> Note this code was all written by Ms2ger; I'm just splitting it into its own > patch because I fear we've lost him to end of year holidays.
OK, I've mucked with it quite a bit to attempt to integrate it into NSURLExtras.mm. Let's say credit to Ms2ger if it works and my fault if EWS comes back red red red.....
Michael Catanzaro
Comment 2
2018-12-20 13:02:14 PST
Created
attachment 357853
[details]
Patch
Michael Catanzaro
Comment 3
2018-12-20 13:18:39 PST
Created
attachment 357856
[details]
Patch
Michael Catanzaro
Comment 4
2018-12-20 13:30:36 PST
Created
attachment 357858
[details]
Patch
Michael Catanzaro
Comment 5
2018-12-20 13:46:10 PST
Created
attachment 357862
[details]
Patch
Michael Catanzaro
Comment 6
2018-12-20 13:55:40 PST
Created
attachment 357865
[details]
Patch
Michael Catanzaro
Comment 7
2018-12-20 14:34:00 PST
Comment on
attachment 357865
[details]
Patch Unsetting r? until the Mac EWS are working again: Exception: Error auto-installing the keyring package to: "/Volumes/Data/EWS/WebKit/Tools/Scripts/webkitpy/thirdparty/autoinstalled/keyring" --> Inner message: Could not download Python modules from URL "None". Make sure you are connected to the internet. You must be connected to the internet when downloading needed modules for the first time. --> Inner message: <urlopen error [SSL: TLSV1_ALERT_PROTOCOL_VERSION] tlsv1 alert protocol version (_ssl.c:590)>
Michael Catanzaro
Comment 8
2018-12-20 17:00:02 PST
Comment on
attachment 357865
[details]
Patch OK, all the bots are happy except for the network error.... This is my first Objective C[++] ever and I am very proud of it! Even if it is mostly copied from Ms2ger. Note that after this lands, I will immediately reland the patch in
bug #174816
, deleting all this code. This commit is just to make sure that if the switch from the NSString normalization to the ICU normalization introduces an unexpected regression, it will bisect down to this small revision instead of the huge patch in that bug.
Alex Christensen
Comment 9
2018-12-20 17:05:09 PST
Comment on
attachment 357865
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=357865&action=review
> Source/WTF/wtf/cocoa/NSURLExtras.mm:1170 > + ASSERT(sourceBuffer.last() == '\0');
This assertion seems redundant.
> Source/WTF/wtf/cocoa/NSURLExtras.mm:1175 > + const UNormalizer2 *normalizer = unorm2_getNFCInstance(&uerror);
Can unorm2_getNFCInstance never fail?
> Source/WTF/wtf/cocoa/NSURLExtras.mm:1177 > + if (uerror == U_BUFFER_OVERFLOW_ERROR) {
Is this case covered by tests? Are there tests with emoji?
Michael Catanzaro
Comment 10
2018-12-20 19:47:11 PST
Comment on
attachment 357865
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=357865&action=review
>> Source/WTF/wtf/cocoa/NSURLExtras.mm:1170 >> + ASSERT(sourceBuffer.last() == '\0'); > > This assertion seems redundant.
It is, but it makes the next line sourceBuffer.removeLast() easier to understand, so seems worth keeping.
>> Source/WTF/wtf/cocoa/NSURLExtras.mm:1175 >> + const UNormalizer2 *normalizer = unorm2_getNFCInstance(&uerror); > > Can unorm2_getNFCInstance never fail?
Good catch. It can fail if there is a memory allocation error. So not in practice, but we should check it regardless, especially since we create an error, pass it in, and don't check it... that's no good. And maybe in the future the function could change to fail in different ways.
>> Source/WTF/wtf/cocoa/NSURLExtras.mm:1177 >> + if (uerror == U_BUFFER_OVERFLOW_ERROR) { > > Is this case covered by tests? Are there tests with emoji?
There is a test with a snowman emoji. None of the tests intentionally triggers this "error condition". (Of course it's not a real error, just bad API design to require the caller to allocate a bigger buffer instead of ICU allocating the right size and returning it.) I can check the patch in
bug #174816
tomorrow to see if any of those tests hit this codepath in the big patch.
Michael Catanzaro
Comment 11
2018-12-22 16:38:59 PST
(In reply to Michael Catanzaro from
comment #10
)
> None of the tests intentionally triggers this "error condition". (Of course > it's not a real error, just bad API design to require the caller to allocate > a bigger buffer instead of ICU allocating the right size and returning it.) > I can check the patch in
bug #174816
tomorrow to see if any of those tests > hit this codepath in the big patch.
None of the tests hit this path. In fact, input size is always equal to output size. I doubt that's guaranteed to be the case, but I don't know enough about Unicode to pursue it further. I'll update the patch to add the missing error check, though.
Michael Catanzaro
Comment 12
2018-12-22 16:48:15 PST
Created
attachment 358027
[details]
Patch
Alexey Proskuryakov
Comment 13
2018-12-23 10:46:29 PST
> None of the tests hit this path. In fact, input size is always equal to output size. I doubt that's guaranteed to be the case, but I don't know enough about Unicode to pursue it further.
Is this actually about Unicode, or just strings longer than 2048 characters? Unicode can certainly be tricky because conversions can make strings longer, but I don't see how that comes into play in this code.
Michael Catanzaro
Comment 14
2018-12-24 10:34:02 PST
Comment on
attachment 358027
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=358027&action=review
> Source/WTF/wtf/cocoa/NSURLExtras.mm:1180 > + normalizedLength = unorm2_normalize(normalizer, sourceBuffer.data(), sourceBuffer.size(), normalizedCharacters.data(), normalizedCharacters.size(), &uerror); > + if (uerror == U_BUFFER_OVERFLOW_ERROR) {
This "error" occurs if the output size is greater than input size (sourceBuffer.size()). I don't think it has anything to do with the 2048-bit inline capacity of sourceBuffer. In all of the URLHelpers.cpp tests from
bug #174816
, the output size is always equal to the input size, so it is not currently tested (probably; there are a couple more tests in URLExtras.mm that I have not run).
Michael Catanzaro
Comment 15
2018-12-24 10:35:33 PST
(In reply to Michael Catanzaro from
comment #14
)
> This "error" occurs if the output size is greater than input size > (sourceBuffer.size()). I don't think it has anything to do with the 2048-bit > inline capacity of sourceBuffer.
More specifically: the error occurs if the output size is greater than the size of the buffer provided for the output, normalizedCharacters.size(). And normalizedCharacters.size() is initially the same as sourceBuffer.size().
Myles C. Maxfield
Comment 16
2019-01-02 14:52:56 PST
Alex Christensen should review this.
Alex Christensen
Comment 17
2019-01-03 08:49:44 PST
Comment on
attachment 358027
[details]
Patch I think this code should be put in a subroutine with a name that takes a const String& parameter and returns a String.
Michael Catanzaro
Comment 18
2019-01-14 17:39:13 PST
Adding a new subroutine might be harder for me than you'd think, but here goes. A sacrifice to the EWS gods: 🔥🐑🔥
Michael Catanzaro
Comment 19
2019-01-14 17:41:25 PST
Created
attachment 359107
[details]
Patch
Michael Catanzaro
Comment 20
2019-01-14 19:24:40 PST
Thanks! (In reply to Michael Catanzaro from
comment #18
)
> A sacrifice to the EWS gods: 🔥🐑🔥
This must have worked. Can't believe it.
Michael Catanzaro
Comment 21
2019-01-14 19:26:08 PST
Committed
r239970
: <
https://trac.webkit.org/changeset/239970
>
Radar WebKit Bug Importer
Comment 22
2019-01-14 19:29:29 PST
<
rdar://problem/47274320
>
Darin Adler
Comment 23
2019-01-15 08:56:13 PST
Why does this use charactersWithNullTermination? That’s the worst part of this new code, I think. Do we need a fast path for any simple cases, like all ASCII?
Darin Adler
Comment 24
2019-01-15 08:57:22 PST
I think this can and should easily be improved so that it works without allocating memory just to copy the characters into it.
Darin Adler
Comment 25
2019-01-15 08:57:47 PST
I’m referring to charactersWithNullTermination, not the Vector on the stack.
Michael Catanzaro
Comment 26
2019-01-15 09:35:35 PST
(In reply to Darin Adler from
comment #23
)
> Why does this use charactersWithNullTermination? That’s the worst part of > this new code, I think.
We need a UChar* buffer. It could also be done using String::characters16 and copying the resultant buffer, but that seems worse.
> Do we need a fast path for any simple cases, like all ASCII?
That's a question for the performance tests, but I'd guess probably not, because displaying URLs should never be performance-sensitive. (And, without looking into implementation, it's possible that ICU has a fast path already internally, and also possible that the NSString function we've replaced did not. We'd have to check the implementations to decide whether to expect the new one to be slower.) Regardless, I'd like to handle suggested improvements in
bug #174816
, please, so we can do them in C++ instead of prayers. (I've never written Objective C++ before, and have just been throwing patches at EWS hoping they'd build and maybe also work.)
Darin Adler
Comment 27
2019-01-15 12:54:28 PST
(In reply to Michael Catanzaro from
comment #26
)
> (In reply to Darin Adler from
comment #23
) > > Why does this use charactersWithNullTermination? That’s the worst part of > > this new code, I think. > > We need a UChar* buffer. It could also be done using String::characters16 > and copying the resultant buffer, but that seems worse.
The idiom to efficiently get a UChar* buffer would be: auto characters = StringView { string }.upconvertedCharacters(); Then "characters" can be passed anywhere that needs a "const UChar*". We should switch to that I think. I don’t know why you say "copying the resultant buffer". I don’t see any reason we’d need to copy it. Feel free to handle this in
bug 174816
. Nothing super urgent, just unnecessarily underperforming idiom. Nothing should use charactersWithNullTermination unless it needs null termination.
Darin Adler
Comment 28
2019-01-15 12:55:53 PST
(In reply to Michael Catanzaro from
comment #26
)
> > Do we need a fast path for any simple cases, like all ASCII? > > That's a question for the performance tests, but I'd guess probably not, > because displaying URLs should never be performance-sensitive. (And, without > looking into implementation, it's possible that ICU has a fast path already > internally, and also possible that the NSString function we've replaced did > not. We'd have to check the implementations to decide whether to expect the > new one to be slower.)
I think we should consider a special case for when the string is already normalized. That will not only help speed but might even help memory use because we’d share the string instead of allocating a new one with all the same characters. But maybe I’m missing the big picture. This might be such a specialized corner of the code that these kinds of optimization aren’t important.
Michael Catanzaro
Comment 29
2019-01-15 13:23:13 PST
(In reply to Darin Adler from
comment #27
)
> The idiom to efficiently get a UChar* buffer would be: > > auto characters = StringView { string }.upconvertedCharacters(); > > Then "characters" can be passed anywhere that needs a "const UChar*". We > should switch to that I think. > > I don’t know why you say "copying the resultant buffer". I don’t see any > reason we’d need to copy it. > > Feel free to handle this in
bug 174816
. Nothing super urgent, just > unnecessarily underperforming idiom. Nothing should use > charactersWithNullTermination unless it needs null termination.
OK, I'll copy this into
bug #174816
. (In reply to Darin Adler from
comment #28
)
> I think we should consider a special case for when the string is already > normalized. That will not only help speed but might even help memory use > because we’d share the string instead of allocating a new one with all the > same characters.
Problem is: how to detect that the string is already normalized? We could pass it through this API and verify that the result is unchanged. :D
> But maybe I’m missing the big picture. This might be such a specialized > corner of the code that these kinds of optimization aren’t important.
Hopefully, since this code only runs when formatting a URL for display (e.g. in the address bar).
Darin Adler
Comment 30
2019-01-15 17:14:42 PST
(In reply to Michael Catanzaro from
comment #29
)
> Problem is: how to detect that the string is already normalized?
unorm2_quickCheck
> We could > pass it through this API and verify that the result is unchanged. :D
Or that.
Michael Catanzaro
Comment 31
2019-01-15 19:27:52 PST
(In reply to Darin Adler from
comment #30
)
> (In reply to Michael Catanzaro from
comment #29
) > > Problem is: how to detect that the string is already normalized? > > unorm2_quickCheck
OK, cool: piled that onto
bug #174816
.
Note
You need to
log in
before you can comment on or make changes to this bug.
Top of Page
Format For Printing
XML
Clone This Bug