WebKit Bugzilla
New
Browse
Log In
×
Sign in with GitHub
or
Remember my login
Create Account
·
Forgot Password
Forgotten password account recovery
RESOLVED FIXED
Bug 195330
Improve normalization code, including moving from unorm.h to unorm2.h
https://bugs.webkit.org/show_bug.cgi?id=195330
Summary
Improve normalization code, including moving from unorm.h to unorm2.h
Darin Adler
Reported
2019-03-05 09:20:48 PST
Improve the toNormalizationFormC function
Attachments
Patch
(4.28 KB, patch)
2019-03-05 09:27 PST
,
Darin Adler
no flags
Details
Formatted Diff
Diff
Patch
(29.43 KB, patch)
2019-03-09 09:33 PST
,
Darin Adler
no flags
Details
Formatted Diff
Diff
Archive of layout-test-results from ews115 for mac-highsierra
(2.31 MB, application/zip)
2019-03-09 11:34 PST
,
EWS Watchlist
no flags
Details
Patch
(30.06 KB, patch)
2019-03-10 12:23 PDT
,
Darin Adler
no flags
Details
Formatted Diff
Diff
Patch
(35.58 KB, patch)
2019-03-16 11:53 PDT
,
Darin Adler
no flags
Details
Formatted Diff
Diff
Patch
(35.58 KB, patch)
2019-03-16 15:00 PDT
,
Darin Adler
no flags
Details
Formatted Diff
Diff
Patch
(35.60 KB, patch)
2019-03-16 17:00 PDT
,
Darin Adler
no flags
Details
Formatted Diff
Diff
Show Obsolete
(6)
View All
Add attachment
proposed patch, testcase, etc.
Darin Adler
Comment 1
2019-03-05 09:27:31 PST
Comment hidden (obsolete)
Created
attachment 363643
[details]
Patch
Darin Adler
Comment 2
2019-03-05 09:28:40 PST
The only part of this I am not sure about is the three different passes through the string it takes. Otherwise, I think this is better and maybe should be moved somewhere where it can be reused.
Darin Adler
Comment 3
2019-03-05 10:32:28 PST
Comment on
attachment 363643
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=363643&action=review
> Source/WTF/wtf/URLHelpers.cpp:779 > + if (string.is8Bit() || string.isEmpty())
This isEmpty check isn’t needed.
Michael Catanzaro
Comment 4
2019-03-05 11:02:24 PST
Comment on
attachment 363643
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=363643&action=review
>> Source/WTF/wtf/URLHelpers.cpp:779 >> + if (string.is8Bit() || string.isEmpty()) > > This isEmpty check isn’t needed.
Why not? Anyway, returning early for 8-bit strings was a good call. Now, because you know the WTFString is internally 16-bit, you're able to use the nice characters16 function. Much better than what I came up with when I tried.
> Source/WTF/wtf/URLHelpers.cpp:786 > + ASSERT(U_SUCCESS(error)); > + if (U_FAILURE(error)) > + return string;
I'm surprised to see you handling the case where the assertion fails. Normally I would complain that U_SUCCESS(error) is guaranteed to be true here, because of the assert, but I suppose you prefer to handle it for release builds with asserts disabled.... Anyway, I do like that you're returning the original string on error, instead of the empty string like before. More robust.
> Source/WTF/wtf/URLHelpers.cpp:795 > + auto normalizedLength = unorm2_normalize(normalizer, string.characters16(), string.length(), nullptr, 0, &error); > + ASSERT(error == U_BUFFER_OVERFLOW_ERROR);
Er, what? You assume the normalized string will *always* be larger than the original string? I see it passes the tests, which surprises me, so that's good, and I'm no expert on Unicode normalization... but this doesn't seem like a good assumption at all. Shouldn't the code be prepared to handle the normalized result being smaller than, or equal to, the original? Please reconsider this before landing. As we discussed previously, there are many places in WebKit outside URLHelpers.cpp currently using unorm_normalize() or unorm2_normalize() manually, which would benefit from use of this function. So even if for some strange reason normalized URLs are guaranteed to be larger, I would still make this more general. The original code looks safer.
Darin Adler
Comment 5
2019-03-05 19:17:04 PST
Comment on
attachment 363643
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=363643&action=review
>>> Source/WTF/wtf/URLHelpers.cpp:779 >>> + if (string.is8Bit() || string.isEmpty()) >> >> This isEmpty check isn’t needed. > > Why not? > > Anyway, returning early for 8-bit strings was a good call. Now, because you know the WTFString is internally 16-bit, you're able to use the nice characters16 function. Much better than what I came up with when I tried.
There are multiple reasons, but the main reason is that the empty string is guaranteed to be 8-bit.
>> Source/WTF/wtf/URLHelpers.cpp:786 >> + return string; > > I'm surprised to see you handling the case where the assertion fails. Normally I would complain that U_SUCCESS(error) is guaranteed to be true here, because of the assert, but I suppose you prefer to handle it for release builds with asserts disabled.... > > Anyway, I do like that you're returning the original string on error, instead of the empty string like before. More robust.
Three options: ASSERT with early return (so we silently fail if the unthinkable happens) RELEASE_ASSERT (so we abort if the unthinkable happens) ASSERT (so something unpredictable happens if the unthinkable happens -- in this case probably a crash since it returns null, but different at other call sites) Not sure which I prefer. Maybe RELEASE_ASSERT.
>> Source/WTF/wtf/URLHelpers.cpp:795 >> + ASSERT(error == U_BUFFER_OVERFLOW_ERROR); > > Er, what? You assume the normalized string will *always* be larger than the original string? I see it passes the tests, which surprises me, so that's good, and I'm no expert on Unicode normalization... but this doesn't seem like a good assumption at all. Shouldn't the code be prepared to handle the normalized result being smaller than, or equal to, the original? Please reconsider this before landing. > > As we discussed previously, there are many places in WebKit outside URLHelpers.cpp currently using unorm_normalize() or unorm2_normalize() manually, which would benefit from use of this function. So even if for some strange reason normalized URLs are guaranteed to be larger, I would still make this more general. The original code looks safer.
I am passing a length of 0 here for the target buffer. That’s what guarantees U_BUFFER_OVERFLOW_ERROR.
Michael Catanzaro
Comment 6
2019-03-06 07:30:19 PST
Comment on
attachment 363643
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=363643&action=review
>>> Source/WTF/wtf/URLHelpers.cpp:795 >>> + ASSERT(error == U_BUFFER_OVERFLOW_ERROR); >> >> Er, what? You assume the normalized string will *always* be larger than the original string? I see it passes the tests, which surprises me, so that's good, and I'm no expert on Unicode normalization... but this doesn't seem like a good assumption at all. Shouldn't the code be prepared to handle the normalized result being smaller than, or equal to, the original? Please reconsider this before landing. >> >> As we discussed previously, there are many places in WebKit outside URLHelpers.cpp currently using unorm_normalize() or unorm2_normalize() manually, which would benefit from use of this function. So even if for some strange reason normalized URLs are guaranteed to be larger, I would still make this more general. The original code looks safer. > > I am passing a length of 0 here for the target buffer. That’s what guarantees U_BUFFER_OVERFLOW_ERROR.
Ahh, that's what I missed, OK then. That way you don't have to allocate the wrong size of buffer and then try again. I doubt this is really well-optimized, though, since now you have to do the normalization twice, discarding the result the first time. The heuristic we were using before -- assuming that the string will usually fit into the same size buffer as was required prior to normalization -- seemed good to me.
Darin Adler
Comment 7
2019-03-08 09:31:23 PST
Comment on
attachment 363643
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=363643&action=review
>>>> Source/WTF/wtf/URLHelpers.cpp:795 >>>> + ASSERT(error == U_BUFFER_OVERFLOW_ERROR); >>> >>> Er, what? You assume the normalized string will *always* be larger than the original string? I see it passes the tests, which surprises me, so that's good, and I'm no expert on Unicode normalization... but this doesn't seem like a good assumption at all. Shouldn't the code be prepared to handle the normalized result being smaller than, or equal to, the original? Please reconsider this before landing. >>> >>> As we discussed previously, there are many places in WebKit outside URLHelpers.cpp currently using unorm_normalize() or unorm2_normalize() manually, which would benefit from use of this function. So even if for some strange reason normalized URLs are guaranteed to be larger, I would still make this more general. The original code looks safer. >> >> I am passing a length of 0 here for the target buffer. That’s what guarantees U_BUFFER_OVERFLOW_ERROR. > > Ahh, that's what I missed, OK then. That way you don't have to allocate the wrong size of buffer and then try again. > > I doubt this is really well-optimized, though, since now you have to do the normalization twice, discarding the result the first time. The heuristic we were using before -- assuming that the string will usually fit into the same size buffer as was required prior to normalization -- seemed good to me.
For this minor point (whether to compute the size or optimistically allocate a buffer and try at the starting size), there are at least 4 considerations, all relatively small and minor themselves: 1) The cost of normalizing multiple times. As you pointed out, the patch we are reviewing here normalizes twice (and arguably three times since the isNormalized check might be roughly as expensive as normalizing in edge cases). 2) The cost of allocating memory once or twice. The patch we are reviewing here allocates memory exactly once, for the StringImpl. Other approaches involve allocating memory multiple times for intermediate buffers before allocating the final string. For example pre-patch there is the Vector for the original characters (allocation #1), the Vector for the normalized characters (sometimes allocation #2 if the string is long enough to not fit in the stack buffer), and the StringImpl (allocation #2 or #3). 3) The cost of copying characters. The patch we are reviewing does not copy characters outside of copying done by the ICU normalization function. Normalizing into a Vector and then converting that into a String with String::adopt involves copying all the characters in the case where the string is small enough to fit into the buffer on the stack. 4) The cost of keeping around a string in one memory block of two. The patch we are reviewing always creates a string that uses the buffer inside the StringImpl, so has only takes one memory block overhead, rather than two; can matter if we keep it around for a long time. Using a heap-allocated Vector (or even a StringBuffer) to create a string creates a string that uses more storage, for its entire lifetime, because it uses two memory blocks, one for the StringImpl and one for the characters. It’s also worth noting that normalizing to form "C" (the "compact" one) almost always results in a string that is *smaller* than the original string. Almost never the same size unless the string is already normalized (which we handle separately, first). I believe that it *can* result in a string that is larger, but I am not sure in what cases that can occur since normalization to C almost always means composing pairs that are in their decomposed form. I’m gong to post a new patch that moves this function and another related one to the StringView header soon. The other call site I found doing normalization was already using the "call once to just get the size" approach. Not sure whether I will stick with that, or move to the "Vector with stack capacity" approach.
Darin Adler
Comment 8
2019-03-09 09:33:41 PST
Comment hidden (obsolete)
Created
attachment 364132
[details]
Patch
EWS Watchlist
Comment 9
2019-03-09 09:35:33 PST
Comment hidden (obsolete)
Attachment 364132
[details]
did not pass style-queue: ERROR: Source/JavaScriptCore/runtime/StringPrototype.cpp:1808: enum members should use InterCaps with an initial capital letter or initial 'k' for C-style enums. [readability/enum_casing] [4] Total errors found: 1 in 10 files If any of these errors are false positives, please file a bug against check-webkit-style.
Darin Adler
Comment 10
2019-03-09 10:07:54 PST
Comment on
attachment 364132
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=364132&action=review
> Source/WTF/wtf/URLHelpers.cpp:779 > + for (size_t j = 0; j < characterLength; ++j)
I suppose this should be unsigned rather than size_t.
EWS Watchlist
Comment 11
2019-03-09 11:34:02 PST
Comment hidden (obsolete)
Comment on
attachment 364132
[details]
Patch
Attachment 364132
[details]
did not pass mac-debug-ews (mac): Output:
https://webkit-queues.webkit.org/results/11439957
New failing tests: fast/text/find-kana.html
EWS Watchlist
Comment 12
2019-03-09 11:34:04 PST
Comment hidden (obsolete)
Created
attachment 364135
[details]
Archive of layout-test-results from ews115 for mac-highsierra The attached test failures were seen while running run-webkit-tests on the mac-debug-ews. Bot: ews115 Port: mac-highsierra Platform: Mac OS X 10.13.6
EWS Watchlist
Comment 13
2019-03-09 12:11:22 PST
Comment hidden (obsolete)
Comment on
attachment 364132
[details]
Patch
Attachment 364132
[details]
did not pass jsc-ews (mac): Output:
https://webkit-queues.webkit.org/results/11439980
New failing tests: ChakraCore.yaml/ChakraCore/test/es6/normalize.js.default stress/string-normalize.js.no-ftl stress/string-normalize.js.no-llint stress/string-normalize.js.ftl-eager stress/string-normalize.js.ftl-no-cjit-b3o0 stress/string-normalize.js.ftl-eager-no-cjit-b3o1 stress/string-normalize.js.ftl-no-cjit-no-inline-validate stress/string-normalize.js.dfg-maximal-flush-validate-no-cjit stress/string-normalize.js.ftl-eager-no-cjit stress/string-normalize.js.bytecode-cache stress/string-normalize.js.no-cjit-validate-phases stress/string-normalize.js.default stress/string-normalize.js.no-cjit-collect-continuously stress/string-normalize.js.ftl-no-cjit-small-pool stress/string-normalize.js.ftl-no-cjit-no-put-stack-validate stress/string-normalize.js.dfg-eager stress/string-normalize.js.ftl-no-cjit-validate-sampling-profiler stress/string-normalize.js.dfg-eager-no-cjit-validate
Darin Adler
Comment 14
2019-03-09 13:54:58 PST
Comment hidden (obsolete)
Oops.
Darin Adler
Comment 15
2019-03-10 12:01:17 PDT
Turns out the 8-bit special case is correct for NFC, but not NFD, for example.
Darin Adler
Comment 16
2019-03-10 12:23:08 PDT
Comment hidden (obsolete)
Created
attachment 364195
[details]
Patch
EWS Watchlist
Comment 17
2019-03-10 12:26:33 PDT
Comment hidden (obsolete)
Attachment 364195
[details]
did not pass style-queue: ERROR: Source/JavaScriptCore/runtime/StringPrototype.cpp:1808: enum members should use InterCaps with an initial capital letter or initial 'k' for C-style enums. [readability/enum_casing] [4] Total errors found: 1 in 10 files If any of these errors are false positives, please file a bug against check-webkit-style.
Darin Adler
Comment 18
2019-03-10 13:35:43 PDT
Hooray, this one is passing the tests. Looks like I fixed the two problems in the earlier version so ready for review now.
Michael Catanzaro
Comment 19
2019-03-11 10:15:57 PDT
Comment on
attachment 364195
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=364195&action=review
There's also FontCairoHarfbuzzNG.cpp which could be converted to use the new normalizedNFC() function. Should be straightforward, if you're brave and want to touch a file you can't build.
> Source/WTF/wtf/text/StringView.h:216 > +WTF_EXPORT_PRIVATE StringViewWithUnderlyingString normalizedNFC(StringView);
Wouldn't it be simpler to just return a String and let the API user create the StringView?
> Source/WebCore/editing/TextIterator.cpp:2017 > static void normalizeCharacters(const UChar* characters, unsigned length, Vector<UChar>& buffer)
Shame we can't use your new normalizedNFC() function here. Oh well. :(
Darin Adler
Comment 20
2019-03-11 20:51:19 PDT
Comment on
attachment 364195
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=364195&action=review
I’ll take a look at the call site in FontCairoHarfbuzzNG.cpp too.
>> Source/WTF/wtf/text/StringView.h:216 >> +WTF_EXPORT_PRIVATE StringViewWithUnderlyingString normalizedNFC(StringView); > > Wouldn't it be simpler to just return a String and let the API user create the StringView?
When the existing StringView is already normalized, this returns a null String and the same StringView. When it’s not it returns a non-null String and a StringView pointing to that. I am not sure what the best way of documenting this is. Perhaps I should just add a comment saying that?
>> Source/WebCore/editing/TextIterator.cpp:2017 >> static void normalizeCharacters(const UChar* characters, unsigned length, Vector<UChar>& buffer) > > Shame we can't use your new normalizedNFC() function here. Oh well. :(
Maybe we can eventually find a way to share the code. So far I didn’t find the perfect way.
Michael Catanzaro
Comment 21
2019-03-12 09:12:11 PDT
(In reply to Darin Adler from
comment #20
)
> > Wouldn't it be simpler to just return a String and let the API user create the StringView? > > When the existing StringView is already normalized, this returns a null > String and the same StringView. When it’s not it returns a non-null String > and a StringView pointing to that. I am not sure what the best way of > documenting this is. Perhaps I should just add a comment saying that?
Hm, this does seem ideal, then. I think a comment would suffice. The name could also indicate nullability: StringViewPotentiallyWithUnderlyingString. But that's kinda yucky. Or the function could use two out parameters instead of a return value, to avoid the need to create a new type, but that doesn't seem necessarily any better.
Darin Adler
Comment 22
2019-03-16 11:53:08 PDT
Comment hidden (obsolete)
Created
attachment 364939
[details]
Patch
EWS Watchlist
Comment 23
2019-03-16 11:54:56 PDT
Comment hidden (obsolete)
Attachment 364939
[details]
did not pass style-queue: ERROR: Source/JavaScriptCore/runtime/StringPrototype.cpp:1808: enum members should use InterCaps with an initial capital letter or initial 'k' for C-style enums. [readability/enum_casing] [4] Total errors found: 1 in 11 files If any of these errors are false positives, please file a bug against check-webkit-style.
Darin Adler
Comment 24
2019-03-16 11:55:28 PDT
(In reply to Michael Catanzaro from
comment #21
)
> The name could also indicate nullability: > StringViewPotentiallyWithUnderlyingString. But that's kinda yucky. Or the > function could use two out parameters instead of a return value, to avoid > the need to create a new type, but that doesn't seem necessarily any better.
I think we might eventually want to update the struct name, but I am not sure it makes it better. The existing call site that uses StringViewWithUnderlyingString uses it in another way: In that case, the StringView might be a substring of the larger underlying string.
Darin Adler
Comment 25
2019-03-16 15:00:34 PDT
Comment hidden (obsolete)
Created
attachment 364945
[details]
Patch
EWS Watchlist
Comment 26
2019-03-16 15:03:27 PDT
Comment hidden (obsolete)
Attachment 364945
[details]
did not pass style-queue: ERROR: Source/JavaScriptCore/runtime/StringPrototype.cpp:1808: enum members should use InterCaps with an initial capital letter or initial 'k' for C-style enums. [readability/enum_casing] [4] Total errors found: 1 in 11 files If any of these errors are false positives, please file a bug against check-webkit-style.
Darin Adler
Comment 27
2019-03-16 17:00:28 PDT
Created
attachment 364949
[details]
Patch
EWS Watchlist
Comment 28
2019-03-16 17:01:42 PDT
Attachment 364949
[details]
did not pass style-queue: ERROR: Source/JavaScriptCore/runtime/StringPrototype.cpp:1808: enum members should use InterCaps with an initial capital letter or initial 'k' for C-style enums. [readability/enum_casing] [4] Total errors found: 1 in 11 files If any of these errors are false positives, please file a bug against check-webkit-style.
WebKit Commit Bot
Comment 29
2019-03-16 19:20:57 PDT
Comment on
attachment 364949
[details]
Patch Clearing flags on attachment: 364949 Committed
r243049
: <
https://trac.webkit.org/changeset/243049
>
WebKit Commit Bot
Comment 30
2019-03-16 19:20:59 PDT
All reviewed patches have been landed. Closing bug.
Radar WebKit Bug Importer
Comment 31
2019-03-16 19:22:22 PDT
<
rdar://problem/48958229
>
Note
You need to
log in
before you can comment on or make changes to this bug.
Top of Page
Format For Printing
XML
Clone This Bug