195330 – Improve normalization code, including moving from unorm.h to unorm2.h

RESOLVED FIXED 195330

Improve normalization code, including moving from unorm.h to unorm2.h

https://bugs.webkit.org/show_bug.cgi?id=195330

Summary Improve normalization code, including moving from unorm.h to unorm2.h

Darin Adler

Reported 2019-03-05 09:20:48 PST

Improve the toNormalizationFormC function

Attachments
Patch (4.28 KB, patch) 2019-03-05 09:27 PST, Darin Adler	no flags	Details Formatted Diff Diff
Patch (29.43 KB, patch) 2019-03-09 09:33 PST, Darin Adler	no flags	Details Formatted Diff Diff
Archive of layout-test-results from ews115 for mac-highsierra (2.31 MB, application/zip) 2019-03-09 11:34 PST, EWS Watchlist	no flags	Details
Patch (30.06 KB, patch) 2019-03-10 12:23 PDT, Darin Adler	no flags	Details Formatted Diff Diff
Patch (35.58 KB, patch) 2019-03-16 11:53 PDT, Darin Adler	no flags	Details Formatted Diff Diff
Patch (35.58 KB, patch) 2019-03-16 15:00 PDT, Darin Adler	no flags	Details Formatted Diff Diff
Patch (35.60 KB, patch) 2019-03-16 17:00 PDT, Darin Adler	no flags	Details Formatted Diff Diff
Show Obsolete (6) View All Add attachment proposed patch, testcase, etc.

Darin Adler

Comment 1 2019-03-05 09:27:31 PST Comment hidden (obsolete)

Created attachment 363643 [details] Patch

Darin Adler

Comment 2 2019-03-05 09:28:40 PST

The only part of this I am not sure about is the three different passes through the string it takes. Otherwise, I think this is better and maybe should be moved somewhere where it can be reused.

Darin Adler

Comment 3 2019-03-05 10:32:28 PST

Comment on attachment 363643 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=363643&action=review > Source/WTF/wtf/URLHelpers.cpp:779 > + if (string.is8Bit() || string.isEmpty()) This isEmpty check isn’t needed.

Michael Catanzaro

Comment 4 2019-03-05 11:02:24 PST

Comment on attachment 363643 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=363643&action=review >> Source/WTF/wtf/URLHelpers.cpp:779 >> + if (string.is8Bit() || string.isEmpty()) > > This isEmpty check isn’t needed. Why not? Anyway, returning early for 8-bit strings was a good call. Now, because you know the WTFString is internally 16-bit, you're able to use the nice characters16 function. Much better than what I came up with when I tried. > Source/WTF/wtf/URLHelpers.cpp:786 > + ASSERT(U_SUCCESS(error)); > + if (U_FAILURE(error)) > + return string; I'm surprised to see you handling the case where the assertion fails. Normally I would complain that U_SUCCESS(error) is guaranteed to be true here, because of the assert, but I suppose you prefer to handle it for release builds with asserts disabled.... Anyway, I do like that you're returning the original string on error, instead of the empty string like before. More robust. > Source/WTF/wtf/URLHelpers.cpp:795 > + auto normalizedLength = unorm2_normalize(normalizer, string.characters16(), string.length(), nullptr, 0, &error); > + ASSERT(error == U_BUFFER_OVERFLOW_ERROR); Er, what? You assume the normalized string will *always* be larger than the original string? I see it passes the tests, which surprises me, so that's good, and I'm no expert on Unicode normalization... but this doesn't seem like a good assumption at all. Shouldn't the code be prepared to handle the normalized result being smaller than, or equal to, the original? Please reconsider this before landing. As we discussed previously, there are many places in WebKit outside URLHelpers.cpp currently using unorm_normalize() or unorm2_normalize() manually, which would benefit from use of this function. So even if for some strange reason normalized URLs are guaranteed to be larger, I would still make this more general. The original code looks safer.

Darin Adler

Comment 5 2019-03-05 19:17:04 PST

Comment on attachment 363643 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=363643&action=review >>> Source/WTF/wtf/URLHelpers.cpp:779 >>> + if (string.is8Bit() || string.isEmpty()) >> >> This isEmpty check isn’t needed. > > Why not? > > Anyway, returning early for 8-bit strings was a good call. Now, because you know the WTFString is internally 16-bit, you're able to use the nice characters16 function. Much better than what I came up with when I tried. There are multiple reasons, but the main reason is that the empty string is guaranteed to be 8-bit. >> Source/WTF/wtf/URLHelpers.cpp:786 >> + return string; > > I'm surprised to see you handling the case where the assertion fails. Normally I would complain that U_SUCCESS(error) is guaranteed to be true here, because of the assert, but I suppose you prefer to handle it for release builds with asserts disabled.... > > Anyway, I do like that you're returning the original string on error, instead of the empty string like before. More robust. Three options: ASSERT with early return (so we silently fail if the unthinkable happens) RELEASE_ASSERT (so we abort if the unthinkable happens) ASSERT (so something unpredictable happens if the unthinkable happens -- in this case probably a crash since it returns null, but different at other call sites) Not sure which I prefer. Maybe RELEASE_ASSERT. >> Source/WTF/wtf/URLHelpers.cpp:795 >> + ASSERT(error == U_BUFFER_OVERFLOW_ERROR); > > Er, what? You assume the normalized string will *always* be larger than the original string? I see it passes the tests, which surprises me, so that's good, and I'm no expert on Unicode normalization... but this doesn't seem like a good assumption at all. Shouldn't the code be prepared to handle the normalized result being smaller than, or equal to, the original? Please reconsider this before landing. > > As we discussed previously, there are many places in WebKit outside URLHelpers.cpp currently using unorm_normalize() or unorm2_normalize() manually, which would benefit from use of this function. So even if for some strange reason normalized URLs are guaranteed to be larger, I would still make this more general. The original code looks safer. I am passing a length of 0 here for the target buffer. That’s what guarantees U_BUFFER_OVERFLOW_ERROR.

Michael Catanzaro

Comment 6 2019-03-06 07:30:19 PST

Comment on attachment 363643 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=363643&action=review >>> Source/WTF/wtf/URLHelpers.cpp:795 >>> + ASSERT(error == U_BUFFER_OVERFLOW_ERROR); >> >> Er, what? You assume the normalized string will *always* be larger than the original string? I see it passes the tests, which surprises me, so that's good, and I'm no expert on Unicode normalization... but this doesn't seem like a good assumption at all. Shouldn't the code be prepared to handle the normalized result being smaller than, or equal to, the original? Please reconsider this before landing. >> >> As we discussed previously, there are many places in WebKit outside URLHelpers.cpp currently using unorm_normalize() or unorm2_normalize() manually, which would benefit from use of this function. So even if for some strange reason normalized URLs are guaranteed to be larger, I would still make this more general. The original code looks safer. > > I am passing a length of 0 here for the target buffer. That’s what guarantees U_BUFFER_OVERFLOW_ERROR. Ahh, that's what I missed, OK then. That way you don't have to allocate the wrong size of buffer and then try again. I doubt this is really well-optimized, though, since now you have to do the normalization twice, discarding the result the first time. The heuristic we were using before -- assuming that the string will usually fit into the same size buffer as was required prior to normalization -- seemed good to me.

Darin Adler

Comment 7 2019-03-08 09:31:23 PST

Comment on attachment 363643 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=363643&action=review >>>> Source/WTF/wtf/URLHelpers.cpp:795 >>>> + ASSERT(error == U_BUFFER_OVERFLOW_ERROR); >>> >>> Er, what? You assume the normalized string will *always* be larger than the original string? I see it passes the tests, which surprises me, so that's good, and I'm no expert on Unicode normalization... but this doesn't seem like a good assumption at all. Shouldn't the code be prepared to handle the normalized result being smaller than, or equal to, the original? Please reconsider this before landing. >>> >>> As we discussed previously, there are many places in WebKit outside URLHelpers.cpp currently using unorm_normalize() or unorm2_normalize() manually, which would benefit from use of this function. So even if for some strange reason normalized URLs are guaranteed to be larger, I would still make this more general. The original code looks safer. >> >> I am passing a length of 0 here for the target buffer. That’s what guarantees U_BUFFER_OVERFLOW_ERROR. > > Ahh, that's what I missed, OK then. That way you don't have to allocate the wrong size of buffer and then try again. > > I doubt this is really well-optimized, though, since now you have to do the normalization twice, discarding the result the first time. The heuristic we were using before -- assuming that the string will usually fit into the same size buffer as was required prior to normalization -- seemed good to me. For this minor point (whether to compute the size or optimistically allocate a buffer and try at the starting size), there are at least 4 considerations, all relatively small and minor themselves: 1) The cost of normalizing multiple times. As you pointed out, the patch we are reviewing here normalizes twice (and arguably three times since the isNormalized check might be roughly as expensive as normalizing in edge cases). 2) The cost of allocating memory once or twice. The patch we are reviewing here allocates memory exactly once, for the StringImpl. Other approaches involve allocating memory multiple times for intermediate buffers before allocating the final string. For example pre-patch there is the Vector for the original characters (allocation #1), the Vector for the normalized characters (sometimes allocation #2 if the string is long enough to not fit in the stack buffer), and the StringImpl (allocation #2 or #3). 3) The cost of copying characters. The patch we are reviewing does not copy characters outside of copying done by the ICU normalization function. Normalizing into a Vector and then converting that into a String with String::adopt involves copying all the characters in the case where the string is small enough to fit into the buffer on the stack. 4) The cost of keeping around a string in one memory block of two. The patch we are reviewing always creates a string that uses the buffer inside the StringImpl, so has only takes one memory block overhead, rather than two; can matter if we keep it around for a long time. Using a heap-allocated Vector (or even a StringBuffer) to create a string creates a string that uses more storage, for its entire lifetime, because it uses two memory blocks, one for the StringImpl and one for the characters. It’s also worth noting that normalizing to form "C" (the "compact" one) almost always results in a string that is *smaller* than the original string. Almost never the same size unless the string is already normalized (which we handle separately, first). I believe that it *can* result in a string that is larger, but I am not sure in what cases that can occur since normalization to C almost always means composing pairs that are in their decomposed form. I’m gong to post a new patch that moves this function and another related one to the StringView header soon. The other call site I found doing normalization was already using the "call once to just get the size" approach. Not sure whether I will stick with that, or move to the "Vector with stack capacity" approach.

Darin Adler

Comment 8 2019-03-09 09:33:41 PST Comment hidden (obsolete)

Created attachment 364132 [details] Patch

EWS Watchlist

Comment 9 2019-03-09 09:35:33 PST Comment hidden (obsolete)

Attachment 364132 [details] did not pass style-queue: ERROR: Source/JavaScriptCore/runtime/StringPrototype.cpp:1808: enum members should use InterCaps with an initial capital letter or initial 'k' for C-style enums. [readability/enum_casing] [4] Total errors found: 1 in 10 files If any of these errors are false positives, please file a bug against check-webkit-style.

Darin Adler

Comment 10 2019-03-09 10:07:54 PST

Comment on attachment 364132 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=364132&action=review > Source/WTF/wtf/URLHelpers.cpp:779 > + for (size_t j = 0; j < characterLength; ++j) I suppose this should be unsigned rather than size_t.

EWS Watchlist

Comment 11 2019-03-09 11:34:02 PST Comment hidden (obsolete)

Comment on attachment 364132 [details] Patch Attachment 364132 [details] did not pass mac-debug-ews (mac): Output: https://webkit-queues.webkit.org/results/11439957 New failing tests: fast/text/find-kana.html

EWS Watchlist

Comment 12 2019-03-09 11:34:04 PST Comment hidden (obsolete)

Created attachment 364135 [details] Archive of layout-test-results from ews115 for mac-highsierra The attached test failures were seen while running run-webkit-tests on the mac-debug-ews. Bot: ews115 Port: mac-highsierra Platform: Mac OS X 10.13.6

EWS Watchlist

Comment 13 2019-03-09 12:11:22 PST Comment hidden (obsolete)

Comment on attachment 364132 [details] Patch Attachment 364132 [details] did not pass jsc-ews (mac): Output: https://webkit-queues.webkit.org/results/11439980 New failing tests: ChakraCore.yaml/ChakraCore/test/es6/normalize.js.default stress/string-normalize.js.no-ftl stress/string-normalize.js.no-llint stress/string-normalize.js.ftl-eager stress/string-normalize.js.ftl-no-cjit-b3o0 stress/string-normalize.js.ftl-eager-no-cjit-b3o1 stress/string-normalize.js.ftl-no-cjit-no-inline-validate stress/string-normalize.js.dfg-maximal-flush-validate-no-cjit stress/string-normalize.js.ftl-eager-no-cjit stress/string-normalize.js.bytecode-cache stress/string-normalize.js.no-cjit-validate-phases stress/string-normalize.js.default stress/string-normalize.js.no-cjit-collect-continuously stress/string-normalize.js.ftl-no-cjit-small-pool stress/string-normalize.js.ftl-no-cjit-no-put-stack-validate stress/string-normalize.js.dfg-eager stress/string-normalize.js.ftl-no-cjit-validate-sampling-profiler stress/string-normalize.js.dfg-eager-no-cjit-validate

Darin Adler

Comment 14 2019-03-09 13:54:58 PST Comment hidden (obsolete)

Oops.

Darin Adler

Comment 15 2019-03-10 12:01:17 PDT

Turns out the 8-bit special case is correct for NFC, but not NFD, for example.

Darin Adler

Comment 16 2019-03-10 12:23:08 PDT Comment hidden (obsolete)

Created attachment 364195 [details] Patch

EWS Watchlist

Comment 17 2019-03-10 12:26:33 PDT Comment hidden (obsolete)

Attachment 364195 [details] did not pass style-queue: ERROR: Source/JavaScriptCore/runtime/StringPrototype.cpp:1808: enum members should use InterCaps with an initial capital letter or initial 'k' for C-style enums. [readability/enum_casing] [4] Total errors found: 1 in 10 files If any of these errors are false positives, please file a bug against check-webkit-style.

Darin Adler

Comment 18 2019-03-10 13:35:43 PDT

Hooray, this one is passing the tests. Looks like I fixed the two problems in the earlier version so ready for review now.

Michael Catanzaro

Comment 19 2019-03-11 10:15:57 PDT

Comment on attachment 364195 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=364195&action=review There's also FontCairoHarfbuzzNG.cpp which could be converted to use the new normalizedNFC() function. Should be straightforward, if you're brave and want to touch a file you can't build. > Source/WTF/wtf/text/StringView.h:216 > +WTF_EXPORT_PRIVATE StringViewWithUnderlyingString normalizedNFC(StringView); Wouldn't it be simpler to just return a String and let the API user create the StringView? > Source/WebCore/editing/TextIterator.cpp:2017 > static void normalizeCharacters(const UChar* characters, unsigned length, Vector<UChar>& buffer) Shame we can't use your new normalizedNFC() function here. Oh well. :(

Darin Adler

Comment 20 2019-03-11 20:51:19 PDT

Comment on attachment 364195 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=364195&action=review I’ll take a look at the call site in FontCairoHarfbuzzNG.cpp too. >> Source/WTF/wtf/text/StringView.h:216 >> +WTF_EXPORT_PRIVATE StringViewWithUnderlyingString normalizedNFC(StringView); > > Wouldn't it be simpler to just return a String and let the API user create the StringView? When the existing StringView is already normalized, this returns a null String and the same StringView. When it’s not it returns a non-null String and a StringView pointing to that. I am not sure what the best way of documenting this is. Perhaps I should just add a comment saying that? >> Source/WebCore/editing/TextIterator.cpp:2017 >> static void normalizeCharacters(const UChar* characters, unsigned length, Vector<UChar>& buffer) > > Shame we can't use your new normalizedNFC() function here. Oh well. :( Maybe we can eventually find a way to share the code. So far I didn’t find the perfect way.

Michael Catanzaro

Comment 21 2019-03-12 09:12:11 PDT

(In reply to Darin Adler from comment #20) > > Wouldn't it be simpler to just return a String and let the API user create the StringView? > > When the existing StringView is already normalized, this returns a null > String and the same StringView. When it’s not it returns a non-null String > and a StringView pointing to that. I am not sure what the best way of > documenting this is. Perhaps I should just add a comment saying that? Hm, this does seem ideal, then. I think a comment would suffice. The name could also indicate nullability: StringViewPotentiallyWithUnderlyingString. But that's kinda yucky. Or the function could use two out parameters instead of a return value, to avoid the need to create a new type, but that doesn't seem necessarily any better.

Darin Adler

Comment 22 2019-03-16 11:53:08 PDT Comment hidden (obsolete)

Created attachment 364939 [details] Patch

EWS Watchlist

Comment 23 2019-03-16 11:54:56 PDT Comment hidden (obsolete)

Attachment 364939 [details] did not pass style-queue: ERROR: Source/JavaScriptCore/runtime/StringPrototype.cpp:1808: enum members should use InterCaps with an initial capital letter or initial 'k' for C-style enums. [readability/enum_casing] [4] Total errors found: 1 in 11 files If any of these errors are false positives, please file a bug against check-webkit-style.

Darin Adler

Comment 24 2019-03-16 11:55:28 PDT

(In reply to Michael Catanzaro from comment #21) > The name could also indicate nullability: > StringViewPotentiallyWithUnderlyingString. But that's kinda yucky. Or the > function could use two out parameters instead of a return value, to avoid > the need to create a new type, but that doesn't seem necessarily any better. I think we might eventually want to update the struct name, but I am not sure it makes it better. The existing call site that uses StringViewWithUnderlyingString uses it in another way: In that case, the StringView might be a substring of the larger underlying string.

Darin Adler

Comment 25 2019-03-16 15:00:34 PDT Comment hidden (obsolete)

Created attachment 364945 [details] Patch

EWS Watchlist

Comment 26 2019-03-16 15:03:27 PDT Comment hidden (obsolete)

Attachment 364945 [details] did not pass style-queue: ERROR: Source/JavaScriptCore/runtime/StringPrototype.cpp:1808: enum members should use InterCaps with an initial capital letter or initial 'k' for C-style enums. [readability/enum_casing] [4] Total errors found: 1 in 11 files If any of these errors are false positives, please file a bug against check-webkit-style.

Darin Adler

Comment 27 2019-03-16 17:00:28 PDT

Created attachment 364949 [details] Patch

EWS Watchlist

Comment 28 2019-03-16 17:01:42 PDT

Attachment 364949 [details] did not pass style-queue: ERROR: Source/JavaScriptCore/runtime/StringPrototype.cpp:1808: enum members should use InterCaps with an initial capital letter or initial 'k' for C-style enums. [readability/enum_casing] [4] Total errors found: 1 in 11 files If any of these errors are false positives, please file a bug against check-webkit-style.

WebKit Commit Bot

Comment 29 2019-03-16 19:20:57 PDT

Comment on attachment 364949 [details] Patch Clearing flags on attachment: 364949 Committed r243049: <https://trac.webkit.org/changeset/243049>

WebKit Commit Bot

Comment 30 2019-03-16 19:20:59 PDT

All reviewed patches have been landed. Closing bug.

Radar WebKit Bug Importer

Comment 31 2019-03-16 19:22:22 PDT

<rdar://problem/48958229>

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Normal

Classification Unclassified

Version WebKit Nightly Build

Hardware Unspecified

OS Unspecified

Product WebKit

Component Web Template Framework

Assignee

Darin Adler

Reported

2019-03-05 09:20 PST

Modified

2019-03-16 19:22 PDT History

CC List

11 users Show

URL

Keywords InRadar

Depends on

Blocks