RESOLVED FIXED Bug 195330
Improve normalization code, including moving from unorm.h to unorm2.h
https://bugs.webkit.org/show_bug.cgi?id=195330
Summary Improve normalization code, including moving from unorm.h to unorm2.h
Darin Adler
Reported 2019-03-05 09:20:48 PST
Improve the toNormalizationFormC function
Attachments
Patch (4.28 KB, patch)
2019-03-05 09:27 PST, Darin Adler
no flags
Patch (29.43 KB, patch)
2019-03-09 09:33 PST, Darin Adler
no flags
Archive of layout-test-results from ews115 for mac-highsierra (2.31 MB, application/zip)
2019-03-09 11:34 PST, EWS Watchlist
no flags
Patch (30.06 KB, patch)
2019-03-10 12:23 PDT, Darin Adler
no flags
Patch (35.58 KB, patch)
2019-03-16 11:53 PDT, Darin Adler
no flags
Patch (35.58 KB, patch)
2019-03-16 15:00 PDT, Darin Adler
no flags
Patch (35.60 KB, patch)
2019-03-16 17:00 PDT, Darin Adler
no flags
Darin Adler
Comment 1 2019-03-05 09:27:31 PST Comment hidden (obsolete)
Darin Adler
Comment 2 2019-03-05 09:28:40 PST
The only part of this I am not sure about is the three different passes through the string it takes. Otherwise, I think this is better and maybe should be moved somewhere where it can be reused.
Darin Adler
Comment 3 2019-03-05 10:32:28 PST
Comment on attachment 363643 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=363643&action=review > Source/WTF/wtf/URLHelpers.cpp:779 > + if (string.is8Bit() || string.isEmpty()) This isEmpty check isn’t needed.
Michael Catanzaro
Comment 4 2019-03-05 11:02:24 PST
Comment on attachment 363643 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=363643&action=review >> Source/WTF/wtf/URLHelpers.cpp:779 >> + if (string.is8Bit() || string.isEmpty()) > > This isEmpty check isn’t needed. Why not? Anyway, returning early for 8-bit strings was a good call. Now, because you know the WTFString is internally 16-bit, you're able to use the nice characters16 function. Much better than what I came up with when I tried. > Source/WTF/wtf/URLHelpers.cpp:786 > + ASSERT(U_SUCCESS(error)); > + if (U_FAILURE(error)) > + return string; I'm surprised to see you handling the case where the assertion fails. Normally I would complain that U_SUCCESS(error) is guaranteed to be true here, because of the assert, but I suppose you prefer to handle it for release builds with asserts disabled.... Anyway, I do like that you're returning the original string on error, instead of the empty string like before. More robust. > Source/WTF/wtf/URLHelpers.cpp:795 > + auto normalizedLength = unorm2_normalize(normalizer, string.characters16(), string.length(), nullptr, 0, &error); > + ASSERT(error == U_BUFFER_OVERFLOW_ERROR); Er, what? You assume the normalized string will *always* be larger than the original string? I see it passes the tests, which surprises me, so that's good, and I'm no expert on Unicode normalization... but this doesn't seem like a good assumption at all. Shouldn't the code be prepared to handle the normalized result being smaller than, or equal to, the original? Please reconsider this before landing. As we discussed previously, there are many places in WebKit outside URLHelpers.cpp currently using unorm_normalize() or unorm2_normalize() manually, which would benefit from use of this function. So even if for some strange reason normalized URLs are guaranteed to be larger, I would still make this more general. The original code looks safer.
Darin Adler
Comment 5 2019-03-05 19:17:04 PST
Comment on attachment 363643 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=363643&action=review >>> Source/WTF/wtf/URLHelpers.cpp:779 >>> + if (string.is8Bit() || string.isEmpty()) >> >> This isEmpty check isn’t needed. > > Why not? > > Anyway, returning early for 8-bit strings was a good call. Now, because you know the WTFString is internally 16-bit, you're able to use the nice characters16 function. Much better than what I came up with when I tried. There are multiple reasons, but the main reason is that the empty string is guaranteed to be 8-bit. >> Source/WTF/wtf/URLHelpers.cpp:786 >> + return string; > > I'm surprised to see you handling the case where the assertion fails. Normally I would complain that U_SUCCESS(error) is guaranteed to be true here, because of the assert, but I suppose you prefer to handle it for release builds with asserts disabled.... > > Anyway, I do like that you're returning the original string on error, instead of the empty string like before. More robust. Three options: ASSERT with early return (so we silently fail if the unthinkable happens) RELEASE_ASSERT (so we abort if the unthinkable happens) ASSERT (so something unpredictable happens if the unthinkable happens -- in this case probably a crash since it returns null, but different at other call sites) Not sure which I prefer. Maybe RELEASE_ASSERT. >> Source/WTF/wtf/URLHelpers.cpp:795 >> + ASSERT(error == U_BUFFER_OVERFLOW_ERROR); > > Er, what? You assume the normalized string will *always* be larger than the original string? I see it passes the tests, which surprises me, so that's good, and I'm no expert on Unicode normalization... but this doesn't seem like a good assumption at all. Shouldn't the code be prepared to handle the normalized result being smaller than, or equal to, the original? Please reconsider this before landing. > > As we discussed previously, there are many places in WebKit outside URLHelpers.cpp currently using unorm_normalize() or unorm2_normalize() manually, which would benefit from use of this function. So even if for some strange reason normalized URLs are guaranteed to be larger, I would still make this more general. The original code looks safer. I am passing a length of 0 here for the target buffer. That’s what guarantees U_BUFFER_OVERFLOW_ERROR.
Michael Catanzaro
Comment 6 2019-03-06 07:30:19 PST
Comment on attachment 363643 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=363643&action=review >>> Source/WTF/wtf/URLHelpers.cpp:795 >>> + ASSERT(error == U_BUFFER_OVERFLOW_ERROR); >> >> Er, what? You assume the normalized string will *always* be larger than the original string? I see it passes the tests, which surprises me, so that's good, and I'm no expert on Unicode normalization... but this doesn't seem like a good assumption at all. Shouldn't the code be prepared to handle the normalized result being smaller than, or equal to, the original? Please reconsider this before landing. >> >> As we discussed previously, there are many places in WebKit outside URLHelpers.cpp currently using unorm_normalize() or unorm2_normalize() manually, which would benefit from use of this function. So even if for some strange reason normalized URLs are guaranteed to be larger, I would still make this more general. The original code looks safer. > > I am passing a length of 0 here for the target buffer. That’s what guarantees U_BUFFER_OVERFLOW_ERROR. Ahh, that's what I missed, OK then. That way you don't have to allocate the wrong size of buffer and then try again. I doubt this is really well-optimized, though, since now you have to do the normalization twice, discarding the result the first time. The heuristic we were using before -- assuming that the string will usually fit into the same size buffer as was required prior to normalization -- seemed good to me.
Darin Adler
Comment 7 2019-03-08 09:31:23 PST
Comment on attachment 363643 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=363643&action=review >>>> Source/WTF/wtf/URLHelpers.cpp:795 >>>> + ASSERT(error == U_BUFFER_OVERFLOW_ERROR); >>> >>> Er, what? You assume the normalized string will *always* be larger than the original string? I see it passes the tests, which surprises me, so that's good, and I'm no expert on Unicode normalization... but this doesn't seem like a good assumption at all. Shouldn't the code be prepared to handle the normalized result being smaller than, or equal to, the original? Please reconsider this before landing. >>> >>> As we discussed previously, there are many places in WebKit outside URLHelpers.cpp currently using unorm_normalize() or unorm2_normalize() manually, which would benefit from use of this function. So even if for some strange reason normalized URLs are guaranteed to be larger, I would still make this more general. The original code looks safer. >> >> I am passing a length of 0 here for the target buffer. That’s what guarantees U_BUFFER_OVERFLOW_ERROR. > > Ahh, that's what I missed, OK then. That way you don't have to allocate the wrong size of buffer and then try again. > > I doubt this is really well-optimized, though, since now you have to do the normalization twice, discarding the result the first time. The heuristic we were using before -- assuming that the string will usually fit into the same size buffer as was required prior to normalization -- seemed good to me. For this minor point (whether to compute the size or optimistically allocate a buffer and try at the starting size), there are at least 4 considerations, all relatively small and minor themselves: 1) The cost of normalizing multiple times. As you pointed out, the patch we are reviewing here normalizes twice (and arguably three times since the isNormalized check might be roughly as expensive as normalizing in edge cases). 2) The cost of allocating memory once or twice. The patch we are reviewing here allocates memory exactly once, for the StringImpl. Other approaches involve allocating memory multiple times for intermediate buffers before allocating the final string. For example pre-patch there is the Vector for the original characters (allocation #1), the Vector for the normalized characters (sometimes allocation #2 if the string is long enough to not fit in the stack buffer), and the StringImpl (allocation #2 or #3). 3) The cost of copying characters. The patch we are reviewing does not copy characters outside of copying done by the ICU normalization function. Normalizing into a Vector and then converting that into a String with String::adopt involves copying all the characters in the case where the string is small enough to fit into the buffer on the stack. 4) The cost of keeping around a string in one memory block of two. The patch we are reviewing always creates a string that uses the buffer inside the StringImpl, so has only takes one memory block overhead, rather than two; can matter if we keep it around for a long time. Using a heap-allocated Vector (or even a StringBuffer) to create a string creates a string that uses more storage, for its entire lifetime, because it uses two memory blocks, one for the StringImpl and one for the characters. It’s also worth noting that normalizing to form "C" (the "compact" one) almost always results in a string that is *smaller* than the original string. Almost never the same size unless the string is already normalized (which we handle separately, first). I believe that it *can* result in a string that is larger, but I am not sure in what cases that can occur since normalization to C almost always means composing pairs that are in their decomposed form. I’m gong to post a new patch that moves this function and another related one to the StringView header soon. The other call site I found doing normalization was already using the "call once to just get the size" approach. Not sure whether I will stick with that, or move to the "Vector with stack capacity" approach.
Darin Adler
Comment 8 2019-03-09 09:33:41 PST Comment hidden (obsolete)
EWS Watchlist
Comment 9 2019-03-09 09:35:33 PST Comment hidden (obsolete)
Darin Adler
Comment 10 2019-03-09 10:07:54 PST
Comment on attachment 364132 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=364132&action=review > Source/WTF/wtf/URLHelpers.cpp:779 > + for (size_t j = 0; j < characterLength; ++j) I suppose this should be unsigned rather than size_t.
EWS Watchlist
Comment 11 2019-03-09 11:34:02 PST Comment hidden (obsolete)
EWS Watchlist
Comment 12 2019-03-09 11:34:04 PST Comment hidden (obsolete)
EWS Watchlist
Comment 13 2019-03-09 12:11:22 PST Comment hidden (obsolete)
Darin Adler
Comment 14 2019-03-09 13:54:58 PST Comment hidden (obsolete)
Darin Adler
Comment 15 2019-03-10 12:01:17 PDT
Turns out the 8-bit special case is correct for NFC, but not NFD, for example.
Darin Adler
Comment 16 2019-03-10 12:23:08 PDT Comment hidden (obsolete)
EWS Watchlist
Comment 17 2019-03-10 12:26:33 PDT Comment hidden (obsolete)
Darin Adler
Comment 18 2019-03-10 13:35:43 PDT
Hooray, this one is passing the tests. Looks like I fixed the two problems in the earlier version so ready for review now.
Michael Catanzaro
Comment 19 2019-03-11 10:15:57 PDT
Comment on attachment 364195 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=364195&action=review There's also FontCairoHarfbuzzNG.cpp which could be converted to use the new normalizedNFC() function. Should be straightforward, if you're brave and want to touch a file you can't build. > Source/WTF/wtf/text/StringView.h:216 > +WTF_EXPORT_PRIVATE StringViewWithUnderlyingString normalizedNFC(StringView); Wouldn't it be simpler to just return a String and let the API user create the StringView? > Source/WebCore/editing/TextIterator.cpp:2017 > static void normalizeCharacters(const UChar* characters, unsigned length, Vector<UChar>& buffer) Shame we can't use your new normalizedNFC() function here. Oh well. :(
Darin Adler
Comment 20 2019-03-11 20:51:19 PDT
Comment on attachment 364195 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=364195&action=review I’ll take a look at the call site in FontCairoHarfbuzzNG.cpp too. >> Source/WTF/wtf/text/StringView.h:216 >> +WTF_EXPORT_PRIVATE StringViewWithUnderlyingString normalizedNFC(StringView); > > Wouldn't it be simpler to just return a String and let the API user create the StringView? When the existing StringView is already normalized, this returns a null String and the same StringView. When it’s not it returns a non-null String and a StringView pointing to that. I am not sure what the best way of documenting this is. Perhaps I should just add a comment saying that? >> Source/WebCore/editing/TextIterator.cpp:2017 >> static void normalizeCharacters(const UChar* characters, unsigned length, Vector<UChar>& buffer) > > Shame we can't use your new normalizedNFC() function here. Oh well. :( Maybe we can eventually find a way to share the code. So far I didn’t find the perfect way.
Michael Catanzaro
Comment 21 2019-03-12 09:12:11 PDT
(In reply to Darin Adler from comment #20) > > Wouldn't it be simpler to just return a String and let the API user create the StringView? > > When the existing StringView is already normalized, this returns a null > String and the same StringView. When it’s not it returns a non-null String > and a StringView pointing to that. I am not sure what the best way of > documenting this is. Perhaps I should just add a comment saying that? Hm, this does seem ideal, then. I think a comment would suffice. The name could also indicate nullability: StringViewPotentiallyWithUnderlyingString. But that's kinda yucky. Or the function could use two out parameters instead of a return value, to avoid the need to create a new type, but that doesn't seem necessarily any better.
Darin Adler
Comment 22 2019-03-16 11:53:08 PDT Comment hidden (obsolete)
EWS Watchlist
Comment 23 2019-03-16 11:54:56 PDT Comment hidden (obsolete)
Darin Adler
Comment 24 2019-03-16 11:55:28 PDT
(In reply to Michael Catanzaro from comment #21) > The name could also indicate nullability: > StringViewPotentiallyWithUnderlyingString. But that's kinda yucky. Or the > function could use two out parameters instead of a return value, to avoid > the need to create a new type, but that doesn't seem necessarily any better. I think we might eventually want to update the struct name, but I am not sure it makes it better. The existing call site that uses StringViewWithUnderlyingString uses it in another way: In that case, the StringView might be a substring of the larger underlying string.
Darin Adler
Comment 25 2019-03-16 15:00:34 PDT Comment hidden (obsolete)
EWS Watchlist
Comment 26 2019-03-16 15:03:27 PDT Comment hidden (obsolete)
Darin Adler
Comment 27 2019-03-16 17:00:28 PDT
EWS Watchlist
Comment 28 2019-03-16 17:01:42 PDT
Attachment 364949 [details] did not pass style-queue: ERROR: Source/JavaScriptCore/runtime/StringPrototype.cpp:1808: enum members should use InterCaps with an initial capital letter or initial 'k' for C-style enums. [readability/enum_casing] [4] Total errors found: 1 in 11 files If any of these errors are false positives, please file a bug against check-webkit-style.
WebKit Commit Bot
Comment 29 2019-03-16 19:20:57 PDT
Comment on attachment 364949 [details] Patch Clearing flags on attachment: 364949 Committed r243049: <https://trac.webkit.org/changeset/243049>
WebKit Commit Bot
Comment 30 2019-03-16 19:20:59 PDT
All reviewed patches have been landed. Closing bug.
Radar WebKit Bug Importer
Comment 31 2019-03-16 19:22:22 PDT
Note You need to log in before you can comment on or make changes to this bug.