Bug 195330 - Improve normalization code, including moving from unorm.h to unorm2.h
Summary: Improve normalization code, including moving from unorm.h to unorm2.h
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: Web Template Framework (show other bugs)
Version: WebKit Nightly Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Darin Adler
URL:
Keywords: InRadar
Depends on:
Blocks:
 
Reported: 2019-03-05 09:20 PST by Darin Adler
Modified: 2019-03-16 19:22 PDT (History)
11 users (show)

See Also:


Attachments
Patch (4.28 KB, patch)
2019-03-05 09:27 PST, Darin Adler
no flags Details | Formatted Diff | Diff
Patch (29.43 KB, patch)
2019-03-09 09:33 PST, Darin Adler
no flags Details | Formatted Diff | Diff
Archive of layout-test-results from ews115 for mac-highsierra (2.31 MB, application/zip)
2019-03-09 11:34 PST, Build Bot
no flags Details
Patch (30.06 KB, patch)
2019-03-10 12:23 PDT, Darin Adler
no flags Details | Formatted Diff | Diff
Patch (35.58 KB, patch)
2019-03-16 11:53 PDT, Darin Adler
no flags Details | Formatted Diff | Diff
Patch (35.58 KB, patch)
2019-03-16 15:00 PDT, Darin Adler
no flags Details | Formatted Diff | Diff
Patch (35.60 KB, patch)
2019-03-16 17:00 PDT, Darin Adler
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Darin Adler 2019-03-05 09:20:48 PST
Improve the toNormalizationFormC function
Comment 1 Darin Adler 2019-03-05 09:27:31 PST Comment hidden (obsolete)
Comment 2 Darin Adler 2019-03-05 09:28:40 PST
The only part of this I am not sure about is the three different passes through the string it takes. Otherwise, I think this is better and maybe should be moved somewhere where it can be reused.
Comment 3 Darin Adler 2019-03-05 10:32:28 PST
Comment on attachment 363643 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=363643&action=review

> Source/WTF/wtf/URLHelpers.cpp:779
> +    if (string.is8Bit() || string.isEmpty())

This isEmpty check isn’t needed.
Comment 4 Michael Catanzaro 2019-03-05 11:02:24 PST
Comment on attachment 363643 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=363643&action=review

>> Source/WTF/wtf/URLHelpers.cpp:779
>> +    if (string.is8Bit() || string.isEmpty())
> 
> This isEmpty check isn’t needed.

Why not?

Anyway, returning early for 8-bit strings was a good call. Now, because you know the WTFString is internally 16-bit, you're able to use the nice characters16 function. Much better than what I came up with when I tried.

> Source/WTF/wtf/URLHelpers.cpp:786
> +    ASSERT(U_SUCCESS(error));
> +    if (U_FAILURE(error))
> +        return string;

I'm surprised to see you handling the case where the assertion fails. Normally I would complain that U_SUCCESS(error) is guaranteed to be true here, because of the assert, but I suppose you prefer to handle it for release builds with asserts disabled....

Anyway, I do like that you're returning the original string on error, instead of the empty string like before. More robust.

> Source/WTF/wtf/URLHelpers.cpp:795
> +    auto normalizedLength = unorm2_normalize(normalizer, string.characters16(), string.length(), nullptr, 0, &error);
> +    ASSERT(error == U_BUFFER_OVERFLOW_ERROR);

Er, what? You assume the normalized string will *always* be larger than the original string? I see it passes the tests, which surprises me, so that's good, and I'm no expert on Unicode normalization... but this doesn't seem like a good assumption at all. Shouldn't the code be prepared to handle the normalized result being smaller than, or equal to, the original? Please reconsider this before landing.

As we discussed previously, there are many places in WebKit outside URLHelpers.cpp currently using unorm_normalize() or unorm2_normalize() manually, which would benefit from use of this function. So even if for some strange reason normalized URLs are guaranteed to be larger, I would still make this more general. The original code looks safer.
Comment 5 Darin Adler 2019-03-05 19:17:04 PST
Comment on attachment 363643 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=363643&action=review

>>> Source/WTF/wtf/URLHelpers.cpp:779
>>> +    if (string.is8Bit() || string.isEmpty())
>> 
>> This isEmpty check isn’t needed.
> 
> Why not?
> 
> Anyway, returning early for 8-bit strings was a good call. Now, because you know the WTFString is internally 16-bit, you're able to use the nice characters16 function. Much better than what I came up with when I tried.

There are multiple reasons, but the main reason is that the empty string is guaranteed to be 8-bit.

>> Source/WTF/wtf/URLHelpers.cpp:786
>> +        return string;
> 
> I'm surprised to see you handling the case where the assertion fails. Normally I would complain that U_SUCCESS(error) is guaranteed to be true here, because of the assert, but I suppose you prefer to handle it for release builds with asserts disabled....
> 
> Anyway, I do like that you're returning the original string on error, instead of the empty string like before. More robust.

Three options:

ASSERT with early return (so we silently fail if the unthinkable happens)
RELEASE_ASSERT (so we abort if the unthinkable happens)
ASSERT (so something unpredictable happens if the unthinkable happens -- in this case probably a crash since it returns null, but different at other call sites)

Not sure which I prefer. Maybe RELEASE_ASSERT.

>> Source/WTF/wtf/URLHelpers.cpp:795
>> +    ASSERT(error == U_BUFFER_OVERFLOW_ERROR);
> 
> Er, what? You assume the normalized string will *always* be larger than the original string? I see it passes the tests, which surprises me, so that's good, and I'm no expert on Unicode normalization... but this doesn't seem like a good assumption at all. Shouldn't the code be prepared to handle the normalized result being smaller than, or equal to, the original? Please reconsider this before landing.
> 
> As we discussed previously, there are many places in WebKit outside URLHelpers.cpp currently using unorm_normalize() or unorm2_normalize() manually, which would benefit from use of this function. So even if for some strange reason normalized URLs are guaranteed to be larger, I would still make this more general. The original code looks safer.

I am passing a length of 0 here for the target buffer. That’s what guarantees U_BUFFER_OVERFLOW_ERROR.
Comment 6 Michael Catanzaro 2019-03-06 07:30:19 PST
Comment on attachment 363643 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=363643&action=review

>>> Source/WTF/wtf/URLHelpers.cpp:795
>>> +    ASSERT(error == U_BUFFER_OVERFLOW_ERROR);
>> 
>> Er, what? You assume the normalized string will *always* be larger than the original string? I see it passes the tests, which surprises me, so that's good, and I'm no expert on Unicode normalization... but this doesn't seem like a good assumption at all. Shouldn't the code be prepared to handle the normalized result being smaller than, or equal to, the original? Please reconsider this before landing.
>> 
>> As we discussed previously, there are many places in WebKit outside URLHelpers.cpp currently using unorm_normalize() or unorm2_normalize() manually, which would benefit from use of this function. So even if for some strange reason normalized URLs are guaranteed to be larger, I would still make this more general. The original code looks safer.
> 
> I am passing a length of 0 here for the target buffer. That’s what guarantees U_BUFFER_OVERFLOW_ERROR.

Ahh, that's what I missed, OK then. That way you don't have to allocate the wrong size of buffer and then try again.

I doubt this is really well-optimized, though, since now you have to do the normalization twice, discarding the result the first time. The heuristic we were using before -- assuming that the string will usually fit into the same size buffer as was required prior to normalization -- seemed good to me.
Comment 7 Darin Adler 2019-03-08 09:31:23 PST
Comment on attachment 363643 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=363643&action=review

>>>> Source/WTF/wtf/URLHelpers.cpp:795
>>>> +    ASSERT(error == U_BUFFER_OVERFLOW_ERROR);
>>> 
>>> Er, what? You assume the normalized string will *always* be larger than the original string? I see it passes the tests, which surprises me, so that's good, and I'm no expert on Unicode normalization... but this doesn't seem like a good assumption at all. Shouldn't the code be prepared to handle the normalized result being smaller than, or equal to, the original? Please reconsider this before landing.
>>> 
>>> As we discussed previously, there are many places in WebKit outside URLHelpers.cpp currently using unorm_normalize() or unorm2_normalize() manually, which would benefit from use of this function. So even if for some strange reason normalized URLs are guaranteed to be larger, I would still make this more general. The original code looks safer.
>> 
>> I am passing a length of 0 here for the target buffer. That’s what guarantees U_BUFFER_OVERFLOW_ERROR.
> 
> Ahh, that's what I missed, OK then. That way you don't have to allocate the wrong size of buffer and then try again.
> 
> I doubt this is really well-optimized, though, since now you have to do the normalization twice, discarding the result the first time. The heuristic we were using before -- assuming that the string will usually fit into the same size buffer as was required prior to normalization -- seemed good to me.

For this minor point (whether to compute the size or optimistically allocate a buffer and try at the starting size), there are at least 4 considerations, all relatively small and minor themselves:

1) The cost of normalizing multiple times. As you pointed out, the patch we are reviewing here normalizes twice (and arguably three times since the isNormalized check might be roughly as expensive as normalizing in edge cases).

2) The cost of allocating memory once or twice. The patch we are reviewing here allocates memory exactly once, for the StringImpl. Other approaches involve allocating memory multiple times for intermediate buffers before allocating the final string. For example pre-patch there is the Vector for the original characters (allocation #1), the Vector for the normalized characters (sometimes allocation #2 if the string is long enough to not fit in the stack buffer), and the StringImpl (allocation #2 or #3).

3) The cost of copying characters. The patch we are reviewing does not copy characters outside of copying done by the ICU normalization function. Normalizing into a Vector and then converting that into a String with String::adopt involves copying all the characters in the case where the string is small enough to fit into the buffer on the stack.

4) The cost of keeping around a string in one memory block of two. The patch we are reviewing always creates a string that uses the buffer inside the StringImpl, so has only takes one memory block overhead, rather than two; can matter if we keep it around for a long time. Using a heap-allocated Vector (or even a StringBuffer) to create a string creates a string that uses more storage, for its entire lifetime, because it uses two memory blocks, one for the StringImpl and one for the characters.

It’s also worth noting that normalizing to form "C" (the "compact" one) almost always results in a string that is *smaller* than the original string. Almost never the same size unless the string is already normalized (which we handle separately, first). I believe that it *can* result in a string that is larger, but I am not sure in what cases that can occur since normalization to C almost always means composing pairs that are in their decomposed form.

I’m gong to post a new patch that moves this function and another related one to the StringView header soon.

The other call site I found doing normalization was already using the "call once to just get the size" approach. Not sure whether I will stick with that, or move to the "Vector with stack capacity" approach.
Comment 8 Darin Adler 2019-03-09 09:33:41 PST Comment hidden (obsolete)
Comment 9 Build Bot 2019-03-09 09:35:33 PST Comment hidden (obsolete)
Comment 10 Darin Adler 2019-03-09 10:07:54 PST
Comment on attachment 364132 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=364132&action=review

> Source/WTF/wtf/URLHelpers.cpp:779
> +            for (size_t j = 0; j < characterLength; ++j)

I suppose this should be unsigned rather than size_t.
Comment 11 Build Bot 2019-03-09 11:34:02 PST Comment hidden (obsolete)
Comment 12 Build Bot 2019-03-09 11:34:04 PST Comment hidden (obsolete)
Comment 13 Build Bot 2019-03-09 12:11:22 PST Comment hidden (obsolete)
Comment 14 Darin Adler 2019-03-09 13:54:58 PST Comment hidden (obsolete)
Comment 15 Darin Adler 2019-03-10 12:01:17 PDT
Turns out the 8-bit special case is correct for NFC, but not NFD, for example.
Comment 16 Darin Adler 2019-03-10 12:23:08 PDT Comment hidden (obsolete)
Comment 17 Build Bot 2019-03-10 12:26:33 PDT Comment hidden (obsolete)
Comment 18 Darin Adler 2019-03-10 13:35:43 PDT
Hooray, this one is passing the tests. Looks like I fixed the two problems in the earlier version so ready for review now.
Comment 19 Michael Catanzaro 2019-03-11 10:15:57 PDT
Comment on attachment 364195 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=364195&action=review

There's also FontCairoHarfbuzzNG.cpp which could be converted to use the new normalizedNFC() function. Should be straightforward, if you're brave and want to touch a file you can't build.

> Source/WTF/wtf/text/StringView.h:216
> +WTF_EXPORT_PRIVATE StringViewWithUnderlyingString normalizedNFC(StringView);

Wouldn't it be simpler to just return a String and let the API user create the StringView?

> Source/WebCore/editing/TextIterator.cpp:2017
>  static void normalizeCharacters(const UChar* characters, unsigned length, Vector<UChar>& buffer)

Shame we can't use your new normalizedNFC() function here. Oh well. :(
Comment 20 Darin Adler 2019-03-11 20:51:19 PDT
Comment on attachment 364195 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=364195&action=review

I’ll take a look at the call site in FontCairoHarfbuzzNG.cpp too.

>> Source/WTF/wtf/text/StringView.h:216
>> +WTF_EXPORT_PRIVATE StringViewWithUnderlyingString normalizedNFC(StringView);
> 
> Wouldn't it be simpler to just return a String and let the API user create the StringView?

When the existing StringView is already normalized, this returns a null String and the same StringView. When it’s not it returns a non-null String and a StringView pointing to that. I am not sure what the best way of documenting this is. Perhaps I should just add a comment saying that?

>> Source/WebCore/editing/TextIterator.cpp:2017
>>  static void normalizeCharacters(const UChar* characters, unsigned length, Vector<UChar>& buffer)
> 
> Shame we can't use your new normalizedNFC() function here. Oh well. :(

Maybe we can eventually find a way to share the code. So far I didn’t find the perfect way.
Comment 21 Michael Catanzaro 2019-03-12 09:12:11 PDT
(In reply to Darin Adler from comment #20)
> > Wouldn't it be simpler to just return a String and let the API user create the StringView?
> 
> When the existing StringView is already normalized, this returns a null
> String and the same StringView. When it’s not it returns a non-null String
> and a StringView pointing to that. I am not sure what the best way of
> documenting this is. Perhaps I should just add a comment saying that?

Hm, this does seem ideal, then. I think a comment would suffice.

The name could also indicate nullability: StringViewPotentiallyWithUnderlyingString. But that's kinda yucky. Or the function could use two out parameters instead of a return value, to avoid the need to create a new type, but that doesn't seem necessarily any better.
Comment 22 Darin Adler 2019-03-16 11:53:08 PDT Comment hidden (obsolete)
Comment 23 Build Bot 2019-03-16 11:54:56 PDT Comment hidden (obsolete)
Comment 24 Darin Adler 2019-03-16 11:55:28 PDT
(In reply to Michael Catanzaro from comment #21)
> The name could also indicate nullability:
> StringViewPotentiallyWithUnderlyingString. But that's kinda yucky. Or the
> function could use two out parameters instead of a return value, to avoid
> the need to create a new type, but that doesn't seem necessarily any better.

I think we might eventually want to update the struct name, but I am not sure it makes it better.

The existing call site that uses StringViewWithUnderlyingString uses it in another way: In that case, the StringView might be a substring of the larger underlying string.
Comment 25 Darin Adler 2019-03-16 15:00:34 PDT Comment hidden (obsolete)
Comment 26 Build Bot 2019-03-16 15:03:27 PDT Comment hidden (obsolete)
Comment 27 Darin Adler 2019-03-16 17:00:28 PDT
Created attachment 364949 [details]
Patch
Comment 28 Build Bot 2019-03-16 17:01:42 PDT
Attachment 364949 [details] did not pass style-queue:


ERROR: Source/JavaScriptCore/runtime/StringPrototype.cpp:1808:  enum members should use InterCaps with an initial capital letter or initial 'k' for C-style enums.  [readability/enum_casing] [4]
Total errors found: 1 in 11 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 29 WebKit Commit Bot 2019-03-16 19:20:57 PDT
Comment on attachment 364949 [details]
Patch

Clearing flags on attachment: 364949

Committed r243049: <https://trac.webkit.org/changeset/243049>
Comment 30 WebKit Commit Bot 2019-03-16 19:20:59 PDT
All reviewed patches have been landed.  Closing bug.
Comment 31 Radar WebKit Bug Importer 2019-03-16 19:22:22 PDT
<rdar://problem/48958229>