Bug 194272

Summary: URLHelpers should use unorm2_quickCheck before converting to NFC
Product: WebKit Reporter: Michael Catanzaro <mcatanzaro>
Component: WebKitGTKAssignee: Michael Catanzaro <mcatanzaro>
Status: RESOLVED FIXED    
Severity: Normal CC: ap, benjamin, bugs-noreply, cdumez, cmarcelo, commit-queue, darin, dbates, ews-watchlist, mcatanzaro
Priority: P2    
Version: WebKit Nightly Build   
Hardware: PC   
OS: Linux   
Attachments:
Description Flags
Patch
none
Patch none

Description Michael Catanzaro 2019-02-04 19:48:00 PST
Bug #174816 turned gargantuan so I'll use a new bug to address these follow-ups:

> Any tests we're adding to the cross-platform test should be removed from the
> platform-specific test. Ideally the entire test would become cross-platform,
> but I see that would require a bit more work and we're just days away from
> API freeze, so I understand if that can't happen now.
> 
> This is minor, but Darin did request that we not use
> charactersWithNullTermination if we are just going to cut off the null
> termination right away. His suggested idiom was:
> 
> auto characters = StringView { string }.upconvertedCharacters();
> 
> Were there any troubles with that?
> 
> Then the unorm2_quickCheck should be one extra function call towards the top
> of toNormalizationFormC(), just to avoid calling unorm2_normalize() if it's
> not really needed. That could be done in a follow-up, of course, but it
> seems easy?
Comment 1 Michael Catanzaro 2019-02-19 18:55:23 PST
Might be able to share code with FontCairoHarfbuzzNG.cpp.
Comment 2 Michael Catanzaro 2019-02-23 16:34:20 PST
(In reply to Michael Catanzaro from comment #0)
> > This is minor, but Darin did request that we not use
> > charactersWithNullTermination if we are just going to cut off the null
> > termination right away. His suggested idiom was:
> > 
> > auto characters = StringView { string }.upconvertedCharacters();
> > 
> > Were there any troubles with that?

Not happy with the gymnastics required to keep length separately:

    auto characters = StringView { string }.upconvertedCharacters();
    const UChar* sourceBuffer = characters.get();
    auto length = string.length();

The code is more clear when we just have a Vector<UChar>, like currently.
Comment 3 Michael Catanzaro 2019-02-23 16:34:50 PST
(In reply to Michael Catanzaro from comment #1)
> Might be able to share code with FontCairoHarfbuzzNG.cpp.

It's used in so many other places. That should be left for a different bug.
Comment 4 Michael Catanzaro 2019-02-23 16:45:13 PST
(In reply to Michael Catanzaro from comment #0)
> Bug #174816 turned gargantuan so I'll use a new bug to address these
> follow-ups:
> 
> > Any tests we're adding to the cross-platform test should be removed from the
> > platform-specific test. Ideally the entire test would become cross-platform,
> > but I see that would require a bit more work and we're just days away from
> > API freeze, so I understand if that can't happen now

I'm going to punt on this too since the Cocoa test relies on some APIs that we don't plan to make cross-platform.
Comment 5 Michael Catanzaro 2019-02-23 16:56:13 PST
Created attachment 362841 [details]
Patch
Comment 6 Darin Adler 2019-03-03 00:37:32 PST
Comment on attachment 362841 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=362841&action=review

This seems OK as is, but I think there is room for improvement.

> Source/WTF/wtf/URLHelpers.cpp:778
> -    auto sourceBuffer = string.charactersWithNullTermination();
> +    Vector<UChar> sourceBuffer = string.charactersWithNullTermination();

I guess you don’t like auto in a case like this. I do; keeps the code less wordy.

I don’t understand why we need the null termination. I think unorm2_quickCheck can work on characters in place; charactersWithNullTermination is an expensive operation that should be avoided if possible.

> Source/WTF/wtf/URLHelpers.cpp:783
> +    const UNormalizer2* normalizer = unorm2_getNFCInstance(&uerror);

I prefer auto to writing out const UNormalizer2*, but I suppose tastes differ.

> Source/WTF/wtf/URLHelpers.cpp:787
> +    UNormalizationCheckResult checkResult = unorm2_quickCheck(normalizer, sourceBuffer.data(), sourceBuffer.size(), &uerror);

I prefer auto to writing out UNormalizationCheckResult, but I suppose tastes differ.

> Source/WTF/wtf/URLHelpers.cpp:793
> +        return String(sourceBuffer.data(), sourceBuffer.size());

Better performance if this does:

    return string;

Then we share the StringImpl instead of allocating a new one with the same characters in it.

> Source/WTF/wtf/URLHelpers.cpp:798
> +    uerror = U_ZERO_ERROR;

Is this needed? If there was an error, we would have returned, I think.

> Source/WTF/wtf/URLHelpers.cpp:799
> +    if (U_SUCCESS(uerror)) {

This is strange. It checks the value of something we set the line before. I think we instead want to remove this if statement.
Comment 7 Darin Adler 2019-03-03 00:39:33 PST
(In reply to Michael Catanzaro from comment #2)
> Not happy with the gymnastics required to keep length separately:
> 
>     auto characters = StringView { string }.upconvertedCharacters();
>     const UChar* sourceBuffer = characters.get();
>     auto length = string.length();
> 
> The code is more clear when we just have a Vector<UChar>, like currently.

I’ll come back and do the optimization. Because that clarity comes at a real cost. The upconvertedCharacters() function is free when the string already is stored as UChar, whereas the Vector thing means copying all the characters into a newly allocated memory buffer!

Might also want a fast case for LChar strings to avoid upconverting them.

All depends on how often this function is used. A high quality implementation for extensive re-use should probably eventually have these optimizations.
Comment 8 Michael Catanzaro 2019-03-03 11:25:21 PST
(In reply to Darin Adler from comment #6)
> > Source/WTF/wtf/URLHelpers.cpp:778
> > -    auto sourceBuffer = string.charactersWithNullTermination();
> > +    Vector<UChar> sourceBuffer = string.charactersWithNullTermination();
> 
> I guess you don’t like auto in a case like this. I do; keeps the code less
> wordy.

I think we don't have guidelines on this. My rule of thumb is: is the code easier to read with auto? Less-wordy code is usually easier to read, but sometimes it's more important to be able to see the type at a glance without having to open WTFString.h and search for "charactersWithNullTermination," which is what I probably had to do here to figure out what type I was working with. It's important that this is a Vector and that it's a vector of UChar, not a char* or whatever. So I usually find myself using auto more for (a) integer types, (b) really long types, or (c) when the type is obvious, e.g. when there is a cast on the rhs of an assignment.

> I don’t understand why we need the null termination. I think
> unorm2_quickCheck can work on characters in place;
> charactersWithNullTermination is an expensive operation that should be
> avoided if possible.

We don't need the null termination. String just doesn't provide a function to get a Vector<UChar> without the null termination.

> > Source/WTF/wtf/URLHelpers.cpp:783
> > +    const UNormalizer2* normalizer = unorm2_getNFCInstance(&uerror);
> 
> I prefer auto to writing out const UNormalizer2*, but I suppose tastes
> differ.

I wish we had stronger guidelines so I didn't have to think about this whenever declaring variables. Here, I could go either way. You're probably not going to be able to modify this code without looking at the ICU API reference, so having the type written out seems much less important here than in the example above. But I don't see much advantage of using auto here, either. I think we normally also write out const auto* in WebKit to avoid hiding those modifiers, even though they're not necessary. (Especially the * is important for readability.)

> > Source/WTF/wtf/URLHelpers.cpp:793
> > +        return String(sourceBuffer.data(), sourceBuffer.size());
> 
> Better performance if this does:
> 
>     return string;
> 
> Then we share the StringImpl instead of allocating a new one with the same
> characters in it.

Oh, right. Let me fix this....

> > Source/WTF/wtf/URLHelpers.cpp:798
> > +    uerror = U_ZERO_ERROR;
> 
> Is this needed? If there was an error, we would have returned, I think.
> 
> > Source/WTF/wtf/URLHelpers.cpp:799
> > +    if (U_SUCCESS(uerror)) {
> 
> This is strange. It checks the value of something we set the line before. I
> think we instead want to remove this if statement.

That's just a refactoring bug. Oops. Will fix this too.

I'll let you handle optimizing it by using upconvertedCharacters(), and convert to auto if you wish.
Comment 9 Michael Catanzaro 2019-03-03 11:45:50 PST
Created attachment 363462 [details]
Patch
Comment 10 Michael Catanzaro 2019-03-03 11:46:36 PST
And I removed the inadvertently-committed test file in r242329. I wish I had more time to work on making the URLExtras.mm test cross-platform, but I don't right now, sorry.
Comment 11 Darin Adler 2019-03-03 11:59:46 PST
(In reply to Michael Catanzaro from comment #8)
> I'll let you handle optimizing it by using upconvertedCharacters()

Might do that. Especially if we add enough tests that it’s easy for me to know I didn’t break it.

> and convert to auto

Definitely won’t do that!
Comment 12 Michael Catanzaro 2019-03-04 06:53:23 PST
There are already tons of tests in URLExtras.mm, I just wish they were cross-platform. :(
Comment 13 WebKit Commit Bot 2019-03-04 07:18:48 PST
Comment on attachment 363462 [details]
Patch

Clearing flags on attachment: 363462

Committed r242353: <https://trac.webkit.org/changeset/242353>
Comment 14 WebKit Commit Bot 2019-03-04 07:18:50 PST
All reviewed patches have been landed.  Closing bug.