Bug 194272

Summary:

URLHelpers should use unorm2_quickCheck before converting to NFC

Product:

WebKit

Reporter:

Michael Catanzaro <mcatanzaro>

Component:

WebKitGTK

Assignee:

Michael Catanzaro <mcatanzaro>

Status:

RESOLVED FIXED

Severity:

Normal

CC:

ap, benjamin, bugs-noreply, cdumez, cmarcelo, commit-queue, darin, dbates, ews-watchlist, mcatanzaro

Priority:

Version:

WebKit Nightly Build

Hardware:

OS:

Linux

Attachments:

Description	Flags
Patch	none
Patch	none

Michael Catanzaro

Reported 2019-02-04 19:48:00 PST

Bug #174816 turned gargantuan so I'll use a new bug to address these follow-ups: > Any tests we're adding to the cross-platform test should be removed from the > platform-specific test. Ideally the entire test would become cross-platform, > but I see that would require a bit more work and we're just days away from > API freeze, so I understand if that can't happen now. > > This is minor, but Darin did request that we not use > charactersWithNullTermination if we are just going to cut off the null > termination right away. His suggested idiom was: > > auto characters = StringView { string }.upconvertedCharacters(); > > Were there any troubles with that? > > Then the unorm2_quickCheck should be one extra function call towards the top > of toNormalizationFormC(), just to avoid calling unorm2_normalize() if it's > not really needed. That could be done in a follow-up, of course, but it > seems easy?

Attachments
Patch (2.80 KB, patch) 2019-02-23 16:56 PST, Michael Catanzaro	no flags	Details Formatted Diff Diff
Patch (3.29 KB, patch) 2019-03-03 11:45 PST, Michael Catanzaro	no flags	Details Formatted Diff Diff
Show Obsolete (1) View All Add attachment proposed patch, testcase, etc.

Michael Catanzaro

Comment 1 2019-02-19 18:55:23 PST

Might be able to share code with FontCairoHarfbuzzNG.cpp.

Michael Catanzaro

Comment 2 2019-02-23 16:34:20 PST

(In reply to Michael Catanzaro from comment #0) > > This is minor, but Darin did request that we not use > > charactersWithNullTermination if we are just going to cut off the null > > termination right away. His suggested idiom was: > > > > auto characters = StringView { string }.upconvertedCharacters(); > > > > Were there any troubles with that? Not happy with the gymnastics required to keep length separately: auto characters = StringView { string }.upconvertedCharacters(); const UChar* sourceBuffer = characters.get(); auto length = string.length(); The code is more clear when we just have a Vector<UChar>, like currently.

Michael Catanzaro

Comment 3 2019-02-23 16:34:50 PST

(In reply to Michael Catanzaro from comment #1) > Might be able to share code with FontCairoHarfbuzzNG.cpp. It's used in so many other places. That should be left for a different bug.

Michael Catanzaro

Comment 4 2019-02-23 16:45:13 PST

(In reply to Michael Catanzaro from comment #0) > Bug #174816 turned gargantuan so I'll use a new bug to address these > follow-ups: > > > Any tests we're adding to the cross-platform test should be removed from the > > platform-specific test. Ideally the entire test would become cross-platform, > > but I see that would require a bit more work and we're just days away from > > API freeze, so I understand if that can't happen now I'm going to punt on this too since the Cocoa test relies on some APIs that we don't plan to make cross-platform.

Michael Catanzaro

Comment 5 2019-02-23 16:56:13 PST

Created attachment 362841 [details] Patch

Darin Adler

Comment 6 2019-03-03 00:37:32 PST

Comment on attachment 362841 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=362841&action=review This seems OK as is, but I think there is room for improvement. > Source/WTF/wtf/URLHelpers.cpp:778 > - auto sourceBuffer = string.charactersWithNullTermination(); > + Vector<UChar> sourceBuffer = string.charactersWithNullTermination(); I guess you don’t like auto in a case like this. I do; keeps the code less wordy. I don’t understand why we need the null termination. I think unorm2_quickCheck can work on characters in place; charactersWithNullTermination is an expensive operation that should be avoided if possible. > Source/WTF/wtf/URLHelpers.cpp:783 > + const UNormalizer2* normalizer = unorm2_getNFCInstance(&uerror); I prefer auto to writing out const UNormalizer2*, but I suppose tastes differ. > Source/WTF/wtf/URLHelpers.cpp:787 > + UNormalizationCheckResult checkResult = unorm2_quickCheck(normalizer, sourceBuffer.data(), sourceBuffer.size(), &uerror); I prefer auto to writing out UNormalizationCheckResult, but I suppose tastes differ. > Source/WTF/wtf/URLHelpers.cpp:793 > + return String(sourceBuffer.data(), sourceBuffer.size()); Better performance if this does: return string; Then we share the StringImpl instead of allocating a new one with the same characters in it. > Source/WTF/wtf/URLHelpers.cpp:798 > + uerror = U_ZERO_ERROR; Is this needed? If there was an error, we would have returned, I think. > Source/WTF/wtf/URLHelpers.cpp:799 > + if (U_SUCCESS(uerror)) { This is strange. It checks the value of something we set the line before. I think we instead want to remove this if statement.

Darin Adler

Comment 7 2019-03-03 00:39:33 PST

(In reply to Michael Catanzaro from comment #2) > Not happy with the gymnastics required to keep length separately: > > auto characters = StringView { string }.upconvertedCharacters(); > const UChar* sourceBuffer = characters.get(); > auto length = string.length(); > > The code is more clear when we just have a Vector<UChar>, like currently. I’ll come back and do the optimization. Because that clarity comes at a real cost. The upconvertedCharacters() function is free when the string already is stored as UChar, whereas the Vector thing means copying all the characters into a newly allocated memory buffer! Might also want a fast case for LChar strings to avoid upconverting them. All depends on how often this function is used. A high quality implementation for extensive re-use should probably eventually have these optimizations.

Michael Catanzaro

Comment 8 2019-03-03 11:25:21 PST

(In reply to Darin Adler from comment #6) > > Source/WTF/wtf/URLHelpers.cpp:778 > > - auto sourceBuffer = string.charactersWithNullTermination(); > > + Vector<UChar> sourceBuffer = string.charactersWithNullTermination(); > > I guess you don’t like auto in a case like this. I do; keeps the code less > wordy. I think we don't have guidelines on this. My rule of thumb is: is the code easier to read with auto? Less-wordy code is usually easier to read, but sometimes it's more important to be able to see the type at a glance without having to open WTFString.h and search for "charactersWithNullTermination," which is what I probably had to do here to figure out what type I was working with. It's important that this is a Vector and that it's a vector of UChar, not a char* or whatever. So I usually find myself using auto more for (a) integer types, (b) really long types, or (c) when the type is obvious, e.g. when there is a cast on the rhs of an assignment. > I don’t understand why we need the null termination. I think > unorm2_quickCheck can work on characters in place; > charactersWithNullTermination is an expensive operation that should be > avoided if possible. We don't need the null termination. String just doesn't provide a function to get a Vector<UChar> without the null termination. > > Source/WTF/wtf/URLHelpers.cpp:783 > > + const UNormalizer2* normalizer = unorm2_getNFCInstance(&uerror); > > I prefer auto to writing out const UNormalizer2*, but I suppose tastes > differ. I wish we had stronger guidelines so I didn't have to think about this whenever declaring variables. Here, I could go either way. You're probably not going to be able to modify this code without looking at the ICU API reference, so having the type written out seems much less important here than in the example above. But I don't see much advantage of using auto here, either. I think we normally also write out const auto* in WebKit to avoid hiding those modifiers, even though they're not necessary. (Especially the * is important for readability.) > > Source/WTF/wtf/URLHelpers.cpp:793 > > + return String(sourceBuffer.data(), sourceBuffer.size()); > > Better performance if this does: > > return string; > > Then we share the StringImpl instead of allocating a new one with the same > characters in it. Oh, right. Let me fix this.... > > Source/WTF/wtf/URLHelpers.cpp:798 > > + uerror = U_ZERO_ERROR; > > Is this needed? If there was an error, we would have returned, I think. > > > Source/WTF/wtf/URLHelpers.cpp:799 > > + if (U_SUCCESS(uerror)) { > > This is strange. It checks the value of something we set the line before. I > think we instead want to remove this if statement. That's just a refactoring bug. Oops. Will fix this too. I'll let you handle optimizing it by using upconvertedCharacters(), and convert to auto if you wish.

Michael Catanzaro

Comment 9 2019-03-03 11:45:50 PST

Created attachment 363462 [details] Patch

Michael Catanzaro

Comment 10 2019-03-03 11:46:36 PST

And I removed the inadvertently-committed test file in r242329. I wish I had more time to work on making the URLExtras.mm test cross-platform, but I don't right now, sorry.

Darin Adler

Comment 11 2019-03-03 11:59:46 PST

(In reply to Michael Catanzaro from comment #8) > I'll let you handle optimizing it by using upconvertedCharacters() Might do that. Especially if we add enough tests that it’s easy for me to know I didn’t break it. > and convert to auto Definitely won’t do that!

Michael Catanzaro

Comment 12 2019-03-04 06:53:23 PST

There are already tons of tests in URLExtras.mm, I just wish they were cross-platform. :(

WebKit Commit Bot

Comment 13 2019-03-04 07:18:48 PST

Comment on attachment 363462 [details] Patch Clearing flags on attachment: 363462 Committed r242353: <https://trac.webkit.org/changeset/242353>

WebKit Commit Bot

Comment 14 2019-03-04 07:18:50 PST

All reviewed patches have been landed. Closing bug.

Note You need to log in before you can comment on or make changes to this bug.