Bug 174816 - [GTK][WPE] Need a function to convert internal URI to display ("pretty") URI
Summary: [GTK][WPE] Need a function to convert internal URI to display ("pretty") URI
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: WebKitGTK (show other bugs)
Version: Other
Hardware: PC Linux
: P2 Normal
Assignee: Ms2ger (he/him; ⌚ UTC+1/+2)
URL:
Keywords:
Depends on: 192765 192945
Blocks:
  Show dependency treegraph
 
Reported: 2017-07-25 05:37 PDT by Michael Catanzaro
Modified: 2019-02-05 08:04 PST (History)
16 users (show)

See Also:


Attachments
Patch (7.12 KB, patch)
2018-01-07 10:22 PST, Gabriel Ivașcu
no flags Details | Formatted Diff | Diff
Patch (21.72 KB, patch)
2018-01-08 04:59 PST, Gabriel Ivașcu
no flags Details | Formatted Diff | Diff
Patch (59.15 KB, patch)
2018-01-15 08:58 PST, Gabriel Ivașcu
no flags Details | Formatted Diff | Diff
Patch (59.70 KB, patch)
2018-01-15 09:09 PST, Gabriel Ivașcu
no flags Details | Formatted Diff | Diff
Archive of layout-test-results from ews101 for mac-sierra (2.28 MB, application/zip)
2018-01-15 10:11 PST, Build Bot
no flags Details
Archive of layout-test-results from ews106 for mac-sierra-wk2 (2.76 MB, application/zip)
2018-01-15 10:17 PST, Build Bot
no flags Details
Archive of layout-test-results from ews114 for mac-sierra (2.94 MB, application/zip)
2018-01-15 10:42 PST, Build Bot
no flags Details
Patch (61.60 KB, patch)
2018-01-20 03:47 PST, Gabriel Ivașcu
no flags Details | Formatted Diff | Diff
Archive of layout-test-results from ews102 for mac-sierra (2.21 MB, application/zip)
2018-01-20 04:47 PST, Build Bot
no flags Details
Archive of layout-test-results from ews106 for mac-sierra-wk2 (2.59 MB, application/zip)
2018-01-20 04:54 PST, Build Bot
no flags Details
Archive of layout-test-results from ews114 for mac-sierra (2.97 MB, application/zip)
2018-01-20 05:13 PST, Build Bot
no flags Details
Patch (60.75 KB, patch)
2018-01-27 10:50 PST, Gabriel Ivașcu
no flags Details | Formatted Diff | Diff
Archive of layout-test-results from ews103 for mac-sierra (2.22 MB, application/zip)
2018-01-27 11:51 PST, Build Bot
no flags Details
Archive of layout-test-results from ews104 for mac-sierra-wk2 (2.55 MB, application/zip)
2018-01-27 11:56 PST, Build Bot
no flags Details
Archive of layout-test-results from ews114 for mac-sierra (2.94 MB, application/zip)
2018-01-27 12:17 PST, Build Bot
no flags Details
Patch (63.51 KB, patch)
2018-06-12 09:03 PDT, Ms2ger (he/him; ⌚ UTC+1/+2)
no flags Details | Formatted Diff | Diff
Patch (68.21 KB, patch)
2018-06-19 08:37 PDT, Ms2ger (he/him; ⌚ UTC+1/+2)
no flags Details | Formatted Diff | Diff
Patch (68.24 KB, patch)
2018-06-19 09:04 PDT, Ms2ger (he/him; ⌚ UTC+1/+2)
no flags Details | Formatted Diff | Diff
Archive of layout-test-results from ews126 for ios-simulator-wk2 (2.24 MB, application/zip)
2018-06-19 10:50 PDT, Build Bot
no flags Details
Patch (69.09 KB, patch)
2018-06-28 06:07 PDT, Ms2ger (he/him; ⌚ UTC+1/+2)
no flags Details | Formatted Diff | Diff
Patch (71.22 KB, patch)
2018-07-04 09:15 PDT, Ms2ger (he/him; ⌚ UTC+1/+2)
cgarcia: review-
ews: commit-queue-
Details | Formatted Diff | Diff
Archive of layout-test-results from ews125 for ios-simulator-wk2 (2.24 MB, application/zip)
2018-07-04 11:03 PDT, Build Bot
no flags Details
Archive of layout-test-results from ews201 for win-future (12.76 MB, application/zip)
2018-07-04 13:23 PDT, Build Bot
no flags Details
Patch (114.10 KB, patch)
2018-12-07 05:14 PST, Ms2ger (he/him; ⌚ UTC+1/+2)
mcatanzaro: review-
ews: commit-queue-
Details | Formatted Diff | Diff
Archive of layout-test-results from ews122 for ios-simulator-wk2 (9.72 MB, application/zip)
2018-12-07 07:09 PST, Build Bot
no flags Details
Patch (108.73 KB, patch)
2018-12-11 01:41 PST, Ms2ger (he/him; ⌚ UTC+1/+2)
no flags Details | Formatted Diff | Diff
Patch (109.44 KB, patch)
2018-12-14 08:32 PST, Ms2ger (he/him; ⌚ UTC+1/+2)
no flags Details | Formatted Diff | Diff
Patch (109.52 KB, patch)
2018-12-18 01:04 PST, Ms2ger (he/him; ⌚ UTC+1/+2)
no flags Details | Formatted Diff | Diff
Patch (109.52 KB, patch)
2018-12-18 01:12 PST, Ms2ger (he/him; ⌚ UTC+1/+2)
no flags Details | Formatted Diff | Diff
Patch (112.49 KB, patch)
2019-01-28 09:00 PST, Ms2ger (he/him; ⌚ UTC+1/+2)
no flags Details | Formatted Diff | Diff
Patch (112.48 KB, patch)
2019-01-29 00:51 PST, Ms2ger (he/him; ⌚ UTC+1/+2)
no flags Details | Formatted Diff | Diff
Patch (112.36 KB, patch)
2019-01-29 08:55 PST, Ms2ger (he/him; ⌚ UTC+1/+2)
no flags Details | Formatted Diff | Diff
Patch (112.35 KB, patch)
2019-01-30 00:52 PST, Ms2ger (he/him; ⌚ UTC+1/+2)
no flags Details | Formatted Diff | Diff
Patch (112.42 KB, patch)
2019-01-30 03:50 PST, Ms2ger (he/him; ⌚ UTC+1/+2)
mcatanzaro: review+
mcatanzaro: commit-queue-
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Michael Catanzaro 2017-07-25 05:37:59 PDT
WebKitWebView should have a "pretty" URI property. I don't know what we should name it; perhaps presentable-uri or decoded-uri or something along those lines. This URI would be for use in user-visible locations. It should percent-decode the URI, so it will contain UTF-8 characters instead of ASCII. It should also perform IDN decoding. Currently, Epiphany performs both tasks itself, which is not good because most applications really ought to be doing the same thing, using the same code, and it is a significant amount of code. See ephy_uri_decode() in ephy-uri-helpers.c: it's about 40 lines of code to convert WebKitWebView:uri to be suitable for displaying to the user. That's not good!

If we add this property, we would need to include a defense against IDN homograph attacks. Epiphany does not currently have such a defense. So some research on IDN homograph defenses will be needed. I like the Firefox approach better than Chrome's approach. (Chrome completely bans characters that look like Latin characters, essentially banning Cyrillic URIs entirely. I'd rather have lower security than harm Cyrillic users like that.)

Finally, in Epiphany, we should see if we can remove ephy_uri_decode() (and maybe also ephy_uri_normalize()?) and use this new property instead.

I won't be working on this right away. Just reporting it for now!
Comment 1 Michael Catanzaro 2017-07-25 05:45:50 PDT
Also, ephy_uri_decode() currently contains a mutex, because there's no convenient location to store the ICU UIDNA context so it needs to be created inside the helper function. That's not good as it entails an unnecessary locking cost. It's also overkill, because the function is probably only ever used from the UI thread, but we don't want helper functions modifying global state like that unless they are threadsafe as that'd be quite a footgun. So we can't do better in Epiphany.

We can do better for WebKit. The UIDNA context should become a private member of WebKitWebView, so it would be created once on startup and no mutex would be needed at all.
Comment 2 Adrian Perez 2017-07-25 12:49:36 PDT
(In reply to Michael Catanzaro from comment #1)

> Also, ephy_uri_decode() currently contains a mutex [...]
> 
> We can do better for WebKit. The UIDNA context should become a private
> member of WebKitWebView, so it would be created once on startup and no mutex
> would be needed at all.

Couldn't we do _even_ better? It looks to me that the UIDNA context could
be a global singleton, AFAIU it does not need to be associated with a
WebKitWebView.
Comment 3 Michael Catanzaro 2017-07-25 13:48:58 PDT
If it consumes a substantial amount of memory (I suppose it might), then we could do that. It would be an error to access this property from any thread other than the UI thread, so there is no harm in that. I generally prefer to avoid global singletons since they've caused us so much trouble in the past, but if a UIDNA context requires a substantial amount of memory then we'd better not create a new one for each web view.
Comment 4 Ms2ger (he/him; ⌚ UTC+1/+2) 2017-10-23 06:53:53 PDT
Looking into this, I'm not sure if solving this just for WebKitWebView's uri makes sense. ephy_uri_decode is also used with urls returned from webkit_download_get_destination and webkit_hit_test_result_get_link_uri, for example. Perhaps exposing something like ephy_uri_decode directly would be better?
Comment 5 Michael Catanzaro 2017-10-23 07:10:09 PDT
Hm.

First of all: good point.

One problem with that is that we do not currently have any "free functions" -- functions that are not methods of objects -- exposed in our API. It would be a significant change for us to expose a function with no associated object. And yet, you're right, that seems to be required here. What we really want is for the functionality of ephy_uri_decode() to move into WebKit (with some added defenses against IDN homograph attacks).

I would ask Carlos Garcia if he has an API recommendation.
Comment 6 Carlos Garcia Campos 2017-10-23 23:35:54 PDT
I don't see any problem in adding global functions not attached to any particular object. We had some of those in WebKit1 API, under a globals section in the docs (not sure it's a good idea to keep using globals, but still).

https://webkitgtk.org/reference/webkitgtk/stable/webkitgtk-Global-functions.html
Comment 7 Adrian Perez 2017-10-24 00:17:40 PDT
(In reply to Carlos Garcia Campos from comment #6)
> I don't see any problem in adding global functions not attached to any
> particular object. We had some of those in WebKit1 API, under a globals
> section in the docs (not sure it's a good idea to keep using globals, but
> still).
> 
> https://webkitgtk.org/reference/webkitgtk/stable/webkitgtk-Global-functions.
> html

I don't see any problem either in having functions which are “pure” (only
read their arguments, produce a result which only depends on the input) and
don't need state (= an instance). From a functional POV one could argue that
there are the best kind of functions ;-)
Comment 8 Gabriel Ivașcu 2017-11-08 04:25:38 PST
(In reply to Michael Catanzaro from comment #0)
> If we add this property, we would need to include a defense against IDN
> homograph attacks. Epiphany does not currently have such a defense. So some
> research on IDN homograph defenses will be needed. I like the Firefox
> approach better than Chrome's approach. (Chrome completely bans characters
> that look like Latin characters, essentially banning Cyrillic URIs entirely.
> I'd rather have lower security than harm Cyrillic users like that.)

I did a bit research and if you wish to follow Mozilla's way, I think there are two approaches for this problem:

(1) Allow strings containing characters from the same language script only, e.g. all chars are Greek, all chars are Cyrillic, etc. This is easier to implement, but is more restrictive since it "blocks" non-Latin domains with characters from multiple scripts.

(2) Allow strings containing characters from the same language script and characters from some allowed combinations of scripts. This is what Mozilla does actually (https://wiki.mozilla.org/IDN_Display_Algorithm#Algorithm). It's more permissive than (1), but a bit harder to implement.

I'm willing to work on either (1) or (2). We can implement it in Epiphany first, and then move it to WebKit along with the code in ephy_uri_decode().
Comment 9 Michael Catanzaro 2017-11-08 07:39:42 PST
I think (2) would be ideal, if there's a code comment linking to the Mozilla page. But (1) would be OK if (2) is difficult.

Agreed that implementing it in Epiphany first would be best.
Comment 10 Gabriel Ivașcu 2018-01-06 03:36:34 PST
Sorry for the late reply. Do we have a resolution on whether this should be a "global" function rather than a method of WebKitWebView?
Comment 11 Michael Catanzaro 2018-01-06 09:08:46 PST
I failed to get a recommendation from Carlos Garcia as to where he wants it, so I would just add a new file (WebKitURIUtilities.h) and a new section in the documentation "URI Utilities" for them. We can always reorganize those later; the exact layout of the headers is not API.
Comment 12 Gabriel Ivașcu 2018-01-07 10:22:25 PST
Created attachment 330658 [details]
Patch
Comment 13 Gabriel Ivașcu 2018-01-07 10:52:37 PST
(In reply to Gabriel Ivașcu from comment #12)
> Created attachment 330658 [details]
> Patch

OK, seems that the new files were not added to the patch. Is there any way to tell the webkit-patch script to add the new files too to the patch?
Comment 14 Michael Catanzaro 2018-01-07 14:28:24 PST
(In reply to Gabriel Ivașcu from comment #13) 
> OK, seems that the new files were not added to the patch. Is there any way
> to tell the webkit-patch script to add the new files too to the patch?

Use 'git add' (only for the new files)

Yes, the workflow is not very good....
Comment 15 Carlos Alberto Lopez Perez 2018-01-08 04:01:54 PST
(In reply to Gabriel Ivașcu from comment #13)
> (In reply to Gabriel Ivașcu from comment #12)
> > Created attachment 330658 [details]
> > Patch
> 
> OK, seems that the new files were not added to the patch. Is there any way
> to tell the webkit-patch script to add the new files too to the patch?

I usually create a new branch, commit the (going to be uploaded) patch locally and then pass to webkit-patch the sha1 hash of the local commit_id I want to upload with the flag "-g"
Comment 16 Gabriel Ivașcu 2018-01-08 04:59:28 PST
Created attachment 330687 [details]
Patch
Comment 17 Carlos Garcia Campos 2018-01-08 05:25:20 PST
Comment on attachment 330687 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=330687&action=review

Isn't this what WebCore::userVisibleString() does? Can't we implement that in WebCore instead? or even make the current objc implementation cross-platform?

> Source/WebKit/UIProcess/API/glib/WebKitURIUtilities.cpp:47
> +static const UIDNA* internationalDomainNameTranscoder()
> +{
> +    static UIDNA* transcoder;
> +    static std::once_flag onceFlag;
> +    std::call_once(onceFlag, [] {
> +        UErrorCode error = U_ZERO_ERROR;
> +        // These flags should be synced with URLParser::internationalDomainNameTranscoder() in URLParser.cpp.
> +        transcoder = uidna_openUTS46(UIDNA_CHECK_BIDI | UIDNA_CHECK_CONTEXTJ | UIDNA_NONTRANSITIONAL_TO_UNICODE | UIDNA_NONTRANSITIONAL_TO_ASCII, &error);
> +        RELEASE_ASSERT(U_SUCCESS(error));
> +        RELEASE_ASSERT(transcoder);
> +    });
> +    return transcoder;
> +}

Why are we duplicating this? Can't we just use URLParser::internationalDomainNameTranscoder()?
Comment 18 Michael Catanzaro 2018-01-08 05:48:04 PST
(In reply to Carlos Garcia Campos from comment #17)
> Isn't this what WebCore::userVisibleString() does? Can't we implement that
> in WebCore instead? or even make the current objc implementation
> cross-platform?

I didn't know about this. WebCore::userVisibleString only exists in Cocoa ports. It's declared in the Objective C++ header WebCoreNSURLExtras.h. It converts an NSURL* to an NSString*. We can certainly create a new function with the same name, converting a const char* to a char* instead, and move it to the platform layer. Then it could be used in the pasteboard code, same as on Mac, if we ever decide that it'd be desirable to copy the prettified URL to the clipboard. But I'm not sure that's actually what we want to do with the pasteboard, and we can always move it down from the API layer to the platform layer in the future if we do decide to do that. So I think it's probably better where it is now.

> > Source/WebKit/UIProcess/API/glib/WebKitURIUtilities.cpp:47
> > +static const UIDNA* internationalDomainNameTranscoder()
> > +{
> > +    static UIDNA* transcoder;
> > +    static std::once_flag onceFlag;
> > +    std::call_once(onceFlag, [] {
> > +        UErrorCode error = U_ZERO_ERROR;
> > +        // These flags should be synced with URLParser::internationalDomainNameTranscoder() in URLParser.cpp.
> > +        transcoder = uidna_openUTS46(UIDNA_CHECK_BIDI | UIDNA_CHECK_CONTEXTJ | UIDNA_NONTRANSITIONAL_TO_UNICODE | UIDNA_NONTRANSITIONAL_TO_ASCII, &error);
> > +        RELEASE_ASSERT(U_SUCCESS(error));
> > +        RELEASE_ASSERT(transcoder);
> > +    });
> > +    return transcoder;
> > +}
> 
> Why are we duplicating this? Can't we just use
> URLParser::internationalDomainNameTranscoder()?

Yes, that's a good idea.
Comment 19 Carlos Garcia Campos 2018-01-08 06:00:32 PST
(In reply to Michael Catanzaro from comment #18)
> (In reply to Carlos Garcia Campos from comment #17)
> > Isn't this what WebCore::userVisibleString() does? Can't we implement that
> > in WebCore instead? or even make the current objc implementation
> > cross-platform?
> 
> I didn't know about this. WebCore::userVisibleString only exists in Cocoa
> ports. It's declared in the Objective C++ header WebCoreNSURLExtras.h. It
> converts an NSURL* to an NSString*. We can certainly create a new function
> with the same name, converting a const char* to a char* instead, and move it
> to the platform layer. Then it could be used in the pasteboard code, same as
> on Mac, if we ever decide that it'd be desirable to copy the prettified URL
> to the clipboard. But I'm not sure that's actually what we want to do with
> the pasteboard, and we can always move it down from the API layer to the
> platform layer in the future if we do decide to do that. So I think it's
> probably better where it is now.

I disagree, there's no point on duplicating the code. The fact that it's cocoa only doesn't mean that the code is 100% cocoa specific. It converts NSURL to NSString probably because it's more convenient, being cocoa specific code, but maybe we can share 99% of the implementation. I would check it first.
Comment 20 Michael Catanzaro 2018-01-08 07:04:10 PST
Comment on attachment 330687 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=330687&action=review

Good job as usual, I see you even converted from GLib types like GHashTable to WTF types without me having to suggest that. And you updated the WPE headers as well. Great.

Like Carlos suggests, this code is indeed lower-level than what we normally have up here in the GTK API layer. So it would indeed make sense to move the implementation down to a lower layer, and call that from the API layer. But instead of implementing a new version of WebCore::userVisibleString, I'm going to suggest moving the code directly into the URL class instead. How about URL::userVisibleURL? You'd declare it like this in the header file Source/WebCore/platform/URL.h:

#if USE(SOUP)
    URL(SoupURI*);
    GUniquePtr<SoupURI> createSoupURI() const;

    String userVisibleURL() const;
#endif

The implementation would go in Source/WebCore/platform/soup/URLSoup.cpp. You can move the webkit_uri_prettify() code there. To convert it to a SoupURI you can just use URL::createSoupURI. Then webkit_uri_prettify() just has to create a new WebCore::URL and, call the userVisibleURL method, and convert the result of that from String to char*. Should look like: g_strdup(url.userVisibleURL().utf8.data()).

> Source/WebKit/UIProcess/API/glib/WebKitURIUtilities.cpp:60
> +    WTF::HashMap<guint, guint> scriptMap;

Style: use "unsigned" instead of guint in implementation code (except when working with APIs that use guint).

> Source/WebKit/UIProcess/API/glib/WebKitURIUtilities.cpp:125
> + * webkit_uri_prettify:

Hm, I don't like the name.

Perhaps: webkit_uri_decode_for_display()? Or webkit_decode_uri_for_display()?

> Source/WebKit/UIProcess/API/glib/WebKitURIUtilities.cpp:130
> + * WebKit may contain percent-encoded characters or Punycode, which may not be
> + * always suitable to display to users. This function provides protection

"which are not generally suitable"

> Tools/TestWebKitAPI/Tests/WebKitGLib/TestWebKitURIUtilities.cpp:31
> +        // Latin-only, OK.
> +        { "http://abcdef.com/", "http://abcdef.com/" },

Is this list copied from somewhere? Add a link so that it will be possible to understand and update it in the future. At least add a link to the Mozilla design doc, so there's some indication of what you're testing here.

> Tools/TestWebKitAPI/Tests/WebKitGLib/TestWebKitURIUtilities.cpp:69
> +        char* prettyURI = webkit_uri_prettify(items[i].input);

You can and should use smart pointers in the API tests:

GUniquePtr<char> prettyURI(webkit_uri_prettify(items[i].input));

Then you don't have to free it manually.
Comment 21 Michael Catanzaro 2018-01-08 07:46:27 PST
Also: there is a comment in URLParser::internationalDomainNameTranscoder that needs to be deleted now, since we'll be removing this code from Epiphany.

(In reply to Carlos Garcia Campos from comment #19)
> I disagree, there's no point on duplicating the code. The fact that it's
> cocoa only doesn't mean that the code is 100% cocoa specific. It converts
> NSURL to NSString probably because it's more convenient, being cocoa
> specific code, but maybe we can share 99% of the implementation. I would
> check it first.

I was going to say "looks like a lot of code to just reimplement g_uri_unescape_string()", but at the bottom it also does some map hostnames thing... then I scrolled up... and, OK, the mapping is a similar calculation to what Gabriel has done here. So you're right: it would be better to share this code with Mac, rather than to implement our own version with a separate set of bugs that won't get updated in the future. It looks like WebCoreNSURLExtras implements even more checks than ours, or at least different checks.

So, Gabriel, sharing the code with Mac will probably consume the rest of your project hours. Do you think you have enough hours left to work on this? It won't be easy, because you probably don't have a Mac to be able to run the Objective C++ code, and who knows how all these NS APIs work... the cost of having solved hard problems in the past is that people then give you more hard problems. :P I guess the goal would be to (a) understand how WebCore::userVisibleString in WebCoreNSURLExtras.mm works, (b) move the code into either WebCore::URL (or maybe WebCore::URLParser?), eliminating the use of Objective C++ and the NS dependencies, and (c) decide how to handle the differences between its implementation and yours. Then (d) move your test to Tools/TestWebKitAPI/Tests/WebCore/URL.cpp (or maybe URLParser.cpp).

I see Carlos has already CCed Alex Christensen from Apple on this bug. He wrote URLParser and is an expert on these URL classes, so if you have questions, he might know the answer.
Comment 22 Carlos Garcia Campos 2018-01-09 00:48:53 PST
Note there are already unit tests in Tools/TestWebKitAPI/Tests/WebCore/cocoa/URLExtras.mm
Comment 23 Carlos Garcia Campos 2018-01-09 00:49:56 PST
Comment on attachment 330687 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=330687&action=review

>> Source/WebKit/UIProcess/API/glib/WebKitURIUtilities.cpp:125
>> + * webkit_uri_prettify:
> 
> Hm, I don't like the name.
> 
> Perhaps: webkit_uri_decode_for_display()? Or webkit_decode_uri_for_display()?

Or simply webkit_uri_for_display().
Comment 24 Michael Catanzaro 2018-01-09 07:56:23 PST
(In reply to Carlos Garcia Campos from comment #23)
> Or simply webkit_uri_for_display().

Yes, that's a good name!
Comment 25 Gabriel Ivașcu 2018-01-09 13:12:26 PST
(In reply to Michael Catanzaro from comment #21)
> So, Gabriel, sharing the code with Mac will probably consume the rest of
> your project hours. Do you think you have enough hours left to work on this?
> It won't be easy, because you probably don't have a Mac to be able to run
> the Objective C++ code, and who knows how all these NS APIs work... the cost
> of having solved hard problems in the past is that people then give you more
> hard problems. :P I guess the goal would be to (a) understand how
> WebCore::userVisibleString in WebCoreNSURLExtras.mm works, (b) move the code
> into either WebCore::URL (or maybe WebCore::URLParser?), eliminating the use
> of Objective C++ and the NS dependencies, and (c) decide how to handle the
> differences between its implementation and yours. Then (d) move your test to
> Tools/TestWebKitAPI/Tests/WebCore/URL.cpp (or maybe URLParser.cpp).

Thanks for summarizing it like this. This will not be a trivial task for me because I don't have any knowledge of Objective C++ and NS API, but I'm willing to give it a try. That's also why this is going to probably take me a bit more time than usual. I'll try to come up with a patch as soon as possible. I still need to ask: are you and Carlos sure that this is going to work as intended, i.e. there isn't any Mac-specific code in WebCoreNSURLExtras that would prevent us from moving the implementation to the WebCore layer?
Comment 26 Michael Catanzaro 2018-01-09 20:23:45 PST
Not at all sure. Would it still be fun if you knew in advance that the task is possible? :D

I think it should be, though, and you have a couple advantages:

 * This file contains very few Objective C style function calls, the weird [[ ]] things. I've never been able to wrap my head around those.
 * It also contains fairly few NS and CF API calls, it looks like almost all can be eliminated by converting from NSString to String, and these APIs are well-documented on the web. That will probably be most of the work, and it looks like the main reason this file is not cross-platform.
 * There are a bunch of comments to help.

But working against you is that (a) reading other people's code is hard, (b) understanding other people's code well enough to rewrite it in another language is harder, (c) this code is complex, and (d) it's not a small amount. So this is certainly a hard task.

Let's narrow the task a bit, to something I'm more confident is doable. I would try to start with only a small portion, the most important part of the code to share: the part that handles the decision of whether to decode the IDN or not. The code that's directly in WebCore::userVisibleString looks less-important to share; looks like most of that code is unescaping the string in the manner of g_uri_unescape_string(). Even the code in WebCore::mapHostNames and WebCore::applyHostNameFunctionToURLString looks non-essential to share: it looks like it's mostly splitting up the URL string into the host component (which SoupURI does for us) and then into labels (which your patch already does). That also looks like the hardest portion of the code to rewrite, since it makes heavy use of NSString and NS/CF API calls. So forget about sharing that, at least for now. The part that looks most interesting to share begins in WebCore::mapHostNameWithRange. I think we really want to share WebCore::allCharactersInIDNScriptWhiteList, WebCore::allCharactersAllowedByTLDRules, and all the code higher than that in WebCoreNSURLExtras.mm. Forget about the code below it: that's too much work to rewrite to be cross-platform. (If Carlos isn't satisfied by that, maybe he can convert the rest. :D But I think he'll be pleased to just share the IDN code.) So I think you can keep your code to percent-decode the URL, and your code to split it into labels, and just focus on sharing the IDN-related code. Fortunately, that looks like it uses very little Objective C++ and very few NS/CF API calls.

Probably the best place for all that code, once it's rewritten as C++, would be WebCore::URLParser. Or maybe WebCore::URL. I'd start with WebCore::URLParser, knowing it will be easy to move later if requested, once it's out of WebCoreNSURLExtras.h.

Now, it does look like Apple's strategy for handling IDN homographs is different from Firefox's. It seems to be based on hardcoded knowledge of the different character set policies of different TLDs. I'll leave you to study that and suggest changes if you think it appropriate, keeping in mind that any changes here would need to be approved by Apple. Usually we just match Apple's behavior,  since that's easiest, unless we have a strong reason to try to change it, but if you think the Firefox approach you already implemented is better, that would certainly be worth discussing.

P.S. Remember to delete the comment in URLParser::internationalDomainNameTranscoder!
Comment 27 Michael Catanzaro 2018-01-09 20:26:03 PST
That should cut your task roughly in half, BTW, so hopefully that will improve your confidence. ;)
Comment 28 Gabriel Ivașcu 2018-01-15 08:58:50 PST
Created attachment 331338 [details]
Patch
Comment 29 Gabriel Ivașcu 2018-01-15 09:06:19 PST
Please let me know what you think of the above patch.

As far as I could reckon, Apple uses a different policy than Mozilla when it comes to interpreting Unicode characters in URLs, virtually banning almost every Unicode character. I removed all own script code computations from my previous patch and went with what was already provided in WebCoreNSURLExtras.mm, therefore reducing the number of test cases from my previous patch.
Comment 30 Gabriel Ivașcu 2018-01-15 09:09:34 PST
Created attachment 331339 [details]
Patch
Comment 31 Build Bot 2018-01-15 10:11:05 PST Comment hidden (obsolete)
Comment 32 Build Bot 2018-01-15 10:11:06 PST Comment hidden (obsolete)
Comment 33 Build Bot 2018-01-15 10:17:46 PST Comment hidden (obsolete)
Comment 34 Build Bot 2018-01-15 10:17:48 PST Comment hidden (obsolete)
Comment 35 Build Bot 2018-01-15 10:42:18 PST Comment hidden (obsolete)
Comment 36 Build Bot 2018-01-15 10:42:20 PST Comment hidden (obsolete)
Comment 37 Michael Catanzaro 2018-01-15 11:18:56 PST
That's exactly what I was hoping to see. Alex, does it look like a good approach to you?

Now the "fun" part will be investigating why those four tests are failing. Hopefully the problem with the tests is in the cross-platform code, and you'll be able to reproduce them with WebKitGTK+. Try 'Tools/Scripts/run-webkit-tests --force fast/url/user-visible/'. But the failures could also be due to your changes in NSWebCoreURLExtras.mm. In that case, you're unlucky. If you don't think you can guess the problem for the test output, the easiest way to do this is then to upload patches with added debugging and then wait for the bots to spit out the results.
Comment 38 Michael Catanzaro 2018-01-15 11:21:28 PST
(In reply to Michael Catanzaro from comment #37)
> Try 'Tools/Scripts/run-webkit-tests --force fast/url/user-visible/'

If you haven't already, you'll first need to:

$ Tools/Scripts/update-webkitgtk-libs
$ Tools/Scripts/build-webkit --gtk

That will get you a release build under WebKitBuild/Release, where the test runner expects it to be. You could also try copying your build directory from wherever it is now (it would be WebKitBuild/GNOME if you've followed my instructions for WebKit/Epiphany development), but I'm not sure if that will work.

I haven't found a better way to handle this, I just build WebKit again whenever I need to switch between working on tests and working on Epiphany.
Comment 39 Michael Catanzaro 2018-01-15 11:34:18 PST
Comment on attachment 331339 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=331339&action=review

r- for the failing tests.

> Source/WebCore/platform/URLParser.cpp:3265
> +String URLParser::ICUConvertHostName(const String& hostName, bool encode, const uint32_t (&IDNScriptBlackList)[(USCRIPT_CODE_LIMIT + 31) / 32])

Hm, I didn't  know you could declare array arguments like this, except in templates. Does the size specification actually do anything? I bet it's ignored. A better declaration would be:

const uint32_t* IDNScriptBlackList

which everyone will understand.

> Source/WebKit/UIProcess/API/glib/WebKitURIUtilities.cpp:46
> + * Return value: (transfer full): @uri suitable for display

Returns:

("Return value" is an older syntax.)

> Source/WebKit/UIProcess/API/glib/WebKitURIUtilities.cpp:52
> +    auto URI = WebCore::URL(WebCore::ParsedURLStringTag::ParsedURLString, WTF::String(uri));

WTF::String is hoisted into the global namespace with a using declaration in its header file, so you don't have to qualify it with WTF::.

You do need to use String::fromUTF8 to create the string, though, since WebKit unfortunately does not use UTF-8 internally, like GNOME code does.

Also, URI in all caps is not a great name. How about "coreURI"?

> Source/WebKit/UIProcess/API/glib/WebKitURIUtilities.cpp:55
> +    URI.setPass(WTF::emptyString());

Ditto, you don't need WTF:: here. (See the bottom of WTFString.h.)

> Source/WebKit/UIProcess/API/glib/WebKitURIUtilities.cpp:58
> +    URI.setHost(WebCore::URLParser::ICUConvertHostName(String(percentDecodedHost.get()), false));

Use String::fromUTF8
Comment 40 Gabriel Ivașcu 2018-01-17 04:55:22 PST
Seems that each of the failing tests makes use of at least one function from WebCoreNSURLExtras.h that internally calls mapHostNameWithRange(), which is the function from WebCoreNSURLExtras.mm that I've modified. It's odd that so far I got the expected output when testing the new URLParser::ICUConvertHostName() function in WebKitGTK+ against the host names of the test URLs from Mac. I'll keep investigating, it must be something that I'm missing.
Comment 41 Gabriel Ivașcu 2018-01-20 03:47:01 PST
Created attachment 331840 [details]
Patch
Comment 42 Build Bot 2018-01-20 04:47:53 PST Comment hidden (obsolete)
Comment 43 Build Bot 2018-01-20 04:47:54 PST Comment hidden (obsolete)
Comment 44 Build Bot 2018-01-20 04:54:16 PST Comment hidden (obsolete)
Comment 45 Build Bot 2018-01-20 04:54:18 PST Comment hidden (obsolete)
Comment 46 Build Bot 2018-01-20 05:13:23 PST Comment hidden (obsolete)
Comment 47 Build Bot 2018-01-20 05:13:24 PST Comment hidden (obsolete)
Comment 48 Michael Catanzaro 2018-01-24 18:02:34 PST
At this point, instead of guessing in the dark as to what might be wrong, I would add some debug statements and upload them in a patch. I know it's weird to upload a debug patch to Bugzilla, but anything you print to stderr will show up in the layout test results that Build Bot spits out. That way, you can check the stderr output for each of the four failing tests. If you add enough debug then you should eventually find the problem.

Alternatively, I could try to get you access to Igalia's Mac server, but I've never used it before, and I bet it would be harder to learn how to use it than to just fight Build Bot the old fashioned way.
Comment 49 Alex Christensen 2018-01-25 11:04:40 PST
Comment on attachment 331840 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=331840&action=review

> Source/WebCore/platform/URLParser.h:56
> +    template<typename CharacterType> static bool isLookalikeCharacter(CharacterType, CharacterType anotherCharacter);
> +    static String ICUConvertHostName(const String& hostName, bool encode, const uint32_t (&IDNScriptBlackList)[(USCRIPT_CODE_LIMIT + 31) / 32] = { 0 });

I'm not against moving this from a .mm to a sharable location, but the URLParser is an implementation of https://url.spec.whatwg.org so let's make a new class for this.

> Source/WebCore/platform/mac/WebCoreNSURLExtras.mm:48
> -static uint32_t IDNScriptWhiteList[(USCRIPT_CODE_LIMIT + 31) / 32];
> +static uint32_t IDNScriptBlackList[(USCRIPT_CODE_LIMIT + 31) / 32];

Why are you making this change?  Let's not change the functionality of the ObjC code.

> Source/WebCore/platform/mac/WebCoreNSURLExtras.mm:639
> -    std::optional<UChar32> previousCodePoint;
> +    UChar32 previousCodePoint = INT32_MAX;

This doesn't look good.  We should use std::optional instead of having a sentinel value.  Also, there's a signed/unsigned mismatch, but we shouldn't do this anyway.
Comment 50 Gabriel Ivașcu 2018-01-27 04:14:01 PST
(In reply to Alex Christensen from comment #49)
> Comment on attachment 331840 [details]

> > Source/WebCore/platform/URLParser.h:56
> > +    template<typename CharacterType> static bool isLookalikeCharacter(CharacterType, CharacterType anotherCharacter);
> > +    static String ICUConvertHostName(const String& hostName, bool encode, const uint32_t (&IDNScriptBlackList)[(USCRIPT_CODE_LIMIT + 31) / 32] = { 0 });
> 
> I'm not against moving this from a .mm to a sharable location, but the
> URLParser is an implementation of https://url.spec.whatwg.org so let's make
> a new class for this.

Hmm OK but that will make a class with only two static methods.

> > Source/WebCore/platform/mac/WebCoreNSURLExtras.mm:48
> > -static uint32_t IDNScriptWhiteList[(USCRIPT_CODE_LIMIT + 31) / 32];
> > +static uint32_t IDNScriptBlackList[(USCRIPT_CODE_LIMIT + 31) / 32];
> 
> Why are you making this change?  Let's not change the functionality of the
> ObjC code.

I changed this so that callers that don't use script white/black lists wouldn't have to create explicitly a whitelist with all bits set to one to pass to the function. This way, they would leave the function's default blacklist's value of all zeros, i.e. all scripts are allowed. I can change it back however.
 
> > Source/WebCore/platform/mac/WebCoreNSURLExtras.mm:639
> > -    std::optional<UChar32> previousCodePoint;
> > +    UChar32 previousCodePoint = INT32_MAX;
> 
> This doesn't look good.  We should use std::optional instead of having a
> sentinel value.  Also, there's a signed/unsigned mismatch, but we shouldn't
> do this anyway.

OK. But UChar32 is defined as int32_t in WTF/icu/unicode/umachine.h.
Comment 51 Gabriel Ivașcu 2018-01-27 04:15:20 PST
(In reply to Michael Catanzaro from comment #48)
> At this point, instead of guessing in the dark as to what might be wrong, I
> would add some debug statements and upload them in a patch. I know it's
> weird to upload a debug patch to Bugzilla, but anything you print to stderr
> will show up in the layout test results that Build Bot spits out. That way,
> you can check the stderr output for each of the four failing tests. If you
> add enough debug then you should eventually find the problem.

I will give it one more try before resorting to this.
Comment 52 Michael Catanzaro 2018-01-27 09:50:02 PST
(In reply to Gabriel Ivașcu from comment #50)
> Hmm OK but that will make a class with only two static methods.

In that case, I would use a namespace instead of a class. It's fine to have global functions outside of any class in C++. The namespace is even optional, but I think it's helpful to group related functions together in cases like this.

> > > Source/WebCore/platform/mac/WebCoreNSURLExtras.mm:48
> I changed this so that callers that don't use script white/black lists
> wouldn't have to create explicitly a whitelist with all bits set to one to
> pass to the function. This way, they would leave the function's default
> blacklist's value of all zeros, i.e. all scripts are allowed. I can change
> it back however.

I think the change is good (as long as it's not what's causing the tests to break ;)
  
> > > Source/WebCore/platform/mac/WebCoreNSURLExtras.mm:639
> > > -    std::optional<UChar32> previousCodePoint;
> > > +    UChar32 previousCodePoint = INT32_MAX;
> > 
> > This doesn't look good.  We should use std::optional instead of having a
> > sentinel value.  Also, there's a signed/unsigned mismatch, but we shouldn't
> > do this anyway.
> 
> OK. But UChar32 is defined as int32_t in WTF/icu/unicode/umachine.h.

The U means "Unicode", not "unsigned"... but yes, much better to use std::optional here instead. Also in hasCharactersInIDNScriptBlackList as well.

(In reply to Alex Christensen from comment #49)
> I'm not against moving this from a .mm to a sharable location, but the
> URLParser is an implementation of https://url.spec.whatwg.org so let's make
> a new class for this.

Gabriel, the tricky part here is that you cannot create new cross-platform files without breaking the Mac build. You can add the new file in WebCore/CMakeLists.txt, but you'll need help from Alex to modify your patch in order to add the new file to the XCode build system, or the Mac EWS is going to be red, which will not be helpful in trying to figure out if your patch passes the Mac tests or not.

So for now, my recommendation is to move the code to the URL class instead of URLParser, and make sure it works there. Then, only once it passes tests on Mac, we can move it into a new file if Alex wants to help with that. I kinda think that URL might be a good permanent home for this code, since that keeps it out of URLParser, and being able to call url.displayURL() would be nice, but we'll do whatever Alex prefers.
Comment 53 Gabriel Ivașcu 2018-01-27 10:50:41 PST
Created attachment 332474 [details]
Patch
Comment 54 Build Bot 2018-01-27 11:51:10 PST Comment hidden (obsolete)
Comment 55 Build Bot 2018-01-27 11:51:11 PST Comment hidden (obsolete)
Comment 56 Build Bot 2018-01-27 11:56:37 PST Comment hidden (obsolete)
Comment 57 Build Bot 2018-01-27 11:56:38 PST Comment hidden (obsolete)
Comment 58 Build Bot 2018-01-27 12:17:00 PST Comment hidden (obsolete)
Comment 59 Build Bot 2018-01-27 12:17:02 PST Comment hidden (obsolete)
Comment 60 Ms2ger (he/him; ⌚ UTC+1/+2) 2018-06-11 03:37:41 PDT
Comment on attachment 332474 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=332474&action=review

> Source/WebCore/platform/URLParser.cpp:3073
> +        if (previousCodePoint != INT32_MAX && URLParser::isLookalikeCharacter(c, previousCodePoint))

Adding the `previousCodePoint != INT32_MAX` seems likely to change the behaviour here.

> Source/WebCore/platform/URLParser.cpp:3290
> +    if (allCharactersInIDNScriptWhiteList(outputBuffer, outputLength, IDNScriptWhiteList) && allCharactersAllowedByTLDRules(outputBuffer, outputLength))

In inverting the condition, you changed !(!A && !B) to (A && B) where it should have been (A || B); this is (at least one of the) bug(s) that caused the test failures.
Comment 61 Ms2ger (he/him; ⌚ UTC+1/+2) 2018-06-11 03:37:48 PDT
View in context: https://bugs.webkit.org/attachment.cgi?id=332474&action=review

> Source/WebCore/platform/URLParser.cpp:3073
> +        if (previousCodePoint != INT32_MAX && URLParser::isLookalikeCharacter(c, previousCodePoint))

Adding the `previousCodePoint != INT32_MAX` seems likely to change the behaviour here.

> Source/WebCore/platform/URLParser.cpp:3290
> +    if (allCharactersInIDNScriptWhiteList(outputBuffer, outputLength, IDNScriptWhiteList) && allCharactersAllowedByTLDRules(outputBuffer, outputLength))

In inverting the condition, you changed !(!A && !B) to (A && B) where it should have been (A || B); this is (at least one of the) bug(s) that caused the test failures.
Comment 62 Ms2ger (he/him; ⌚ UTC+1/+2) 2018-06-11 06:56:49 PDT
(I'm preparing a new patch for review.)
Comment 63 Gabriel Ivașcu 2018-06-12 02:27:42 PDT
(In reply to Ms2ger from comment #60)
> In inverting the condition, you changed !(!A && !B) to (A && B) where it
> should have been (A || B); this is (at least one of the) bug(s) that caused
> the test failures.

You're so right, too bad for me for missing this.

(In reply to Ms2ger from comment #62)
> (I'm preparing a new patch for review.)

Thank you for doing this!
Comment 64 Ms2ger (he/him; ⌚ UTC+1/+2) 2018-06-12 09:03:11 PDT
Created attachment 342542 [details]
Patch
Comment 65 Build Bot 2018-06-12 09:06:02 PDT Comment hidden (obsolete)
Comment 66 Michael Catanzaro 2018-06-12 09:40:21 PDT
(In reply to Build Bot from comment #65)
> Attachment 342542 [details] did not pass style-queue:
> 
> 
> WARNING: File exempt from style guide. Skipping:
> "Source/WebKit/UIProcess/API/gtk/webkit2.h"
> ERROR: Source/WebCore/platform/URLParser.cpp:2928:  A case label should not
> be indented, but line up with its switch statement.  [whitespace/indent] [4]
> ERROR: Source/WebCore/platform/URLParser.cpp:3120:  Tests for true/false,
> null/non-null, and zero/non-zero should all be done without equality
> comparisons.  [readability/comparison_to_zero] [5]
> ERROR: Source/WebCore/platform/URLParser.cpp:3140:  One space before end of
> line comments  [whitespace/comments] [5]
> ERROR: Source/WebCore/platform/URLParser.cpp:3149:  One space before end of
> line comments  [whitespace/comments] [5]
> ERROR: Source/WebCore/platform/URLParser.cpp:3161:  One space before end of
> line comments  [whitespace/comments] [5]
> ERROR: Source/WebCore/platform/URLParser.cpp:3171:  One space before end of
> line comments  [whitespace/comments] [5]
> ERROR: Source/WebCore/platform/URLParser.cpp:3184:  One space before end of
> line comments  [whitespace/comments] [5]
> ERROR: Source/WebCore/platform/URLParser.cpp:3194:  One space before end of
> line comments  [whitespace/comments] [5]
> ERROR: Source/WebCore/platform/URLParser.cpp:3205:  One space before end of
> line comments  [whitespace/comments] [5]
> ERROR: Source/WebCore/platform/URLParser.cpp:3214:  One space before end of
> line comments  [whitespace/comments] [5]
> ERROR: Source/WebCore/platform/URLParser.cpp:3226:  One space before end of
> line comments  [whitespace/comments] [5]
> ERROR: Source/WebCore/platform/URLParser.cpp:3238:  One space before end of
> line comments  [whitespace/comments] [5]
> ERROR: Source/WebCore/platform/URLParser.cpp:3250:  One space before end of
> line comments  [whitespace/comments] [5]
> ERROR: Source/WebCore/platform/URLParser.cpp:3262:  One space before end of
> line comments  [whitespace/comments] [5]
> ERROR: Source/WebCore/platform/URLParser.cpp:3274:  One space before end of
> line comments  [whitespace/comments] [5]
> ERROR: Source/WebCore/platform/URLParser.cpp:3304:  One line control clauses
> should not use braces.  [whitespace/braces] [4]

I know you're just moving code, but best fix these nits at the same time.

> ERROR: Source/WebKit/UIProcess/API/glib/WebKitURIUtilities.cpp:65:  Missing
> space inside { }.  [whitespace/braces] [5]
> ERROR: Source/WebKit/UIProcess/API/glib/WebKitURIUtilities.cpp:73:  One line
> control clauses should not use braces.  [whitespace/braces] [4]
> WARNING: File exempt from style guide. Skipping:
> "Source/WebKit/UIProcess/API/wpe/webkit.h"
> WARNING: File exempt from style guide. Skipping:
> "Source/WebKit/UIProcess/API/gtk/WebKitURIUtilities.h"
> WARNING: File exempt from style guide. Skipping:
> "Source/WebKit/UIProcess/API/wpe/WebKitURIUtilities.h"
> ERROR: Source/WebCore/ChangeLog:10:  You should remove the 'No new tests'
> and either add and list tests, or explain why no new tests were possible. 
> [changelog/nonewtests] [5]
> Total errors found: 19 in 19 files

And the rest of these ERRORs, too.
Comment 67 Michael Catanzaro 2018-06-12 10:07:54 PDT
Comment on attachment 342542 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=342542&action=review

Thanks a bunch for resurrecting this work.

The main issue is that Alex requested that we not add this to URLParser:

(In reply to Alex Christensen from comment #49)
> I'm not against moving this from a .mm to a sharable location, but the
> URLParser is an implementation of https://url.spec.whatwg.org so let's make
> a new class for this.

So some new home will be needed, e.g. URLHelpers.cpp or URLDecoder.cpp, something like that. r- for this

> Source/WebCore/ChangeLog:8
> +        This code is based almost entirely on code by Gabriel Ivascu.

We often credit multiple authors on the authorship line:

2018-06-12 Ms2ger <Ms2ger@igalia.com> and Gabriel Ivascu <gabrielivascu@gnome.org>

> Source/WebCore/platform/URLParser.cpp:2886
> +
> +

Let's drop down to one blank line between functions.

> Source/WebCore/platform/URLParser.cpp:2910
> +
> +
> +

Ditto.

> Source/WebCore/platform/URLParser.h:56
> +    static String ICUConvertHostName(const String& hostName, bool encode, const uint32_t (&IDNScriptWhiteList)[(USCRIPT_CODE_LIMIT + 31) / 32], bool* error);

This function name is rather ambiguous... how about DecodeHostnameForDisplay?

> Source/WebKit/UIProcess/API/glib/WebKitURIUtilities.cpp:43
> + * against IDN homograph attacks so in some cases the host part of the returned

comma after "attacks"

> Source/WebKit/UIProcess/API/glib/WebKitURIUtilities.cpp:48
> +gchar* webkit_uri_for_display(const gchar* uri)

We might want to bikeshed about the name. It's fine with me, but please ask Carlos Garcia if he has a preference here.

> Source/WebKit/UIProcess/API/glib/WebKitURIUtilities.cpp:52
> +    auto coreURI = WebCore::URL(WebCore::URL(), String::fromUTF8(uri));

I think we normally use bracket initialization for URLs:

auto coreURI = WebCore::URL { { }, String::fromUTF8(uri) };

> Source/WebKit/UIProcess/API/glib/WebKitURIUtilities.cpp:54
> +    if (!coreURI.isValid())
> +        return g_strdup(uri);

Hm, we should document that if the input is not a valid URI, it returns the original input. I wonder if it would be better to return nullptr when the input is not a valid URI, though. Probably? I'd like an opinion from Carlos Garcia on this. The behavior of strdup()ing the original input is coming from ephy_uri_decode(), which is designed to always return a value for convenience, but it might be better for public API to indicate the error. In that case, the return value would need to be annotated as (nullable) and everywhere we currently return g_strdup(uri) we would need to instead return nullptr.

> Source/WebKit/UIProcess/API/glib/WebKitURIUtilities.cpp:60
> +    // Remove password and percent-decode host name.
> +    coreURI.setPass(emptyString());
> +    auto soupURI = coreURI.createSoupURI();
> +    if (!soupURI.get()->host)
> +        return g_strdup(uri);

If the URI contains a password but not a hostname, which is valid, then we fail to remove the password. Oops. That's true for the subsequent early return on line 69, as well. Returning null would avoid this.

> Source/WebKit/UIProcess/API/glib/WebKitURIUtilities.cpp:65
> +    uint32_t IDNScriptWhiteList[(USCRIPT_CODE_LIMIT + 31) / 32] = {};

The third parameter of ICUConvertHostName should be declared to use a typedef (or using statement), then this line could become something much saner, like:

ICUConvertHostnameWhitelist whitelist = {};

> Source/WebKit/UIProcess/API/glib/WebKitURIUtilities.cpp:73
> +    if (convertedHostName.isNull()) {
> +        convertedHostName = percentDecodedHost;
> +    }

No brackets around a one-line condition

> Source/WebKit/UIProcess/API/glib/WebKitURIUtilities.cpp:80
> +    // Now, decode any percent-encoded characters in the URI. If there are null
> +    // characters or escaped slashes, this returns nullptr, so just display the
> +    // encoded URI in that case.

I'm confused, why are percent-decoding the hostname twice? Epiphany feeds the percent-encoded URI into ICU to decode the punycode, then percent-decodes. I think it did not work properly when doing it the other way around. But here you are doing it twice, once before decoding punycode and once after. What is the correct approach?

This would also be simpler if we return nullptr on error.
Comment 68 Michael Catanzaro 2018-06-12 10:10:25 PDT
(In reply to Michael Catanzaro from comment #67)
> This function name is rather ambiguous... how about DecodeHostnameForDisplay?

I guess that name would only be appropriate if the function were to perform the percent-decoding as well. Perhaps simply: DecodePunycode?
Comment 69 Ms2ger (he/him; ⌚ UTC+1/+2) 2018-06-18 07:18:07 PDT
(In reply to Michael Catanzaro from comment #67)
> Comment on attachment 342542 [details]
> Patch
> 
> > Source/WebKit/UIProcess/API/glib/WebKitURIUtilities.cpp:52
> > +    auto coreURI = WebCore::URL(WebCore::URL(), String::fromUTF8(uri));
> 
> I think we normally use bracket initialization for URLs:
> 
> auto coreURI = WebCore::URL { { }, String::fromUTF8(uri) };

Interestingly, this interprets the { } as initializing a WebCore::ParsedURLStringTag. Using WebCore::URL { } helps. (Yay, C++.)
Comment 70 Ms2ger (he/him; ⌚ UTC+1/+2) 2018-06-19 08:37:14 PDT
Created attachment 343057 [details]
Patch
Comment 71 Build Bot 2018-06-19 08:39:59 PDT Comment hidden (obsolete)
Comment 72 Ms2ger (he/him; ⌚ UTC+1/+2) 2018-06-19 09:04:28 PDT
Created attachment 343059 [details]
Patch
Comment 73 Build Bot 2018-06-19 10:50:09 PDT Comment hidden (obsolete)
Comment 74 Build Bot 2018-06-19 10:50:10 PDT Comment hidden (obsolete)
Comment 75 Michael Catanzaro 2018-06-19 13:57:01 PDT
Comment on attachment 343059 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=343059&action=review

Great! Just a couple more comments from me regarding the public functions of the new URLHelpers class. Other than that, it looks ready.

> Source/WebCore/ChangeLog:20
> +        * platform/URLParser.cpp:
> +        (WebCore::isArmenianLookalikeCharacter): code moved without semantic changes.
> +        (WebCore::isArmenianScriptCharacter): code moved without semantic changes.
> +        (WebCore::isASCIIDigitOrValidHostCharacter): code moved without semantic changes.
> +        (WebCore::URLParser::isLookalikeCharacter): code moved without semantic changes.
> +        (WebCore::allCharactersInIDNScriptWhiteList): code moved without semantic changes.
> +        (WebCore::isSecondLevelDomainNameAllowedByTLDRules): code moved without semantic changes.
> +        (WebCore::isRussianDomainNameCharacter): code moved without semantic changes.
> +        (WebCore::allCharactersAllowedByTLDRules): code moved without semantic changes.
> +        (WebCore::URLParser::ICUConvertHostName): add.
> +        * platform/URLParser.h: add signatures.

The ChangeLog is outdated and should be regenerated with prepare-ChangeLog.

> Source/WebCore/platform/URLHelpers.h:34
> +namespace std {
> +template<typename> class optional;
> +}

I'm glad you tried to avoid a forward declaration, but this is too much work IMO, I would just #include <wtf/Optional.h>

> Source/WebCore/platform/URLHelpers.h:43
> +    static bool isLookalikeCharacter(std::optional<UChar32>, UChar32);
> +    static String decodePunycode(const String& hostName, bool encode, const ICUConvertHostnameWhitelist& IDNScriptWhiteList, bool* error);

Hm, this is still not great... the function signatures are designed as if they're file-private (which they used to be), so they are complex to make them easier to use within their file, but now they are public and it's more important for the signatures to be easy to use.

What are the two arguments to isLookalikeCharacter? They should have descriptive names in the header (previousCodePoint and charCode).

Now that you've renamed it to decodePunycode, I notice that function named decodePunycode should really probably not have a bool encode parameter. This public interface should probably offer a separate encodePunycode function, with the bool parameter removed.

And lastly, returning error via a bool out parameter is pretty weird. Shouldn't it at least be a reference, rather than a pointer?

> Source/WebCore/platform/mac/WebCoreNSURLExtras.mm:111
>          if (!readIDNScriptWhiteListFile([bundle pathForResource:@"IDNScriptWhiteList" ofType:@"txt"]))
>              CRASH();

What's this whitelist file? I see we do not use the whitelist. Is that bad?
Comment 76 Tim Horton 2018-06-19 15:33:07 PDT
Comment on attachment 343059 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=343059&action=review

> Source/WebCore/platform/URLHelpers.cpp:12
> + * Copyright (C) 2018 Metrological Group B.V.
> + * Copyright (C) 2018 Igalia S.L.
> + *
> + * Redistribution and use in source and binary forms, with or without
> + * modification, are permitted provided that the following conditions
> + * are met:
> + * 1. Redistributions of source code must retain the above copyright
> + *    notice, this list of conditions and the following disclaimer.
> + * 2. Redistributions in binary form must reproduce the above copyright
> + *    notice, this list of conditions and the following disclaimer in the
> + *    documentation and/or other materials provided with the distribution.

You can't replace the copyright header on this much moved code. You can add to it, but not replace it.
Comment 77 Tim Horton 2018-06-19 15:38:28 PDT
Comment on attachment 343059 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=343059&action=review

> Source/WebCore/platform/mac/WebCoreNSURLExtras.mm:137
> +    NSString* substring = [string substringWithRange:range];

All your NSString stars are on the wrong side (C++ on the left, ObjC on the right). It's a little weird, but it's what we do.

> Source/WebCore/platform/mac/WebCoreNSURLExtras.mm:145
> +    NSString* convertedString_(convertedString);

What is this line for? Why not just return convertedString and/or cast at the return site if needed?
Comment 78 Michael Catanzaro 2018-06-25 08:55:49 PDT
Comment on attachment 343059 [details]
Patch

This is really close, but not quite ready for r+ as there is outstanding review feedback from myself and Tim.
Comment 79 Ms2ger (he/him; ⌚ UTC+1/+2) 2018-06-28 06:07:57 PDT
Created attachment 343806 [details]
Patch
Comment 80 Michael Catanzaro 2018-06-28 12:19:37 PDT
LGTM, except for one unresolved question:

(In reply to Michael Catanzaro from comment #75)
> > Source/WebCore/platform/mac/WebCoreNSURLExtras.mm:111
> >          if (!readIDNScriptWhiteListFile([bundle pathForResource:@"IDNScriptWhiteList" ofType:@"txt"]))
> >              CRASH();
> 
> What's this whitelist file? I see we do not use the whitelist. Is that bad?
Comment 81 Ms2ger (he/him; ⌚ UTC+1/+2) 2018-06-28 23:58:51 PDT
(In reply to Michael Catanzaro from comment #80)
> LGTM, except for one unresolved question:
> 
> (In reply to Michael Catanzaro from comment #75)
> > > Source/WebCore/platform/mac/WebCoreNSURLExtras.mm:111
> > >          if (!readIDNScriptWhiteListFile([bundle pathForResource:@"IDNScriptWhiteList" ofType:@"txt"]))
> > >              CRASH();
> > 
> > What's this whitelist file? I see we do not use the whitelist. Is that bad?

That's not really clear to me; maybe Gabriel remembers?
Comment 82 Gabriel Ivașcu 2018-06-29 07:31:22 PDT
(In reply to Ms2ger from comment #81)
> > > What's this whitelist file? I see we do not use the whitelist. Is that bad?
> 
> That's not really clear to me; maybe Gabriel remembers?

IIRC the current Objective C++ code that runs on Mac and iOS reads the whitelisted characters from a file on disk and builds a bit array used to verify that all characters from a string from a given script are allowed. In WebKitGTK+ I think all characters can be considered whitelisted for now so the bit array can be filled with all ones.
Comment 83 Michael Catanzaro 2018-06-29 07:39:51 PDT
Comment on attachment 343806 [details]
Patch

I found it at Source/WebCore/Resources and it looks important. :) I think we should add it to WebKitResources in Source/WebKit/PlatformGTK.cmake and Source/WebKit/PlatformWPE.cmake and load it using g_resources_lookup_data(). Almost there....
Comment 84 Ms2ger (he/him; ⌚ UTC+1/+2) 2018-07-04 09:15:45 PDT
Created attachment 344291 [details]
Patch
Comment 85 Build Bot 2018-07-04 11:03:50 PDT Comment hidden (obsolete)
Comment 86 Build Bot 2018-07-04 11:03:52 PDT Comment hidden (obsolete)
Comment 87 Build Bot 2018-07-04 13:23:11 PDT Comment hidden (obsolete)
Comment 88 Build Bot 2018-07-04 13:23:23 PDT Comment hidden (obsolete)
Comment 89 Carlos Garcia Campos 2018-07-05 00:42:22 PDT
I still don't understand why we need to use internal parts of WebCoreNSURLExtras. What's the difference between userVisibleString and webkit_uri_for_display? is that we also decoded percentage encoded parts? To make all this simpler I would split this patch in two:

 - The first patch would do the WebCore refactoring, making userVisibleString() cross-platform. We could use platform specific bits for reading the whitelist file, but everything done in WebCore. This patch should also make Tools/TestWebKitAPI/Tests/WebCore/cocoa/URLExtras.mm cross-platform, at least the tests for userVisibleString(). And we could also make Internals::userVisibleString cross platform and enable the layout tests using it for al platforms. There are a few other places in WebCore cross-platform code where we might remove cocoa ifdefs.

 - The second patch simply uses WebCore::userVisibleString() to expose webkit_uri_for_display(). This should include its own unit test since it seems we also decode the percentage encoded parts. I will add specific comments about this part in this bug.
Comment 90 Carlos Garcia Campos 2018-07-05 01:07:09 PDT
Comment on attachment 344291 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=344291&action=review

> Source/WebKit/SourcesGTK.txt:169
> +UIProcess/API/glib/WebKitURIUtilities.cpp @no-unify

I think this could be shared, you might want to use it in a web extension. I would move it to Shared/API/glib.

> Source/WebKit/UIProcess/API/glib/WebKitURIUtilities.cpp:36
> + **/

**/ -> */

> Source/WebKit/UIProcess/API/glib/WebKitURIUtilities.cpp:40
> +static void readIDNScriptWhiteListFile()

As I commented before I think this could also be refactored to be done by WebCore in cross-platform code. We only need to had a function to read the data that would be platform specific.

> Source/WebKit/UIProcess/API/glib/WebKitURIUtilities.cpp:105
> + **/

Since: 2.22
**/ -> */

> Source/WebKit/UIProcess/API/glib/WebKitURIUtilities.cpp:121
> +    if (!coreURI.isValid())
> +        return nullptr;
> +
> +    // Remove password and percent-decode host name.
> +    coreURI.setPass(emptyString());
> +    auto soupURI = coreURI.createSoupURI();
> +    if (!soupURI.get()->host)
> +        return nullptr;
> +
> +    GUniquePtr<gchar> percentDecodedHostChars(soup_uri_decode(soupURI.get()->host));
> +    auto percentDecodedHost = String::fromUTF8(percentDecodedHostChars.get());

I wonder if we could use WebCore::URLParser for this instead of libsoup to avoid the utf16 -> utf8 -> utf16 conversions

> Source/WebKit/UIProcess/API/glib/WebKitURIUtilities.cpp:128
> +    auto convertedHostName = WebCore::URLHelpers::decodePunycode(percentDecodedHost, IDNScriptWhiteList, error);
> +    if (error)
> +        return nullptr;

Maybe we could add a GError** parameter to provide more information about what failed.

> Source/WebKit/UIProcess/API/glib/WebKitURIUtilities.cpp:137
> +    // Now, decode any percent-encoded characters in the URI.
> +    GUniquePtr<char> percentEncodedURI(soup_uri_to_string(soupURI.get(), FALSE));

I don't understand this. The comment says we are decoding the uri, but the variable name says it's encoded. Does soup_uri_to_string() decode percentage encoded parts or not? Why do we need to do the hostname part separately? Can't we simply call userVisibleString() and decode the resulting string?

> Source/WebKit/UIProcess/API/gtk/WebKitURIUtilities.h:20
> +#if !defined(__WEBKIT2_H_INSIDE__) && !defined(WEBKIT2_COMPILATION)

&& !defined(__WEBKIT_WEB_EXTENSION_H_INSIDE__) if we make this shared.

> Source/WebKit/UIProcess/API/wpe/WebKitURIUtilities.h:20
> +#if !defined(__WEBKIT_H_INSIDE__) && !defined(WEBKIT2_COMPILATION)

Ditto.

> Tools/TestWebKitAPI/Tests/WebKitGLib/TestWebKitURIUtilities.cpp:53
> +        { "http://www.%7Bexample%7D.com/", "http://www.{example}.com/" },
> +        { "http://example.com/a%2Fb", nullptr }, // '/' in path needs to remain encoded.

I would like to see more test cases about the percent-decoding thing which is what we make different compared to WebCore:userVisibleString().
Comment 91 Michael Catanzaro 2018-07-17 18:56:30 PDT
This looks at risk of slipping; keep in mind that API freeze begins July 30.
Comment 92 Ms2ger (he/him; ⌚ UTC+1/+2) 2018-12-07 05:14:24 PST
Created attachment 356804 [details]
Patch
Comment 93 Build Bot 2018-12-07 05:16:48 PST Comment hidden (obsolete)
Comment 94 Build Bot 2018-12-07 07:09:07 PST Comment hidden (obsolete)
Comment 95 Build Bot 2018-12-07 07:09:10 PST Comment hidden (obsolete)
Comment 96 Michael Catanzaro 2018-12-07 19:32:46 PST
From WPE EWS:

lib/libWTF.a(lib/../Source/WTF/wtf/CMakeFiles/WTF.dir/URLHelpers.cpp.o):URLHelpers.cpp:function WTF::mapHostName(WTF::String, std::optional<WTF::String (&)(WTF::String)>, bool*): error: undefined reference to 'WTF::loadIDNScriptWhiteList()'
Comment 97 Michael Catanzaro 2018-12-07 20:27:39 PST
Comment on attachment 356804 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=356804&action=review

Amazing monster patch!

I have a lot of comments, but don't worry: all are for quick trivial stuff. I sense the finish line is very near. My only important complaint is to fix the build failures related to loadIDNScriptWhiteList, which I explain how to do below. The red ios-sim EWS looks unrelated and can be ignored.

> LayoutTests/TestExpectations:-71
>  # These tests don't have to be platform-specific, but they are only implemented on Mac now.
> -fast/url/user-visible [ Skip ]

Very nice! Good that you saw the opportunity to implement the TestController function.

> Source/WTF/wtf/PlatformGTK.cmake:29
> +    glib/URLHelpersGlib.cpp

URLHelpersGLib.cpp (capital L)

We should probably rename other files ending in Glib.cpp at some point since I know there's a lot of inconsistency currently.

> Source/WTF/wtf/URLHelpers.cpp:41
> +// Needs to be big enough to hold an IDN-encoded name.
> +// For host names bigger than this, we won't do IDN encoding, which is almost certainly OK.
> +#define HOST_NAME_BUFFER_LENGTH 2048
> +#define URL_BYTES_BUFFER_LENGTH 2048

Since this is C++ now you should use const unsigned or const int instead of #define, and move them into the WTF namespace.

> Source/WTF/wtf/URLHelpers.cpp:43
> +static uint32_t IDNScriptWhiteList[(USCRIPT_CODE_LIMIT + 31) / 32];

This should also go inside the WTF namespace.

> Source/WTF/wtf/URLHelpers.cpp:67
> +}
> +
> +
> +template<typename CharacterType> inline bool isASCIIDigitOrValidHostCharacter(CharacterType charCode)

One blank line between functions.

> Source/WTF/wtf/URLHelpers.cpp:92
> +}
> +
> +
> +
> +static bool isLookalikeCharacter(std::optional<UChar32> previousCodePoint, UChar32 charCode)
> +{

Ditto regarding too many blank lines.

Also, careful not to copy optionals around unnecessarily. This should be const std::optional<UChar32>& previousCodePoint, or std::optional<UChar32>&& previousCodePoint if you add the WTFMove() at its callsite. Probably the later, but whichever you prefer is fine.

> Source/WTF/wtf/URLHelpers.cpp:284
> +static bool allCharactersInIDNScriptWhiteList(const UChar *buffer, int32_t length)

const UChar* buffer

> Source/WTF/wtf/URLHelpers.cpp:341
> +#define CHECK_RULES_IF_SUFFIX_MATCHES(suffix, function) \
> +    { \
> +        static const int32_t suffixLength = sizeof(suffix) / sizeof(suffix[0]); \
> +        if (length > suffixLength && 0 == memcmp(buffer + length - suffixLength, suffix, sizeof(suffix))) \
> +            return isSecondLevelDomainNameAllowedByTLDRules(buffer, length - suffixLength, function); \
> +    }

Eh, couldn't this become a function instead of a macro?

> Source/WTF/wtf/URLHelpers.cpp:507
> +String mapHostName(String string, std::optional<DecodeFunction> decodeFunction, bool *error)

bool* error (the * hangs on the type).

Or better: bool& error, unless the error parameter is really supposed to be optional, which I doubt.

Then I think we need to not copy the String or std::optional by value here. So const String& string and const std::optional<DecodeFunction>& decodeFunction would be good safe declarations. But if the values are being "sunk" then String&& string and std::optional<DecodeFunction>>&& decodeFunction would be better. Looks like they are being sunk throughout this file, but I see that they are APIs accessible to other files, so that swings the balance in favor of const-ref. So that leaves us with:

String mapHostName(const String& string, const std::optional<DecodeFunction> decodeFunction, bool& error)

> Source/WTF/wtf/URLHelpers.cpp:511
> +        return String();

This is old C++. Modern C++:

return { };

> Source/WTF/wtf/URLHelpers.cpp:514
> +        return String();

Ditto.

> Source/WTF/wtf/URLHelpers.cpp:528
> +        return String();

Ditto.

> Source/WTF/wtf/URLHelpers.cpp:532
> +        return String();

Ditto.

> Source/WTF/wtf/URLHelpers.cpp:535
> +        return String();

Ditto.

> Source/WTF/wtf/URLHelpers.cpp:542
> +static void collectRangesThatNeedMapping(String string, unsigned location, unsigned length, MappingRangesVector *array, std::optional<DecodeFunction> decodeFunction)

const String& string

MappingRangesVector& array -- the * hangs on the type, and then since you dereference it unconditionally, it's not optional/nullable and should use & instead of *

const std::optional<DecodeFunction>& decodeFunction

> Source/WTF/wtf/URLHelpers.cpp:545
> +    // Therefore, we use nil to indicate no mapping here and an empty array to indicate error.

It's not Objective C++ anymore so "nil" is no longer right.

> Source/WTF/wtf/URLHelpers.cpp:561
> +static void applyHostNameFunctionToMailToURLString(String string, std::optional<DecodeFunction> decodeFunction, MappingRangesVector *array)

Ditto.

> Source/WTF/wtf/URLHelpers.cpp:633
> +static void applyHostNameFunctionToURLString(String string, std::optional<DecodeFunction> decodeFunction, MappingRangesVector *array)

Ditto.

> Source/WTF/wtf/URLHelpers.cpp:692
> +String mapHostNames(String string, std::optional<DecodeFunction> decodeFunction)

Ditto.

> Source/WTF/wtf/URLHelpers.cpp:706
> +        return String();

return { };

> Source/WTF/wtf/URLHelpers.cpp:759
> +        return emptyString();

return { };

> Source/WTF/wtf/URLHelpers.cpp:763
> +    Vector<char, URL_BYTES_BUFFER_LENGTH> after(bufferLength); // large enough to %-escape every character

Comment should be a complete sentence.

> Source/WTF/wtf/URLHelpers.cpp:765
> +    char *q = after.data();

char* q

> Source/WTF/wtf/URLHelpers.cpp:767
> +        const unsigned char *p = before;

const unsigned char* p

> Source/WTF/wtf/URLHelpers.cpp:803
> +        char *p = after.data() + bufferLength - afterlength - 1;

char* p

> Source/WTF/wtf/URLHelpers.cpp:805
> +        char *q = after.data();

char* q

> Source/WTF/wtf/URLHelpers.cpp:846
> +        return String();

return { };

> Source/WTF/wtf/URLHelpers.h:37
> +using DecodeFunction = String(&)(String);

WTF is a big namespace and DecodeFunction could mean a lot of things. I'd call it URLDecodeFunction.

> Source/WTF/wtf/URLHelpers.h:39
> +WTF_EXPORT_PRIVATE String userVisibleString(CString URL);

const CString& URL?

> Source/WTF/wtf/cocoa/NSURLExtras.h:29
> +#import <wtf/text/WTFString.h>

Are you sure this is needed here? Can't it go in NSURLExtras.mm?

> Source/WTF/wtf/cocoa/NSURLExtras.mm:34
> +#import <wtf/URLHelpers.h>

This isn't sorted properly?

> Source/WTF/wtf/cocoa/NSURLExtras.mm:42
>  #define URL_BYTES_BUFFER_LENGTH 2048

This is unused and can be deleted.

> Source/WTF/wtf/cocoa/NSURLExtras.mm:109
> +    return !host ? string : (NSString*)host;

Above you used (NSString *)host, which I suspect is the right style here. Not sure. Any Apple developer should know, or you could poke around in the *.mm files more looking for the answer.

> Source/WTF/wtf/cocoa/NSURLExtras.mm:118
> +    return !host ? string : (NSString*)host;

(NSString *)?

> Source/WTF/wtf/glib/URLHelpersGlib.cpp:39
> +void loadIDNScriptWhiteList()
> +{
> +    static std::once_flag flag;
> +    std::call_once(flag, initializeDefaultIDNScriptWhiteList);
> +}

The Windows build failure is because there's no cross-platform implementation. And the WPE build failure is because you forgot to add this file to PlatformWPE.cmake.

Well we need a cross-platform implementation because we have to care for JSCOnly, PlayStation, and Windows. And there's nothing GLib-specific here, so this can be that implementation, right? Just rename the file to URLHelpers.cpp, build it on all platforms in the cross-platform CMakeLists.txt (you can even add it to XCode build in case we want to add more functions here in the future), and slap an #if !PLATFORM(COCOA) around the definition here.

> Source/WebKit/PlatformGTK.cmake:106
> +    ${WEBKIT_DIR}/UIProcess/API/gtk/WebKitURIUtilities.h

You remembered to install it AND add docs... nice.

> Tools/TestWebKitAPI/Tests/WebKitGLib/TestWebKitURIUtilities.cpp:112
> +void beforeAll()
> +{
> +    Test::add("WebKitURIUtilities", "uri-for-display-unaffected", testURIForDisplayUnaffected);
> +    Test::add("WebKitURIUtilities", "uri-for-display-affected", testURIForDisplayAffected);
> +}

Although having a test for webkit_uri_for_display() is important, copying the entire thing from the cross-platform test is not valuable. The tests will get out of sync, anyway. Better to test the implementation details in the cross-platform test, since that will be run for all developers. Then the GLib test can become way simpler. E.g. just pass in a %-encoded URL and verify that the result is decoded. The GLib test can be simpler because you have a good cross-platform test.
Comment 98 Michael Catanzaro 2018-12-10 11:16:37 PST
Comment on attachment 356804 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=356804&action=review

> Source/WebKit/UIProcess/API/glib/WebKitURIUtilities.cpp:45
> +/**
> + * webkit_uri_for_display:
> + * @uri: the URI to be converted
> + *
> + * Use this function to format a URI for display. The URIs used internally by
> + * WebKit may contain percent-encoded characters or Punycode, which are not
> + * generally suitable to display to users. This function provides protection
> + * against IDN homograph attacks, so in some cases the host part of the returned
> + * URI may be in Punycode if the safety check fails.
> + *
> + * Returns: (nullable) (transfer full): @uri suitable for display, or %NULL in
> + *    case of error.
> + **/

This is missing the since tag:

Since: 2.24
Comment 99 Ms2ger (he/him; ⌚ UTC+1/+2) 2018-12-11 01:41:19 PST
Created attachment 357043 [details]
Patch

(In reply to Michael Catanzaro from comment #97)
> Comment on attachment 356804 [details]
> Patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=356804&action=review
> 
> > Source/WTF/wtf/URLHelpers.cpp:341
> > +#define CHECK_RULES_IF_SUFFIX_MATCHES(suffix, function) \
> > +    { \
> > +        static const int32_t suffixLength = sizeof(suffix) / sizeof(suffix[0]); \
> > +        if (length > suffixLength && 0 == memcmp(buffer + length - suffixLength, suffix, sizeof(suffix))) \
> > +            return isSecondLevelDomainNameAllowedByTLDRules(buffer, length - suffixLength, function); \
> > +    }
> 
> Eh, couldn't this become a function instead of a macro?

Not easily; you'd at least need a templated function to extract the length of the array, and you'd need to return a tri-state (continue / return true / return false), and your call site would need to become check + return. Not worth it, I think.

> > Source/WTF/wtf/cocoa/NSURLExtras.mm:42
> >  #define URL_BYTES_BUFFER_LENGTH 2048
> 
> This is unused and can be deleted.

It's used in URLByTruncatingOneCharacterBeforeComponent().

> > Source/WebKit/PlatformGTK.cmake:106
> > +    ${WEBKIT_DIR}/UIProcess/API/gtk/WebKitURIUtilities.h
> 
> You remembered to install it AND add docs... nice.

I must admit I copied those parts from Gabriel's patch :)
Comment 100 Build Bot 2018-12-11 01:44:51 PST Comment hidden (obsolete)
Comment 101 Michael Catanzaro 2018-12-11 14:49:04 PST
Comment on attachment 357043 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=357043&action=review

> Source/WTF/wtf/URLHelpers.h:40
> +WTF_EXPORT_PRIVATE String userVisibleString(const CString& URL);

Sorry for more nitpicking. Nitpicking's all I have left. Looking at how this is used in Internals.cpp, I realize it's a bit unclear: what's a user-visible string? It's clear in the context of this header, but outside of any class... I would add all of these to a new namespace, URLHelpers. The namespace will make it more clear. It's even more important for the non-exported functions, like WTF::mapHostName which is weird to have as a toplevel function. As WTF::URLHelpers::mapHostName it becomes more clear that the function is associated with these other URLHelpers functions. I would also rename userVisibleString to userVisibleURL to make it even more clear.
Comment 102 Michael Catanzaro 2018-12-11 14:49:30 PST
r=me if you placate the style checker.

Since this patch adds new API, it requires approval of a second GTK reviewer. Carlos Garcia would be best, since he reviewed it previously.
Comment 103 Carlos Garcia Campos 2018-12-12 00:29:51 PST
Comment on attachment 357043 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=357043&action=review

Thanks! API looks good to me, note that wpe now has documentation files too.

> Source/WTF/wtf/URLHelpers.cpp:514
> +// Return value of nil means no mapping is necessary.
> +// If makeString is NO, then return value is either nil or self to indicate mapping is necessary.
> +// If makeString is YES, then return value is either nil or the mapped string.

You should update this comment that still refers to nil and makeString that I don't know what it is.

> Source/WTF/wtf/URLHelpers.cpp:515
> +String mapHostName(const String& string_, const std::optional<URLDecodeFunction>& decodeFunction, bool& error)

string_ -> hostname

> Source/WTF/wtf/URLHelpers.cpp:528
> +    if (decodeFunction && string.contains('%')) {
> +        string = (*decodeFunction)(string_);
> +    } else {
> +        string = string_;
> +    }

Remove the brackets here.

> Source/WTF/wtf/URLHelpers.cpp:539
> +        error = true;
> +        return { };

Could we make this return std::optional<String> and then return std::nullopt instead of using an out bool parameter for error case.

> Source/WTF/wtf/URLHelpers.cpp:553
> +static void collectRangesThatNeedMapping(const String& string, unsigned location, unsigned length, MappingRangesVector& array, std::optional<URLDecodeFunction> decodeFunction)

should decodeFunction be const & like in the other functions?

> Source/WTF/wtf/URLHelpers.cpp:556
> +    // Generally, we want to optimize for the case where there is one host name that does not need mapping.
> +    // Therefore, we use null to indicate no mapping here and an empty array to indicate error.

Can we use std::optional now instead of nullptr?

> Source/WTF/wtf/URLHelpers.cpp:730
> +static void createStringWithEscapedUnsafeCharacters(const Vector<UChar, URL_BYTES_BUFFER_LENGTH>& sourceBuffer, Vector<UChar, URL_BYTES_BUFFER_LENGTH>& outBuffer)

outBuffer could be the return value instead.

> Source/WTF/wtf/URLHelpers.cpp:765
> +String userVisibleString(const CString& URL)

I find weird that this function receives a CString but returns a String.
URL -> url

> Source/WTF/wtf/URLHelpers.cpp:767
> +    auto before = reinterpret_cast<const unsigned char*>(URL.data());

auto*

> Source/WTF/wtf/URLHelpers.cpp:864
> +    Vector<UChar, URL_BYTES_BUFFER_LENGTH> outBuffer;
> +    createStringWithEscapedUnsafeCharacters(normalizedCharacters, outBuffer);
> +    return String::adopt(WTFMove(outBuffer));

Can we make createStringWithEscapedUnsafeCharacters return a String?

> Source/WebKit/UIProcess/API/glib/WebKitURIUtilities.cpp:31
> + **/

*/

> Source/WebKit/UIProcess/API/glib/WebKitURIUtilities.cpp:47
> + **/

*/

> Source/WebKit/UIProcess/API/glib/WebKitURIUtilities.cpp:52
> +    CString uri_string(uri);

uri_string -> uriString, but I don't think you need the local variable

> Source/WebKit/UIProcess/API/glib/WebKitURIUtilities.cpp:53
> +    String result = WTF::userVisibleString(uri_string);

Just pass uri to userVisibleString() the CString will be created implicitly I think
Comment 104 Ms2ger (he/him; ⌚ UTC+1/+2) 2018-12-14 08:32:30 PST
Created attachment 357316 [details]
Patch

I think this addresses all outstanding comments.
Comment 105 WebKit Commit Bot 2018-12-17 06:08:55 PST
Comment on attachment 357316 [details]
Patch

Clearing flags on attachment: 357316

Committed r239265: <https://trac.webkit.org/changeset/239265>
Comment 106 WebKit Commit Bot 2018-12-17 06:08:58 PST
All reviewed patches have been landed.  Closing bug.
Comment 107 Antoine Quint 2018-12-17 10:06:24 PST
I can no longer build WTF following this change:

Source/WTF/wtf/URLHelpers.cpp:853:32: error: 'unorm_normalize' is deprecated [-Werror,-Wdeprecated-declarations]
    int32_t normalizedLength = unorm_normalize(sourceBuffer.data(), sourceBuffer.size(), UNORM_NFC, 0, normalizedCharacters.data(), sourceBuffer.size(), &uerror);
                               ^
Comment 108 Keith Rollin 2018-12-17 10:37:32 PST
Ditto.
Comment 109 Alex Christensen 2018-12-17 10:58:27 PST
Comment on attachment 357316 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=357316&action=review

> Source/WTF/wtf/URLHelpers.cpp:853
> +    int32_t normalizedLength = unorm_normalize(sourceBuffer.data(), sourceBuffer.size(), UNORM_NFC, 0, normalizedCharacters.data(), sourceBuffer.size(), &uerror);

unorm_normalize is deprecated.  If you absolutely need to use it, please only use it in GTK-specific code, but I think the deprecation is an indication that this is done the wrong way.  This was also a behavior change, which I requested be avoided with this code move.  This also broke an internal build, so I intend to roll it out.
Comment 110 WebKit Commit Bot 2018-12-17 11:18:23 PST
Re-opened since this is blocked by bug 192765
Comment 111 Alex Christensen 2018-12-17 13:23:01 PST
If you try this again, please split it into two patches, one that does nothing but move code with no change in behavior or organization, then another that adds functionality for linux.
Comment 112 Michael Catanzaro 2018-12-17 14:09:53 PST
Comment on attachment 357316 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=357316&action=review

>> Source/WTF/wtf/URLHelpers.cpp:853
>> +    int32_t normalizedLength = unorm_normalize(sourceBuffer.data(), sourceBuffer.size(), UNORM_NFC, 0, normalizedCharacters.data(), sourceBuffer.size(), &uerror);
> 
> unorm_normalize is deprecated.  If you absolutely need to use it, please only use it in GTK-specific code, but I think the deprecation is an indication that this is done the wrong way.  This was also a behavior change, which I requested be avoided with this code move.  This also broke an internal build, so I intend to roll it out.

It should be using unorm2_normalize: http://icu-project.org/apiref/icu4c/unorm2_8h.html#a0a596802db767da410b4b04cb75cbc53. It's been deprecated since ICU 56 released in 2015, so Ms2ger you should have noticed a build warning, right? Unless you were working with a truly ancient host system? If it got lost in other warnings, then we might need to consider turning on -Werror for GTK. 😈

Why is normalization needed here but not in the original code? Is it a necessary consequence of the change from NSString to String?

I'll water down Alex's request slightly, since we requested a bunch of small improvements to the organization of the code during many rounds of patch review, and trying to disentangle them and land separately could be a significant effort, and spending a bunch of time making the patch worse doesn't make sense. But any functional behavior changes in the moved code, like an extra normalization step, should be split out if feasible. And if not, they should be prominently mentioned in the ChangeLog, so we know what changed. Otherwise, it looks like burying a small important change in a huge code reorganization commit.  And we don't want potential regressions to bisect down to a massive code move commit like this.

If there are not other behavior changes in URLHelpers.[cpp,h], then splitting the rest into two patches should be easy, since we have a nice clean split between the URLHelpers/NSURLExtras code and the Linux-specific stuff in the layers above.
Comment 113 Alex Christensen 2018-12-17 17:32:17 PST
This is quite carefully crafted code that has evolved over many years. You may have requested a bunch of small improvements during review, but this functionality should not change at all on Apple platforms without an internally-reviewed security bug. You are welcome to use this code on Linux but please make no functional change, however convinced you may be that it is an improvement.
Comment 114 Ms2ger (he/him; ⌚ UTC+1/+2) 2018-12-18 01:04:04 PST
Created attachment 357550 [details]
Patch

The normalization is, AFAIK, a translation of the following code in ObjC:

    result = [result precomposedStringWithCanonicalMapping];

... which can't really work in GTK.

If you're interested in the individual changes, see https://github.com/WebKit/webkit/compare/master...Ms2ger:userVisibleString?expand=1
Comment 115 Build Bot 2018-12-18 01:07:38 PST Comment hidden (obsolete)
Comment 116 Ms2ger (he/him; ⌚ UTC+1/+2) 2018-12-18 01:12:27 PST
Created attachment 357551 [details]
Patch
Comment 117 Michael Catanzaro 2018-12-18 08:06:57 PST
So Ms2ger, there should be no functional changes, correct? You're just replacing use of the NSString function precomposedStringWithCanonicalMapping -- documented to return "[a] string made by normalizing the string’s contents using the Unicode Normalization Form C." -- with ICU's unorm2_normalize, documented to return "the normalized form of the source string"? If so, then I agree it's the minimum-required change to translate this from Objective C++ to C++.

I would still split out all the changes from Source/WebKit and Tools/TestWebKitAPI into a separate patch, that way all the functional changes can go into a separate bug, even if they are Linux-specific.

(In reply to Build Bot from comment #115)
> ERROR: Source/WTF/wtf/URLHelpers.cpp:853:  Declaration has space between
> type name and * in const UNormalizer2 *normalizer  [whitespace/declaration]
> [3]

Careful here, this needs fixed.
Comment 118 Michael Catanzaro 2018-12-18 08:16:04 PST
(In reply to Alex Christensen from comment #113)
> This is quite carefully crafted code that has evolved over many years. You
> may have requested a bunch of small improvements during review, but this
> functionality should not change at all on Apple platforms without an
> internally-reviewed security bug. You are welcome to use this code on Linux
> but please make no functional change, however convinced you may be that it
> is an improvement.

Code style and organization improvements of the sort appropriate when translating a file from one language to another, not functionality changes. Well, you can see the history here up above. There shouldn't be behavior changes. Ms2ger, is there anything in here as risky as the switch from the NSString to the ICU normalization function? (We know in theory that should not result in behavior changes if both implement the Unicode normalization algorithm correctly, but in practice no doubt these functions will have different sets of bugs.) That change can be done in Objective C++, so if Alex wants to be really strict, he might ask us to do that change and other similar changes first in its own patch, then the code move patch, then the WPE/GTK API patch.
Comment 119 Michael Catanzaro 2018-12-18 08:24:58 PST
Unicode has four normalization forms:

https://unicode.org/reports/tr15/#Norm_Forms

Just for additional due diligence, I checked to ensure you're using unorm2_getNFCInstance(), which corresponds to form C, the same form as the NSString function, so it should be OK. I think we should go ahead and do that separately in its own initial patch, like I suggested above, since it can be done in Objective C++. That way, if that change does introduce unexpected problems, they'll bisect down to a nice separate commit and not the jumbo commit.
Comment 120 Alex Christensen 2018-12-18 09:37:16 PST
(In reply to Ms2ger (he/him; ⌚ UTC+1/+2) from comment #114)
> Created attachment 357550 [details]
> Patch
> 
> The normalization is, AFAIK, a translation of the following code in ObjC:
> 
>     result = [result precomposedStringWithCanonicalMapping];

This normalization must not change for Apple platforms.  This call has important functionality that cannot be emulated by any one call to a unicode function.  To make it work on linux, please add an abstraction for this operation.
Comment 121 Alex Christensen 2018-12-18 09:47:50 PST
(In reply to Michael Catanzaro from comment #119)
> Unicode has four normalization forms:
> 
> https://unicode.org/reports/tr15/#Norm_Forms
> 
> Just for additional due diligence, I checked to ensure you're using
> unorm2_getNFCInstance(), which corresponds to form C, the same form as the
> NSString function, so it should be OK. I think we should go ahead and do
> that separately in its own initial patch, like I suggested above, since it
> can be done in Objective C++. That way, if that change does introduce
> unexpected problems, they'll bisect down to a nice separate commit and not
> the jumbo commit.

Actually, you're right.  unorm2_getNFCInstance is probably similar enough.  A small patch would be great.
Comment 122 Michael Catanzaro 2018-12-20 19:38:21 PST
Note: the tests are duplicated with URLExtras.mm. The duplicated tests should be removed from the URLExtras.mm test now that they are added in the cross-platform test.
Comment 123 Michael Catanzaro 2018-12-22 16:42:40 PST
Also, in bug #192945 Alex noticed a missing error check, should be:

    const UNormalizer2* normalizer = unorm2_getNFCInstance(&uerror);
    if (U_FAILURE(uerror))
        return { };
Comment 124 Michael Catanzaro 2019-01-14 17:43:07 PST
Also, we need to not lose the recent change from https://trac.webkit.org/changeset/239586/webkit.
Comment 125 Michael Catanzaro 2019-01-14 19:26:50 PST
Also: https://trac.webkit.org/changeset/239967/webkit
Comment 126 Michael Catanzaro 2019-01-15 13:26:54 PST
Two more requests:

 * Alex wants the unorm2_getNFCInstance normalization to be split into a subroutine, like it already is after bug #192945.

 * Darin has requested we not use String::charactersWithNullTermination here, because we don't need the null-termination:

> The idiom to efficiently get a UChar* buffer would be:
> 
>     auto characters = StringView { string }.upconvertedCharacters();
> 
> Then "characters" can be passed anywhere that needs a "const UChar*". We
> should switch to that I think.
Comment 127 Michael Catanzaro 2019-01-15 19:29:19 PST
Darin has also suggested checking with unorm2_quickCheck() to see if normalization is likely required before performing the normalization.

OK, maybe that will be enough suggested improvements? This bug has reached epic length!
Comment 128 Ms2ger (he/him; ⌚ UTC+1/+2) 2019-01-28 09:00:16 PST
Created attachment 360348 [details]
Patch
Comment 129 Build Bot 2019-01-28 09:02:19 PST
Attachment 360348 [details] did not pass style-queue:


WARNING: File exempt from style guide. Skipping: "Source/WebKit/UIProcess/API/gtk/webkit2.h"
WARNING: File exempt from style guide. Skipping: "Source/WebKit/UIProcess/API/gtk/WebKitURIUtilities.h"
ERROR: Source/WTF/wtf/cocoa/NSURLExtras.mm:111:  Use 'WTF::Optional<>' instead of 'std::optional<>'.  [runtime/wtf_optional] [4]
ERROR: Source/WTF/wtf/cocoa/NSURLExtras.mm:119:  Use 'WTF::Optional<>' instead of 'std::optional<>'.  [runtime/wtf_optional] [4]
WARNING: File exempt from style guide. Skipping: "Source/WebKit/UIProcess/API/wpe/WebKitURIUtilities.h"
ERROR: Source/WTF/wtf/URLHelpers.cpp:786:  Declaration has space between type name and * in const UNormalizer2 *normalizer  [whitespace/declaration] [3]
WARNING: File exempt from style guide. Skipping: "Source/WebKit/UIProcess/API/wpe/webkit.h"
Total errors found: 3 in 29 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 130 Ms2ger (he/him; ⌚ UTC+1/+2) 2019-01-29 00:51:04 PST
Created attachment 360451 [details]
Patch
Comment 131 Ms2ger (he/him; ⌚ UTC+1/+2) 2019-01-29 08:55:25 PST
Created attachment 360462 [details]
Patch
Comment 132 Build Bot 2019-01-29 08:57:35 PST
Attachment 360462 [details] did not pass style-queue:


WARNING: File exempt from style guide. Skipping: "Source/WebKit/UIProcess/API/gtk/webkit2.h"
WARNING: File exempt from style guide. Skipping: "Source/WebKit/UIProcess/API/gtk/WebKitURIUtilities.h"
WARNING: File exempt from style guide. Skipping: "Source/WebKit/UIProcess/API/wpe/WebKitURIUtilities.h"
ERROR: Source/WTF/wtf/URLHelpers.cpp:786:  Declaration has space between type name and * in const UNormalizer2 *normalizer  [whitespace/declaration] [3]
WARNING: File exempt from style guide. Skipping: "Source/WebKit/UIProcess/API/wpe/webkit.h"
Total errors found: 1 in 29 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 133 Michael Catanzaro 2019-01-29 09:48:26 PST
EWS is still pretty sad.

(In reply to Build Bot from comment #132)
> ERROR: Source/WTF/wtf/URLHelpers.cpp:786:  Declaration has space between
> type name and * in const UNormalizer2 *normalizer  [whitespace/declaration]
> [3]

UNormalizer2* normalizer
Comment 134 Ms2ger (he/him; ⌚ UTC+1/+2) 2019-01-30 00:52:59 PST
Created attachment 360561 [details]
Patch
Comment 135 Ms2ger (he/him; ⌚ UTC+1/+2) 2019-01-30 03:50:23 PST
Created attachment 360567 [details]
Patch

*sigh*
Comment 136 Alex Christensen 2019-01-30 08:54:57 PST
Comment on attachment 360567 [details]
Patch

Wow.  I'll have to hear what Tim thinks of this patch.
Comment 137 Michael Catanzaro 2019-01-30 09:29:01 PST
Comment on attachment 360567 [details]
Patch

Note it's already been reviewed by Alex in... OMG, comment #49, that was over a year ago. :O

Note also: API freeze is Monday, Feb 4, and this remains a rollout risk.

Anyway, great to see EWS green. Checklist of changes since the first time it landed:

 * Avoid test duplication: fail, it seems the tests added in URLHelpers.cpp are still duplicates of the tests in URLExtras.mm. 
 * Added missing error check after use of unorm2_getNFCInstance, good
 * Change in r239586 has been ported, good
 * Change in r239967 has been ported, good
 * unorm2_getNFCInstance normalization split into toNormalizationFormC, good
 * charactersWithNullTermination use removed from toNormalizationFormC... fail, that was missed.
 * Add unorm2_quickCheck(), also fail, that was missed too.

So:

Any tests we're adding to the cross-platform test should be removed from the platform-specific test. Ideally the entire test would become cross-platform, but I see that would require a bit more work and we're just days away from API freeze, so I understand if that can't happen now.

This is minor, but Darin did request that we not use charactersWithNullTermination if we are just going to cut off the null termination right away. His suggested idiom was:

auto characters = StringView { string }.upconvertedCharacters();

Were there any troubles with that?

Then the unorm2_quickCheck should be one extra function call towards the top of toNormalizationFormC(), just to avoid calling unorm2_normalize() if it's not really needed. That could be done in a follow-up, of course, but it seems easy?
Comment 138 Michael Catanzaro 2019-02-01 09:21:56 PST
(In reply to Alex Christensen from comment #136)
> Comment on attachment 360567 [details]
> Patch
> 
> Wow.  I'll have to hear what Tim thinks of this patch.

BTW please feel free to review today if you want to... don't want this to slip yet another release cycle. I'm giving r- for the duplicated tests, but it's still  long past ready for reviews.
Comment 139 Michael Catanzaro 2019-02-01 09:23:05 PST
Another option, if you don't have time to make the tests cross-platform, is to keep them Cocoa-specific and we can just rely on them passing on the Apple bots to know that it's working. This is not the greatest solution, but it's better than duplicating the same list of tests in two different places.
Comment 140 Alex Christensen 2019-02-04 10:27:10 PST
Comment on attachment 360567 [details]
Patch

LGTM other than the test issue mentioned by Michael.
Comment 141 Michael Catanzaro 2019-02-04 13:15:23 PST
(In reply to Michael Catanzaro from comment #137)
> Any tests we're adding to the cross-platform test should be removed from the
> platform-specific test. Ideally the entire test would become cross-platform,
> but I see that would require a bit more work and we're just days away from
> API freeze, so I understand if that can't happen now.
> 
> This is minor, but Darin did request that we not use
> charactersWithNullTermination if we are just going to cut off the null
> termination right away. His suggested idiom was:
> 
> auto characters = StringView { string }.upconvertedCharacters();
> 
> Were there any troubles with that?
> 
> Then the unorm2_quickCheck should be one extra function call towards the top
> of toNormalizationFormC(), just to avoid calling unorm2_normalize() if it's
> not really needed. That could be done in a follow-up, of course, but it
> seems easy?

I'm going to create a follow-up bug to handle all three of these issues, since they are all relatively minor. (The tests will take some work, but no harm in keeping the Cocoa tests for now.) Ms2ger, thanks for shepherding the patch so far!
Comment 142 Michael Catanzaro 2019-02-04 19:48:21 PST
This patch is the same as Ms2ger's, except I've removed Tools/TestWebKitAPI/Tests/WTF/URLHelpers.cpp. I will try to sort that out in the follow-up bug #194272.
Comment 143 Michael Catanzaro 2019-02-04 20:04:35 PST
Committed r240962: <https://trac.webkit.org/changeset/240962>