Bug 144320 - URL paths should not be normalized when encoded
Summary: URL paths should not be normalized when encoded
Status: NEW
Alias: None
Product: WebKit
Classification: Unclassified
Component: Text (show other bugs)
Version: 528+ (Nightly build)
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Nobody
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-04-28 03:02 PDT by Carlos Garcia Campos
Modified: 2023-01-02 12:28 PST (History)
6 users (show)

See Also:


Attachments
Patch (9.63 KB, patch)
2015-04-28 03:11 PDT, Carlos Garcia Campos
no flags Details | Formatted Diff | Diff
Updated patch (9.19 KB, patch)
2015-07-22 02:07 PDT, Carlos Garcia Campos
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Carlos Garcia Campos 2015-04-28 03:02:10 PDT
I noticed that some files containing the character 'ñ' were not loaded when used in img or object tags. The problem is that the character 'ñ' is encoded in the filename as U+006E U+0303. The URL encodes the given path normalizing the string, which converts the 'ñ' to U+00F1. When the network backend tries to load the URL, it fails because the resource is not found at the given path.
Comment 1 Carlos Garcia Campos 2015-04-28 03:11:45 PDT
Created attachment 251834 [details]
Patch

There was a FIXME in TextEncoding::encode() wondering if encode(9 is the best place for the normalization. This bug proves, it indeed depends on the caller, but I think it's very convenient that it happens in encode(), so we can add a parameter to let the caller decide. I've added a text case to URL unit test.

**FAIL** URLTest.URLPathEnconding

../../Tools/TestWebKitAPI/Tests/WebCore/URL.cpp:97
Value of: url.lastPathComponent()
  Actual: test-ni%C3%B1o.png
Expected: encodeWithURLEscapeSequences(path)
Which is: test-nin%CC%83o.png

This is what happens without the patch.
Comment 2 Darin Adler 2015-04-28 08:15:36 PDT
(In reply to comment #1)
> but I think it's very convenient that it happens in encode()

Really? I’m not so sure. How many different call sites do we have for encode()?
Comment 3 Darin Adler 2015-04-28 08:16:39 PDT
Comment on attachment 251834 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=251834&action=review

> Source/WebCore/platform/text/TextEncoding.h:73
> +        enum EncodeNormalization { NormalizeBeforeEncoding, DoNotNormalizeBeforeEncoding };
> +        CString encode(StringView, UnencodableHandling, EncodeNormalization = NormalizeBeforeEncoding) const;

This is OK, but I’d also like us to consider doing the normalization with a separate function as the comment suggested. I’d like to understand how many different call sites there are that do this.
Comment 4 Carlos Garcia Campos 2015-04-28 08:48:03 PDT
(In reply to comment #2)
> (In reply to comment #1)
> > but I think it's very convenient that it happens in encode()
> 
> Really? I’m not so sure. How many different call sites do we have for
> encode()?

Not because they are many, but because of StringView -> Uchar conversions. Le me check all the callers
Comment 5 Carlos Garcia Campos 2015-04-28 08:57:36 PDT
WebCore/fileapi/WebKitBlobBuilder.cpp: BlobBuilder::append()
WebCore/html/FormDataList.cpp: FormDataList::appendString()
WebCore/page/PageSerializer.cpp: PageSerializer::serializeFrame() and PageSerializer::serializeCSSStyleSheet()
WebCore/platform/URL.cpp: 
WebCore/platform/graphics/avfoundation/objc/MediaPlayerPrivateAVFoundationObjC.mm: MediaPlayerPrivateAVFoundationObjC::shouldWaitForLoadingOfResource()
WebCore/platform/graphics/harfbuzz/HarfBuzzFaceCairo.cpp: harfBuzzGetGlyph()
WebCore/platform/network/DataURL.cpp: handleDataURL()
WebCore/platform/network/FormDataBuilder.cpp: FormDataBuilder::addFilenameToMultiPartHeader()
WebCore/xml/XMLHttpRequest.cpp: XMLHttpRequest::send()
Comment 6 Darin Adler 2015-04-28 09:23:52 PDT
So I think we have two questions:

1) If we did want to make normalization separate, what is a good idiom that would keep things brief and efficient and just use the StringView as-is without memory allocation when no normalization is needed?

2) Is normalization helpful for all those other clients?

I suppose we can leave it inside encode for now, but we should make sure that all those other call sites want normalization and have someone write test cases for them.
Comment 7 Carlos Garcia Campos 2015-04-28 09:59:03 PDT
(In reply to comment #6)
> So I think we have two questions:
> 
> 1) If we did want to make normalization separate, what is a good idiom that
> would keep things brief and efficient and just use the StringView as-is
> without memory allocation when no normalization is needed?
> 
> 2) Is normalization helpful for all those other clients?
> 
> I suppose we can leave it inside encode for now, but we should make sure
> that all those other call sites want normalization and have someone write
> test cases for them.

Ok, I'll add a FIXME then.
Comment 8 Alexey Proskuryakov 2015-04-28 20:56:46 PDT
This keeps coming up as something that looks like unusual behavior at s first glance, however this is an intentional behavior for compatibility. 

I expect that this change will break more than it fixes, because NFC is s lot more common in general. You have an example of a file and a server that only accepts the decomposed form, but do we know that it's more common than the opposite?

Mac is special here, because the platform uses decomposed form. But consider the case where a user types into a search form, and the form is submitted as a GET. Currently, we normalize to NFC, and the server gets the same request from Safari as it gets from e.g. Firefox on Windows. With the proposed change, the request will be different, and a server that doesn't implement normalization will misbehave.
Comment 9 Carlos Garcia Campos 2015-04-28 23:04:57 PDT
(In reply to comment #8)
> This keeps coming up as something that looks like unusual behavior at s
> first glance, however this is an intentional behavior for compatibility. 
> 
> I expect that this change will break more than it fixes, because NFC is s
> lot more common in general. You have an example of a file and a server that
> only accepts the decomposed form, but do we know that it's more common than
> the opposite?

Not exactly, this is not about client/server normalizing or not. The file itself in the file system can be normalized or not, in the particular case of a filename containing a 'ñ', it can be encoded as U+006E U+0303, or U+00F1, but they end up being different files, because the bytes are different, even if the visual representation is the same. This was the problem I was having, a server containing files in the file system with 'ñ' encoded as U+006E U+0303, but the URL in WebKit was normalized to U+00F1 and the files failed mto load by the network bacckend. The very same files worked in chrome and firefox.

> Mac is special here, because the platform uses decomposed form. But consider
> the case where a user types into a search form, and the form is submitted as
> a GET. Currently, we normalize to NFC, and the server gets the same request
> from Safari as it gets from e.g. Firefox on Windows. With the proposed
> change, the request will be different, and a server that doesn't implement
> normalization will misbehave.

Form data decoding hasn't changed, except for filenames, so what the user types in a search form is still normalized.
Comment 10 Alexey Proskuryakov 2015-04-28 23:22:48 PDT
> The file itself in the file system can be normalized or not, in the particular case of a filename containing a 'ñ', it can be encoded as U+006E U+0303, or U+00F1, but they end up being different files, because the bytes are different, even if the visual representation is the same.

Yes, I understand that this is what you are saying. This is a bug in server's file system - everything that supports Unicode must treat different normalization forms as equivalent, so a filesystem may not have two files whose names only differ in normalization form.

> The very same files worked in chrome and firefox.

Yes, they work for you in a test case, but they won't work in other scenarios, most notably those that involve user input on a Mac. This is as I said, the behavior in WebKit is intentionally different to have a more common Unicode form on the wire. Windows browsers have the luxury of letting the bytes through unchanged because their OS and Internet both use the same form, but for Safari, it is not as straightforward.

> Form data decoding hasn't changed, except for filenames, so what the user types in a search form is still normalized.

The changes in encodeRelativeString() are quite confusing, I'm not sure if that's correct. There is some "otherDecoded" string that is counter-intuitively a result of calling encode(), and that's separate from where the path is handled.

Another change in this patch is that filenames in form data are not encoded. This means that a file uploaded from Mac will retain the custom HFS normalization form that is not used anywhere else - how if that the right thing to do?
Comment 11 Alexey Proskuryakov 2015-04-28 23:23:47 PDT
> filenames in form data are not encoded

I meant, not normalized.
Comment 12 Carlos Garcia Campos 2015-04-28 23:49:10 PDT
(In reply to comment #10)
> > The file itself in the file system can be normalized or not, in the particular case of a filename containing a 'ñ', it can be encoded as U+006E U+0303, or U+00F1, but they end up being different files, because the bytes are different, even if the visual representation is the same.
> 
> Yes, I understand that this is what you are saying. This is a bug in
> server's file system - everything that supports Unicode must treat different
> normalization forms as equivalent, so a filesystem may not have two files
> whose names only differ in normalization form.

I don't think Linux and the most common file systems used in Linux know anything about encoding, filenames are just bytes (only exceptions are 0 and /, I think), so two files with different bytes in their name are just different. It seems HFS does care about encodings and normalization, I didn't know it. So, maybe this change should be made specific to Linux (or other unix systems, except mac)

> > The very same files worked in chrome and firefox.
> 
> Yes, they work for you in a test case, but they won't work in other
> scenarios, most notably those that involve user input on a Mac. This is as I
> said, the behavior in WebKit is intentionally different to have a more
> common Unicode form on the wire. Windows browsers have the luxury of letting
> the bytes through unchanged because their OS and Internet both use the same
> form, but for Safari, it is not as straightforward.

Well, I isolated the problem in a test case, but the issue was happening in real cases. It's not that the server doesn't normalize the filenames, the server just uses what there's in the filesystem.

> > Form data decoding hasn't changed, except for filenames, so what the user types in a search form is still normalized.
> 
> The changes in encodeRelativeString() are quite confusing, I'm not sure if
> that's correct. There is some "otherDecoded" string that is
> counter-intuitively a result of calling encode(), and that's separate from
> where the path is handled.
> 
> Another change in this patch is that filenames in form data are not encoded.
> This means that a file uploaded from Mac will retain the custom HFS
> normalization form that is not used anywhere else - how if that the right
> thing to do?

I assumed all file systems handled files as just bytes, I didn't know HFS worked differently. We need to make this depending on the platform.
Comment 13 Alexey Proskuryakov 2015-04-29 00:11:05 PDT
With a quick web search, it appears that on Windows and Linux, normalization is basically left to input methods, and they don't consistently produce NFC. This means that a filename typed on one system will not necessarily match the same filename typed on another.

This makes me wonder if we actually need to standardize on WebKit behavior of having a well defined normalization form on the wire. It doesn't seem all that Mac specific any more.
Comment 14 Carlos Garcia Campos 2015-07-22 02:07:33 PDT
Created attachment 257254 [details]
Updated patch

Updated patch to do the normalization or not depending on the platform. Also removed the form data filename, as I'm not sure about that case. Darin, is it correct to assume that PLATFORM(COCOA) means HFS?
Comment 15 Darin Adler 2015-07-22 13:56:21 PDT
(In reply to comment #14)
> Updated patch to do the normalization or not depending on the platform. Also
> removed the form data filename, as I'm not sure about that case. Darin, is
> it correct to assume that PLATFORM(COCOA) means HFS?

It’s not correct to assume that. Macs support NFS, for example.

It’s true that most older Unix systems have filenames that are an arbitrary string of bytes; the typical restriction is simply that none of the bytes can be '\0' or '/'. But note that the string of bytes can’t necessarily be converted into a sequence of UTF-16 code points. There could easily be a filename that’s not a valid UTF-8 sequence. How would we handle that?

When we are talking about URLs, what matters is the file system on the server, not on the client.
Comment 16 Carlos Garcia Campos 2015-07-23 01:15:54 PDT
(In reply to comment #15)
> (In reply to comment #14)
> > Updated patch to do the normalization or not depending on the platform. Also
> > removed the form data filename, as I'm not sure about that case. Darin, is
> > it correct to assume that PLATFORM(COCOA) means HFS?
> 
> It’s not correct to assume that. Macs support NFS, for example.
> 
> It’s true that most older Unix systems have filenames that are an arbitrary
> string of bytes; the typical restriction is simply that none of the bytes
> can be '\0' or '/'. But note that the string of bytes can’t necessarily be
> converted into a sequence of UTF-16 code points. There could easily be a
> filename that’s not a valid UTF-8 sequence. How would we handle that?
> 

So, that's another reason not to normalize filenames in URL, no?

> When we are talking about URLs, what matters is the file system on the
> server, not on the client.

Oops, of course, you are right.
Comment 17 Csaba Osztrogonác 2015-09-14 11:14:27 PDT
Comment on attachment 251834 [details]
Patch

Cleared Darin Adler's review+ from obsolete attachment 251834 [details] so that this bug does not appear in http://webkit.org/pending-commit.
Comment 18 Ahmad Saleem 2022-06-04 04:34:20 PDT
I think Darin's patch based on last Comment got landed.

https://github.com/WebKit/WebKit/commit/7e077530d958327a7716853db6dade3e6f0e7ff5

Do we need this or it is about something else? Thanks!
Comment 19 Alexey Proskuryakov 2022-06-05 13:37:25 PDT
That patch landed in 2006, so it was a different change.

We had other discussions about the trade-offs of matching Windows behavior vs. round-tripping cleanliness in various contexts, and even something like a plan (normalizing text as it enters WebKit via system APIs including text input and file names), but I don't think that anything systemic got implemented yet.