Bug 144320

Summary: URL paths should not be normalized when encoded
Product: WebKit Reporter: Carlos Garcia Campos <cgarcia>
Component: TextAssignee: Nobody <webkit-unassigned>
Status: RESOLVED CONFIGURATION CHANGED    
Severity: Normal CC: achristensen, ahmad.saleem792, annevk, ap, darin, wenson_hsieh
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: Unspecified   
OS: Unspecified   
Attachments:
Description Flags
Patch
none
Updated patch none

Carlos Garcia Campos
Reported 2015-04-28 03:02:10 PDT
I noticed that some files containing the character 'ñ' were not loaded when used in img or object tags. The problem is that the character 'ñ' is encoded in the filename as U+006E U+0303. The URL encodes the given path normalizing the string, which converts the 'ñ' to U+00F1. When the network backend tries to load the URL, it fails because the resource is not found at the given path.
Attachments
Patch (9.63 KB, patch)
2015-04-28 03:11 PDT, Carlos Garcia Campos
no flags
Updated patch (9.19 KB, patch)
2015-07-22 02:07 PDT, Carlos Garcia Campos
no flags
Carlos Garcia Campos
Comment 1 2015-04-28 03:11:45 PDT
Created attachment 251834 [details] Patch There was a FIXME in TextEncoding::encode() wondering if encode(9 is the best place for the normalization. This bug proves, it indeed depends on the caller, but I think it's very convenient that it happens in encode(), so we can add a parameter to let the caller decide. I've added a text case to URL unit test. **FAIL** URLTest.URLPathEnconding ../../Tools/TestWebKitAPI/Tests/WebCore/URL.cpp:97 Value of: url.lastPathComponent() Actual: test-ni%C3%B1o.png Expected: encodeWithURLEscapeSequences(path) Which is: test-nin%CC%83o.png This is what happens without the patch.
Darin Adler
Comment 2 2015-04-28 08:15:36 PDT
(In reply to comment #1) > but I think it's very convenient that it happens in encode() Really? I’m not so sure. How many different call sites do we have for encode()?
Darin Adler
Comment 3 2015-04-28 08:16:39 PDT
Comment on attachment 251834 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=251834&action=review > Source/WebCore/platform/text/TextEncoding.h:73 > + enum EncodeNormalization { NormalizeBeforeEncoding, DoNotNormalizeBeforeEncoding }; > + CString encode(StringView, UnencodableHandling, EncodeNormalization = NormalizeBeforeEncoding) const; This is OK, but I’d also like us to consider doing the normalization with a separate function as the comment suggested. I’d like to understand how many different call sites there are that do this.
Carlos Garcia Campos
Comment 4 2015-04-28 08:48:03 PDT
(In reply to comment #2) > (In reply to comment #1) > > but I think it's very convenient that it happens in encode() > > Really? I’m not so sure. How many different call sites do we have for > encode()? Not because they are many, but because of StringView -> Uchar conversions. Le me check all the callers
Carlos Garcia Campos
Comment 5 2015-04-28 08:57:36 PDT
WebCore/fileapi/WebKitBlobBuilder.cpp: BlobBuilder::append() WebCore/html/FormDataList.cpp: FormDataList::appendString() WebCore/page/PageSerializer.cpp: PageSerializer::serializeFrame() and PageSerializer::serializeCSSStyleSheet() WebCore/platform/URL.cpp: WebCore/platform/graphics/avfoundation/objc/MediaPlayerPrivateAVFoundationObjC.mm: MediaPlayerPrivateAVFoundationObjC::shouldWaitForLoadingOfResource() WebCore/platform/graphics/harfbuzz/HarfBuzzFaceCairo.cpp: harfBuzzGetGlyph() WebCore/platform/network/DataURL.cpp: handleDataURL() WebCore/platform/network/FormDataBuilder.cpp: FormDataBuilder::addFilenameToMultiPartHeader() WebCore/xml/XMLHttpRequest.cpp: XMLHttpRequest::send()
Darin Adler
Comment 6 2015-04-28 09:23:52 PDT
So I think we have two questions: 1) If we did want to make normalization separate, what is a good idiom that would keep things brief and efficient and just use the StringView as-is without memory allocation when no normalization is needed? 2) Is normalization helpful for all those other clients? I suppose we can leave it inside encode for now, but we should make sure that all those other call sites want normalization and have someone write test cases for them.
Carlos Garcia Campos
Comment 7 2015-04-28 09:59:03 PDT
(In reply to comment #6) > So I think we have two questions: > > 1) If we did want to make normalization separate, what is a good idiom that > would keep things brief and efficient and just use the StringView as-is > without memory allocation when no normalization is needed? > > 2) Is normalization helpful for all those other clients? > > I suppose we can leave it inside encode for now, but we should make sure > that all those other call sites want normalization and have someone write > test cases for them. Ok, I'll add a FIXME then.
Alexey Proskuryakov
Comment 8 2015-04-28 20:56:46 PDT
This keeps coming up as something that looks like unusual behavior at s first glance, however this is an intentional behavior for compatibility. I expect that this change will break more than it fixes, because NFC is s lot more common in general. You have an example of a file and a server that only accepts the decomposed form, but do we know that it's more common than the opposite? Mac is special here, because the platform uses decomposed form. But consider the case where a user types into a search form, and the form is submitted as a GET. Currently, we normalize to NFC, and the server gets the same request from Safari as it gets from e.g. Firefox on Windows. With the proposed change, the request will be different, and a server that doesn't implement normalization will misbehave.
Carlos Garcia Campos
Comment 9 2015-04-28 23:04:57 PDT
(In reply to comment #8) > This keeps coming up as something that looks like unusual behavior at s > first glance, however this is an intentional behavior for compatibility. > > I expect that this change will break more than it fixes, because NFC is s > lot more common in general. You have an example of a file and a server that > only accepts the decomposed form, but do we know that it's more common than > the opposite? Not exactly, this is not about client/server normalizing or not. The file itself in the file system can be normalized or not, in the particular case of a filename containing a 'ñ', it can be encoded as U+006E U+0303, or U+00F1, but they end up being different files, because the bytes are different, even if the visual representation is the same. This was the problem I was having, a server containing files in the file system with 'ñ' encoded as U+006E U+0303, but the URL in WebKit was normalized to U+00F1 and the files failed mto load by the network bacckend. The very same files worked in chrome and firefox. > Mac is special here, because the platform uses decomposed form. But consider > the case where a user types into a search form, and the form is submitted as > a GET. Currently, we normalize to NFC, and the server gets the same request > from Safari as it gets from e.g. Firefox on Windows. With the proposed > change, the request will be different, and a server that doesn't implement > normalization will misbehave. Form data decoding hasn't changed, except for filenames, so what the user types in a search form is still normalized.
Alexey Proskuryakov
Comment 10 2015-04-28 23:22:48 PDT
> The file itself in the file system can be normalized or not, in the particular case of a filename containing a 'ñ', it can be encoded as U+006E U+0303, or U+00F1, but they end up being different files, because the bytes are different, even if the visual representation is the same. Yes, I understand that this is what you are saying. This is a bug in server's file system - everything that supports Unicode must treat different normalization forms as equivalent, so a filesystem may not have two files whose names only differ in normalization form. > The very same files worked in chrome and firefox. Yes, they work for you in a test case, but they won't work in other scenarios, most notably those that involve user input on a Mac. This is as I said, the behavior in WebKit is intentionally different to have a more common Unicode form on the wire. Windows browsers have the luxury of letting the bytes through unchanged because their OS and Internet both use the same form, but for Safari, it is not as straightforward. > Form data decoding hasn't changed, except for filenames, so what the user types in a search form is still normalized. The changes in encodeRelativeString() are quite confusing, I'm not sure if that's correct. There is some "otherDecoded" string that is counter-intuitively a result of calling encode(), and that's separate from where the path is handled. Another change in this patch is that filenames in form data are not encoded. This means that a file uploaded from Mac will retain the custom HFS normalization form that is not used anywhere else - how if that the right thing to do?
Alexey Proskuryakov
Comment 11 2015-04-28 23:23:47 PDT
> filenames in form data are not encoded I meant, not normalized.
Carlos Garcia Campos
Comment 12 2015-04-28 23:49:10 PDT
(In reply to comment #10) > > The file itself in the file system can be normalized or not, in the particular case of a filename containing a 'ñ', it can be encoded as U+006E U+0303, or U+00F1, but they end up being different files, because the bytes are different, even if the visual representation is the same. > > Yes, I understand that this is what you are saying. This is a bug in > server's file system - everything that supports Unicode must treat different > normalization forms as equivalent, so a filesystem may not have two files > whose names only differ in normalization form. I don't think Linux and the most common file systems used in Linux know anything about encoding, filenames are just bytes (only exceptions are 0 and /, I think), so two files with different bytes in their name are just different. It seems HFS does care about encodings and normalization, I didn't know it. So, maybe this change should be made specific to Linux (or other unix systems, except mac) > > The very same files worked in chrome and firefox. > > Yes, they work for you in a test case, but they won't work in other > scenarios, most notably those that involve user input on a Mac. This is as I > said, the behavior in WebKit is intentionally different to have a more > common Unicode form on the wire. Windows browsers have the luxury of letting > the bytes through unchanged because their OS and Internet both use the same > form, but for Safari, it is not as straightforward. Well, I isolated the problem in a test case, but the issue was happening in real cases. It's not that the server doesn't normalize the filenames, the server just uses what there's in the filesystem. > > Form data decoding hasn't changed, except for filenames, so what the user types in a search form is still normalized. > > The changes in encodeRelativeString() are quite confusing, I'm not sure if > that's correct. There is some "otherDecoded" string that is > counter-intuitively a result of calling encode(), and that's separate from > where the path is handled. > > Another change in this patch is that filenames in form data are not encoded. > This means that a file uploaded from Mac will retain the custom HFS > normalization form that is not used anywhere else - how if that the right > thing to do? I assumed all file systems handled files as just bytes, I didn't know HFS worked differently. We need to make this depending on the platform.
Alexey Proskuryakov
Comment 13 2015-04-29 00:11:05 PDT
With a quick web search, it appears that on Windows and Linux, normalization is basically left to input methods, and they don't consistently produce NFC. This means that a filename typed on one system will not necessarily match the same filename typed on another. This makes me wonder if we actually need to standardize on WebKit behavior of having a well defined normalization form on the wire. It doesn't seem all that Mac specific any more.
Carlos Garcia Campos
Comment 14 2015-07-22 02:07:33 PDT
Created attachment 257254 [details] Updated patch Updated patch to do the normalization or not depending on the platform. Also removed the form data filename, as I'm not sure about that case. Darin, is it correct to assume that PLATFORM(COCOA) means HFS?
Darin Adler
Comment 15 2015-07-22 13:56:21 PDT
(In reply to comment #14) > Updated patch to do the normalization or not depending on the platform. Also > removed the form data filename, as I'm not sure about that case. Darin, is > it correct to assume that PLATFORM(COCOA) means HFS? It’s not correct to assume that. Macs support NFS, for example. It’s true that most older Unix systems have filenames that are an arbitrary string of bytes; the typical restriction is simply that none of the bytes can be '\0' or '/'. But note that the string of bytes can’t necessarily be converted into a sequence of UTF-16 code points. There could easily be a filename that’s not a valid UTF-8 sequence. How would we handle that? When we are talking about URLs, what matters is the file system on the server, not on the client.
Carlos Garcia Campos
Comment 16 2015-07-23 01:15:54 PDT
(In reply to comment #15) > (In reply to comment #14) > > Updated patch to do the normalization or not depending on the platform. Also > > removed the form data filename, as I'm not sure about that case. Darin, is > > it correct to assume that PLATFORM(COCOA) means HFS? > > It’s not correct to assume that. Macs support NFS, for example. > > It’s true that most older Unix systems have filenames that are an arbitrary > string of bytes; the typical restriction is simply that none of the bytes > can be '\0' or '/'. But note that the string of bytes can’t necessarily be > converted into a sequence of UTF-16 code points. There could easily be a > filename that’s not a valid UTF-8 sequence. How would we handle that? > So, that's another reason not to normalize filenames in URL, no? > When we are talking about URLs, what matters is the file system on the > server, not on the client. Oops, of course, you are right.
Csaba Osztrogonác
Comment 17 2015-09-14 11:14:27 PDT
Comment on attachment 251834 [details] Patch Cleared Darin Adler's review+ from obsolete attachment 251834 [details] so that this bug does not appear in http://webkit.org/pending-commit.
Ahmad Saleem
Comment 18 2022-06-04 04:34:20 PDT
I think Darin's patch based on last Comment got landed. https://github.com/WebKit/WebKit/commit/7e077530d958327a7716853db6dade3e6f0e7ff5 Do we need this or it is about something else? Thanks!
Alexey Proskuryakov
Comment 19 2022-06-05 13:37:25 PDT
That patch landed in 2006, so it was a different change. We had other discussions about the trade-offs of matching Windows behavior vs. round-tripping cleanliness in various contexts, and even something like a plan (normalizing text as it enters WebKit via system APIs including text input and file names), but I don't think that anything systemic got implemented yet.
Anne van Kesteren
Comment 20 2024-08-28 01:46:18 PDT
This patch is attempting to change the URL parser which got substantially rewritten since 2015. Pretty sure we have a separate bug tracking the Unicode normalization that happens here and there.
Note You need to log in before you can comment on or make changes to this bug.