Bug 81270

Summary: FileApi does not handle files with NFD encoded umlaut in file name
Product: WebKit Reporter: Toni Barzic <tbarzic>
Component: PlatformAssignee: Nobody <webkit-unassigned>
Status: UNCONFIRMED    
Severity: Normal CC: abarth, ap, ericu, zelidrag
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: Unspecified   
OS: Unspecified   
Attachments:
Description Flags
Patch
none
Patch ap: review-

Toni Barzic
Reported 2012-03-15 14:52:58 PDT
Files with NFD encoded characters (e.g. a umlaut encoded as a%CC%88) in name are not handled well in Chromium OS file manager. (see http://crosbug.com/ 26028) When we do file system operations we use file's url as an argument. When the url gets created from the file name, it gets normalized to NFC encoding. When the file name gets extracted from the url on Chromium side, it contains %C3%A4 (NFC encoding of a umlaut) instead of a%CC%88, and file operations fail because we cannot find the specified file path on the disk. (we search for foo0xc30xa4.foo instead of fooa0xcc0x88). The file gets shown in the file manager, but it becomes useless (we cannot do anything with it). Note that fileEntry object contains correctly encoded file name/path.
Attachments
Patch (17.18 KB, patch)
2012-03-15 15:42 PDT, Toni Barzic
no flags
Patch (17.11 KB, patch)
2012-03-20 10:10 PDT, Toni Barzic
ap: review-
Toni Barzic
Comment 1 2012-03-15 15:42:18 PDT
Eric U.
Comment 2 2012-03-19 17:43:52 PDT
Comment on attachment 132136 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=132136&action=review Seems like a reasonable approach to me. > Source/WebCore/ChangeLog:9 > + -create new WebCore::TextEncoding::encode method that will only dot the text encoding (without normalization). s/dot/do/
Toni Barzic
Comment 3 2012-03-20 10:10:13 PDT
Adam Barth
Comment 4 2012-03-20 14:49:20 PDT
I've removed the [Chromium] tag because this bug changes code that isn't Chromium-specific.
Adam Barth
Comment 5 2012-03-20 14:53:10 PDT
Comment on attachment 132844 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=132844&action=review > Source/WebCore/ChangeLog:10 > + -switch all calls to encode to normalizeAndEncode except the ones in platform/KURLGoogle.cpp Does that introduce another difference between KURL and GURL? Can you write a test in LayoutTest/fast/url that shows the difference for this patch?
Adam Barth
Comment 6 2012-03-20 14:54:00 PDT
+ap for thoughts about normalization and encoding. It's unclear to me how to decide which of these call sites want both normalization and encoding and which just want encoding.
Alexey Proskuryakov
Comment 7 2012-03-20 15:22:52 PDT
Comment on attachment 132844 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=132844&action=review My understanding is that to match the Web, file system implementation will need to decode percent escapes and do its own normalization. Or just work with NFC internally, which just makes sense for any Web-related system. > Source/WebCore/ChangeLog:3 > + [Chromium] FileApi does not handle files with NFD encoded umlaut in file name Did you mean the other way around? TextEncoding::encode produces NFC, which is the standard encoding on the Web. >> Source/WebCore/ChangeLog:10 >> + -switch all calls to encode to normalizeAndEncode except the ones in platform/KURLGoogle.cpp > > Does that introduce another difference between KURL and GURL? Can you write a test in LayoutTest/fast/url that shows the difference for this patch? The goal of existing encode() behavior is to make WebKit on any platform generate the same kind of data as Windows browsers do. So when faced with input that's not in NFC form (like a file name on OS X), it converts the data to NFC. I think that this just makes sense. Sending randomly normalized data across the wire (in URLs or otherwise) does not. This patch seems to affect all URLs, not just file system ones. > Source/WebCore/platform/text/TextEncoding.h:72 > + CString normalizeAndEncode(const UChar*, size_t length, UnencodableHandling) const; This new method does not make a lot of sense conceptually. 1. There are multiple normalization schemes. Normalize to what? 2. You only need to normalize when using one of Unicode encodings (such as UTF-*). So, exposing this as a public method on TextEncoding is confusing. 3. As mentioned above, are there any cases where one wants non-NFC data? > LayoutTests/fast/filesystem/file-from-file-entry-nfd-name.html:7 > +<script src="resources/file-from-file-entry-nfd-name.js"></script> Please don't split tests into .html and .js. The only exception is fast/js directory, and then the .js part should be pure JavaScript, no DOM.
Toni Barzic
Comment 8 2012-03-20 15:52:45 PDT
Comment on attachment 132844 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=132844&action=review >> Source/WebCore/ChangeLog:3 >> + [Chromium] FileApi does not handle files with NFD encoded umlaut in file name > > Did you mean the other way around? TextEncoding::encode produces NFC, which is the standard encoding on the Web. No, I did not mean another way around.. The problem occurs on ChromeOS, where we can access files on the external storage. These files may have names encoded in NFD form. When we try to access, we break because the file url is in NFC form, and the file cannot be fond on the disk. >>> Source/WebCore/ChangeLog:10 >>> + -switch all calls to encode to normalizeAndEncode except the ones in platform/KURLGoogle.cpp >> >> Does that introduce another difference between KURL and GURL? Can you write a test in LayoutTest/fast/url that shows the difference for this patch? > > The goal of existing encode() behavior is to make WebKit on any platform generate the same kind of data as Windows browsers do. So when faced with input that's not in NFC form (like a file name on OS X), it converts the data to NFC. > > I think that this just makes sense. Sending randomly normalized data across the wire (in URLs or otherwise) does not. This patch seems to affect all URLs, not just file system ones. Do you think it would make sense to introduce encodeWithoutNormalization() method and call it only when generating file urls for filesystem operations? >> Source/WebCore/platform/text/TextEncoding.h:72 >> + CString normalizeAndEncode(const UChar*, size_t length, UnencodableHandling) const; > > This new method does not make a lot of sense conceptually. > > 1. There are multiple normalization schemes. Normalize to what? > 2. You only need to normalize when using one of Unicode encodings (such as UTF-*). So, exposing this as a public method on TextEncoding is confusing. > 3. As mentioned above, are there any cases where one wants non-NFC data? see above >> LayoutTests/fast/filesystem/file-from-file-entry-nfd-name.html:7 >> +<script src="resources/file-from-file-entry-nfd-name.js"></script> > > Please don't split tests into .html and .js. The only exception is fast/js directory, and then the .js part should be pure JavaScript, no DOM. Hm, most of the tests in fast/filesystem are sepoarated this way..
Adam Barth
Comment 9 2012-03-20 15:54:33 PDT
I don't think it makes sense to normalize filesystem URLs differently than other kinds of URLs. That's just asking for pain later.
Eric U.
Comment 10 2012-03-20 16:03:17 PDT
(In reply to comment #8) > (From update of attachment 132844 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=132844&action=review > > >> Source/WebCore/ChangeLog:3 > >> + [Chromium] FileApi does not handle files with NFD encoded umlaut in file name > > > > Did you mean the other way around? TextEncoding::encode produces NFC, which is the standard encoding on the Web. > > No, I did not mean another way around.. > The problem occurs on ChromeOS, where we can access files on the external storage. These files may have names encoded in NFD form. When we try to access, we break because the file url is in NFC form, and the file cannot be fond on the disk. > > >>> Source/WebCore/ChangeLog:10 > >>> + -switch all calls to encode to normalizeAndEncode except the ones in platform/KURLGoogle.cpp > >> > >> Does that introduce another difference between KURL and GURL? Can you write a test in LayoutTest/fast/url that shows the difference for this patch? > > > > The goal of existing encode() behavior is to make WebKit on any platform generate the same kind of data as Windows browsers do. So when faced with input that's not in NFC form (like a file name on OS X), it converts the data to NFC. > > > > I think that this just makes sense. Sending randomly normalized data across the wire (in URLs or otherwise) does not. This patch seems to affect all URLs, not just file system ones. > > Do you think it would make sense to introduce encodeWithoutNormalization() method and call it only when generating file urls for filesystem operations? > > >> Source/WebCore/platform/text/TextEncoding.h:72 > >> + CString normalizeAndEncode(const UChar*, size_t length, UnencodableHandling) const; > > > > This new method does not make a lot of sense conceptually. > > > > 1. There are multiple normalization schemes. Normalize to what? > > 2. You only need to normalize when using one of Unicode encodings (such as UTF-*). So, exposing this as a public method on TextEncoding is confusing. > > 3. As mentioned above, are there any cases where one wants non-NFC data? > > see above > > >> LayoutTests/fast/filesystem/file-from-file-entry-nfd-name.html:7 > >> +<script src="resources/file-from-file-entry-nfd-name.js"></script> > > > > Please don't split tests into .html and .js. The only exception is fast/js directory, and then the .js part should be pure JavaScript, no DOM. > > Hm, most of the tests in fast/filesystem are sepoarated this way.. Those are generally split so that the javascript can be shared between the document and worker tests; the HTML files are nearly empty, and just load the required helper files for each environment and the test file.
Alexey Proskuryakov
Comment 11 2012-03-20 16:07:34 PDT
> I don't think it makes sense to normalize filesystem URLs differently than other kinds of URLs. That's just asking for pain later. Right. It's up to file system to be Unicode conformant, and treat all Unicode forms as equivalent. The presence of percent encoding obviously complicates things, but the situation is pretty much the same it is for Web servers, so FileSystem should try to match Apache behavior. I think that means to decode percent escapes, and then normalize the path. > Those are generally split so that the javascript can be shared between the document and worker tests Oh, that's also a valid reason. Tests that are not shared with workers should not be split though.
Adam Barth
Comment 12 2012-03-20 16:13:20 PDT
> It's up to file system to be Unicode conformant, and treat all Unicode forms as equivalent. That's not how Linux file systems work, for what it's worth. They treat file names as octet sequences, not as character sequences.
Eric U.
Comment 13 2012-03-20 16:14:22 PDT
(In reply to comment #8) > (From update of attachment 132844 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=132844&action=review > > >> Source/WebCore/ChangeLog:3 > >> + [Chromium] FileApi does not handle files with NFD encoded umlaut in file name > > > > Did you mean the other way around? TextEncoding::encode produces NFC, which is the standard encoding on the Web. > > No, I did not mean another way around.. > The problem occurs on ChromeOS, where we can access files on the external storage. These files may have names encoded in NFD form. When we try to access, we break because the file url is in NFC form, and the file cannot be fond on the disk. Just to expand on this, the problem comes when the path does a round-trip from UTF-8 to normalized UTF-16 and back to UTF-8 [now normalized]. ChromeOS then tries to find the original file on disk using the normalized filename, and fails. It's not a matter of the FileSystem code treating all Unicode forms as equivalent; it's that the underlying host filesystem doesn't. To most [all?] Linux filesystems, paths are just strings of bytes, and character sets/encodings don't come into it. This is an odd case that's not previously existed in the web platform AFAIK, so it's not clear what the perfect solution is. But what's there now isn't working for that particular use case; they need a non-destructive round-trip. We had also considered keeping the original string around [plumbing it all the way through as an opaque data blob, held in the directory entry]. But then if you pulled the name out of the DirectoryEntry into a JS variable, that sideband data wouldn't go with it, and so calls using that string would mysteriously fail. > >>> Source/WebCore/ChangeLog:10 > >>> + -switch all calls to encode to normalizeAndEncode except the ones in platform/KURLGoogle.cpp > >> > >> Does that introduce another difference between KURL and GURL? Can you write a test in LayoutTest/fast/url that shows the difference for this patch? > > > > The goal of existing encode() behavior is to make WebKit on any platform generate the same kind of data as Windows browsers do. So when faced with input that's not in NFC form (like a file name on OS X), it converts the data to NFC. > > > > I think that this just makes sense. Sending randomly normalized data across the wire (in URLs or otherwise) does not. This patch seems to affect all URLs, not just file system ones. > > Do you think it would make sense to introduce encodeWithoutNormalization() method and call it only when generating file urls for filesystem operations? > > >> Source/WebCore/platform/text/TextEncoding.h:72 > >> + CString normalizeAndEncode(const UChar*, size_t length, UnencodableHandling) const; > > > > This new method does not make a lot of sense conceptually. > > > > 1. There are multiple normalization schemes. Normalize to what? > > 2. You only need to normalize when using one of Unicode encodings (such as UTF-*). So, exposing this as a public method on TextEncoding is confusing. > > 3. As mentioned above, are there any cases where one wants non-NFC data? > > see above > > >> LayoutTests/fast/filesystem/file-from-file-entry-nfd-name.html:7 > >> +<script src="resources/file-from-file-entry-nfd-name.js"></script> > > > > Please don't split tests into .html and .js. The only exception is fast/js directory, and then the .js part should be pure JavaScript, no DOM. > > Hm, most of the tests in fast/filesystem are sepoarated this way..
Adam Barth
Comment 14 2012-03-21 09:27:04 PDT
How did the non-normalized file name get stored on disk in the first place?
Eric U.
Comment 15 2012-03-21 09:32:03 PDT
(In reply to comment #14) > How did the non-normalized file name get stored on disk in the first place? There are several ways it can happen. In the particular manifestations that were reported, it came via the download manager. That's certainly fixable, however that doesn't solve the whole problem. The user could also just plug in a USB drive with denormalized paths on it, and the ChromeOS file browser would fail.
Alexey Proskuryakov
Comment 16 2012-03-21 09:49:45 PDT
This sounds more like a Linux file system issue than anything else. Are all Linux file systems like this, or just some of them? This can be seen as a configuration issue - users that want Unicode in file names to work should use a Unicode aware file system. > The user could also just plug in a USB drive with denormalized paths on it Is this a practical scenario? How would they get non-NFC paths on the USB drive?
Eric U.
Comment 17 2012-03-21 09:57:55 PDT
(In reply to comment #16) > This sounds more like a Linux file system issue than anything else. Are all Linux file systems like this, or just some of them? This can be seen as a configuration issue - users that want Unicode in file names to work should use a Unicode aware file system. > > > The user could also just plug in a USB drive with denormalized paths on it > > Is this a practical scenario? How would they get non-NFC paths on the USB drive? Is NFD so rare? The ChromeOS bug [http://code.google.com/p/chromium-os/issues/detail?id=26028] has multiple reports from different users and several different sources of paths. Granted, several were from screen captures, so there's probably a single screencapture app that's causing 2-3 of the reports, but there were other sources as well.
Eric U.
Comment 18 2012-03-21 10:08:18 PDT
(In reply to comment #16) > This sounds more like a Linux file system issue than anything else. Are all Linux file systems like this, or just some of them? This can be seen as a configuration issue - users that want Unicode in file names to work should use a Unicode aware file system. It's more a system issue than a filesystem issue, I believe, and yes, all Linux filesystems are like this. It's not a configuration option--ext[23], reiser, etc. just don't store or deal with encoding information, and you set environment variables to tell your tools what to expect. The kernel just treats strings as bytes, with the exceptions that a NUL is a terminator and a slash is a directory separator. Beyond that, it knows nothing. The Linux ntfs drivers probably deal with unicode interpretation, but those aren't common as main Linux filesystems. http://hektor.umcs.lublin.pl/~mikosmul/computing/articles/linux-unicode.html suggests that mounting FAT drives [including USB] with the right options might help, but I've not seen that proven. Still, if that fixed USB, and the download manager fixed downloaded paths, we'd have solved the known sources. Doesn't seem a robust solution, though, since there might be other sources. > > The user could also just plug in a USB drive with denormalized paths on it > > Is this a practical scenario? How would they get non-NFC paths on the USB drive?
Adam Barth
Comment 19 2012-03-21 10:20:00 PDT
I think you're running into a basic impedance mismatch in that the web operates in terms of Unicode and the Linux filesystem operates in terms of octet sequences. It's not going to be possible to round-trip arbitrary octet sequences through Unicode, even if we muck around with the normalizations here.
Alexey Proskuryakov
Comment 20 2012-03-21 10:21:41 PDT
> Is NFD so rare? My understanding is that NFD is pretty much limited to OS X. Windows uses precomposed form, and most other systems that became Unicode aware later (including the Web as a whole) naturally leaned towards Windows compatibility. I do not know very much about modern Linux though. > The kernel just treats strings as bytes Sounds like a robust solution would also have to deal with malformed strings (unpaired surrogates etc). I'd prefer if we didn't need to round trip these. Perhaps a way to hide this issue in practice would be for chromium filesystem code to try a different normalization if a file is not found at first.
Eric U.
Comment 21 2012-03-21 10:23:16 PDT
(In reply to comment #19) > I think you're running into a basic impedance mismatch in that the web operates in terms of Unicode and the Linux filesystem operates in terms of octet sequences. It's not going to be possible to round-trip arbitrary octet sequences through Unicode, even if we muck around with the normalizations here. Sure--there's no fix that takes care of all cases. But if we can at least handle all valid UTF-8, I think we're in good shape.
Eric U.
Comment 22 2012-03-21 10:26:57 PDT
(In reply to comment #20) > > Is NFD so rare? > > My understanding is that NFD is pretty much limited to OS X. Windows uses precomposed form, and most other systems that became Unicode aware later (including the Web as a whole) naturally leaned towards Windows compatibility. > > I do not know very much about modern Linux though. > > > The kernel just treats strings as bytes > > Sounds like a robust solution would also have to deal with malformed strings (unpaired surrogates etc). I'd prefer if we didn't need to round trip these. No argument here; I think handling all valid UTF-8 would be sufficient. > Perhaps a way to hide this issue in practice would be for chromium filesystem code to try a different normalization if a file is not found at first. That seems a bit slow and error-prone. It would be pretty easy to attempt to overwrite a file, only to create a new one with the same name but different encoding. At best, all failing file-open operations would have to be done twice.
Adam Barth
Comment 23 2012-03-21 10:36:08 PDT
My concern is that this change has implications far wider than just the FileSystem API on Linux. You're changing how normalization is done for all URLs on the web. That seems a bit like the tail wagging the dog. Limiting the change to just the filesystem scheme is also unattractive because then the filesystem scheme is treated subtly differently than every other URL scheme. Most folks who process filesystem URLs (or just URLs without specific knowledge of the filesystem scheme) will screw this up, creating an even bigger mess.
Eric U.
Comment 24 2012-03-21 10:42:19 PDT
(In reply to comment #23) > My concern is that this change has implications far wider than just the FileSystem API on Linux. You're changing how normalization is done for all URLs on the web. That seems a bit like the tail wagging the dog. > > Limiting the change to just the filesystem scheme is also unattractive because then the filesystem scheme is treated subtly differently than every other URL scheme. Most folks who process filesystem URLs (or just URLs without specific knowledge of the filesystem scheme) will screw this up, creating an even bigger mess. I hear you. Those are valid concerns. What do you suggest? We can try looking at why downloads aren't normalized, and testing to see if the FAT driver deals with USB drives, but I'm a bit nervous about the robustness of that solution.
Adam Barth
Comment 25 2012-03-21 11:01:22 PDT
> What do you suggest? I'm not sure. I don't have a complete understanding of the system. Perhaps filesystem URLs should let you represent non-Unicode file paths using base64.
Toni Barzic
Comment 26 2012-03-21 11:15:10 PDT
I verified we see the issue with a file on a thumb drive. Also, we mount zip files with avfs, and it reproduces for files in mounted archives..
Eric U.
Comment 27 2012-03-26 13:01:07 PDT
(In reply to comment #25) > > What do you suggest? > > I'm not sure. I don't have a complete understanding of the system. Perhaps filesystem URLs should let you represent non-Unicode file paths using base64. In this case, we've got problems with Unicode paths, though, and it would be really nice not to have to base64 anything that wasn't NFC. I suppose we could have a pass that %-encoded any non-NFC chars in filesystem URL paths. That would contain the ugliness, at least, and not surprise people who don't use the filesystem.
Toni Barzic
Comment 28 2012-03-27 15:54:41 PDT
fyi, I landed cl that normalizes downloaded files' names, so ChromeOS Downloads directory should not have this problem anymore.. The problem with files on removable devices and mounted archives is still there..
Note You need to log in before you can comment on or make changes to this bug.