Bug 91817 - Shouldn't normalise file names on submission
Summary: Shouldn't normalise file names on submission
Status: NEW
Alias: None
Product: WebKit
Classification: Unclassified
Component: Page Loading (show other bugs)
Version: 528+ (Nightly build)
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Nobody
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-07-19 21:30 PDT by Ian 'Hixie' Hickson
Modified: 2013-10-04 09:56 PDT (History)
3 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Ian 'Hixie' Hickson 2012-07-19 21:30:59 PDT
The names of uploaded files are normalised to NFC. This means that if you upload two files that have names that are different on the filesystem, it's possible that the server will, even if it faithfully round-trips the filenames, return them with the same filename.

Quoting NARUSE, Yui from W3C bug number 14526 comment number 18 (https://www.w3.org/Bugs/Public/show_bug.cgi?id=14526#c18):
>
> Imagine following situation, a directory has two file, U+795E.txt and
> U+FA19.txt.
> And the user want to upload them. As you can notice, DOM and uploaded server
> can't distinguish them. Normalization considered harmful.
> [...]
> Yes, current WebKit normalizes those Kanjis, and it is considered breakage.
> You can see the breakage by uploading U+FA19.txt.
> After uploading, it become U+795E.txt and you can find the left part of the
> Kanji is changed.
> These kanjis have the same meaning "god", and specified as compatibility
> character thorough some political reason, but people don't want to
> normalize them other than the true normalization situation.
Comment 1 Alexey Proskuryakov 2012-07-20 09:40:54 PDT
I think that what we're doing is right. If we didn't normalize file names, we'd send decomposed form to servers from Mac, while every Windows browser always sends precomposed form.

It is very likely that sites would have trouble with that (either break on any decomposed Unicode because they were only tested with Windows clients, or get confused when a file is touched by multiple platforms). Indeed, imagine a file that's uploaded from Windows, then edited on Mac and uploaded again. Chances are that the server would show two copies if the name were in a different form when re-uploaded.

This is a much more practical situation than the one presented in bug description.
Comment 2 Alexey Proskuryakov 2013-07-31 09:45:09 PDT
If the Unicode spec disagrees with what people want, this is something to bring up with the Unicode committee. It makes no sense for implementations to preserve normalization forms.
Comment 3 Julian Reschke 2013-10-04 09:30:44 PDT
(In reply to comment #1)
> I think that what we're doing is right. If we didn't normalize file names, we'd send decomposed form to servers from Mac, while every Windows browser always sends precomposed form.

Not entirely true; Firefox (and probably IE too) send whatever the FS layer gaev them, and that *can* be decomposed as well (yes, tested).
Comment 4 Alexey Proskuryakov 2013-10-04 09:56:12 PDT
OK.

I don't think that this factoid changes anything though - manually adjusted file names at FS level is not a common scenario, so the fact that Windows browsers don't normalize these is not a practical consideration. It's still true that this never happens, with the exception of your testing.