Bug 4931 - Unicode format characters (Cf) other than soft hyphen should be removed from JavaScript source
Summary: Unicode format characters (Cf) other than soft hyphen should be removed from ...
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: JavaScriptCore (show other bugs)
Version: 420+
Hardware: Mac OS X 10.4
: P2 Normal
Assignee: Alexey Proskuryakov
URL:
Keywords:
: 16694 19070 (view as bug list)
Depends on:
Blocks: 4885
  Show dependency treegraph
 
Reported: 2005-09-11 02:16 PDT by Alexey Proskuryakov
Modified: 2008-05-14 22:29 PDT (History)
4 users (show)

See Also:


Attachments
test case for formatting characters in literals (759 bytes, text/html)
2005-09-11 02:16 PDT, Alexey Proskuryakov
no flags Details
test case for formatting characters outside literals (685 bytes, text/html)
2005-09-11 02:17 PDT, Alexey Proskuryakov
no flags Details
Darin's patch extracted from 4885 (4.61 KB, patch)
2005-10-03 09:35 PDT, Kimmo Kinnunen
no flags Details | Formatted Diff | Diff
first cut, not tested yet (2.97 KB, patch)
2008-04-21 10:34 PDT, Darin Adler
no flags Details | Formatted Diff | Diff
patch, tested now -- needs regression test cases and ChangeLog (2.98 KB, patch)
2008-04-24 12:30 PDT, Darin Adler
no flags Details | Formatted Diff | Diff
patch (43.94 KB, patch)
2008-05-05 10:08 PDT, Darin Adler
no flags Details | Formatted Diff | Diff
only remove the BOM (7.54 KB, patch)
2008-05-14 04:21 PDT, Alexey Proskuryakov
darin: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Alexey Proskuryakov 2005-09-11 02:16:08 PDT
Section 7.1 of ECMA-262 says:

"The format control characters can occur anywhere in the source text of an ECMAScript program. These 
characters are removed from the source text before applying the lexical grammar. <...> these characters 
are removed before processing string and regular expression literals"

Besides Safari, this doesn't appear to work in Firefox, Opera or MacIE.
Comment 1 Alexey Proskuryakov 2005-09-11 02:16:58 PDT
Created attachment 3859 [details]
test case for formatting characters in literals
Comment 2 Alexey Proskuryakov 2005-09-11 02:17:18 PDT
Created attachment 3860 [details]
test case for formatting characters outside literals
Comment 3 Kimmo Kinnunen 2005-10-03 09:35:59 PDT
Created attachment 4179 [details]
Darin's patch extracted from 4885

This patch strips Cf chars from the input stream. It was extracted from a patch
done by Daring, from bug 4885.

Note that it modifies a testcase.
Comment 4 Maciej Stachowiak 2005-10-04 20:28:27 PDT
Does this work in any browser? It may be a compatibility problem if we diverge on this. I can imagine 
people wanting to use soft hyphens in JS strings.
Comment 5 Darin Adler 2005-10-05 10:04:17 PDT
I don't think it's correct in the other browsers either. I agree that it could cause a problem. Wonder what 
we should do.
Comment 6 Darin Adler 2005-10-06 07:18:57 PDT
Comment on attachment 4179 [details]
Darin's patch extracted from 4885

I think the patch is fine, but I am worried about Maciej's comment now. How can
we decide whether to do this or not?
Comment 7 Alexey Proskuryakov 2005-10-06 11:47:12 PDT
I have looked in Mozilla's bugzilla, and also googled for relevant discussions, but found nothing. Perhaps, 
I didn't look close enough, as it's hard to expect that such a feature has never been tested in Firefox (see 
http://www.mozilla.org/js/language/js20/formal/stages.html).

Besides the soft hyphen, I would expect various joiners/non-joiners and bidi overrides to be used in string 
literals sometimes, but I do not have any evidence.
Comment 8 Darin Adler 2005-10-06 11:52:40 PDT
Alexey, I'm not sure what relevance the JavaScript 2.0 specification has to this discussion.

We've tested Firefox, and it does *not* remove soft hyphens.

I'd love to fix this if we can be sure it won't cause compatibility problems.
Comment 9 Alexey Proskuryakov 2005-10-06 12:11:58 PDT
(In reply to comment #8)
> Alexey, I'm not sure what relevance the JavaScript 2.0 specification has to this discussion.

  Only to underline that I'm surprised to have found nothing in Mozilla bugzilla.

> We've tested Firefox, and it does *not* remove soft hyphens.

  Right, that's what I wrote in the description. Later, I have tested WinIE, and it also doesn't remove Cf 
characters, just the other browsers.

> I'd love to fix this if we can be sure it won't cause compatibility problems.

  Well, I was only splitting bug 4885, and don't really have an opinion about this part (unlike the other 
ones :) )
Comment 10 Darin Adler 2005-10-11 08:41:27 PDT
Comment on attachment 4179 [details]
Darin's patch extracted from 4885

I think we should probably land this even thought we are uneasy about the soft
hyphen thing.

If Maciej disagrees, he can review- it.
Comment 11 Timothy Hatcher 2005-10-24 14:27:54 PDT
John Sullivan landed this.

sullivan    05/10/24 14:22:21

  Modified:    .        ChangeLog
               kjs      lexer.cpp
               tests/mozilla/ecma/Array 15.4.5.1-1.js
Comment 12 Alexey Proskuryakov 2006-09-23 11:11:40 PDT
This fix has been rolled out in bug 10183. Not re-opening this bug, since the problem was not with the implementation, but rather with site compatibility.
Comment 13 Darin Adler 2006-09-25 08:55:46 PDT
We won't fix this because of bad effects on site compatibility.
Comment 14 Darin Adler 2008-01-01 21:57:50 PST
*** Bug 16694 has been marked as a duplicate of this bug. ***
Comment 15 Alexey Proskuryakov 2008-04-21 09:11:15 PDT
In fact, it looks like Firefox only preserves U+00AD (soft hyphen), and other Cf characters are removed as required by the spec. We probably want to match this, if only to fix ecma_3/Unicode/uc-001.js, which verifies that \u200E is removed.
Comment 16 Darin Adler 2008-04-21 10:16:08 PDT
OK. Will be very easy to fix.
Comment 17 Darin Adler 2008-04-21 10:34:36 PDT
Created attachment 20731 [details]
first cut, not tested yet
Comment 18 Darin Adler 2008-04-24 12:30:19 PDT
Created attachment 20796 [details]
patch, tested now -- needs regression test cases and ChangeLog

Here's a new patch, ready to go, except needs change log and some regression tests.
Comment 19 Darin Adler 2008-05-05 10:08:56 PDT
Created attachment 20970 [details]
patch
Comment 20 Alexey Proskuryakov 2008-05-05 10:26:19 PDT
Comment on attachment 20970 [details]
patch

r=me. Looks like we are testing for the soft hyphen not being removed in a very indirect way though - perhaps it would help to have a dedicated test case fir this. Or maybe the existing test and the long comment that you added are enough.
Comment 21 Alexey Proskuryakov 2008-05-14 02:54:26 PDT
Now that we are not removing BOM characters at decoding time, this is causing compatibility issues, see <rdar://problem/5934376>. So, I decided to make an additional test case, and land this.

But turns out that Firefox is changing (almost) as we speak - see <https://bugzilla.mozilla.org/show_bug.cgi?id=274152> and <https://bugzilla.mozilla.org/show_bug.cgi?id=368516>.

I'm going to investigate this further.
Comment 22 Alexey Proskuryakov 2008-05-14 04:21:44 PDT
Created attachment 21123 [details]
only remove the BOM
Comment 23 Darin Adler 2008-05-14 09:14:47 PDT
Comment on attachment 21123 [details]
only remove the BOM

You need to use &lt; in the description string instead of the less-than symbol.

r=me
Comment 24 Alexey Proskuryakov 2008-05-14 09:35:56 PDT
Committed <http://trac.webkit.org/changeset/33443>.
Comment 25 Mark Rowe (bdash) 2008-05-14 22:29:10 PDT
*** Bug 19070 has been marked as a duplicate of this bug. ***