Bug 4931 - Unicode format characters (Cf) other than soft hyphen should be removed from JavaScript source
: Unicode format characters (Cf) other than soft hyphen should be removed from ...
Status: RESOLVED FIXED
: WebKit
JavaScriptCore
: 420+
: Macintosh Mac OS X 10.4
: P2 Normal
Assigned To:
:
:
:
: 4885
  Show dependency treegraph
 
Reported: 2005-09-11 02:16 PST by
Modified: 2008-05-14 22:29 PST (History)


Attachments
test case for formatting characters in literals (759 bytes, text/html)
2005-09-11 02:16 PST, Alexey Proskuryakov
no flags Details
test case for formatting characters outside literals (685 bytes, text/html)
2005-09-11 02:17 PST, Alexey Proskuryakov
no flags Details
Darin's patch extracted from 4885 (4.61 KB, patch)
2005-10-03 09:35 PST, Kimmo Kinnunen
no flags Review Patch | Details | Formatted Diff | Diff
first cut, not tested yet (2.97 KB, patch)
2008-04-21 10:34 PST, Darin Adler
no flags Review Patch | Details | Formatted Diff | Diff
patch, tested now -- needs regression test cases and ChangeLog (2.98 KB, patch)
2008-04-24 12:30 PST, Darin Adler
no flags Review Patch | Details | Formatted Diff | Diff
patch (43.94 KB, patch)
2008-05-05 10:08 PST, Darin Adler
no flags Review Patch | Details | Formatted Diff | Diff
only remove the BOM (7.54 KB, patch)
2008-05-14 04:21 PST, Alexey Proskuryakov
darin: review+
Review Patch | Details | Formatted Diff | Diff


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2005-09-11 02:16:08 PST
Section 7.1 of ECMA-262 says:

"The format control characters can occur anywhere in the source text of an ECMAScript program. These 
characters are removed from the source text before applying the lexical grammar. <...> these characters 
are removed before processing string and regular expression literals"

Besides Safari, this doesn't appear to work in Firefox, Opera or MacIE.
------- Comment #1 From 2005-09-11 02:16:58 PST -------
Created an attachment (id=3859) [details]
test case for formatting characters in literals
------- Comment #2 From 2005-09-11 02:17:18 PST -------
Created an attachment (id=3860) [details]
test case for formatting characters outside literals
------- Comment #3 From 2005-10-03 09:35:59 PST -------
Created an attachment (id=4179) [details]
Darin's patch extracted from 4885

This patch strips Cf chars from the input stream. It was extracted from a patch
done by Daring, from bug 4885.

Note that it modifies a testcase.
------- Comment #4 From 2005-10-04 20:28:27 PST -------
Does this work in any browser? It may be a compatibility problem if we diverge on this. I can imagine 
people wanting to use soft hyphens in JS strings.
------- Comment #5 From 2005-10-05 10:04:17 PST -------
I don't think it's correct in the other browsers either. I agree that it could cause a problem. Wonder what 
we should do.
------- Comment #6 From 2005-10-06 07:18:57 PST -------
(From update of attachment 4179 [details])
I think the patch is fine, but I am worried about Maciej's comment now. How can
we decide whether to do this or not?
------- Comment #7 From 2005-10-06 11:47:12 PST -------
I have looked in Mozilla's bugzilla, and also googled for relevant discussions, but found nothing. Perhaps, 
I didn't look close enough, as it's hard to expect that such a feature has never been tested in Firefox (see 
http://www.mozilla.org/js/language/js20/formal/stages.html).

Besides the soft hyphen, I would expect various joiners/non-joiners and bidi overrides to be used in string 
literals sometimes, but I do not have any evidence.
------- Comment #8 From 2005-10-06 11:52:40 PST -------
Alexey, I'm not sure what relevance the JavaScript 2.0 specification has to this discussion.

We've tested Firefox, and it does *not* remove soft hyphens.

I'd love to fix this if we can be sure it won't cause compatibility problems.
------- Comment #9 From 2005-10-06 12:11:58 PST -------
(In reply to comment #8)
> Alexey, I'm not sure what relevance the JavaScript 2.0 specification has to this discussion.

  Only to underline that I'm surprised to have found nothing in Mozilla bugzilla.

> We've tested Firefox, and it does *not* remove soft hyphens.

  Right, that's what I wrote in the description. Later, I have tested WinIE, and it also doesn't remove Cf 
characters, just the other browsers.

> I'd love to fix this if we can be sure it won't cause compatibility problems.

  Well, I was only splitting bug 4885, and don't really have an opinion about this part (unlike the other 
ones :) )
------- Comment #10 From 2005-10-11 08:41:27 PST -------
(From update of attachment 4179 [details])
I think we should probably land this even thought we are uneasy about the soft
hyphen thing.

If Maciej disagrees, he can review- it.
------- Comment #11 From 2005-10-24 14:27:54 PST -------
John Sullivan landed this.

sullivan    05/10/24 14:22:21

  Modified:    .        ChangeLog
               kjs      lexer.cpp
               tests/mozilla/ecma/Array 15.4.5.1-1.js
------- Comment #12 From 2006-09-23 11:11:40 PST -------
This fix has been rolled out in bug 10183. Not re-opening this bug, since the problem was not with the implementation, but rather with site compatibility.
------- Comment #13 From 2006-09-25 08:55:46 PST -------
We won't fix this because of bad effects on site compatibility.
------- Comment #14 From 2008-01-01 21:57:50 PST -------
*** Bug 16694 has been marked as a duplicate of this bug. ***
------- Comment #15 From 2008-04-21 09:11:15 PST -------
In fact, it looks like Firefox only preserves U+00AD (soft hyphen), and other Cf characters are removed as required by the spec. We probably want to match this, if only to fix ecma_3/Unicode/uc-001.js, which verifies that \u200E is removed.
------- Comment #16 From 2008-04-21 10:16:08 PST -------
OK. Will be very easy to fix.
------- Comment #17 From 2008-04-21 10:34:36 PST -------
Created an attachment (id=20731) [details]
first cut, not tested yet
------- Comment #18 From 2008-04-24 12:30:19 PST -------
Created an attachment (id=20796) [details]
patch, tested now -- needs regression test cases and ChangeLog

Here's a new patch, ready to go, except needs change log and some regression tests.
------- Comment #19 From 2008-05-05 10:08:56 PST -------
Created an attachment (id=20970) [details]
patch
------- Comment #20 From 2008-05-05 10:26:19 PST -------
(From update of attachment 20970 [details])
r=me. Looks like we are testing for the soft hyphen not being removed in a very indirect way though - perhaps it would help to have a dedicated test case fir this. Or maybe the existing test and the long comment that you added are enough.
------- Comment #21 From 2008-05-14 02:54:26 PST -------
Now that we are not removing BOM characters at decoding time, this is causing compatibility issues, see <rdar://problem/5934376>. So, I decided to make an additional test case, and land this.

But turns out that Firefox is changing (almost) as we speak - see <https://bugzilla.mozilla.org/show_bug.cgi?id=274152> and <https://bugzilla.mozilla.org/show_bug.cgi?id=368516>.

I'm going to investigate this further.
------- Comment #22 From 2008-05-14 04:21:44 PST -------
Created an attachment (id=21123) [details]
only remove the BOM
------- Comment #23 From 2008-05-14 09:14:47 PST -------
(From update of attachment 21123 [details])
You need to use &lt; in the description string instead of the less-than symbol.

r=me
------- Comment #24 From 2008-05-14 09:35:56 PST -------
Committed <http://trac.webkit.org/changeset/33443>.
------- Comment #25 From 2008-05-14 22:29:10 PST -------
*** Bug 19070 has been marked as a duplicate of this bug. ***