Bug 4931

Summary: Unicode format characters (Cf) other than soft hyphen should be removed from JavaScript source
Product: WebKit Reporter: Alexey Proskuryakov <ap>
Component: JavaScriptCoreAssignee: Alexey Proskuryakov <ap>
Status: RESOLVED FIXED    
Severity: Normal CC: darin, eric, erik, mjs
Priority: P2    
Version: 420+   
Hardware: Mac   
OS: OS X 10.4   
Bug Depends on:    
Bug Blocks: 4885    
Attachments:
Description Flags
test case for formatting characters in literals
none
test case for formatting characters outside literals
none
Darin's patch extracted from 4885
none
first cut, not tested yet
none
patch, tested now -- needs regression test cases and ChangeLog
none
patch
none
only remove the BOM darin: review+

Description Alexey Proskuryakov 2005-09-11 02:16:08 PDT
Section 7.1 of ECMA-262 says:

"The format control characters can occur anywhere in the source text of an ECMAScript program. These 
characters are removed from the source text before applying the lexical grammar. <...> these characters 
are removed before processing string and regular expression literals"

Besides Safari, this doesn't appear to work in Firefox, Opera or MacIE.
Comment 1 Alexey Proskuryakov 2005-09-11 02:16:58 PDT
Created attachment 3859 [details]
test case for formatting characters in literals
Comment 2 Alexey Proskuryakov 2005-09-11 02:17:18 PDT
Created attachment 3860 [details]
test case for formatting characters outside literals
Comment 3 Kimmo Kinnunen 2005-10-03 09:35:59 PDT
Created attachment 4179 [details]
Darin's patch extracted from 4885

This patch strips Cf chars from the input stream. It was extracted from a patch
done by Daring, from bug 4885.

Note that it modifies a testcase.
Comment 4 Maciej Stachowiak 2005-10-04 20:28:27 PDT
Does this work in any browser? It may be a compatibility problem if we diverge on this. I can imagine 
people wanting to use soft hyphens in JS strings.
Comment 5 Darin Adler 2005-10-05 10:04:17 PDT
I don't think it's correct in the other browsers either. I agree that it could cause a problem. Wonder what 
we should do.
Comment 6 Darin Adler 2005-10-06 07:18:57 PDT
Comment on attachment 4179 [details]
Darin's patch extracted from 4885

I think the patch is fine, but I am worried about Maciej's comment now. How can
we decide whether to do this or not?
Comment 7 Alexey Proskuryakov 2005-10-06 11:47:12 PDT
I have looked in Mozilla's bugzilla, and also googled for relevant discussions, but found nothing. Perhaps, 
I didn't look close enough, as it's hard to expect that such a feature has never been tested in Firefox (see 
http://www.mozilla.org/js/language/js20/formal/stages.html).

Besides the soft hyphen, I would expect various joiners/non-joiners and bidi overrides to be used in string 
literals sometimes, but I do not have any evidence.
Comment 8 Darin Adler 2005-10-06 11:52:40 PDT
Alexey, I'm not sure what relevance the JavaScript 2.0 specification has to this discussion.

We've tested Firefox, and it does *not* remove soft hyphens.

I'd love to fix this if we can be sure it won't cause compatibility problems.
Comment 9 Alexey Proskuryakov 2005-10-06 12:11:58 PDT
(In reply to comment #8)
> Alexey, I'm not sure what relevance the JavaScript 2.0 specification has to this discussion.

  Only to underline that I'm surprised to have found nothing in Mozilla bugzilla.

> We've tested Firefox, and it does *not* remove soft hyphens.

  Right, that's what I wrote in the description. Later, I have tested WinIE, and it also doesn't remove Cf 
characters, just the other browsers.

> I'd love to fix this if we can be sure it won't cause compatibility problems.

  Well, I was only splitting bug 4885, and don't really have an opinion about this part (unlike the other 
ones :) )
Comment 10 Darin Adler 2005-10-11 08:41:27 PDT
Comment on attachment 4179 [details]
Darin's patch extracted from 4885

I think we should probably land this even thought we are uneasy about the soft
hyphen thing.

If Maciej disagrees, he can review- it.
Comment 11 Timothy Hatcher 2005-10-24 14:27:54 PDT
John Sullivan landed this.

sullivan    05/10/24 14:22:21

  Modified:    .        ChangeLog
               kjs      lexer.cpp
               tests/mozilla/ecma/Array 15.4.5.1-1.js
Comment 12 Alexey Proskuryakov 2006-09-23 11:11:40 PDT
This fix has been rolled out in bug 10183. Not re-opening this bug, since the problem was not with the implementation, but rather with site compatibility.
Comment 13 Darin Adler 2006-09-25 08:55:46 PDT
We won't fix this because of bad effects on site compatibility.
Comment 14 Darin Adler 2008-01-01 21:57:50 PST
*** Bug 16694 has been marked as a duplicate of this bug. ***
Comment 15 Alexey Proskuryakov 2008-04-21 09:11:15 PDT
In fact, it looks like Firefox only preserves U+00AD (soft hyphen), and other Cf characters are removed as required by the spec. We probably want to match this, if only to fix ecma_3/Unicode/uc-001.js, which verifies that \u200E is removed.
Comment 16 Darin Adler 2008-04-21 10:16:08 PDT
OK. Will be very easy to fix.
Comment 17 Darin Adler 2008-04-21 10:34:36 PDT
Created attachment 20731 [details]
first cut, not tested yet
Comment 18 Darin Adler 2008-04-24 12:30:19 PDT
Created attachment 20796 [details]
patch, tested now -- needs regression test cases and ChangeLog

Here's a new patch, ready to go, except needs change log and some regression tests.
Comment 19 Darin Adler 2008-05-05 10:08:56 PDT
Created attachment 20970 [details]
patch
Comment 20 Alexey Proskuryakov 2008-05-05 10:26:19 PDT
Comment on attachment 20970 [details]
patch

r=me. Looks like we are testing for the soft hyphen not being removed in a very indirect way though - perhaps it would help to have a dedicated test case fir this. Or maybe the existing test and the long comment that you added are enough.
Comment 21 Alexey Proskuryakov 2008-05-14 02:54:26 PDT
Now that we are not removing BOM characters at decoding time, this is causing compatibility issues, see <rdar://problem/5934376>. So, I decided to make an additional test case, and land this.

But turns out that Firefox is changing (almost) as we speak - see <https://bugzilla.mozilla.org/show_bug.cgi?id=274152> and <https://bugzilla.mozilla.org/show_bug.cgi?id=368516>.

I'm going to investigate this further.
Comment 22 Alexey Proskuryakov 2008-05-14 04:21:44 PDT
Created attachment 21123 [details]
only remove the BOM
Comment 23 Darin Adler 2008-05-14 09:14:47 PDT
Comment on attachment 21123 [details]
only remove the BOM

You need to use &lt; in the description string instead of the less-than symbol.

r=me
Comment 24 Alexey Proskuryakov 2008-05-14 09:35:56 PDT
Committed <http://trac.webkit.org/changeset/33443>.
Comment 25 Mark Rowe (bdash) 2008-05-14 22:29:10 PDT
*** Bug 19070 has been marked as a duplicate of this bug. ***